Skip to content

MohamedAziz15/lip-syncing-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LIP-SYNC

MuseTalk

Overview

MuseTalk is a real-time high-quality audio-driven lip-syncing model trained in the latent space of ft-mse-vae, which

  1. modifies an unseen face according to the input audio, with a size of face region of 256 x 256.
  2. Support audio in various languages, such as Chinese, English, and Japanese.
  3. supports real-time inference with 30fps+ on an NVIDIA Tesla V100.
  4. supports modification of the center point of the face region proposes, which SIGNIFICANTLY affects generation results.
  5. checkpoint available trained on the HDTF dataset.

MuseTalk-Doc

Getting Started

We provide a detailed tutorial about the installation and the basic usage of MuseTalk for new users:

The pipeline was implemented step-by-step in Colab, covering all aspects including installation, configuration, downloading weights and models, performing inference, and real-time inference. Everything needed for the process was included

1. Download weights

You can copy the directory ("/content/drive/MyDrive/Lip-Sync/MuseTalk/musetalk/models") from my drive to your drive in the same path to work

OR

You can download weights manually as follows:

  1. Download our trained weights.

  2. Download the weights of other components:

Finally, these weights should be organized in models as follows:

./models/
├── musetalk
│   └── musetalk.json
│   └── pytorch_model.bin
├── dwpose
│   └── dw-ll_ucoco_384.pth
├── face-parse-bisent
│   ├── 79999_iter.pth
│   └── resnet18-5c106cde.pth
├── sd-vae-ft-mse
│   ├── config.json
│   └── diffusion_pytorch_model.bin
└── whisper
    └── tiny.pt

Resources Provided

  • Reference video: 13_K.mp4
  • Reference audio: audio_folder

Lip Syncing Model

  • Selected an open-source lip-syncing model, specifically MuseTalk.
  • Implemented the pipeline step-by-step in Colab, covering:
    • Environment setup
    • Configuration
    • Downloading weights and models
    • Inference and real-time inference
    • Adjusting parameters for better results
  • Generated a high-quality video with excellent synchronization of lip movements with the audio.

Evaluation: Compared the generated video with the reference video [13_K.mp4]. Verified that the model provided good synchronization and overall video quality. Fine-Tuning:

Fine-Tuning: Instead of fine-tuning the model, adjusted parameters like bbox_shift to control mouth openness, resulting in better synchronization.

!python -m scripts.inference --inference_config configs/inference/test.yaml --bbox_shift -7

Results

  • Generated a high-quality video with excellent synchronization of lip movements with the audio. Inference

Real-Inference


Reference

For more details, refer to the documentation provided in the repository. If you encounter any issues or have questions, feel free to open an issue or contact the maintainer.

Releases

No releases published

Packages

No packages published