real-time-dmd-1_720p.mp4
real-time-dmd-2_720p.mp4
This workflow consists of:
- Dreamshaper 8 for the base SD1.5 model.
- 1-step distilled (using DMD) UNet.
- Compiled into a fp8 TensorRT engine.
- TAESD for the VAE.
- Compiled using
torch.compile
.
- Compiled using
- Depth ControlNet
- Compiled using
torch.compile
.
- Compiled using
- DepthAnythingV2
- Compiled into a fp16 TensorRT engine.
This workflow has achieved an output FPS of ~14-15 for a video stream with an input FPS of 14fps using the ComfyStream server + UI when testing with the following setup:
- OS: Ubuntu
- GPU: Nvidia RTX 4090
- Driver: 550.127.05
- CUDA: 12.5 torch: 2.5.1+cu121
The first few runs will be slower due to JIT compilation via torch.compile
.
Check out some of the ideas for enhancing and extending this workflow!
Check out some of the notes for thoughts regarding this workflow!
- Repo (this is a fork of the contentis fork of the original repo that supports fp8 quantization and processing ControlNet inputs)
cd custom_nodes
git clone -b quantization_with_controlnet_fixes https://github.com/yondonfu/ComfyUI_TensorRT.git
cd ComfyUI_TensorRT
pip install -r requirements.txt
comfy node install comfyui-torch-compile
comfy node install depth-anything-tensorrt
Download the dreamshaper-8 model weights here and copy into the models/checkpoints
folder of your ComfyUI workspace.
Download the dreamshaper-8-dmd-1kstep model weights here and copy into the models/unet
folder of your ComfyUI workspace.
The original OpenDMD repo, which contains an open source implementation of the DMD paper, can be found here.
Download the model weights for the _vits.onnx
version here and copy into the custom_nodes/ComfyUI-Depth-Anything-Tensorrt
folder of your ComfyUI workspace. We will use these weights to build a TensorRT engine later on.
Download the model weights here and copy into the models/controlnet
folder of your ComfyUI workspace.
Download the model weights for taesd_decoder.pth and taesd_encoder.pth and copy into the models/vae_approx
folder of your ComfyUI workspace.
In ComfyUI:
- Generate fp8 ONNX weights for the
dreamshaper-8-dmd-1kstep
weights.
- Compile a TensorRT engine using the fp8 ONNX weights.
Follow these instructions to compile a TensorRT engine.
For Step 2, make the following updates to export_trt.py
:
- Set
trt_path
to./depth_anything_v2_vits14-fp16.engine
- Set
onnx_path
to./depth_anything_v2_vits14.onnx
Start ComfyUI with the --disable-cuda-malloc
flag which is required to avoid issues with the torch.compile
configuration for the workflow.
Download workflow.json and drag-and-drop it into ComfyUI.
Download api-workflow.json and upload it when using the ComfyStream UI.
A couple of ideas to experiment with using this workflow as a base (note: in the long term, I suspect video models that are trained on actual videos to learn motion will yield better quality than stacking different techniques together with image models, so think of these as short-term experiments to squeeze as much juice as possible out of the open image models we already have):
Requires less coding:
- Use ComfyUI-Background-Edit to separate the background/foreground of each frame, diffuse the background and/or foreground for each frame separately and then composite them back together. For example, we could re-style the background and keep the foreground character or vice-versa.
- Use SAM2 to mask specific regions of frames and selectively diffuse those regions. For example, we could re-style a basketball in each frame, but keep everything else the same.
- Use a ControlNet with a different conditioning input. For example, we could use an OpenPose ControlNet and then pass in input video with a moving OpenPose skeleton.
- Use RealTimeNodes to control the parameter values in this workflow over time (i.e. change the text prompt over time, change the denoise level over time).
Requires more coding:
- Use the same scheduling/sampling for DMD as the OpenDMD repo.
- Use DMD2, which is supposed to be an improvement over the original DMD, for the UNet. The DMD2 repo contains pretrained weights using SDXL as the base model which might result in better quality outputs than the DMD weights which uses SD1.5 as the base model.
- Use additional speed up techniques found in StreamDiffusion outside of the stream batch (this workflow already uses a 1-step model so the stream batch technique shouldn't make a difference, but there may be additional techniques there that could offer speed gains).
- Use TAESDV, which supposedly can decode a continuous sequence of latents into a smoother sequence of frames than TAESD, for the VAE.
- Use TemporalNet as an additional ControlNet in the workflow and use the optical flow for pairs of frames as the conditioning input to try to improve temporal conistency (i.e. reduce flickering, drastic frame-to-frame changes). In order to make this fast, we can use
torch.compile
to speed up the optical flow model (via a custom node). - Use IP-Adapters to use an image prompt instead of text prompt to guide diffusion. In order to make this fast, we'd probably want to accelerate the IP-Adapter with something like
torch.compile
.
Why create this workflow instead of just using one of the StreamDiffusion custom nodes?
The StreamDiffusion library is built on top of the diffusers
library so any custom node that just wraps the library would not be immediately composable with ComfyUI nodes that work with the ComfyUI core diffusion nodes. As a result, you wouldn't be able to use the UI to do things like swap the sampler, ControlNets, VAEs, etc. without building additional nodes.
How does the speed of diffusion in this workflow compare with StreamDiffusion?
Since this workflow uses a 1-step distilled model and StreamDiffusion's stream batch technique only provides speed ups when the # of steps used by the model is > 1, diffusion for this particular model shouldn't be much slower than StreamDiffusion. However, since this workflow is only able to reach ~14fps on a Nvidia RTX 4090, there may be additional techniques used in StreamDiffusion that if adopted here could close the remaining speed gap.