mmagic.models.editors.animatediff.animatediff
¶
Module Contents¶
Attributes¶
- class mmagic.models.editors.animatediff.animatediff.AnimateDiff(vae: ModelType, text_encoder: ModelType, tokenizer: str, unet: ModelType, scheduler: ModelType, test_scheduler: Optional[ModelType] = None, dtype: str = 'fp32', enable_xformers: bool = True, noise_offset_weight: float = 0, tomesd_cfg: Optional[dict] = None, data_preprocessor=dict(type='DataPreprocessor'), motion_module_cfg: Optional[dict] = None, dream_booth_lora_cfg: Optional[dict] = None)[source]¶
Bases:
mmengine.model.BaseModel
Implementation of `AnimateDiff.
<https://arxiv.org/abs/2307.04725>`_ (AnimateDiff).
- Parameters
vae (Union[dict, nn.Module]) – The config or module for VAE model.
text_encoder (Union[dict, nn.Module]) – The config or module for text encoder.
tokenizer (str) – The name for CLIP tokenizer.
unet (Union[dict, nn.Module]) – The config or module for Unet model.
schedule (Union[dict, nn.Module]) – The config or module for diffusion scheduler.
test_scheduler (Union[dict, nn.Module], optional) – The config or module for diffusion scheduler in test stage (self.infer). If not passed, will use the same scheduler as schedule. Defaults to None.
lora_config (dict, optional) – The config for LoRA finetuning. Defaults to None.
val_prompts (Union[str, List[str]], optional) – The prompts for validation. Defaults to None.
class_prior_prompt (str, optional) – The prompt for class prior loss.
num_class_images (int, optional) – The number of images for class prior. Defaults to 3.
prior_loss_weight (float, optional) – The weight for class prior loss. Defaults to 0.
fine_tune_text_encoder (bool, optional) – Whether to fine-tune text encoder. Defaults to False.
dtype (str, optional) – The dtype for the model. Defaults to ‘fp16’.
enable_xformers (bool, optional) – Whether to use xformers. Defaults to True.
noise_offset_weight (bool, optional) – The weight of noise offset introduced in https://www.crosslabs.org/blog/diffusion-with-offset-noise # noqa Defaults to 0.
tomesd_cfg (dict, optional) – The config for TOMESD. Please refers to https://github.com/dbolya/tomesd and https://github.com/open-mmlab/mmagic/blob/main/mmagic/models/utils/tome_utils.py for detail. # noqa Defaults to None.
data_preprocessor (dict, optional) –
The pre-process config of
BaseDataPreprocessor
. Defaults todict(type=’DataPreprocessor’).
init_cfg (dict, optional) – The weight initialized config for
BaseModule
. Defaults to None/
- set_xformers(module: Optional[torch.nn.Module] = None) torch.nn.Module [source]¶
Set xformers for the model.
- Returns
The model with xformers.
- Return type
nn.Module
- set_tomesd() torch.nn.Module [source]¶
Set ToMe for the stable diffusion model.
- Returns
The model with ToMe.
- Return type
nn.Module
- _encode_prompt(prompt, device, num_videos_per_prompt, do_classifier_free_guidance, negative_prompt)[source]¶
Encodes the prompt into text encoder hidden states.
- prepare_extra_step_kwargs(generator, eta)[source]¶
Prepare extra kwargs for the scheduler step, since not all schedulers have the same signature eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
- convert_lora(state_dict, LORA_PREFIX_UNET='lora_unet', LORA_PREFIX_TEXT_ENCODER='lora_te', alpha=0.6)[source]¶
- Convert lora for unet and text_encoder
TODO: use this function to convert lora
- Parameters
state_dict (_type_) – _description_
LORA_PREFIX_UNET (str, optional) –
'lora_unet'. (_description_. Defaults to) –
LORA_PREFIX_TEXT_ENCODER (str, optional) –
'lora_te'. (_description_. Defaults to) –
alpha (float, optional) – _description_. Defaults to 0.6.
- Returns
check each output type _type_: unet && text_encoder
- Return type
TODO
- prepare_latents(batch_size, num_channels_latents, video_length, height, width, dtype, device, generator, latents=None)[source]¶
Prepare latent variables.
- prepare_model()[source]¶
Prepare model for training.
Move model to target dtype and disable gradient for some models.
- val_step(data: dict) mmagic.utils.typing.SampleList [source]¶
Gets the generated image of given data. Calls
self.data_preprocessor
andself.infer
in order. Return the generated results which will be passed to evaluator or visualizer.- Parameters
data (dict or tuple or list) – Data sampled from dataset.
- Returns
Generated image or image dict.
- Return type
SampleList
- test_step(data: dict) mmagic.utils.typing.SampleList [source]¶
Gets the generated image of given data. Calls
self.data_preprocessor
andself.infer
in order. Return the generated results which will be passed to evaluator or visualizer.- Parameters
data (dict or tuple or list) – Data sampled from dataset.
- Returns
Generated image or image dict.
- Return type
SampleList
- infer(prompt: Union[str, List[str]], video_length: Optional[int] = 16, height: Optional[int] = None, width: Optional[int] = None, num_inference_steps: int = 50, guidance_scale: float = 7.5, negative_prompt: Optional[Union[str, List[str]]] = None, num_videos_per_prompt: Optional[int] = 1, eta: float = 0.0, generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, latents: Optional[torch.FloatTensor] = None, return_type: Optional[str] = 'tensor', show_progress: bool = True, seed: Optional[int] = 1007)[source]¶
Function invoked when calling the pipeline for generation.
- Parameters
prompt (str or List[str]) – The prompt or prompts to guide the video generation.
video_length (int, Option) – The number of frames of the generated video. Defaults to 16.
height (int, Optional) – The height in pixels of the generated image. If not passed, the height will be self.unet_sample_size * self.vae_scale_factor Defaults to None.
width (int, Optional) – The width in pixels of the generated image. If not passed, the width will be self.unet_sample_size * self.vae_scale_factor Defaults to None.
num_inference_steps (int) – The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. Defaults to 50.
guidance_scale (float) – Guidance scale as defined in Classifier- Free Diffusion Guidance (https://arxiv.org/abs/2207.12598). Defaults to 7.5
negative_prompt (str or List[str], optional) – The prompt or prompts not to guide the video generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). Defaults to None.
num_videos_per_prompt (int) – The number of videos to generate per prompt. Defaults to 1.
eta (float) – Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to DDIMScheduler, will be ignored for others. Defaults to 0.0.
generator (torch.Generator, optional) – A torch generator to make generation deterministic. Defaults to None.
latents (torch.FloatTensor, optional) – Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator. Defaults to None.
return_type (str) – The return type of the inference results. Supported types are ‘video’, ‘numpy’, ‘tensor’. If ‘video’ is passed, a list of PIL images will be returned. If ‘numpy’ is passed, a numpy array with shape [N, C, H, W] will be returned, and the value range will be same as decoder’s output range. If ‘tensor’ is passed, the decoder’s output will be returned. Defaults to ‘image’.
#TODO :returns: A dict containing the generated video :rtype: dict