mmagic.models.editors.stable_diffusion_xl
¶
Package Contents¶
Classes¶
Class for Stable Diffusion XL. Refers to https://github.com/Stability- |
- class mmagic.models.editors.stable_diffusion_xl.StableDiffusionXL(vae: ModelType, text_encoder_one: ModelType, tokenizer_one: str, text_encoder_two: ModelType, tokenizer_two: str, unet: ModelType, scheduler: ModelType, test_scheduler: Optional[ModelType] = None, dtype: Optional[str] = None, enable_xformers: bool = True, noise_offset_weight: float = 0, tomesd_cfg: Optional[dict] = None, data_preprocessor: Optional[ModelType] = dict(type='DataPreprocessor'), lora_config: Optional[dict] = None, val_prompts: Union[str, List[str]] = None, finetune_text_encoder: bool = False, force_zeros_for_empty_prompt: bool = True, init_cfg: Optional[dict] = None)[source]¶
Bases:
mmengine.model.BaseModel
Class for Stable Diffusion XL. Refers to https://github.com/Stability- AI.
/generative-models and https://github.com/huggingface/diffusers/blob/main/ src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
- Parameters
unet (Union[dict, nn.Module]) – The config or module for Unet model.
text_encoder_one (Union[dict, nn.Module]) – The config or module for text encoder.
tokenizer_one (str) – The name for CLIP tokenizer.
text_encoder_two (Union[dict, nn.Module]) – The config or module for text encoder.
tokenizer_two (str) – The name for CLIP tokenizer.
vae (Union[dict, nn.Module]) – The config or module for VAE model.
schedule (Union[dict, nn.Module]) – The config or module for diffusion scheduler.
test_scheduler (Union[dict, nn.Module], optional) – The config or module for diffusion scheduler in test stage (self.infer). If not passed, will use the same scheduler as schedule. Defaults to None.
dtype (str, optional) – The dtype for the model This argument will not work when dtype is defined for submodels. Defaults to None.
enable_xformers (bool, optional) – Whether to use xformers. Defaults to True.
noise_offset_weight (bool, optional) – The weight of noise offset introduced in https://www.crosslabs.org/blog/diffusion-with-offset-noise Defaults to 0.
tomesd_cfg (dict, optional) – The config for TOMESD. Please refers to https://github.com/dbolya/tomesd and https://github.com/open-mmlab/mmagic/blob/main/mmagic/models/utils/tome_utils.py for detail. # noqa Defaults to None.
data_preprocessor (dict, optional) – The pre-process config of
BaseDataPreprocessor
.lora_config (dict, optional) – The config for LoRA finetuning. Defaults to None.
val_prompts (Union[str, List[str]], optional) – The prompts for validation. Defaults to None.
finetune_text_encoder (bool, optional) – Whether to fine-tune text encoder. Defaults to False.
force_zeros_for_empty_prompt (bool) – Whether the negative prompt embeddings shall be forced to always be set to 0. Defaults to True.
init_cfg (dict, optional) – The weight initialized config for
BaseModule
.
- property device¶
- prepare_model()[source]¶
Prepare model for training.
Move model to target dtype and disable gradient for some models.
- set_xformers(module: Optional[torch.nn.Module] = None) torch.nn.Module [source]¶
Set xformers for the model.
- Returns
The model with xformers.
- Return type
nn.Module
- set_tomesd() torch.nn.Module [source]¶
Set ToMe for the stable diffusion model.
- Returns
The model with ToMe.
- Return type
nn.Module
- train(mode: bool = True)[source]¶
Set train/eval mode.
- Parameters
mode (bool, optional) – Whether set train mode. Defaults to True.
- infer(prompt: Union[str, List[str]], prompt_2: Optional[Union[str, List[str]]] = None, height: Optional[int] = None, width: Optional[int] = None, num_inference_steps: int = 50, denoising_end: Optional[float] = None, guidance_scale: float = 7.5, negative_prompt: Optional[Union[str, List[str]]] = None, negative_prompt_2: Optional[Union[str, List[str]]] = None, num_images_per_prompt: Optional[int] = 1, eta: float = 0.0, generator: Optional[torch.Generator] = None, latents: Optional[torch.FloatTensor] = None, show_progress: bool = True, seed: int = 1, original_size: Optional[Tuple[int, int]] = None, crops_coords_top_left: Tuple[int, int] = (0, 0), target_size: Optional[Tuple[int, int]] = None, negative_original_size: Optional[Tuple[int, int]] = None, negative_crops_coords_top_left: Tuple[int, int] = (0, 0), negative_target_size: Optional[Tuple[int, int]] = None, return_type='image')[source]¶
Function invoked when calling the pipeline for generation.
- Parameters
prompt (str or List[str]) – The prompt or prompts to guide the image generation.
prompt2 (str or List[str], optional) – The prompt or prompts to be sent to the tokenizer_two and text_encoder_two. If not defined, prompt is used in both text-encoders. Defaults to None.
(int (height) – defaults to self.unet_sample_size * self.vae_scale_factor): The height in pixels of the generated image.
optional – defaults to self.unet_sample_size * self.vae_scale_factor): The height in pixels of the generated image.
- :paramdefaults to self.unet_sample_size * self.vae_scale_factor):
The height in pixels of the generated image.
- Parameters
(int (width) – defaults to self.unet_sample_size * self.vae_scale_factor): The width in pixels of the generated image.
optional – defaults to self.unet_sample_size * self.vae_scale_factor): The width in pixels of the generated image.
- :paramdefaults to self.unet_sample_size * self.vae_scale_factor):
The width in pixels of the generated image.
- Parameters
num_inference_steps (int, optional, defaults to 50) – The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
denoising_end (float, optional) – When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in [Refining the Image Output]( https://huggingface.co/docs/diffusers/api/pipelines/ stable_diffusion/stable_diffusion_xl#refining-the-image-output)
guidance_scale (float, optional, defaults to 7.5) – Guidance scale as defined in [Classifier-Free Diffusion Guidance] (https://arxiv.org/abs/2207.12598).
negative_prompt (str or List[str], optional) – The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
negative_prompt_2 (str or List[str], optional)) – The negative_prompt to be sent to the tokenizer_two and text_encoder_two. If not defined, negative_prompt is used in both text-encoders. Defaults to None.
num_images_per_prompt (int, optional, defaults to 1) – The number of images to generate per prompt.
eta (float, optional, defaults to 0.0) – Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to [schedulers.DDIMScheduler], will be ignored for others.
generator (torch.Generator, optional) – A [torch generator] to make generation deterministic.
latents (torch.FloatTensor, optional) – Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
show_progress (bool) – Whether to show progress. Defaults to False.
seed (int) – Seed to be used. Defaults to 1.
original_size (Tuple[int], optional) – If original_size is not the same as target_size the image will appear to be down- or upsampled. If original_size is (width, height) if not specified. Defaults to None.
crops_coords_top_left (Tuple[int], optional) – crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Defaults to (0, 0).
target_size (Tuple[int], optional) – For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will be (width, height). Defaults to None.
negative_original_size (Tuple[int], optional) – To negatively condition the generation process based on a specific image resolution. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. Defaults to None.
negative_crops_coords_top_left (Tuple[int], optional) – To negatively condition the generation process based on a specific crop coordinates. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. Defaults to (0, 0).
negative_target_size (Tuple[int], optional) – To negatively condition the generation process based on a target image resolution. It should be as same as the target_size for most cases. For more information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208. Defaults to None.
return_type (str) – The return type of the inference results. Supported types are ‘image’, ‘numpy’, ‘tensor’. If ‘image’ is passed, a list of PIL images will be returned. If ‘numpy’ is passed, a numpy array with shape [N, C, H, W] will be returned, and the value range will be same as decoder’s output range. If ‘tensor’ is passed, the decoder’s output will be returned. Defaults to ‘image’.
- Returns
A dict containing the generated images.
- Return type
dict
- _get_add_time_ids(original_size: Optional[Tuple[int, int]], crops_coords_top_left: Tuple[int, int], target_size: Optional[Tuple[int, int]], dtype)[source]¶
Get add_time_ids.
- Parameters
original_size (Tuple[int], optional) – If original_size is not the same as target_size the image will appear to be down- or upsampled. If original_size is (width, height) if not specified. Defaults to None.
crops_coords_top_left (Tuple[int], optional) – crops_coords_top_left can be used to generate an image that appears to be “cropped” from the position. Favorable, well-centered images are usually achieved by setting crops_coords_top_left to (0, 0). Defaults to (0, 0).
target_size (Tuple[int], optional) – For most cases, target_size should be set to the desired height and width of the generated image. If not specified it will be (width, height). Defaults to None.
dtype (str, optional) – The dtype for the embeddings.
- Returns
time ids for time embeddings layer.
- Return type
add_time_ids (torch.Tensor)
- output_to_pil(image) List[PIL.Image.Image] [source]¶
Convert output tensor to PIL image. Output tensor will be de-normed to [0, 255] by DataPreprocessor.destruct. Due to no data_samples is passed, color order conversion will not be performed.
- Parameters
image (torch.Tensor) – The output tensor of the decoder.
- Returns
The list of processed PIL images.
- Return type
List[Image.Image]
- _encode_prompt(prompt, prompt_2, device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt, negative_prompt_2)[source]¶
Encodes the prompt into text encoder hidden states.
- Parameters
prompt (str or list(int)) – prompt to be encoded.
prompt_2 (str or list(int)) – prompt to be encoded. Send to the tokenizer_two and text_encoder_two. If not defined, prompt is used in both text-encoders.
device – (torch.device): torch device.
num_images_per_prompt (int) – number of images that should be generated per prompt.
do_classifier_free_guidance (bool) – whether to use classifier free guidance or not.
negative_prompt (str or List[str]) – The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
negative_prompt_2 (str or List[str]) – The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1). Send to tokenizer_two and text_encoder_two. If not defined, negative_prompt is used in both text-encoders
- Returns
- text embeddings generated by
clip text encoder.
- Return type
text_embeddings (torch.Tensor)
- decode_latents(latents)[source]¶
use vae to decode latents.
- Parameters
latents (torch.Tensor) – latents to decode.
- Returns
image result.
- Return type
image (torch.Tensor)
- prepare_extra_step_kwargs(generator, eta)[source]¶
prepare extra kwargs for the scheduler step.
- Parameters
generator (torch.Generator) – generator for random functions.
eta (float) – eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers. eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502 and should be between [0, 1]
- Returns
dict contains ‘generator’ and ‘eta’
- Return type
extra_step_kwargs (dict)
- prepare_test_scheduler_extra_step_kwargs(generator, eta)[source]¶
prepare extra kwargs for the scheduler step.
- Parameters
generator (torch.Generator) – generator for random functions.
eta (float) – eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers. eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502 and should be between [0, 1]
- Returns
dict contains ‘generator’ and ‘eta’
- Return type
extra_step_kwargs (dict)
- prepare_latents(batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None)[source]¶
prepare latents for diffusion to run in latent space.
- Parameters
batch_size (int) – batch size.
num_channels_latents (int) – latent channel nums.
height (int) – image height.
width (int) – image width.
dtype (torch.dtype) – float type.
device (torch.device) – torch device.
generator (torch.Generator) – generator for random functions, defaults to None.
latents (torch.Tensor) – Pre-generated noisy latents, defaults to None.
- Returns
prepared latents.
- Return type
latents (torch.Tensor)
- val_step(data: dict) mmagic.utils.typing.SampleList [source]¶
Gets the generated image of given data.
- Parameters
data (dict) – Data sampled from metric specific sampler. More details in Metrics and Evaluator.
- Returns
Generated image or image dict.
- Return type
SampleList
- test_step(data: dict) mmagic.utils.typing.SampleList [source]¶
Gets the generated image of given data. Same as
val_step()
.- Parameters
data (dict) – Data sampled from metric specific sampler. More details in Metrics and Evaluator.
- Returns
A list of
DataSample
contain generated results.- Return type
SampleList
- encode_prompt_train(text_one, text_two)[source]¶
Encode prompt for training.
- Parameters
text_one (torch.tensor) – Input ids from tokenizer_one.
text_two (torch.tensor) – Input ids from tokenizer_two.
- Returns
Prompt embedings. pooled_prompt_embeds (torch.tensor): Pooled prompt embeddings.
- Return type
prompt_embeds (torch.tensor)
- train_step(data: List[dict], optim_wrapper: mmengine.optim.OptimWrapperDict)[source]¶
Train step function.
- Parameters
data (List[dict]) – Batch of data as input.
optim_wrapper (OptimWrapperDict) – Dict with optimizers for generator and discriminator (if have).
- Returns
Dict with loss, information for logger, the number of samples and results for visualization.
- Return type
dict