mmagic.models.editors.ddpm
¶
Package Contents¶
Classes¶
Denoising Unet. This network receives a diffused image |
- class mmagic.models.editors.ddpm.DenoisingUnet(image_size, in_channels=3, out_channels=None, base_channels=128, resblocks_per_downsample=3, num_timesteps=1000, use_rescale_timesteps=False, dropout=0, embedding_channels=- 1, num_classes=0, use_fp16=False, channels_cfg=None, output_cfg=dict(mean='eps', var='learned_range'), norm_cfg=dict(type='GN', num_groups=32), act_cfg=dict(type='SiLU', inplace=False), shortcut_kernel_size=1, use_scale_shift_norm=False, resblock_updown=False, num_heads=4, time_embedding_mode='sin', time_embedding_cfg=None, resblock_cfg=dict(type='DenoisingResBlock'), attention_cfg=dict(type='MultiHeadAttention'), encoder_channels=None, downsample_conv=True, upsample_conv=True, downsample_cfg=dict(type='DenoisingDownsample'), upsample_cfg=dict(type='DenoisingUpsample'), attention_res=[16, 8], pretrained=None, unet_type='', down_block_types: Tuple[str] = (), up_block_types: Tuple[str] = (), cross_attention_dim=768, layers_per_block: int = 2)[source]¶
Bases:
mmengine.model.BaseModule
Denoising Unet. This network receives a diffused image
x_t
and current timestept
, and returns aoutput_dict
corresponding to the passedoutput_cfg
.output_cfg
defines the number of channels and the meaning of the output.output_cfg
mainly contains keys ofmean
andvar
, denoting how the network outputs mean and variance required for the denoising process. Formean
: 1.dict(mean='EPS')
: Model will predict noise added in thediffusion process, and the
output_dict
will contain a key namedeps_t_pred
.dict(mean='START_X')
: Model will direct predict the mean of theoriginal image x_0, and the
output_dict
will contain a key namedx_0_pred
.
dict(mean='X_TM1_PRED')
: Model will predict the mean of diffusedimage at t-1 timestep, and the
output_dict
will contain a key namedx_tm1_pred
.
For
var
: 1.dict(var='FIXED_SMALL')
ordict(var='FIXED_LARGE')
: Variance inthe denoising process is regarded as a fixed value. Therefore only ‘mean’ will be predicted, and the output channels will equal to the input image (e.g., three channels for RGB image.)
dict(var='LEARNED')
: Model will predict log_variance in thedenoising process, and the
output_dict
will contain a key namedlog_var
.
dict(var='LEARNED_RANGE')
: Model will predict an interpolationfactor and the log_variance will be calculated as factor * upper_bound + (1-factor) * lower_bound. The
output_dict
will contain a key namedfactor
.
If
var
is notFIXED_SMALL
orFIXED_LARGE
, the number of output channels will be the double of input channels, where the first half part contains predicted mean values and the other part is the predicted variance values. Otherwise, the number of output channels equals to the input channels, only containing the predicted mean values.- Parameters
image_size (int | list[int]) – The size of image to denoise.
in_channels (int, optional) – The input channels of the input image. Defaults as
3
.out_channels (int, optional) – The output channels of the output prediction. Defaults as
None
for automaticaaly assigned byvar_mode
.base_channels (int, optional) – The basic channel number of the generator. The other layers contain channels based on this number. Defaults to
128
.resblocks_per_downsample (int, optional) – Number of ResBlock used between two downsample operations. The number of ResBlock between upsample operations will be the same value to keep symmetry. Defaults to 3.
num_timesteps (int, optional) – The total timestep of the denoising process and the diffusion process. Defaults to
1000
.use_rescale_timesteps (bool, optional) – Whether rescale the input timesteps in range of [0, 1000]. Defaults to
True
.dropout (float, optional) – The probability of dropout operation of each ResBlock. Pass
0
to do not use dropout. Defaults as 0.embedding_channels (int, optional) – The output channels of time embedding layer and label embedding layer. If not passed (or passed
-1
), output channels of the embedding layers will set as four times ofbase_channels
. Defaults to-1
.num_classes (int, optional) – The number of conditional classes. If set to 0, this model will be degraded to an unconditional model. Defaults to 0.
channels_cfg (list | dict[list], optional) – Config for input channels of the intermediate blocks. If list is passed, each element of the list indicates the scale factor for the input channels of the current block with regard to the
base_channels
. For blocki
, the input and output channels should bechannels_cfg[i] * base_channels
andchannels_cfg[i+1] * base_channels
If dict is provided, the key of the dict should be the output scale and corresponding value should be a list to define channels. Default: Please refer to_default_channels_cfg
.output_cfg (dict, optional) – Config for output variables. Defaults to
dict(mean='eps', var='learned_range')
.norm_cfg (dict, optional) – The config for normalization layers. Defaults to
dict(type='GN', num_groups=32)
.act_cfg (dict, optional) – The config for activation layers. Defaults to
dict(type='SiLU', inplace=False)
.shortcut_kernel_size (int, optional) – The kernel size for shortcut conv in ResBlocks. The value of this argument will overwrite the default value of resblock_cfg. Defaults to 3.
use_scale_shift_norm (bool, optional) – Whether perform scale and shift after normalization operation. Defaults to True.
num_heads (int, optional) – The number of attention heads. Defaults to 4.
time_embedding_mode (str, optional) – Embedding method of
time_embedding
. Defaults to ‘sin’.time_embedding_cfg (dict, optional) – Config for
time_embedding
. Defaults to None.resblock_cfg (dict, optional) – Config for ResBlock. Defaults to
dict(type='DenoisingResBlock')
.attention_cfg (dict, optional) – Config for attention operation. Defaults to
dict(type='MultiHeadAttention')
.upsample_conv (bool, optional) – Whether use conv in upsample block. Defaults to
True
.downsample_conv (bool, optional) – Whether use conv operation in downsample block. Defaults to
True
.upsample_cfg (dict, optional) – Config for upsample blocks. Defaults to
dict(type='DenoisingDownsample')
.downsample_cfg (dict, optional) – Config for downsample blocks. Defaults to
dict(type='DenoisingUpsample')
.attention_res (int | list[int], optional) – Resolution of feature maps to apply attention operation. Defaults to
[16, 8]
.pretrained (str | dict, optional) – Path for the pretrained model or dict containing information for pretrained models whose necessary key is ‘ckpt_path’. Besides, you can also provide ‘prefix’ to load the generator part from the whole state dict. Defaults to None.
- _default_channels_cfg¶
- forward(x_t, t, encoder_hidden_states=None, label=None, return_noise=False)[source]¶
Forward function. :param x_t: Diffused image at timestep t to denoise. :type x_t: torch.Tensor :param t: Current timestep. :type t: torch.Tensor :param label: You can directly give a
batch of label through a
torch.Tensor
or offer a callable function to sample a batch of label data. Otherwise, theNone
indicates to use the default label sampler.- Parameters
return_noise (bool, optional) – If True, inputted
x_t
andt
will be returned in a dict with output desired byoutput_cfg
. Defaults to False.- Returns
If not
return_noise
- Return type
torch.Tensor | dict