Fine-tuning Stable Diffusion with Dreambooth in AWS SageMaker Studio (2024)

Initial configuration of the fine-tuning

We need to set some information about how we fine-tune. By supplying this list of concepts, we can tell Dreambooth about additional items we want to teach it. For the sample I have, I am teaching it one extra thing; however, this is an array so you can teach multiple new concepts in one training step.

concepts_list = [ 
 {
 "instance_prompt": "photo of cc person",
 "class_prompt": "photo of a person",
 "instance_data_dir": "./content/data/cc",
 "class_data_dir": "./content/data/person"
 }, 
]

Here is a quick breakdown of what each of those parameters means:

-instance_prompt - the prompt we would type to generate 
 the image we are attempting to fine tune-class_prompt - denotes a prompt without the unique identifier/instance. 
 This prompt is used for generating "class images" for 
 prior preservation. For our example, this prompt is 
 - "a photo of a person" versus a photo of a specific person.
-instance_data_dir - the location where our training images are stored 
 for finetuning
-class_data_dir - sample images for the general class of prompt we are 
 fine tuning - if there are no images here, samples will 
 be generated. 
 Otherwise, you can provie ~20 images of the general 
 concept you want to generate (but not the actual instance 
 images that we finetune on)

In your notebook, feel free to change ‘photo of cc person’ to ‘photo of YOUR_NAME person’. Make sure to keep the prompt itself the same, and switch out ‘cc’->YOUR_NAME. This prompt snippet guides Stable Diffusion to know how to use this new data when generating new images.

The only other thing that you may want to change here is the instance_data_dir. This is a relative path to the folder that has the training images you want to use. Whether you change this directory or not, now is the time to upload the images you want to train with. So you go ahead and do that; I’ll wait here.

HuggingFace Accelerate training

Welcome back. Now you have some images uploaded and are ready for a training run. I want to walk through this next section of code since it is doing all the heavy lifting for us.

MODEL_NAME = "runwayml/stable-diffusion-v1-5"
PRECISION = "fp16"
MAX_TRAIN_STEPS = 1200#!accelerate launch --help
!accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 \
train_dreambooth_ShivamShrirao.py \
 --pretrained_model_name_or_path=$MODEL_NAME \
 --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \
 --output_dir=$OUTPUT_DIR \
 --revision=$PRECISION \
 --with_prior_preservation --prior_loss_weight=1.0 \
 --seed=1337 \
 --resolution=512 \
 --train_batch_size=1 \
 --train_text_encoder \
 --mixed_precision=$PRECISION \
 --use_8bit_adam \
 --gradient_accumulation_steps=1 \
 --learning_rate=1e-6 \
 --lr_scheduler="constant" \
 --lr_warmup_steps=0 \
 --num_class_images=50 \
 --sample_batch_size=4 \
 --max_train_steps=$MAX_TRAIN_STEPS \
 --save_interval=100000 \
 --save_sample_prompt="photo of cc person" \
 --concepts_list="concepts_list_cc.json"

We are using the HuggingFace Accelerate python package to do the training for us, we supply the training script and some hyperparameters, and it does magic behind the scenes.

The training file is a fork of the HuggingFace example located here.

The custom training file focuses on performance improvements and memory optimization and can be found here.

Parameters worth calling out:

MODEL_NAME - what model version you want to run from HuggingFace. 
 There are a few providers of SD models you can choose from.
 For instance, you may want to try out v2 (v1-5 gives better outputs in my opinion)PRECISION - leave this as fp16, especially if you are running in a low memory
 (ie <16g) environment
MAX_TRAIN_STEPS - 800-1200 usually give good results based on my testing
seed - you can make this randomly generated, but it is 
 normal to see this hardcoded once you find a training run that
 generates a good set of initial images
learning_rate - can be altered but I tend to see 1e-6 to 5e-6
save_interval - keep this higher than the MAX_TRAIN_STEPS or you may run out
 of memory in CUDA during the save step
save_sample_prompt - if you updated this prompt in the concept list
 update it here as well

That will take several minutes to run. You should see some output from the training itself. A quick word of warning: I originally put this notebook together in Nov/Dec 2022 and had no issues with the training. However, while I was working on the code for this post, I noticed that something has changed in the underlying libraries, and I have run into numerous CUDA out-of-memory errors. I have rolled back packages and found settings that still work, but your mileage may vary. I will revisit the notebook a few months later and see if future updates help fix the issue, but you should still have a functional notebook to play with until then.

Generate a grid of preview images

Now that the training has completed, running the next couple of cells in the notebook will bring you to a grid of the preview images. WARNING: these will not be flattering images. They will look weird; you may have too many fingers, or noses, or eyes. That’s ok, but if they look too surreal, change the seed value we discussed above and re-run until you get something that looks better.

Fine-tuning Stable Diffusion with Dreambooth in AWS SageMaker Studio (4)

See what I mean? They look weird right now, but generally ok. In the two right-most pictures, there is something wrong with the hands it has generated for me. The second image in the grid has the wrong eye color. Image number 7 decided to grow a baby out of my shoulder (Honest, I didn’t add any training images of me holding a baby, I have no idea why but even after running multiple iterations, it will randomly do this).

Inference

Finally, it’s time for the fun part. Run the rest of the cells until you get to this point.

#this is the normal prompt you are used to when generating images from text
#make sure to include the phrase 'photo of XX person' 
#to force the model to use your finetuned results as a starting point
prompt = "hyper-maximalist overdetailed comic book illustration headshot 
photo of cc person as hero. Give him a long, luxurious beard like Dumbledore. 
Make the image dark and gritty"#negative prompts allow for removing/limiting what will be included 
#commonly would use 'dupliate' to ensure you don't get multiple copies 
#of the instance iamge in a single output
negative_prompt = "duplicate" 
with autocast("cuda"), torch.inference_mode():
 images = pipe(
 prompt,
 height=512,
 width=512,
 negative_prompt=negative_prompt,
 num_images_per_prompt=1,
 num_inference_steps=100,
 guidance_scale=8.5,
 generator=g_cuda
 ).images 
 for img in images:
 dt = datetime.now() 
 ts = datetime.timestamp(dt)
 display(img)
 img.save('./content/ccOutputs/'+str(ts) + ".jpg", "JPEG")

Again, parameters worth noting:

height/width - leave these at 512 as the underlying diffusers are using thisnum_images_per_prompt - you can generate one or multiple images at a time
guidance_scale - this is how closely the algorithm will follow your prompt
 versus how much "creativity" it can have in generation.
 I tend to see this between 7 and 8.5, but feel free to 
 experiment here, it only affects these newly generated
 images.
num_inference_steps - how many denoising steps the algorithm will use when
 generating your image. Stable Diffusion should do well
 with as few as 50 steps, but for more detailed images
 you can go higher. I've had good luck with 100.

That’s it! When you run that cell, your model will generate a new image of you, display it on the screen, and save a copy, so you don’t lose it when you inevitably get caught up in the moment and power-run this cell a few dozen times looking at all the fantastic (and sometimes incredibly horrific) results you get.