· Follow
Published in · 8 min read · Mar 15, 2023
In school, I went through engineering (read CAD) drawing. I can draw an isometric view of something I want to build or a part I need to fabricate. By hand, it’s not great quality, but it gets the job done. In CAD software ( like SketchUp), I can do a pretty good job. I am absolutely abysmal when it comes to free-hand drawing or doing any real kind of artistic drawing, even on a computer.
With Stable Diffusion and other generative AI of the same ilk, the ability to create some truly amazing content is well within reach for those of us whose artistic skill peaked somewhere around ten years old. Being an engineer, I’m not satisfied with just going to https://www.craiyon.com/ or https://stablediffusionweb.com/ and entering my prompts. I want to code something. More specifically, I want to code something that will generate super cool, exciting headshots of myself to add to the internet instead of my dull, real-world face (I want to look like an awesome, techno-wizard Dumbledore, from some noir-esque world that is a cross between Blade Runner/Sin City and Harry Potter!)
I had a tough time finding a Jupyter notebook that I could run in something like SageMaker Studio and fine-tune my own model. So, I threw one together. You can find a sample notebook at my GitHub repo.
Let’s dive right in and take a look at the notebook.
I’ve endeavored to comment on the notebook pretty well, so I will skip the uninteresting stuff like imports and PIP installs.
Initial configuration of the fine-tuning
We need to set some information about how we fine-tune. By supplying this list of concepts, we can tell Dreambooth about additional items we want to teach it. For the sample I have, I am teaching it one extra thing; however, this is an array so you can teach multiple new concepts in one training step.
concepts_list = [
{
"instance_prompt": "photo of cc person",
"class_prompt": "photo of a person",
"instance_data_dir": "./content/data/cc",
"class_data_dir": "./content/data/person"
},
]
Here is a quick breakdown of what each of those parameters means:
-instance_prompt - the prompt we would type to generate
the image we are attempting to fine tune-class_prompt - denotes a prompt without the unique identifier/instance.
This prompt is used for generating "class images" for
prior preservation. For our example, this prompt is
- "a photo of a person" versus a photo of a specific person.
-instance_data_dir - the location where our training images are stored
for finetuning
-class_data_dir - sample images for the general class of prompt we are
fine tuning - if there are no images here, samples will
be generated.
Otherwise, you can provie ~20 images of the general
concept you want to generate (but not the actual instance
images that we finetune on)
In your notebook, feel free to change ‘photo of cc person’ to ‘photo of YOUR_NAME person’. Make sure to keep the prompt itself the same, and switch out ‘cc’->YOUR_NAME. This prompt snippet guides Stable Diffusion to know how to use this new data when generating new images.
The only other thing that you may want to change here is the instance_data_dir
. This is a relative path to the folder that has the training images you want to use. Whether you change this directory or not, now is the time to upload the images you want to train with. So you go ahead and do that; I’ll wait here.
HuggingFace Accelerate training
Welcome back. Now you have some images uploaded and are ready for a training run. I want to walk through this next section of code since it is doing all the heavy lifting for us.
MODEL_NAME = "runwayml/stable-diffusion-v1-5"
PRECISION = "fp16"
MAX_TRAIN_STEPS = 1200#!accelerate launch --help
!accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 \
train_dreambooth_ShivamShrirao.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \
--output_dir=$OUTPUT_DIR \
--revision=$PRECISION \
--with_prior_preservation --prior_loss_weight=1.0 \
--seed=1337 \
--resolution=512 \
--train_batch_size=1 \
--train_text_encoder \
--mixed_precision=$PRECISION \
--use_8bit_adam \
--gradient_accumulation_steps=1 \
--learning_rate=1e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=50 \
--sample_batch_size=4 \
--max_train_steps=$MAX_TRAIN_STEPS \
--save_interval=100000 \
--save_sample_prompt="photo of cc person" \
--concepts_list="concepts_list_cc.json"
We are using the HuggingFace Accelerate python package to do the training for us, we supply the training script and some hyperparameters, and it does magic behind the scenes.
The training file is a fork of the HuggingFace example located here.
The custom training file focuses on performance improvements and memory optimization and can be found here.
Parameters worth calling out:
MODEL_NAME - what model version you want to run from HuggingFace.
There are a few providers of SD models you can choose from.
For instance, you may want to try out v2 (v1-5 gives better outputs in my opinion)PRECISION - leave this as fp16, especially if you are running in a low memory
(ie <16g) environment
MAX_TRAIN_STEPS - 800-1200 usually give good results based on my testing
seed - you can make this randomly generated, but it is
normal to see this hardcoded once you find a training run that
generates a good set of initial images
learning_rate - can be altered but I tend to see 1e-6 to 5e-6
save_interval - keep this higher than the MAX_TRAIN_STEPS or you may run out
of memory in CUDA during the save step
save_sample_prompt - if you updated this prompt in the concept list
update it here as well
That will take several minutes to run. You should see some output from the training itself. A quick word of warning: I originally put this notebook together in Nov/Dec 2022 and had no issues with the training. However, while I was working on the code for this post, I noticed that something has changed in the underlying libraries, and I have run into numerous CUDA out-of-memory errors. I have rolled back packages and found settings that still work, but your mileage may vary. I will revisit the notebook a few months later and see if future updates help fix the issue, but you should still have a functional notebook to play with until then.
Generate a grid of preview images
Now that the training has completed, running the next couple of cells in the notebook will bring you to a grid of the preview images. WARNING: these will not be flattering images. They will look weird; you may have too many fingers, or noses, or eyes. That’s ok, but if they look too surreal, change the seed value we discussed above and re-run until you get something that looks better.
See what I mean? They look weird right now, but generally ok. In the two right-most pictures, there is something wrong with the hands it has generated for me. The second image in the grid has the wrong eye color. Image number 7 decided to grow a baby out of my shoulder (Honest, I didn’t add any training images of me holding a baby, I have no idea why but even after running multiple iterations, it will randomly do this).
Inference
Finally, it’s time for the fun part. Run the rest of the cells until you get to this point.
#this is the normal prompt you are used to when generating images from text
#make sure to include the phrase 'photo of XX person'
#to force the model to use your finetuned results as a starting point
prompt = "hyper-maximalist overdetailed comic book illustration headshot
photo of cc person as hero. Give him a long, luxurious beard like Dumbledore.
Make the image dark and gritty"#negative prompts allow for removing/limiting what will be included
#commonly would use 'dupliate' to ensure you don't get multiple copies
#of the instance iamge in a single output
negative_prompt = "duplicate"
with autocast("cuda"), torch.inference_mode():
images = pipe(
prompt,
height=512,
width=512,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
num_inference_steps=100,
guidance_scale=8.5,
generator=g_cuda
).images
for img in images:
dt = datetime.now()
ts = datetime.timestamp(dt)
display(img)
img.save('./content/ccOutputs/'+str(ts) + ".jpg", "JPEG")
Again, parameters worth noting:
height/width - leave these at 512 as the underlying diffusers are using thisnum_images_per_prompt - you can generate one or multiple images at a time
guidance_scale - this is how closely the algorithm will follow your prompt
versus how much "creativity" it can have in generation.
I tend to see this between 7 and 8.5, but feel free to
experiment here, it only affects these newly generated
images.
num_inference_steps - how many denoising steps the algorithm will use when
generating your image. Stable Diffusion should do well
with as few as 50 steps, but for more detailed images
you can go higher. I've had good luck with 100.
That’s it! When you run that cell, your model will generate a new image of you, display it on the screen, and save a copy, so you don’t lose it when you inevitably get caught up in the moment and power-run this cell a few dozen times looking at all the fantastic (and sometimes incredibly horrific) results you get.