Antalpha.ai · Follow
11 min read · Apr 7, 2023
--
Have you ever imagined to see a picture of your lovely pet getting photographed with the Eiffel tower behind or how you generate an AI image with the faces of your friends? Yes, it’s completely possible with some fine-tuning from Stable Diffusion!
The whole process of creating these scenarios would seem complicated as it involves incorporating particular subjects or objects depicted into new environments in a way that appears natural and effortless. Originally, the Stable Diffusion can generate the image of a human face but it will tend to produce an image of a stranger. Even if you put the prompt of a celebrity or character name, the results won’t be 90% accurate. Most likely it will generate an image with a similarity of under 50%.
By utilizing Stable Diffusion, it can easily help to achieve those images. The Stable Diffusion supports the training of your subjects by using the technique from Lora, Dreambooth and Textual Inversion. These 4 training techniques are widely used by users to train the machine and generate the exact and specific subjects. The model training is not limited to a human face only, it ranges from any kinds of subjects such as human face, animals, vehicles, your house’s flower vase and even a style of picture.
Nowadays, many people mostly use it to train human face images or picture styles (pop art style, vector style, etc). Lora, Dreambooth and Textual Inversion is part of the AI algorithm technique to support the training and refining of the diffusion models such as Stable Diffusion. It works by incorporating a specific object as an input into the model. This is another technical explanation of how each of these fine tunings are different from each other.
- Dreambooth
Dreambooth was first published on 2022 by the research team from Google. Dreambooth captures the subject and allows it to be integrated into any desired context. It was derivated from the idea of photobooth which once the main subject is captured, it can be recreated as how your dreams would be.
The Google team showed the function of Dreambooth in a research paper by presenting an instance where they utilized merely 4 photographs of a Corgi dog as input. As a result, the Dreambooth model was able to produce numerous images of Corgi dog in multiple scenarios. DreamBooth is also powerful enough to capture the essence of a picture or style from whatever artworks. It permits users to fine-tune and customize text-to-image models along with its corresponding class name (e.g., “dog,” “human,” “building”).
Dreambooth able to generates high quality and diverse outputs. Some have said that Dreambooth are excellent for capturing essence of specific things/people. The training method for Dreambooth is that it needs to use a specific and rare word for the subject training that doesn’t have a lot of meaning. This is needed because it will prevent the AI from mixing it with a common and other learnt word. Secondly, Dreambooth also has the Prior preservation class method. Basically it is referred as preservation images, where we put the part of the models that we want to protect from alteration into the class images and the parts that we want to train are excluded from class images. However, if your settings are incorrect, your outputs may end up looking identical to the class images, or too similar to your training images. Compared to LoRA and Textual Inversion, Dreambooth has a greater tendency to distort color balance and specific objects.
DreamBooth is a powerful training method that preserves subject identity and is faithful to prompts. However, it can be frustrating to use and requires at least 12GB of VRAM. Custom models are biased toward specific subjects or styles and can produce higher quality images than standard models. DreamBooth has limitations, as it can only generate what it has been trained to know, and detailed style models may fail to generate anything unfamiliar. The output of Dreambooth are usually in the format of .ckpt (checkpoint) model.
For example, the Anything model is able to produce good results, but it can only generate backgrounds with images. If we ask it to create a “plain background,” it won’t be able to because it doesn’t know what that is. This means that detailed style models may fail when generating anything unfamiliar, like people, places, or things. Despite its limitations, DreamBooth can still produce great results but is not a replacement for other methods like LoRA or Textual Inversion. LoRa models are widely used, but DreamBooth is still considered superior in terms of image quality because it can accept a much larger amount of image inputs fed into the custom model.
The users need to ensure that all of the images are appropriately labeled, utilize a smaller rate of learning, apply prior preservation loss, and take care not to overfit the data, among other considerations.
The reason why many people doesn’t train using Dreambooth because it is more “costly” than the other training methods. The training time usually spends around 15 until 20 minutes which mostly generates high quality and diverse outputs, with a large file size ranging from 3 GB — 8GB depending on the quality and quantity of the input images. Dreambooth is also better at capturing all the information regarding the image style, model all in one checkpoint with highly detailed subject features.
If you want to learn more about Dreambooth and how to run it especially using the Google Colab, this article will help you to tackle it.
- LORA (Low-Rank Adaptation)
LoRA is a mathematical technique that reduces the number of trained parameters that was the newest released compared to the other fine-tuning model. It can be compared to creating a different kinof the model instead of saving the entire model. The researchers at Microsoft developed LoRA, and Simo (a new image generation model) has utilized it in Stable Diffusion. LoRa is like a patch or an injected part of a model that is not as highly detailed as checkpoints, but according to most consensus, it is about 95% as good as Checkpoints model (Dreamboooth).
After reading multiple forums and subreddits, many comments suggest that LoRa models outperform Textual Inversion. LoRa is preferred because it is as powerful as Dreambooth, but with faster training time, less memory consumption, and smaller disk space usage. Dreambooth, on the other hand, can alter color balance and objects in a way that LORA and TI do not. It is important to note that LoRa can be used with any model trained on SD 1.4 or 1.5, regardless of whether Chilloutmix or another model was used to embed the LoRa file. However, the best results are obtained if the LoRa model was trained on the same model that is being used to generate the final output.
LoRa is highly recommended for use with multiple models, and it has a smaller size of below 150mb, and can even be as small as 1mb. Training with LoRa is also faster (ranging from 5–10 minutes) and requires less VRAM during training. It is ideal for training the small datasets with only 5–10 images, the higher the quality the better it will produce the result. LoRa is best for training faces and styles, but not recommended for realistic faces. However, better results are achieved if the LoRa model is trained on the same model that is used to generate the final output.
To create your own reusable LoRA concept, we would suggest you to use the Kohya WebUI for the training. Need to remember that a LoRA model cannot be used independently and requires to be used with a checkpoint model at the same time. When using LoRA in the text prompt, we notice that the format usually is <lora:name of the model:weight of the LoRA> for example it can look like this <lora:AngelinaJolieV1:0.8>.
The 0.8 at the back of the model name indicates how much weight do you want the LoRA to be incorporated into the output image. 0.8 represents 80%. The higher you put the weight, the AI will try to retain the model’s features. Sometimes this could be troublesome if the LoRA is based on anime or animated character models while the Checkpoint model used is based on realistic and 3D images. Sometimes it will generate an output that has slight distortion. Usually for the first try, try putting between 0.6–0.7 as the weight to check whether the LoRA could blend well with the model or not.
One potential downside to LoRa models is that they seem to be highly dependent on the specific training data used. For instance, a LoRA model trained on ChillOutMix may perform well on ChillOutMix Model, but not on Dreamshaper Model. On the other hand, Textual Inversion appear to work well across various 1.5-based models.
If you want to learn more and want to know the depth tutorial on how to use LoRA in Stable Diffusion, below are the links you can refer to :
- Textual Inversion
Textual Inversion is a method for teaching a model a concept, like a person or object, in a small file. It’s great because it takes up very little disk space and is easy to use. The advantage of Textual Inversion is how small and how easy they are to use in promptings. The smallest output sizes of Textual Inversion would range from 40–100 kb(kilobytes) only which are beneficial if you don’t have large storage but still want to train various subjects with your PC.
Generally, Textual Inversion involves capturing images of an object or person, naming it (e.g., Abcdboy), and incorporating it into Stable Diffusion for use in generating image prompts (e.g., Abcdboy ).
Using Textual Inversion for facial training is an excellent choice since it is more adaptable than other training techniques and requires minimal space. This approach enables you to achieve comparable outcomes with less effort by utilizing the model’s pre-existing knowledge to instruct it on the desired appearance of the person. When done correctly they are reliably accurate and very flexible to work with. Textual Inversion works best to train one
Textual Inversion, which provides the generator with guidance on how to create an image, typically contains only 10–30k of hints, while a custom model may contain several GB of data. It is not feasible for Textual Inversion to include that much information in such a limited space. Therefore, Textual Inversion is restricted to a smaller “concept” and cannot encompass a broad concept like “anime style.” An anime Textual Inversion may only be capable of generating one or two poses based on the ones used for training the Textual Inversion, instead of the multitude of poses available in a custom model. It is advisable to use only a few images when training your Textual Inversion, as overwhelming or overtraining it will render it ineffective.
It is correct that Textual Inversion can impact the entire image, but the same can be said for any word added to a prompt. The purpose of Textual Inversion, like a text prompt, is to guide the image generator to a specific location within the model’s latent space, whereas a custom model actually modifies the latent space itself, resulting in a more significant impact.
Textual Inversion is primarily focused on teaching a model a concept, such as a person or an object, by acting as a prompt helper. This approach has its drawbacks, such as taking up token slots in the prompt and not being ideal for perfect replication. However, when combined with a good model, textual inversion can produce excellent results.
It is essential to note that textual inversion generally only works with the model it was trained on and works better with photo-realistic models rather than anime models. Textual Inversion essentially contains a vector that describes a person’s facial features, such as nose size and eye shape, making it more suited to realistic models.
We can even use multiple Textual Inversion in one prompt (unlike Dreambooth where we can only use one checkpoint at a time). However, they are not as effective as other methods because they contain fewer hints for the generator. Textual inversion is best for photorealistic faces rather than anime-style faces because it only contains a limited amount of information. Also, you can’t use two checkpoints at the same time, so you have to merge them at the cost of losing some information. Textual Inversion is like a prompt helper that guides the image generator to a place within the model’s latent space. It works best when paired with a good model, but it only works with the model you trained it on.
- Textual Inversion Tutorial (1; 2; 3)
- Textual Inversion Further Explanation
There are numerous factors at play, including the skill of the trainer and the quality of input resources. If you only want to train the human or object specific images, we would recommend for users to use LoRA as the main training method in Stable Diffusion due to its efficiency and easy implementation. Nowadays people use LoRA a lot compared to the other types of training. This is due to the lower hardware requirements and shorter training times. This means that there are more potential and efficiency for model creators and more opportunities for experimentation and fine-tuning. On the other hand, if you want to train the whole concept of subjects and styles, Dreambooth will be the perfect choice to utilize the full training method.