Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (2024)

Antalpha.ai

11 min read

Apr 7, 2023

Nowadays, many people mostly use it to train human face images or picture styles (pop art style, vector style, etc). Lora, Dreambooth and Textual Inversion is part of the AI algorithm technique to support the training and refining of the diffusion models such as Stable Diffusion. It works by incorporating a specific object as an input into the model. This is another technical explanation of how each of these fine tunings are different from each other.

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (2)

Dreambooth

Dreambooth was first published on 2022 by the research team from Google. Dreambooth captures the subject and allows it to be integrated into any desired context. It was derivated from the idea of photobooth which once the main subject is captured, it can be recreated as how your dreams would be.

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (3)

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (4)

The Google team showed the function of Dreambooth in a research paper by presenting an instance where they utilized merely 4 photographs of a Corgi dog as input. As a result, the Dreambooth model was able to produce numerous images of Corgi dog in multiple scenarios. DreamBooth is also powerful enough to capture the essence of a picture or style from whatever artworks. It permits users to fine-tune and customize text-to-image models along with its corresponding class name (e.g., “dog,” “human,” “building”).

Dreambooth able to generates high quality and diverse outputs. Some have said that Dreambooth are excellent for capturing essence of specific things/people. The training method for Dreambooth is that it needs to use a specific and rare word for the subject training that doesn’t have a lot of meaning. This is needed because it will prevent the AI from mixing it with a common and other learnt word. Secondly, Dreambooth also has the Prior preservation class method. Basically it is referred as preservation images, where we put the part of the models that we want to protect from alteration into the class images and the parts that we want to train are excluded from class images. However, if your settings are incorrect, your outputs may end up looking identical to the class images, or too similar to your training images. Compared to LoRA and Textual Inversion, Dreambooth has a greater tendency to distort color balance and specific objects.

DreamBooth is a powerful training method that preserves subject identity and is faithful to prompts. However, it can be frustrating to use and requires at least 12GB of VRAM. Custom models are biased toward specific subjects or styles and can produce higher quality images than standard models. DreamBooth has limitations, as it can only generate what it has been trained to know, and detailed style models may fail to generate anything unfamiliar. The output of Dreambooth are usually in the format of .ckpt (checkpoint) model.

For example, the Anything model is able to produce good results, but it can only generate backgrounds with images. If we ask it to create a “plain background,” it won’t be able to because it doesn’t know what that is. This means that detailed style models may fail when generating anything unfamiliar, like people, places, or things. Despite its limitations, DreamBooth can still produce great results but is not a replacement for other methods like LoRA or Textual Inversion. LoRa models are widely used, but DreamBooth is still considered superior in terms of image quality because it can accept a much larger amount of image inputs fed into the custom model.

The users need to ensure that all of the images are appropriately labeled, utilize a smaller rate of learning, apply prior preservation loss, and take care not to overfit the data, among other considerations.

The reason why many people doesn’t train using Dreambooth because it is more “costly” than the other training methods. The training time usually spends around 15 until 20 minutes which mostly generates high quality and diverse outputs, with a large file size ranging from 3 GB — 8GB depending on the quality and quantity of the input images. Dreambooth is also better at capturing all the information regarding the image style, model all in one checkpoint with highly detailed subject features.

If you want to learn more about Dreambooth and how to run it especially using the Google Colab, this article will help you to tackle it.

LORA (Low-Rank Adaptation)

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (5)

LoRA is a mathematical technique that reduces the number of trained parameters that was the newest released compared to the other fine-tuning model. It can be compared to creating a different kinof the model instead of saving the entire model. The researchers at Microsoft developed LoRA, and Simo (a new image generation model) has utilized it in Stable Diffusion. LoRa is like a patch or an injected part of a model that is not as highly detailed as checkpoints, but according to most consensus, it is about 95% as good as Checkpoints model (Dreamboooth).

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (6)

After reading multiple forums and subreddits, many comments suggest that LoRa models outperform Textual Inversion. LoRa is preferred because it is as powerful as Dreambooth, but with faster training time, less memory consumption, and smaller disk space usage. Dreambooth, on the other hand, can alter color balance and objects in a way that LORA and TI do not. It is important to note that LoRa can be used with any model trained on SD 1.4 or 1.5, regardless of whether Chilloutmix or another model was used to embed the LoRa file. However, the best results are obtained if the LoRa model was trained on the same model that is being used to generate the final output.

LoRa is highly recommended for use with multiple models, and it has a smaller size of below 150mb, and can even be as small as 1mb. Training with LoRa is also faster (ranging from 5–10 minutes) and requires less VRAM during training. It is ideal for training the small datasets with only 5–10 images, the higher the quality the better it will produce the result. LoRa is best for training faces and styles, but not recommended for realistic faces. However, better results are achieved if the LoRa model is trained on the same model that is used to generate the final output.

To create your own reusable LoRA concept, we would suggest you to use the Kohya WebUI for the training. Need to remember that a LoRA model cannot be used independently and requires to be used with a checkpoint model at the same time. When using LoRA in the text prompt, we notice that the format usually is <lora:name of the model:weight of the LoRA> for example it can look like this <lora:AngelinaJolieV1:0.8>.

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (7)

The 0.8 at the back of the model name indicates how much weight do you want the LoRA to be incorporated into the output image. 0.8 represents 80%. The higher you put the weight, the AI will try to retain the model’s features. Sometimes this could be troublesome if the LoRA is based on anime or animated character models while the Checkpoint model used is based on realistic and 3D images. Sometimes it will generate an output that has slight distortion. Usually for the first try, try putting between 0.6–0.7 as the weight to check whether the LoRA could blend well with the model or not.

One potential downside to LoRa models is that they seem to be highly dependent on the specific training data used. For instance, a LoRA model trained on ChillOutMix may perform well on ChillOutMix Model, but not on Dreamshaper Model. On the other hand, Textual Inversion appear to work well across various 1.5-based models.

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (8)

If you want to learn more and want to know the depth tutorial on how to use LoRA in Stable Diffusion, below are the links you can refer to :

Textual Inversion

Textual Inversion is a method for teaching a model a concept, like a person or object, in a small file. It’s great because it takes up very little disk space and is easy to use. The advantage of Textual Inversion is how small and how easy they are to use in promptings. The smallest output sizes of Textual Inversion would range from 40–100 kb(kilobytes) only which are beneficial if you don’t have large storage but still want to train various subjects with your PC.

Generally, Textual Inversion involves capturing images of an object or person, naming it (e.g., Abcdboy), and incorporating it into Stable Diffusion for use in generating image prompts (e.g., Abcdboy ).

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (9)

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (10)

Using Textual Inversion for facial training is an excellent choice since it is more adaptable than other training techniques and requires minimal space. This approach enables you to achieve comparable outcomes with less effort by utilizing the model’s pre-existing knowledge to instruct it on the desired appearance of the person. When done correctly they are reliably accurate and very flexible to work with. Textual Inversion works best to train one

Textual Inversion, which provides the generator with guidance on how to create an image, typically contains only 10–30k of hints, while a custom model may contain several GB of data. It is not feasible for Textual Inversion to include that much information in such a limited space. Therefore, Textual Inversion is restricted to a smaller “concept” and cannot encompass a broad concept like “anime style.” An anime Textual Inversion may only be capable of generating one or two poses based on the ones used for training the Textual Inversion, instead of the multitude of poses available in a custom model. It is advisable to use only a few images when training your Textual Inversion, as overwhelming or overtraining it will render it ineffective.

It is correct that Textual Inversion can impact the entire image, but the same can be said for any word added to a prompt. The purpose of Textual Inversion, like a text prompt, is to guide the image generator to a specific location within the model’s latent space, whereas a custom model actually modifies the latent space itself, resulting in a more significant impact.

Textual Inversion is primarily focused on teaching a model a concept, such as a person or an object, by acting as a prompt helper. This approach has its drawbacks, such as taking up token slots in the prompt and not being ideal for perfect replication. However, when combined with a good model, textual inversion can produce excellent results.

It is essential to note that textual inversion generally only works with the model it was trained on and works better with photo-realistic models rather than anime models. Textual Inversion essentially contains a vector that describes a person’s facial features, such as nose size and eye shape, making it more suited to realistic models.

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (11)

We can even use multiple Textual Inversion in one prompt (unlike Dreambooth where we can only use one checkpoint at a time). However, they are not as effective as other methods because they contain fewer hints for the generator. Textual inversion is best for photorealistic faces rather than anime-style faces because it only contains a limited amount of information. Also, you can’t use two checkpoints at the same time, so you have to merge them at the cost of losing some information. Textual Inversion is like a prompt helper that guides the image generator to a place within the model’s latent space. It works best when paired with a good model, but it only works with the model you trained it on.

Textual Inversion Tutorial (1; 2; 3)
Textual Inversion Further Explanation

There are numerous factors at play, including the skill of the trainer and the quality of input resources. If you only want to train the human or object specific images, we would recommend for users to use LoRA as the main training method in Stable Diffusion due to its efficiency and easy implementation. Nowadays people use LoRA a lot compared to the other types of training. This is due to the lower hardware requirements and shorter training times. This means that there are more potential and efficiency for model creators and more opportunities for experimentation and fine-tuning. On the other hand, if you want to train the whole concept of subjects and styles, Dreambooth will be the perfect choice to utilize the full training method.

Lora, Dreambooth, Textual Inversion in Stable Diffusion 1.5 (2024)

FAQs

Which is better, LoRA or textual inversion? ›

Lora and Textual Inversion are two methods used to train Stable Diffusion models based on customized images or Texts. While they can yield similar results, there are significant differences between the two. Lora is relatively easier and faster to train compared to Textual Inversion.

Know More ›

How many images are needed for Stable Diffusion textual inversion? ›

How to create a textual inversion embedding. First things first, source data. You can create a training on as little as 3 - 5 images, but more should be better.

Read The Full Story ›

What is better, DreamBooth or LoRA? ›

LoRA has a few differences from DreamBooth that make it especially appealing as an alternative: Faster training: Training a new concept with LoRA takes just a few minutes. Smaller outputs: Trained LoRA outputs are much smaller than DreamBooth outputs. This makes them easier to share, store, and re-use.

Find Out More ›

What is LoRa disadvantage? ›

Disadvantages of LoRa:

LoRa Low data rate: LoRa technology has a slow transfer rate, usually in the hundreds of bits per second. LoRa two-way communication: LoRa technology generally only supports one-way communication, so additional communication protocols are required to enable two-way communication.

See Details ›

What are the limitations of DreamBooth? ›

DreamBooth has limitations, as it can only generate what it has been trained to know, and detailed style models may fail to generate anything unfamiliar. The output of Dreambooth are usually in the format of . ckpt (checkpoint) model.

View Details ›

How many steps is good for Stable Diffusion? ›

Around 25 sampling steps are usually enough to achieve high-quality images. Using more may produce a slightly different picture, but not necessarily better quality.

Know More ›

What is the best image size for Stable Diffusion? ›

Stable Diffusion can create images from 64×64 to 1024×1024 pixels, but optimal results are achieved with its default 512×512 size. This size ensures consistency, diversity, speed, and manageable memory usage.

Get More Info ›

How many images to fine tune Stable Diffusion? ›

You can use as few as 5 images, but 10-20 images is better. The more images you use, the better the fine-tune will be.

Get More Info ›

Why is LoRa so good? ›

LoRa is ideal for applications that transmit small chunks of data with low bit rates. Data can be transmitted at a longer range compared to technologies like WiFi, Bluetooth or ZigBee. These features make LoRa well suited for sensors and actuators that operate in low power mode.

Show Me More ›

What is better than LoRa? ›

Zigbee offers higher data rates than LoRaWAN® does, so it works well for applications that demand faster data transmission. Nevertheless, whether it is LoRaWAN vs Sigfox or Zigbee comparison, LoRaWAN® focuses on low-power applications and offers lower data rates, ensuring energy efficiency and long battery life.

Keep Reading ›

How accurate is LoRa localization? ›

The location can be found with reasonable accuracy, with median error as low as tens of meters; however, in many cases, it is evaluated in very optimistic transmission conditions, e.g. using line-of-sight communication and on a small data set.

Read The Full Story ›

Where to put a textual inversion file? ›

Here is an example for how to use Textual Inversion/Embeddings. To use an embedding put the file in the models/embeddings folder then use it in your prompt like I used the SDA768.pt embedding in the previous picture.