After my first attempt at fine-tuning Stable Diffusion didn’t yield the desired results, I realized that I may have caused the collapse of the embedding space by using too few images with similar descriptions. For my second attempt, I decided to collect more images using a simple approach: extrating frames from a playthrough video of the game. This allowed me to increase the diversity of images and hopefully improve the performance of the fine-tuning process.
This seven hour long video clocked in at about 1,200,000 frames. I used Yolov7 to identify frames that had recognizable objects in them and extracted about 10,000 frames that I cropped and scaled to 512x512
.
The next challenge was to generate accurate captions for these images. To do this, I used BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) to generate four captions for each image: two generated deterministically and two generated by sampling the space, using two different BLIP models. Finally, I used a transformer-based summarizer to summarize the captions.
Here are two example images with corresponding captions:
These are not the best captions but they are more diverse than what I used before and hopefully still reasonably accurate. To make this work with the StableDiffusion code, I had to generate a metadata.jsonl
file that mapped all the images to their corresponding captions. A simple Jupyter Notebook did the trick for that.
Thanks to LambdaLab’s Cloud GPUs, I rented an NVidia A100 for a couple of days. This allowed me to run the fine-tuning for about 25 epochs. I saved a checkpoint after each epoch. To evaluate the quality of the checkpoints, I asked txt2img
to generate 4 images for 4 prompts for each of the checkpoints. The video below shows the result. It remains to be seen how to interpret this. While the results seem definitely better than my previous attempt, it’s still not quite where I would like it to be.
Let me know what you think.