OK, I may be a little mean, here. My pet-peeve reflexes are taking over.
There is no shame in speaking a foreign language with an accent or other idiosyncrasies. If you’re an international corporation and want to spice up your German commercial with an English slogan to make it sound cooler or whatnot, whatever. Feel free. But please, please, please with a cherry on top, run it by a native speaker.
So, on my continuing quest to understand how this AI image generation stuff works under the hood, I thought it was time to start and train an actual neural network — well, from scratch might have been a bit too hard before knowing a lot more about it and resource intensive, so a LoRA was what I was going for. A LoRA basically is a smaller AI model that you can train to augment a more generic base model for special purposes. First, I followed Mickmumpitz’s tutorial for creating consistent characters. And yes, I had to switch over to my Nvidia 3080Ti for that. After that, things went kinda smoothly for a bit. I was having some issues with sharpness, but the character I created was clearly recognizable. The LoRA ended up being rather inflexible, though. Because of the very few images I had for training, poses had little variation, clothes were always the same, or even when I forced the generation of different clothes by very specific prompts, there were always obvious similarities to the original clothes from the training data.
I could potentially have used that initial LoRA to generate a gazillion pictures and feed the best, I dunno, five or ten percent back into training the model all over again. But knowing me, I knew that I wouldn’t have the patience to put the work into it for some random character, where getting it just right held no particular meaning for me. Instead I decided to broaden the scope and teach FLUX about my Cyberpunk 2077 female V.
Dance like nobody’s watching
First Attempts
I stuck with fluxgym off of Mickmumpitz’s recommendation, installed alonside ComfyUI using pinokio. At some point I considered moving over to bmaltais’s UI for kohya_ss/sd_scripts to tinker with some advanced options. Turned out that with my limited hardware resources (a measly 12 GB VRAM and 64 GB RAM *cough, cough*) most of those advanced options were way out of my league, and the only configurations that would work for me were the ones fluxgym had selected for me, anyway. So, I ended up copying and augmenting the batch file from the original fluxgym run, added stuff like logging for tensorboard, saving states to be able to resume. All in all I stayed very close to the fluxgym suggestion, just didn’t click my way through the UI every time. The bmaltais Kohya UI I mostly used for quick WD14 tagging of my training images.
I had about twenty-five screenshots I had made throughout my Cyberpunk 2077 playthroughs, made some new ones to have material similar to that in Mickmumpitz’s character sheet, faces from various angles, various facial expressions, simple full-body shots (in the nude *ahem*) and had the LoRA have a go at it. Early on I had read about difficulties with training LoRAs for FLUX and was using the Flux-Dev2Pro model mentioned there. That was probably a good idea as some later test further down seem to indicate. Still the results were a little underwhelming.
billowing green dress in epoch 4billowing green dress in epoch 5billowing green dress in epoch 10
I didn’t hate the black and white profile shot, though it had more of a digital art look to it than I intended. I realized that I had the implicit expectation of using the screenshots from the game to eventually create photorealistic images, later on. That was a complication I had not hitherto considered. In other words: Consider your own bias, too.
Concept bleeding
The other images were clearly showing my character, but they were not very sharp and by epoch ten, when the gorilla arms finally came through, the training had turned the billowing dress from the prompt into something with pants, because Cyberpunk 2077 just doesn’t do billowing skirts or dresses. At the same time, the concept of a cpFemmeV was already very much bleeding into the AI model’s concept of a woman. Three random women standing in a field in colourful dresses suddenly looked very much like a cpFemmeV triplet:
Three women without LoRA activation keyword
Turns out when people say your activation keywords should be unique and also not contain other concepts the model you are training knows about, that includes French. I’m guessing it’s not actually FLUX that speaks French, but the T5 or CLIP text encoders, but if your let your vanilla FLUX generate an image of a “Femme on a beach” or a “voiture”, it becomes obvious language just doesn’t matter. Consequently, you need to check your activation keywords in any language, not just the one you normally use. After switching over to cpFV a group of women without activation keyword looked much better.
Same three women without LoRA activation keyword cpFV (plus one extra)
Though you can tell the LoRA has some kind of impact even without activation keyword. But chest tattoos in only one image and no pink hair or gorilla arms was definitely progress.
Woman with LoRA turned off compared LoRA turned on but without activation keyword
None of this, however, helped much with the issue that, once activated, the LoRA would also cause people in the background to look like my character.
Photorealism
To tackle the digital artwork style of the black and white shot, I tagged all my training images with “unreal engine render” in the hopes that the AI would later realize that the training images all had a certain look because they were game renderings and that that would allow it to extrapolate photorealistic images off that, if I explicitly prompted for those. Let’s just say, it didn’t do zilch. My guess is that FLUX’s idea of a 3D render is just too different, as evidenced by what you get when you prompt for a 3D render style. Or maybe it just doesn’t work in the direction from less detail to more detail.
What about the sharpness, then? I went and processed all my training images with some decent, but in my opinion not exaggerated sharpening. The hair I actually blurred a bit here and there, because it was already looking pixelated. Well …
Learned from manually sharpened images
While I appreciate the background and especially how at epoch 17 the waiter still doesn’t have purple hair, the foreground has even more of a digital art look to it than previous images. It is pretty obvious that the weird sharpness of the training data was interpreted as an integral characteristic of cpFV. So, this is when it dawned on me that (probably in general but definitely for more complex things) …
the quality of your training data is the single most important variable to your training!
Improving the Training Data
Not only the image quality left much to be desired, I had also accidentally included screenshots from various playthroughs where in some of them the character had gorilla arms while in others she had the monowire implant. And I had tagged neither. That was noticeably messing with what the model learned the arms of my character were supposed to look like. But without the monowire images I didn’t have enough images showing various clothes and natural poses from different angles, etc. And while I was debating with myself whether or not to create an entirely new set of screenshots, I remembered that I had previously used GForce Experience game filters to add sharpening to the game on the fly, and that GForce Experience also allows you to take screenshots with that sharpening applied. And that prospect convinced me to create new screenshots at the maximum resolution my 4K display would allow and sharpened by a Sharpen+ game filter at that native resolution. Once that was done, I divided my images into three distinct subsets with different repetitions, in decreasing order: one for basic face and naked body, one for various facial expressions, and one with various types of shots, various clothes, various poses.
Billowing Green Dress?The black and white profile shot looks differrentMonochrome, but not the image
The sharpness was much improved. But notice how the black and white profile shot suddenly was neither black and white nor a profile shot. It had all the vibrant colour that many repetitions of the face had taught the model to consider as part of the character. Even when I tried to force the model to create a monochrome image, it could no longer quite do it. Also the particular angle the middle image shows the face from (instead of from the side) is one of the most common angles in my screenshots of facial expressions. Due to the haircut you mostly only really see the facial expression from this angle. But I had obviously exacerbated the overfitting effect you always seem to get at higher epochs by giving those facial expression shots way too many repetitions. So, again …
the quality of your training data is the single most important variable to your training!
And that does not only apply to image quality and consistency, but also to the balance and whether there is any unwanted bias towards learning certain things versus others — the latter mostly through too many repetitions for certain kinds of images versus others.
So, after a bit more tinkering with the repetitions for my three subsets of data, I got these:
Almost time to partyCheers!Exuberant joy?Hugging a ball to her chest?
The first two, I was rather happy with. But anything that had the character’s arms raised above her head, would get the V-pose that was in my nude full-body training images, which in turn was in the batch with the most repetitions. This gave way too much weight to this peculiar pose. Whenever the arms would come to about shoulder height, there was nothing I could do in the prompt to keep them from going to that V shape. In the preview you could watch the sampler consider different arm positions, but it would invariably settle for the V shape. I could only get other arm poses, if I forced the arms to go down. So, how did that go again?
the quality of your training data is the single most important variable to your training!
The Final Cut?
After another makeover of the training data, changing the nude full-body shot to have a more neutral pose, arms down at the sides, aggressively cropping images to the relevant parts (so that not every other background would automatically look cyberpunkish), adding more shots of the character in various clothes, various poses, various types of crop, some in black and white, I made sure that the repetitions multiplied by number of images for that subset would be about the same as for the subset with faces, body parts and tattoos. Then I added a batch of regularisation images from actual photos.
I also tinkered with various of the generation settings, but many of those did not improve much. I tried a higher network_dim (up to 128), newer optimizers that don’t need a scheduler, scale_weight_norms (with values that actually do something and not just log to tensorboard), etc. I dumped some money into a Novita.ai GPU instance to even be able to test most of that stuff and had to teach myself how to create a Docker image for the sd3 branch of kohya_ss, first. But even after trying to understand how the whole diffusion stuff works, how a neural network learns diffusion, what loss is all about, I have a feeling that, even if I could actually do the relevant calculations, selecting a configuration from a gazillion of configuration options would still require a whole lot of trial and error. Apparently it’s very much about having experience with certain settings and how they will affect the result in which circumstances — kinda like with everything, except writing a program. Eventually, after reading this, I ended up with these simpler settings (which may or may not work as well for anything other than a character LoRA):
From that and previous runs I was expecting best results around epochs 15 to 20, with a conveniently low loss at 20.
Hyper-realistic, 16k, crisp, sharp and detailed, photo (ultra), (masterpiece, award winning artwork), the scene shows a stylish cafe with chairs and tables, far in the background a waiter is busy cleaning a table, ((cpFV)) is casually lounging back in a chair in the forground and holding a glass of wine in her right hand. She is wearing a fancy dress with feathers on her sleeves and high heels kicked up on a table. She is (grinning:1.5) into the camera.
Generally speaking, you would expect the low epochs to be lacking in accuracy and the high epochs to cause odd effects. The face and tattoos start to look really good to me at epoch 14 and from epoch 20 on, you see odd stuff, from odd glasses to somewhat weird proportions to the odd clothing in epoch 40. Some of the things could probably be worked around by explicit prompting, but I really only want to that every now and then and not for every other image. My favourite is epoch 16, though notice how at 17 the waiter doesn’t have pink hair.
Hyper-realistic, 16k, crisp, sharp and detailed, photo (ultra), (masterpiece, award winning artwork), action shot of cpFV running towards the camera firing an assault rifle, the muzzle fire lighting her grim facial expression and clenched teeth. She is wearing a black latex bodysuit and a short military jacket with a lot of pockets. Behind her there are explosions and and soldiers fighting. Smoke is clouding the distant cityscape in the background. The atmosphere of the scene is very hectic and dynamic, the lighting dark as night.
I love epoch 14 for the dynamics of the pose, the direction the gun is pointing and the facial expression. Epochs 16 and 17 aren’t bad, either, but 14 has the superiour facial expression. Epoch 40 is surprisingly good.
Hyper-realistic, 16k, crisp, sharp and detailed, photo (ultra), (masterpiece, award winning artwork), the scene shows an old-fashioned library in a well-lit spacious historic building filled with old bookshelves, tables and chairs. In the foreground cpFV is sitting on a table cradling an open book in her lap. She is wearing a short and frilly white dress and golden mirror sunglasses. cpFV flashes a flirtatious smile at the camera while other readers are busy in the blurry background.
Oh, the hands! I mean, we all know, hands are a problem, even if FLUX generally seems to be doing a better job of them than older SD variants. But the gorilla arms aren’t helping. They are just too different from what the AI considers as hands. In some cases the AI seems to think they’re gloves, especially in earlier epochs. Epoch 20 seems to get lucky with the hands, though I like the pose in epoch 16 best. I do like the pose in epoch 26, but where is the smile I prompted for? There is a tendency for clothes to come out a bit plastic-like. It’s another thing I like about epoch 16 that it isn’t quite there, yet. Epoch 40 is surprisingly good again.
Hyper-realistic, 16k,Shot on Canon EOS 5D Mark IV, 85mm lens, Deep focus (f/11), on the (left side of the image) a cpFV with asymmetrical pink hair is standing in the foreground , one hand on her hip, the other waving at the camera. She is smiling brightly for the photo, wearing a white crop top and a long floral skirt. Behind her is rocky dirt road with a small oriental market with stalls and several Indian people buying and selling. The Taj Mahal is in the background. It is the evening of a sunny day. The light is coming from behind the camera.
Notice how the concrete slab floors that feature prominently in Night City (and hence my training data) shine through, even though I explicitly asked for a rocky dirt road. Epoch 20 does a pretty good job here. Also the left arm isn’t as stubby as in some other ones and people in the background don’t have pink hair. Also, the prompting for camera and lens type works a bit better for realism.
Hyper-realistic, 16k,Shot on Canon EOS 5D Mark IV, 85mm lens, Deep focus (f/11), the photo shows Irish landscape of rolling green hills under an overcast sky. In the foreground there are a Connemara pony and some sheep grazing. Behind them cpFV woman with asymmetrical pink hair is climbing on top of a huge boulder. She wears a long white dress billowing in the wind. The rolling hills meet a sweeping coastline in the distance.
The pink hair on the horses is cracking me up.
The lighting is pretty good in epoch 16. That’s why I was asking for an overcast sky, when the training data often has extremely bright Californian sunlight. I also like the rock. But overall I like epoch 26 best, here. Something odd is happening with the legs and the dress in epoch 20 and V’s rear end in epoch 40, all else being pretty good, just doesn’t match the training data.
Hyper-realistic, 16k, Shot on Canon EOS 5D Mark IV, 85mm lens, cpFV with asymmetrical pink hair running along the beach on a sunny day splashing in the water exuberantly happy. She is wearing a reflective golden bikini and large sunglasses, tosses a beach ball at the camera, grinning. Palm trees are swaying in the background.
Epoch 8 has good image composition, but the hair and tattoos just don’t look like my character enough, yet. I like the dynamics of epoch 14 a lot, even if the left hand looks a little weird. The shadows in epoch 11 and 30 are a hoot. From epoch 20 on, the ball just looks disconnected from the rest of the picture, except for epoch 30, but the head is a bit large, there. Rear hand considered, my favourite is probably epoch 17.
Captured with a Sony A7 III, 50mm f/1.8 lens, Hyper-realistic, 16k, sharp and detailed, photo (ultra), (masterpiece, award winning artwork), closeup of a slim cpFV woman with asymmetrical pink hair lifting weights in a gym, sitting on a bench, shot from the back, Shallow depth of field (f/1.8), subject in sharp focus, bokeh background
The hands are the nemesis of image generation, again. Just look at the right hand in the otherwise awesome epoch 40 result. Epoch 30 would be my favourite, if the bar actually went through the hands in a natural manner. Epoch 26 would be alright, if the left hand wasn’t missing a finger. So, all in all it’s epoch 16 or 17 again, minus gorilla arm hands and bluish tattoos on the shoulders.
Captured with a Sony A7 III, macro photography, extreme close-up: eye of caucasian (cpFV:1.2), profile of cpFV wearing an opuent glittering diamond (necklace:1.1) and a feather boa. extreme close-up: cpFV’s eye is half closed and lowered her lips are pursed. It’s a (dark:3.0) low key black and white shot at night showing a (sharp:1.2) silhouette against a dim candlelight from the back, barely visible. Shallow depth of field (f/1.8), subject in sharp focus, bokeh
This time it’s a win for epoch 14 for the most extreme closeup, decent photorealism and even a little reflection in the eye. I seem to get closer closeups with aspect rations closer to 1×1. Epoch 16 isn’t bad as such, and neither is 20 except for a little less realistic eye and an odd skin texture. Epoch 40 has some nice contrast, but the overall winner is 14. Yes, the tattooed text isn’t correct, but I have never seen the LoRA get that right.
There is one thing of note about these, though. And that is that they make a problem most visible that some of my epochs have and which is the reason I’m not even showing epoch 15 here, although that had some otherwise great results. If you look closely, you can see artifacts like vertical bands in the images, especially when there’s an area with a simple homogenous tone or a soft gradient and when that gets very bright. To make the effect easier to spot, I’ve increased the contrast in the images below. You can see it most prominently near the candle in epoch 26, to a lesser degree in 17 and 16. In 14, 30, and 40 I can’t really see it. In epoch 15 it was so visible even in the other images that it’s not really usable, unless you want to post process by doing either a 2-pass Ultimate Upscale with a different sampler and scheduler or even an img2img using an epoch that does not have the problem with a low denoise.
It turned out this was caused by a combination of using the Dev2Pro model for training (but not inference) and the –apply_t5_attn_mask parameter. If I train without the T5 attention mask, I don’t get the vertical bands. If I train on the Dev2Pro model and also use it for inference, I don’t get the vertical bands (but boring images). If I train the vanilla Flux-Dev and use it for inference, too, I also don’t get the vertical bands. I’ve commented an issue on the sd_scripts github, but I’m not sure this isn’t an effect of the way Dev2Pro works. I have tried training vanilla Flux-Dev or another de-distilled version, but I haven’t had much success with either. One moment you have an underfitted LoRA that doesn’t even represent your character closely, the next you have a completely overfitted with crazy anatomy or every wall looking like one I used as a simple background in my training data. I’m guessing, there must be a sweet-spot for the learning rate somewhere to get those models to work, but I’m running a bit out of steam. So, instead I just did the same training as above again, just without the –apply_t5_attn_mask, and now I have options.
Conclusion
My results were obviously not perfect. The photorealism sometimes needs a bit of extra nudging to come out, or else the images will look just a bit less than real. I’ve never managed to keep the LoRA from applying my character’s properties to people in the background, too, though I have a feeling T5 attention masking helps a little with it. This always gets worse the better the LoRA actually learns my character’s features. And still I’ve never managed to get the LoRA to always accurately draw the same tattoos on the character, let alone the little bit on her face that spells out “NUSA”. I feel like what I’ve tried to do is on the more complicated end of character LoRA creation, compared to a 2D anime character, for example. There are a lot of details to learn, from the hair and facial features to the tattoos and gorilla arms. And the latter are obviously in conflict with FLUX’s idea of a hand. With increasing epochs you can see the AI starts out by thinking of the hands as gloves. And soon after it learns they are different, uncanny-valley effects start to become more frequent. There is a fine balance to strike between representing the character accurately and overfitting or eventually burning the LoRA completely.
There is also the issue of the vertical bands with –apply_t5_attn_mask that would require more tests with other models, and also, in order to not have cyberpunkish buildings and concrete floors bleed into all sorts of environments, I should train again with loss masks. That’s feature not marketed tremendously well, but there is some documentation here. What it does is it allows you to specify for every training image exactly where your subject is, so that the training process doesn’t look at anything else. That seems very promising for a character LoRA, but creating those masks is just more work than I want to invest right now. Without knowing the feature I’ve tried to do this manually by blurring everything in my training images that is not my subject. But the AI looks at the images more closely than a casual viewer. My masks were far from perfect, and my generated images started to get halos of random pixels around the character, because the AI had learned that my character usually has some sharp pixels around her, before the blur starts. I don’t know how exact the masks need to be for use as loss masks, but I’m guessing it’s a fairly similar approach and they have to be pretty exact. So while there are tools to help with creating loss masks, I expect they would still need a whole lot of manual post-processing — maybe some other time.
Was it all not worth it, then?
Well, is it worth having an AI that can generate images of one of your game characters, in the first place? I mean, I obviously learned some things, learned where I feel I still know way too little. But I feel like I also actually got usable results. The result will probably not to be a single LoRA. Maybe I will keep several epochs and maybe not always the same will work best for every prompt. I’m probably going to be creating images with an earlier epoch for more natural poses and then using a later epoch and img2img to increase the likeness to the character. Or I’ll pick depending on the type of shot: An earlier epoch for wide-angle shots where you don’t see the details as well, anyway, and a later one for closeups. As long as I can generate stuff like this without too much hassle, I guess I’m fine. Even though for the time being it’s a bit more involved than I’d wish.
Since I’m trying to get a Docker image running with a gazillion dependencies going back and forth, I thought I’d compile this table of which xformers versions are compatible with which torch versions.
xformers
torch
v0.0.10
torch >= 1.8.1
v0.0.11
torch >= 1.8.1
v0.0.12
torch >= 1.12
v0.0.13
torch >= 1.12
v0.0.16
torch >= 1.12
v0.0.16rc423
torch >= 1.12
v0.0.16rc424
torch >= 1.12
v0.0.16rc425
torch >= 1.12
v0.0.17
torch >= 1.12
v0.0.17rc481
torch >= 1.12
v0.0.17rc482
torch >= 1.12
v0.0.18
torch >= 1.12
v0.0.19
torch >= 1.12
v0.0.2
torch >= 1.8.1
v0.0.4
torch >= 1.8.1
v0.0.5
torch >= 1.8.1
v0.0.6
torch >= 1.8.1
v0.0.7
torch >= 1.8.1
v0.0.8
torch >= 1.8.1
v0.0.9
torch >= 1.8.1
v0.03
torch >= 1.8.1
v0.0.20
torch >= 1.12
v0.0.21
torch >= 1.12
v0.0.22
torch >= 1.12
v0.0.22.post1
torch >= 1.12
v0.0.22.post2
torch >= 1.12
v0.0.22.post3
torch >= 1.12
v0.0.22.post4
torch >= 1.12
v0.0.22.post5
torch >= 1.12
v0.0.22.post6
torch >= 1.12
v0.0.22.post7
torch >= 1.12
v0.0.23
torch >= 1.12
v0.0.23.post1
torch >= 1.12
v0.0.24
torch >= 2.1
v0.0.25
torch >= 2.1
v0.0.25.post1
torch >= 2.1
v0.0.26
torch >= 2.1
v0.0.26.post1
torch >= 2.1
v0.0.27
torch >= 2.2
v0.0.27.post1
torch >= 2.2
v0.0.27.post2
torch >= 2.2
v0.0.28
torch >= 2.4
v0.0.28.post1
torch >= 2.4
v0.0.28.post2
torch >= 2.4
v0.0.28.post3
torch >= 2.4
v0.0.29
torch >= 2.4
v0.0.29.post1
torch >= 2.4
v0.0.29.post2
torch >= 2.6
v0.0.29.post3
torch >= 2.6
Easier than finding it manually:
git clone https://github.com/facebookresearch/xformers.git
cd xformers
echo "<table>"
git tag --list | \
( while read tag ; do
echo "<tr>"
echo -n "<td>$tag</td>"
git checkout -q $tag
echo -n "<td>"
VER=`grep -i torch requirements.txt`
echo "$VER</td>"
echo "</tr>"
done )
echo "</table>"
git checkout main
cd ..