Skip to content

loss=nan on 1660 SUPER 6GB #293

@martianbit

Description

@martianbit

Hey,
I have a NVIDIA GeForce 1660 SUPER 6GB card, and I wanted to train LoRA models with it.
This is my configuration:

accelerate launch --num_cpu_threads_per_process 4 train_network.py --network_module="networks.lora" --pretrained_model_name_or_path=/mnt/models/animefull-final-pruned.ckpt --vae=/mnt/models/animevae.pt --train_data_dir=/mnt/datasets/character --output_dir=/mnt/out --output_name=character --caption_extension=.txt --shuffle_caption --prior_loss_weight=1 --network_alpha=128 --resolution=512 --enable_bucket --min_bucket_reso=320 --max_bucket_reso=768 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=0.0001 --text_encoder_lr=0.00005 --max_train_epochs=20 --mixed_precision=fp16 --save_precision=fp16 --use_8bit_adam --xformers --save_every_n_epochs=1 --save_model_as=safetensors --clip_skip=2 --flip_aug --color_aug --face_crop_aug_range="2.0,4.0" --network_dim=128 --max_token_length=225 --lr_scheduler=constant

The train directory's name is 3_Concept1, so 3 repetitions are used.
The script does not throw any errors, but loss=nan and corrupted unets are produced.
I've tried setting mixed_precision to no, but then I've run out of VRAM.
I've also tried disabling xformers, but again, I've run out of VRAM.
I've compiled xformers myself, using pip install ninja && MAX_JOBS=4 pip install -v .
Also tried several other xformers versions, like 0.0.16 and the one suggested in the README.
Tried both CUDA 11.6 and 11.7.

Python version: 3.10.6
PyTorch version: torch==1.12.1+cu116 torchvision==0.13.1+cu116

Any help is much appreciated!
Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions