Guide to Preserving HuggingFace Models in Google Colab Environments

Conclusion:

Step 1:

find the model path: ls ~/.cache

Step 2:

Copy the entire folder to Google Drive:

Step 3:

Set model path to the subfolder under snapshot:

My Story:

I initially began exploring Generative AI (GAI) and Google Colab through Stable Diffusion. In the past, as I mainly wrote server services and console applications, I was less familiar with data science modes like R and Jupyter that can maintain a paused state. I didn't quite understand the heavy burden on Colab of creating a temporary Stable Diffusion WebUI with .ipynb, as suggested by popular guides. I just found it troublesome that connections often took a long time and then dropped, requiring a restart.

Recently, while testing new versions of the Stable Diffusion model, and facing challenges due to Colab's policies making various versions of WebUI difficult to run successfully, I started researching how to write my own test programs in Colab. Eventually, I understood that Colab is essentially a VM, capable of executing limited system commands, as well as Python programs in Jupyter. Using Colab has become much simpler for me; the only unfamiliar aspect is the AI-related part. To avoid the issue of disconnections likely caused by running processes for too long, I started looking for ways to download and save the models first.

At that time, I couldn't find the "runwayml/stable-diffusion-v1-5" model on Colab, but I was very keen on using the model available for download from Civit Ai. The model file was a single file, and while StableDiffusionPipeline.from_pretrained() required other related files, I finally succeeded in using the pre-downloaded model for image creation by using from_single_file().

After that experience, I dedicated more time to learning about Generative AI. Following a recommendation, I tested the CodeLlama-7b-hf model, which took a considerable amount of time to download and then upload to MyDrive. However, after testing, I realized that this was not the model I wanted to use.

This time, following the course's suggestion, I tested the Flan-t5-base model. I found the process of 'downloading and uploading' too time-consuming, so I tried downloading directly from Hugging Face's model-name on Colab, which seemed much faster. After extensive searching on the internet, I finally discovered that Hugging Face's models are stored in the ~/.cache directory. So, I moved that directory to MyDrive, but then T5Tokenizer.from_pretrained() complained about the absence of tokenizer files. Upon checking the models--google--flan-t5-base directory, I indeed found no tokenizer.json; it was located in a version subdirectory within the snapshot subdirectory. Hence, I moved the version subdirectory out and deleted the rest. However, when I tried again, T5Tokenizer still complained about the missing tokenizer.

At this point, I was somewhat perplexed. I clearly saw the tokenizer.json file. I decided to check the config.json file downloaded by Colab for any clues and discovered that the structure of the Colab version differed from the online version provided by Hugging Face. Colab configures data files to be stored in a different directory (blob). Therefore, when executing from_pretrained(), it's necessary to set the model path to the version subdirectory under snapshot, while maintaining its complete structure.

So, I downloaded it again, moved the entire folder to MyDrive, and set the correct model path. This solved the problem of having to download repeatedly and also facilitated future extended training.

2024/03/10 update

parameter: cache_dir in from_pretrained() also does the same thing:

text_encoder = CLIPTextModel.from_pretrained(
                    model_path,
                    subfolder="text_encoder",
                    cache_dir=cache_folder,
                    num_hidden_layers=11,
                    torch_dtype=torch.float16
                    )

Digital sailing

Search This Blog