Skip to main content

Guide to Preserving HuggingFace Models in Google Colab Environments

Conclusion: 

Step 1: 

find the model path: ls ~/.cache 



Step 2: 

Copy the entire folder to Google Drive: 



Step 3: 

Set model path to the subfolder under snapshot:




My Story:

I initially began exploring Generative AI (GAI) and Google Colab through Stable Diffusion. In the past, as I mainly wrote server services and console applications, I was less familiar with data science modes like R and Jupyter that can maintain a paused state. I didn't quite understand the heavy burden on Colab of creating a temporary Stable Diffusion WebUI with .ipynb, as suggested by popular guides. I just found it troublesome that connections often took a long time and then dropped, requiring a restart.


Recently, while testing new versions of the Stable Diffusion model, and facing challenges due to Colab's policies making various versions of WebUI difficult to run successfully, I started researching how to write my own test programs in Colab. Eventually, I understood that Colab is essentially a VM, capable of executing limited system commands, as well as Python programs in Jupyter. Using Colab has become much simpler for me; the only unfamiliar aspect is the AI-related part. To avoid the issue of disconnections likely caused by running processes for too long, I started looking for ways to download and save the models first.

At that time, I couldn't find the "runwayml/stable-diffusion-v1-5" model on Colab, but I was very keen on using the model available for download from Civit Ai. The model file was a single file, and while StableDiffusionPipeline.from_pretrained() required other related files, I finally succeeded in using the pre-downloaded model for image creation by using from_single_file().


After that experience, I dedicated more time to learning about Generative AI. Following a recommendation, I tested the CodeLlama-7b-hf model, which took a considerable amount of time to download and then upload to MyDrive. However, after testing, I realized that this was not the model I wanted to use.

This time, following the course's suggestion, I tested the Flan-t5-base model. I found the process of 'downloading and uploading' too time-consuming, so I tried downloading directly from Hugging Face's model-name on Colab, which seemed much faster. After extensive searching on the internet, I finally discovered that Hugging Face's models are stored in the ~/.cache directory. So, I moved that directory to MyDrive, but then T5Tokenizer.from_pretrained() complained about the absence of tokenizer files. Upon checking the models--google--flan-t5-base directory, I indeed found no tokenizer.json; it was located in a version subdirectory within the snapshot subdirectory. Hence, I moved the version subdirectory out and deleted the rest. However, when I tried again, T5Tokenizer still complained about the missing tokenizer.








At this point, I was somewhat perplexed. I clearly saw the tokenizer.json file. I decided to check the config.json file downloaded by Colab for any clues and discovered that the structure of the Colab version differed from the online version provided by Hugging Face. Colab configures data files to be stored in a different directory (blob). Therefore, when executing from_pretrained(), it's necessary to set the model path to the version subdirectory under snapshot, while maintaining its complete structure.



So, I downloaded it again, moved the entire folder to MyDrive, and set the correct model path. This solved the problem of having to download repeatedly and also facilitated future extended training.

2024/03/10 update

parameter: cache_dir  in from_pretrained() also does the same thing:

text_encoder = CLIPTextModel.from_pretrained(
                    model_path,
                    subfolder="text_encoder",
                    cache_dir=cache_folder,
                    num_hidden_layers=11,
                    torch_dtype=torch.float16
                    )



-

Comments

Popular posts from this blog

Bookmark service (MongoDB & Spring REST) -2/2

    I accidentally deleted my development VM. I got lucky having the habit of taking notes. This blog is useful. Development VM is doom to be lost. Rebuild it waste time, but having a clean slate is refreshing~. What concerns me more is my AWS free quota this month is reaching 85%. The second VM I launched but never being used might be the one to blame. (Of course, my mistake.) I terminated the wrong VM. Now I got Linux 2 built. Great, just threw away everything happened on AMI.  1st layer: Page Page class   Originally, I need to prepare getter/setter for all class properties for Spring. By using lombok, I only need to create constructors. lombok will deal with getter/setter and toString(). But there are chances to call getter/setter, but how? .......Naming convention.... Capitalize the 1st character with the prefix get/set.  Annotation @Data was used on this class.  Repository class Spring Docs: Repository https://docs.spring.io/spring-data/mongodb/docs/3....

Setup Maven and two basic projects

    The interesting implementation of Java I proceed to is Spring. And, only getting Java running is not enough. I also need to set up Maven. This name is new to me, and I found many Spring tutorials just skip this part. At least I need Maven to generate templates for me. I should learn it more. ( I knew there is a great tool -- Eclipse -- can make tedious things disappear. I’m taking a strategy to install all experiments I want to try and throw away when it's full. And that's an external VM, not my PC. I, not yet, want to do research about installing Eclipse on AMI. )    Upgrade to Java 8     First is to upgrade Java on AMI to Java8. AWS provides advanced tools for Linux 2. And DIY for Linux2. At least there are solutions for my choice.  Amazon Corretto 8 Installation Instructions for Amazon Linux 2 https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/amazon-linux-install.html Here are commands I used: >wget */amazon-corretto-8-x64-linux-jdk.d...