Skip to main content

Guide to Preserving HuggingFace Models in Google Colab Environments

Conclusion: 

Step 1: 

find the model path: ls ~/.cache 



Step 2: 

Copy the entire folder to Google Drive: 



Step 3: 

Set model path to the subfolder under snapshot:




My Story:

I initially began exploring Generative AI (GAI) and Google Colab through Stable Diffusion. In the past, as I mainly wrote server services and console applications, I was less familiar with data science modes like R and Jupyter that can maintain a paused state. I didn't quite understand the heavy burden on Colab of creating a temporary Stable Diffusion WebUI with .ipynb, as suggested by popular guides. I just found it troublesome that connections often took a long time and then dropped, requiring a restart.


Recently, while testing new versions of the Stable Diffusion model, and facing challenges due to Colab's policies making various versions of WebUI difficult to run successfully, I started researching how to write my own test programs in Colab. Eventually, I understood that Colab is essentially a VM, capable of executing limited system commands, as well as Python programs in Jupyter. Using Colab has become much simpler for me; the only unfamiliar aspect is the AI-related part. To avoid the issue of disconnections likely caused by running processes for too long, I started looking for ways to download and save the models first.

At that time, I couldn't find the "runwayml/stable-diffusion-v1-5" model on Colab, but I was very keen on using the model available for download from Civit Ai. The model file was a single file, and while StableDiffusionPipeline.from_pretrained() required other related files, I finally succeeded in using the pre-downloaded model for image creation by using from_single_file().


After that experience, I dedicated more time to learning about Generative AI. Following a recommendation, I tested the CodeLlama-7b-hf model, which took a considerable amount of time to download and then upload to MyDrive. However, after testing, I realized that this was not the model I wanted to use.

This time, following the course's suggestion, I tested the Flan-t5-base model. I found the process of 'downloading and uploading' too time-consuming, so I tried downloading directly from Hugging Face's model-name on Colab, which seemed much faster. After extensive searching on the internet, I finally discovered that Hugging Face's models are stored in the ~/.cache directory. So, I moved that directory to MyDrive, but then T5Tokenizer.from_pretrained() complained about the absence of tokenizer files. Upon checking the models--google--flan-t5-base directory, I indeed found no tokenizer.json; it was located in a version subdirectory within the snapshot subdirectory. Hence, I moved the version subdirectory out and deleted the rest. However, when I tried again, T5Tokenizer still complained about the missing tokenizer.








At this point, I was somewhat perplexed. I clearly saw the tokenizer.json file. I decided to check the config.json file downloaded by Colab for any clues and discovered that the structure of the Colab version differed from the online version provided by Hugging Face. Colab configures data files to be stored in a different directory (blob). Therefore, when executing from_pretrained(), it's necessary to set the model path to the version subdirectory under snapshot, while maintaining its complete structure.



So, I downloaded it again, moved the entire folder to MyDrive, and set the correct model path. This solved the problem of having to download repeatedly and also facilitated future extended training.

2024/03/10 update

parameter: cache_dir  in from_pretrained() also does the same thing:

text_encoder = CLIPTextModel.from_pretrained(
                    model_path,
                    subfolder="text_encoder",
                    cache_dir=cache_folder,
                    num_hidden_layers=11,
                    torch_dtype=torch.float16
                    )



-

Comments

Popular posts from this blog

Bookmark service (MongoDB & Spring REST) -2/2

    I accidentally deleted my development VM. I got lucky having the habit of taking notes. This blog is useful. Development VM is doom to be lost. Rebuild it waste time, but having a clean slate is refreshing~. What concerns me more is my AWS free quota this month is reaching 85%. The second VM I launched but never being used might be the one to blame. (Of course, my mistake.) I terminated the wrong VM. Now I got Linux 2 built. Great, just threw away everything happened on AMI.  1st layer: Page Page class   Originally, I need to prepare getter/setter for all class properties for Spring. By using lombok, I only need to create constructors. lombok will deal with getter/setter and toString(). But there are chances to call getter/setter, but how? .......Naming convention.... Capitalize the 1st character with the prefix get/set.  Annotation @Data was used on this class.  Repository class Spring Docs: Repository https://docs.spring.io/spring-data/mongodb/docs/3....

gamer's interview

This project simulates a gamer's interview. Based on NodeJS+ ReactJS The setting is interviewing a gamer/journalist what's his/her plan of March 2020? The gamer answers his/her game list in plan, how many reviews on demand and how many hours expected. Games selected for review take 5 hours for each, while others take one. This project is designed to practice render html, jsx, component, props introduced in  https://www.w3schools.com/REACT/default.asp . Also fixed other issues to make it work. When trying to modualize objects and tools, my design developed to separate views and processes. And it is quite similar to the initialized structure NodeJS+ReactJS provided. Furthermore, since include local module files are banned by browsers, use NodeJS service seems to be the best option. view file main process object tool

Comments for my Server/Client Web API samples

        Finally, I finished the comments for python/07 and 09 projects. I almost forgot to put the date on source code which is used to note how long it took me. Not precisely in hours….. I didn’t include source code in my previous post. If choosing code-section for this post…… maybe I want to mark out my comment….. (Really?!)          Once my work was developing websites for enterprises, including ERP, CRM or content sites. The sustainability of network and security are important issues. There are 2 methods for HTML Form submission: GET and POST. Submit via POST is secure, compared to GET which piles parameters on URL. RESTful API is mainly using GET.         Yup, even if you have a certification key, if you put the value on the URL, it is visible data. When writing socket-communication, client-server sockets are a pair; both follow the agreement on commands and structures; and there are countless ports for usa...