I’ve been working to containerize various open-source large language models throughout this year, with varying success. I recently got the Llama 2 model containerized with Apptainer and was running some tests with it on GCP, and thought I’d put together a short blog post summarizing where we’re at and how we got here.
This has been a long year for open source large language model development. We opened the year with GPT-J and GPT-NeoX-20B as the standard for the field, with many companies basing their own models on these. Many companies were also using these as the standard offering for their fine-tuning services, where you can give them a formatted dataset, pick which model you want to fine-tune, and then they do the heavy lifting of setting up the backend. There are other models in the space that have had a strong following, like BLOOM, but generally the former two are what most chat-type applications were based on, owing to their ease of access and consistently positive results in chat applications.
However, these models were starting to age a little, being originally released last year, versus the state of proprietary tech. There were codebases out there that had been built up around the models to allow people to easily get up and running with fine-tuning, but I found that with varying versioning problems after sitting for so long (some of the GitHubs in question were created, became a standard, and then weren’t updated for close to a year), it was difficult to really make much progress on anything other than the basic creation of a test environment. Around this same time, I noticed a number of “open-source” models based on these two models being released by various companies, but that ultimately were linked to their cloud platform, custom AI chip, etc.
Things changed with the leak of the Llama model from Meta earlier this year. While originally announced by Meta as something they would only be making available to researchers and other similarly vetted people and institutions, the model code itself leaked about a week after launch. A new standard appeared overnight, and a race was on to produce a commercially viable version that could be used without it, ultimately thus requiring you to base your model on leaked code. This was released with a model called OpenLlama that came out a month or so later, and while I made some early experiments with it, the space continued to move pretty fast and my attention was pulled elsewhere before I could do much more than put together a basic container for a training toolkit that had been implemented with OpenLlama.
I saw not too long ago that Meta had released their Llama 2 model with a permissive license that allowed for commercial use, and shelved that information for a little while. Recently, I was able to return to this project in order to follow up on this development - with great success. I’ve got the Llama 2 model containerized (just add model files from Meta) and deployed out onto GPUs on Rocky Linux on GCP. Let’s discuss this more.
Firstly, I requested the model files from Meta at this link. This gets you an emailed link that allows you to download the models. I spun up an a2-ultragpu-1g instance on GCP in a region that I have access to a single A100 80GB GPU in, and attached about 2TB of block storage to it. It’s useful to put your model downloads (which are quite large) onto some kind of detachable block storage, so you can reattach the storage to GPU instances of different sizes to support using the different parameter size versions of each model. After pulling the model’s GitHub and using the link and instructions they provided via email to download the models themselves, I started testing the sample code they give you. I did a quick install of the necessary CUDA toolkit components to get the GPU working along with the requirements that Meta shows in their repo you need to install, and was pleased to find that the examples “just worked” on the instance once everything was set.
Encouraged, I replicated the same processes that I had just done on the command line in an Apptainer definition file, and came to:
Bootstrap: docker From: nvidia/cuda:12.0.1-cudnn8-devel-rockylinux8 %post dnf -y update dnf -y install git python39 python39-devel mkdir /llama-farm && cd /llama-farm git clone https://github.com/facebookresearch/llama.git git clone https://github.com/facebookresearch/codellama.git cd llama python3.9 -m pip install --upgrade pip python3.9 -m pip install -e . dnf clean all -y %environment export PATH=/usr/local/bin:$PATH
Let’s explore this, before jumping into how to use this container.
At this top, we have the bootstrap section, which tells Apptainer where the base image for the container will come from and where to find it.
In this case, we’re pulling a container from the Docker Hub, specifically one based on Rocky Linux 8 that has the CUDA toolkit and NVIDIA GPU tooling installed inside of it, along with the CuDNN packages. When we build our container, Apptainer will reach out to the Docker Hub and pull the Docker image to start building the rest of the container off of.
Once the container is pulled, the %post section will start. This section is a series of commands that will be run in the context of the base image and thus allow for modification of the base image as the container builds.
dnf -y update dnf -y install git python39 python39-devel
In this case, we start by updating the packages in the container and installing Python 3.9, which is the version of Python that the packages Llama requires has available.
We then set up a directory for the Llama model code examples and code in general to be stored, before pulling the repos from Git:
mkdir /llama-farm && cd /llama-farm git clone https://github.com/facebookresearch/llama.git git clone https://github.com/facebookresearch/codellama.git
After we’ve pulled the repos, we’ll go ahead and install the Llama dependencies through Python 3.9, and clean the dnf cache to make the container size smaller.
cd llama python3.9 -m pip install --upgrade pip python3.9 -m pip install -e . dnf clean all -y
This is overall a fairly simple container, but given the Llama model files that we downloaded from Meta with the link they provided, we’ll be able to work and run model examples as shown below.
We can built this container:
apptainer build --nv llama-2.sif llama-2.def
And once that’s finished, we can run one of the model examples. In this case, we’ll use the chat completion example with the 7B version of the chat-optimized model. From the directory that you have the Meta scripts in, which will probably also be the directory you downloaded the Meta models into, run:
apptainer exec --nv /mnt/llama-barn/llama-2.sif torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6
Change the path to the container, script, model, and tokenizer accordingly. You may also have to add a --bind between where your model, scripts, etc. are stored on disk and where you need them to be in the container. After running this, we can see that the GPU is being used to run inference from the container via nvidia-smi:
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 2956 C /usr/bin/python3.9 14942MiB |
From here, you can easily begin to use the container to iterate on your own code based on the Llama model, easily deploying on new GPU resources of different scales to support the different parameter size models by just running this same container on them.
Keep an eye out for further articles from CIQ about cutting-edge use cases running on Apptainer.
Forrest Burt is an HPC systems engineer at CIQ, where he works in-depth with containerized HPC and the Fuzzball platform. He was previously an HPC system administrator while a student at Boise State University, supporting campus and national lab researchers on the R2 and Borah clusters while obtaining a B.S. in computer science.