July 7th, 2024, Open AGI Summit Brussels
Atlas Wang – Associate Professor at The University Texas at Austin – XTY Labs
Full Session Recording:
Talk Notes
Here are the notes on the presentation Democratizing LLM Training by Exploiting Low-Rank Gradients delivered at the Open AGI Summit in Brussels last week:
The LLM Challenge:
- LLMs are quite powerful with the ability to handle conversations, generate content, form AI agents, perform reasoning, and sometimes even manage planning tasks.
- Stil,l training LLMs can be prohibitively expensive:
- Training and even fine-tuning the best LLMs require high-end GPUs, significant time, and substantial human resources.
- These economic barriers make it infeasible for individuals and even small countries.
Vision for Democratization:
- Can LLM training be democratized using cheaper, consumer-grade GPUs?
- You can get some insight into this by comparing high-end GPUs (e.g., H100, A100) vs. consumer-grade GPUs (e.g., 4090)
- When you do this, you find that the main differences were: memory and communication capabilities, not computational power. Which is not too different in a 4090 and A100 for example.
Memory and Cost Analysis:
- LLM training involves massive datasets and storage for model parameters and gradients.
- High-end GPUs are significantly more expensive due to superior memory and communication features.
Is scaling all you need:
- Current industry practice focuses on scaling models to solve many research challenges.
- Is there any chance we can avoid hitting the GPU “memory wall” instead?
- Yes, you can keep a large model but reduce memory requirements for gradients!
- Technically, this is possible thanks to the inherent low-rank phenomenon
Parameter-Efficient Fine-tuning: Lora
- LoRA is one of the most familiar low-rank techniques in ML
- LoRA is a method for performing fine-tuning more efficiently
- In LoRA, instead of updating the weights for a pre-trained (already trained) language model, you break it down into W0 + B A through matrix factorization (where B and A are separate matrices who’s product is approximately the updated matrix)
- Then you fine-tune B and A instead of W directly (B and A being much smaller matrices). This allows you to save lots of memory
Limitations of LoRA:
- LoRA can’t train models from scratch
- LoRA is also somewhat ad hoc and changes the optimization objective
New Method: GaLore
- A new method: Gradient Low-Rank Projection (GaLore) which allows you to save memory both during pre-training and fine-tuning
- This allows you for the first time to pre-train a Llama 7B model with a memory cost under 24 GB, below the memory limit of an RTX 4090
How does GaLore Work?
- GaLore Algorithm Steps:
- Compute gradients in the original space.
- Project gradient into low-rank space.
- Update in projected space and project back to the weight space.
- Compatible with mainstream deep learning optimizers like AdamW.
- Proved that gradient remains low-rank during training.
- GaLore is popular among the open-source community: implemented by leading open-source libraries, including Hugging Face, PyTorch, and LlaMA-Factory
GaLore Meets Quantization:
- Integrated with quantization to reduce memory cost further.
- Some parts of GaLore algorithm, such as the projection step can be deeply quantized
- This allows for further optimization that will allow you to pre-train LlaMA on a $499 RTX 4060 TI with only 16GB.
Conclusion:
- Such optimization will help democratize AI development by making it so that anyone can train such models on their own computers.