Atlas Wang: Democratizing LLM Training by Exploiting Low-Rank Gradients

OpenAGISummit · July 27, 2024, 12:59pm

July 7th, 2024, Open AGI Summit Brussels

Atlas Wang – Associate Professor at The University Texas at Austin – XTY Labs

Full Session Recording:

Talk Notes

Here are the notes on the presentation Democratizing LLM Training by Exploiting Low-Rank Gradients delivered at the Open AGI Summit in Brussels last week:

The LLM Challenge:

LLMs are quite powerful with the ability to handle conversations, generate content, form AI agents, perform reasoning, and sometimes even manage planning tasks.
Stil,l training LLMs can be prohibitively expensive:
- Training and even fine-tuning the best LLMs require high-end GPUs, significant time, and substantial human resources.
- These economic barriers make it infeasible for individuals and even small countries.

Vision for Democratization:

Can LLM training be democratized using cheaper, consumer-grade GPUs?
You can get some insight into this by comparing high-end GPUs (e.g., H100, A100) vs. consumer-grade GPUs (e.g., 4090)

When you do this, you find that the main differences were: memory and communication capabilities, not computational power. Which is not too different in a 4090 and A100 for example.

Memory and Cost Analysis:

LLM training involves massive datasets and storage for model parameters and gradients.
High-end GPUs are significantly more expensive due to superior memory and communication features.

Is scaling all you need:

Current industry practice focuses on scaling models to solve many research challenges.
Is there any chance we can avoid hitting the GPU “memory wall” instead?
Yes, you can keep a large model but reduce memory requirements for gradients!
Technically, this is possible thanks to the inherent low-rank phenomenon

Parameter-Efficient Fine-tuning: Lora

LoRA is one of the most familiar low-rank techniques in ML
LoRA is a method for performing fine-tuning more efficiently
In LoRA, instead of updating the weights for a pre-trained (already trained) language model, you break it down into W0 + B A through matrix factorization (where B and A are separate matrices who’s product is approximately the updated matrix)
Then you fine-tune B and A instead of W directly (B and A being much smaller matrices). This allows you to save lots of memory

Limitations of LoRA:

LoRA can’t train models from scratch
LoRA is also somewhat ad hoc and changes the optimization objective

New Method: GaLore

A new method: Gradient Low-Rank Projection (GaLore) which allows you to save memory both during pre-training and fine-tuning
This allows you for the first time to pre-train a Llama 7B model with a memory cost under 24 GB, below the memory limit of an RTX 4090

How does GaLore Work?

GaLore Algorithm Steps:
- Compute gradients in the original space.
- Project gradient into low-rank space.
- Update in projected space and project back to the weight space.
- Compatible with mainstream deep learning optimizers like AdamW.
- Proved that gradient remains low-rank during training.

GaLore is popular among the open-source community: implemented by leading open-source libraries, including Hugging Face, PyTorch, and LlaMA-Factory

GaLore Meets Quantization:

Integrated with quantization to reduce memory cost further.
Some parts of GaLore algorithm, such as the projection step can be deeply quantized

This allows for further optimization that will allow you to pre-train LlaMA on a $499 RTX 4060 TI with only 16GB.

Conclusion:

Such optimization will help democratize AI development by making it so that anyone can train such models on their own computers.

Ben · July 27, 2024, 1:09pm

This was a great talk!

Topic		Replies	Views
Leveling Up Reasoning Via Games: a Post AGI-thon Analysis AGI-thon: Agent Building	2	95	April 27, 2025
Honorable Mention: Siuuupremacy ETH Zurich Datathon (ODS)	0	23	April 22, 2025
Banghua Zhu: Nexusflow seperated open source models and agents Model Building foundational-models , agents	12	333	February 9, 2025
3rd Place: Here4Food ETH Zurich Datathon (ODS)	0	29	April 21, 2025
Honorable Mention: BluBomberBing ETH Zurich Datathon (ODS)	0	17	April 22, 2025

Atlas Wang: Democratizing LLM Training by Exploiting Low-Rank Gradients

Related topics