Atlas Wang: Democratizing LLM Training by Exploiting Low-Rank Gradients

July 7th, 2024, Open AGI Summit Brussels

Atlas Wang – Associate Professor at The University Texas at AustinXTY Labs

Full Session Recording:

Talk Notes

Here are the notes on the presentation Democratizing LLM Training by Exploiting Low-Rank Gradients delivered at the Open AGI Summit in Brussels last week:

The LLM Challenge:

  • LLMs are quite powerful with the ability to handle conversations, generate content, form AI agents, perform reasoning, and sometimes even manage planning tasks.
  • Stil,l training LLMs can be prohibitively expensive:
    • Training and even fine-tuning the best LLMs require high-end GPUs, significant time, and substantial human resources.
    • These economic barriers make it infeasible for individuals and even small countries.

Vision for Democratization:

  • Can LLM training be democratized using cheaper, consumer-grade GPUs?
  • You can get some insight into this by comparing high-end GPUs (e.g., H100, A100) vs. consumer-grade GPUs (e.g., 4090)

  • When you do this, you find that the main differences were: memory and communication capabilities, not computational power. Which is not too different in a 4090 and A100 for example.

Memory and Cost Analysis:

  • LLM training involves massive datasets and storage for model parameters and gradients.
  • High-end GPUs are significantly more expensive due to superior memory and communication features.

Is scaling all you need:

  • Current industry practice focuses on scaling models to solve many research challenges.
  • Is there any chance we can avoid hitting the GPU “memory wall” instead?
  • Yes, you can keep a large model but reduce memory requirements for gradients!
  • Technically, this is possible thanks to the inherent low-rank phenomenon

Parameter-Efficient Fine-tuning: Lora

  • LoRA is one of the most familiar low-rank techniques in ML
  • LoRA is a method for performing fine-tuning more efficiently
  • In LoRA, instead of updating the weights for a pre-trained (already trained) language model, you break it down into W0 + B A through matrix factorization (where B and A are separate matrices who’s product is approximately the updated matrix)
  • Then you fine-tune B and A instead of W directly (B and A being much smaller matrices). This allows you to save lots of memory

Limitations of LoRA:

  • LoRA can’t train models from scratch
  • LoRA is also somewhat ad hoc and changes the optimization objective

New Method: GaLore

  • A new method: Gradient Low-Rank Projection (GaLore) which allows you to save memory both during pre-training and fine-tuning
  • This allows you for the first time to pre-train a Llama 7B model with a memory cost under 24 GB, below the memory limit of an RTX 4090

How does GaLore Work?

  • GaLore Algorithm Steps:
    • Compute gradients in the original space.
    • Project gradient into low-rank space.
    • Update in projected space and project back to the weight space.
    • Compatible with mainstream deep learning optimizers like AdamW.
    • Proved that gradient remains low-rank during training.

  • GaLore is popular among the open-source community: implemented by leading open-source libraries, including Hugging Face, PyTorch, and LlaMA-Factory

GaLore Meets Quantization:

  • Integrated with quantization to reduce memory cost further.
  • Some parts of GaLore algorithm, such as the projection step can be deeply quantized

  • This allows for further optimization that will allow you to pre-train LlaMA on a $499 RTX 4060 TI with only 16GB.

Conclusion:

  • Such optimization will help democratize AI development by making it so that anyone can train such models on their own computers.
3 Likes

This was a great talk!

2 Likes