Series Introduction: Using AI to build AI: Intro
As part of this series on Using AI to build AI, I want to start with AgentInstruct, a very relevant project and breakthrough on this line of thought. In AgentInstruct, agents are used to create high quality synthetic data from raw data to train a new LLM: Orca-3.
Motivation of AgentInstruct:
-
Quality of training data (through data curation) has become an important determinant in the performance of SOTA LLM models.
-
Today the process of data curation requires a lot of manual work from human researchers. This is why projects like DataComp-LM have so many authors: to build an LLM you need to have many researchers doing manual work to process data for training
-
AgentInstruct demonstrates how you can use agents instead of human researchers to do this data processing.
AgentInstruct:
The goal of AgentInstruct by Mitra et al. is to generate large-scale, diverse and high quality synthetic data via agents rather than manual human annotations. Here is a summary of how agent instruct works:
-
First the researchers identify 17 different desired high level skills they want to generate data for. Think of this as the buckets of capabilities that they want to teach the final LLM through the synthetic data. Skills include reading comprehension, web usage, multiple-choice question answering ability, RAG, tool use, analytical reasoning and more.
-
Then for each of these 17 different high level skills, the researchers create three flows to go from raw data (seed content) to high quality instructions. Note that the focus here is creating prompts from content, not content from prompts. These flows are presented in the below diagram:
-
(Content Transformation Flow) In the first flow carefully designed agents create content better suited for training than the original raw content seeds (source code, internet text excerpts, etc…)
-
For example, for the reading comprehension skill, nine content transformation agents take a raw data seed and turn it into argument passages, debates, conversations, long passages, meeting transcripts, poems, etc…
-
(Seed Instruction Generation Flow) In the second flow, agents with a predefined taxonomy (specialized agents) take the content created in the previous flow and generate a diverse set of instructions (prompts) with respect to this content
-
Often depends on pre-defined types of instructions relevant to a certain skill that agents already understand.
-
For example for reading comprehension the researchers compiled a collection of 43 reading comprehension question types.
-
An example of a seed instruction would be like this:
-
(Instruction Refinement Flow) In the third flow, suggester-editor agents go through each of the prompts generated in the last step and suggest ways to make the questions harder to answer.
-
The researchers used AgentInstruct to create 25 million prompt-response pairs to fine-tune Mistral-7b to produce Orca-3.
-
This new model achieves huge jumps on reasoning benchmarks (and others) as you see in the figure below.
Thoughts and Conclusions:
This was a thought provoking paper; some of my thoughts are below:
-
Any manual work will be super-automated by agents and data curation is clearly a good field to automate.
-
This paper shows how you can use humans to create agents that create data that can in turn be used to improve an LLM.
A final thought:
-
Based on this system can you make a competitive framework where different people are competing to make the best data curation agents?
-
If you want to have competing systems that build towards this, you could easily assess the quality of data curation by different systems that do this by evaluating the performance of a resulting fine-tuned model.
-
This could lead to clear and open competition among data-curation agent designers. As data-curation agents create value for model builders, we can expect a booming market for data-curation agents, which enables more powerful and personalized LLMs.
Sentient’s vision of decentralized AGI clearly aligns with this kind of research. We encourage anyone who is interested to reach out to participate!