A Season of Reasoning + Using AI to build AI Series Post
LLMs get their intelligence via large scale pre-training with an internet-scale dataset. What dataset to choose becomes a very critical question when today this is arguably the main input into LLM performance.
The open source Phi-series of models have pursued a unique new approach to developing datasets for LLM training. Fundamentally they look at whether you can train powerful LLMs using much smaller but much higher quality data sets. Here is a time line of Phi Model development that introduces the work done on this series of models.
PHI Time line:
- May 2023, TinyStories shows a 10M model can write some coherent English, by using a GPT-4 curated high quality dataset. The Tiny Stories dataset is built by taking an extremely small dataset of just 3000 words, with about the same number of nouns, verbs, and adjectives, and then using an LLM to create a children’s story using one noun, one verb and one adjective from the list millions of times over.
- June 2023, Textbook is all you need becomes the first paper to show that a small model (phi-1 is 1.3B) can be achieved by training with high quality dataset (with a small size of only 7B tokens). They are using GPT to augment and curate the dataset.
- Sep 2023 Phi1.5 is out, the model is the same, but the dataset is scaled up (phi-1 7B→ phi1.5-web 100B), with significantly improved performance.
- Dec 2023, Phi2 is released. Model size (1.3B→2.7B) and dataset size (7B→1.4T tokens) are both increased. Note till phi2, there is no instruction finetuning. All the above is pre-training! So the performance given that this is just pre-training is very impressive. Here is a useful slide on this.
- May 2024 Phi3 is released. Model size (2.7B→phi-3-mini:3.8B, phi-3-small: 7B and phi-3-medium: 14B). Phi is no longer just a small model anymore. Data size is also increased (1.4T→ 4.8T tokens). Also LMM is there. Phi3-medium performance-wise is on par with GPT-3.5.
- Aug 2024. Phi3.5 is out with a Mixture of Experts version to boost the performance.
My Thoughts on Phi:
It seems like the “textbook is all you need” mojo that Phi started on is showing diminished returns as the latest series of Phi models are starting to look more and more like general purpose models. Using a Mixture of Experts for Phi3.5 especially seems to go against the original thesis. If no new fundamental technology is invented, we are expecting the next Phi release to be larger and trained with more data which would be unfortunately be quite boring.