Latest and Greatest
Posts
Llama 3.1: The best opensource model in the world

Llama 3.1: The best opensource model in the world

...or the best model in the world, period?

Divyanshu Dixit
July 24, 2024

The launch of Llama 3.1 has the potential to finally drive a step change in the capabilities of foundation models, given how good it is out of the box and open source - (not truly open-source by definition given the limited details on data*, except Alexander Wang (CEO - Scale AI) confirming on X that “Scale Data Foundry was utilized to generate frontier data (SFT & RLHF data) to push the performance of Llama-3.1”). Although to their credit, they did release the weights and an accompanying paper explaining in detail how the model came to be - respect!

Source: Meta

Not a fan of legacy evals? How about Scale’s SEAL evals? Llama 3.1 slots in just below Sonnet 3.5 and above GPT-4o.

🚀 We added Llama 3.1 405B onto the SEAL Leaderboards and it does not disappoint! Here's how it stacks up:
- 🥇 #1 in Instruction Following
- 🥈 #2 in GSM1k
- 💻 #4 in Coding
SEAL evals are private, expert evals that refresh periodically: scale.com/leaderboard
— Summer Yue (@summeryue0)
3:50 PM • Jul 23, 2024

Allen AI’s Zero eval concurs:

Source: Allen AI - ZeroEval

If you were in the “fine-tuning is dead” camp, this release does inspire a rethink. Here are the results of the initial evals, courtesy Kyle Corbitt (OpenPipe) - “never been an open model this small, this good!”.

Guys fine-tuned Llama 3.1 8B is completely cracked. Just ran it through our fine-tuning test suite and blows GPT-4o mini out of the water on every task.
There has never been an open model this small, this good. x.com/i/web/status/1…
— Kyle Corbitt (@corbtt)
7:20 PM • Jul 23, 2024

How about Llama 3.1 fine tuned vs GPT4o mini fine-tuned (yes OpenAI launched their finetuning feature for GPT4o mini feature just yesterday)? Llama 3.1 wins on chatbot responses and resume summarization, while underperforming on data extraction. Once again, huge thanks to Kyle @ Openpipe for this great work!

Source: Kyle Corbitt on X

The kicker is the updated license: you can use Llama 3 for synthetic data generation.

We’ve also updated our license to allow developers to use the outputs from Llama models — including 405B — to improve other models for the first time.
We’re excited about how this will enable new advancements in the field through synthetic data generation and model distillation… x.com/i/web/status/1…
— AI at Meta (@AIatMeta)
3:10 PM • Jul 23, 2024

Alright, so it is good, even the best when it comes to Instruction following. But how about cost? At the 405b size, the hardware surely cannot be cheap. Turns out, Fireworks blows everyone out of the water at 3/3 i/o cost pm.

Source: Artificialanalysis.ai

But how is Fireworks doing this? It is something, to beat the competition by this margin! Obviously, we got to speculate, how about this banger from Hamel - buy AMD?

Fireworks is cheaper than everyone else by a wide margin. Speculation is they are using AMD to dramatically lower costs? Pretty interesting
— Hamel Husain (@HamelHusain)
8:52 PM • Jul 23, 2024

Remember the chart below from my post earlier in April? Here is the updated version by Maxime Labonne at Liquid AI. See that gap closing between open-source and closed models on the upper right - well contextualizes the significance of this release.

Source: Maxime Labonne on X

That should have everyone excited yes? Well not everyone - enter the EU and their AI Act on cue.

Llama 3 405b is a "systemic risk" to society, according to the European Union and their AI Act.
— Dean W. Ball (@deanwball)
7:10 PM • Jul 23, 2024

But before you pass a judgement, I highly recommend going through the 92 page paper they have released - one of the best in recent memory and extremely detailed on processes. TLDR: Supervised fin-tuning (SFT) + Synthetic Data for the win!

Source: Meta

Notably, they have also open-sourced system level prompt-based filtering mechanisms - Prompt Guard and Code Shield. Given how good the paper is, I shall use their definitions as is:

Prompt Guard is a model-based filter designed to detect prompt attacks, which are input strings designed to subvert the intended behavior of an LLM functioning as part of an application. The model is a multi-label classifier that detects two classes of prompt attack risk - direct jailbreaks (techniques that explicitly try to override a model’s safety conditioning or system prompt) and indirect prompt injections (instances where third-party data included in a model’s context window includes instructions inadvertently executed as user commands by an LLM).

Code Shield is an example of a class of system-level protections based on providing inference-time filtering. In particular, it focuses on detecting the generation of insecure code before it might enter a downstream use case such as a production system.

Source: Meta

There are also capabilities such as tool use and multi-step reasoning (agentic systems) showcased within the paper. I recommend exploring some of it on their github.

These are exciting times, as we can now see a future where frontier foundation models will no more be owned by a few, maybe fine-tuning is not dead after all and maybe Zuck means every word of his essay (also a great read). If audio/video is your thing, here is his interview with Rowan Cheung.

The pace at which AI is moving can be overwhelming and there is always a feeling of not having enough time. But do not worry, I got you!

If you have reached this far and have not subscribed, please do. If you like what you read, please tell others - knowledge shared is knowledge gained!

Also, feel free share your thoughts and ideas in the comments section or you can reach me at:

X: @divyanshu_van

LinkedIn: Divyanshu Dixit

Until next time, Happy Reading!

Reply

or to participate.