Inference Speed Across 30+ LLMs

Fokke Dekker

If I told you the fastest LLM we tested is over 400 times faster than the slowest, would you believe me? Keep on reading to find out more.

Last month, we hosted March Model Madness, a contest between the top Language Models to crown a winner. If you missed it, you can read more about it here. Running this model contest allowed us to track some of the metrics of all these models in a controlled environment, i.e., the same prompt and maximum token output.

March Model Madness, and most use cases for that matter, focus on the quality of the output. While it’s certainly important to get the desired output, there is another factor that we should not overlook: inference speed!

For example, a couple of weeks ago, we built an LLM-powered voice assistant. Part of the data flow for this project consisted of an LLM call to extract the category and relevant data in JSON format. Almost every modern LLM can perform this task. So, at that point, the most important metric—especially in a voice assistant—was latency.

The results from March Model Madness show considerable differences in speed. While every LLM had the same settings and max output tokens, some generated more than others. So, to normalize the results, we only looked at tokens per second (TPS).

The results are from all models available on the Seaplane platform. You can do a lot to optimize the TPS for models. Some are self-hosted, others are provided by hyperscalers such as Google and Amazon, and some are from third-party providers. Depending on your platform of choice, your mileage may vary, but this is the performance you can expect on Seaplane.

Now for the results, as mentioned earlier, there are some pretty wild differences between the various models. The fastest model had an average of 1109 TPS, the slowest model had an average of 2.5 TPS. To put that in perspective the fastest model is 443 times faster then the slowest.

During March Model Madness, the models were derived in three categories. Lets look at all of them in more detail.

Chat

Comparatively, at 38.6 TPS, the chat category was the slowest of all three categories. The crown for the fastest model in this category goes to GPT-3.5 at 88 TPS. The slowest model we tested in this category was falcon-40b-instruct at 8 TPS

Based on quality, neither of these models made it very far in the competition. GPT-3.5 was knocked out in round one (to everyone's surprise, I might add), and Falcon-40b-instruct only made it to round two.

Gemini Pro, the winner of the chat bracket, had an average of 62 TPS, nearly double the bracket average. If you are looking to build a GenAI application that requires a chat-based model, Seaplane recommends Gemini-pro.

‍

Code

Our code-based models, comparatively, were the fastest group of the three categories, with an average speed of 133 TPS.

The fastest model in this bracket was Codellama-7b-instruct with 840 TPS, our slowest model in this bracket (and the slowest model overall) was wizardcoder-34b-v1.0 with only 2.5 TPS.

Quality-wise, wizardcoder wasn’t much better either, getting knocked out in the first round. Codellam7b-instruct, on the other hand, made it to the semi-finals. While GPT-3.5 ultimately walked away with the crown. Which model works best will depend on your use case, but with almost 10x the tokens per second, Seaplane recommends CodeLlama-7b-instruct for projects that require a code-based model.

‍

Instruct

On average, the instruct-based models produced 93 TPS. The fastest model in this category and all categories was llama-2-70b-chat, with a whopping 1109 TPS. While technically a chat-based model, it performed really well with the instruct prompts, both in terms of speed and quality, making it to the semi-finals.

The slowest model in this bracket was once again falcon-40b-instruct; this model performed poorly in terms of speed and quality and, in our opinion, should not be used in a production environment.

GPT-4 took the crown in terms of quality, but in terms of TPS, it scored just above falcon-40b-instruct. Depending on your use case, this might not be the right model, especially if latency is a key factor in your application.

Overall, we recommend Llama-2-70b-chat for your future projects that need an instruct-based model.

Conclusion

When we consider which LLMs to incorporate into our projects, the two most crucial aspects are the quality of responses and latency. After evaluating both for all available models on Seaplane, we recommend the Gemini-Pro, CodeLlama-7b-instruct, and Llama-2-70b-chat for chat, code, and instruct usages.

Switching LLMs in your next project is getting easier with Seaplane’s recent launch of the Seaplane SDK V0.6. Easy access to over 40+ models is now integrated into our platform. Sign up for an account here.

If you’d like to follow up and discuss the pros and cons of certain LLMs, check out our growing Discord.

‍

Bring Your Apps to the World

Join the Seaplane Beta