Leverage
Posts

Scaling LLMs Apps

Lessons learned from OpenAI Dev Day to increase intelligence while reducing cost and latency

How do you scale your app from localhost:3000 to 10M users? Here's what we picked up from OpenAI's Dev Day.

This session was jam packed with actionable and OpenAI-validated advice on how to increase the accuracy of your apps while reducing cost and latency. The insights they shared are a playbook and roadmap to squeeze more (for less) out of your applications.

Dev Day breakout session

Note: This post is heavily inspired by the breakout session at Dev Day. Any claims or specific numbers are as of Oct '24. It's augmented by my personal beliefs. If you want to chat about what this means for your business, let's chat!

Three areas to consider:

  1. Accuracy - Increasing the performance of your apps
  2. Latency - Making your apps faster
  3. Cost - Getting more tokens for less 🤑

Accuracy

  • Most intelligence first - To start by optimizing for accuracy, use the most intelligent model you have. This means you'll use gpt-4o or Claude 3.5 Sonnet before working with 4o-mini or flash. This follows the "Make it work, make it right, make it fast" mantra. Make sure you have a baseline of success before you optimize for performance.
  • Evals > Set Target > Optimize - The starting point for your apps should always start with an evals. You won't know if you're going the right direction without an objective way to measure performance.
    • Evals - Set up a system or test to measure performance of your apps
    • Set Target - Define what is good enough. You likely won't hit 100% accuracy, so define what your business is comfortable with. You can do this by setting a dollar value for the positive and failure cases (picture below). Once you compare this to the current performance (likely from a human), you can land on a number that makes sense for your app/business.
    • Optimize - Optimize your performance to your target. Don't fall into the trap of over-optimizing.
  • Don't skip setting the target - Busineses will often skip setting a "good enough" target. If you do this your app's performance is an emotional decision rather than an objective one.

Setting Targets Here's an example of how to set targets for your app's performance through dollar values.

  • Eval-driven development - Don't ship a component until you have a eval to test it. Two types of evals:
    1. Component evals - Unit tests of your app's components. "Did this question get routed to the right intent?"
    2. End-to-end evals - Tests that check full flows. "Did the customer achieve their objective?" This includes retrieval evals as well.
  • Eval every PR - Every time you raise a PR, run your eval tests to see if performance degraded.
  • How to make your apps more accurate
    • Prompt engineering - Experiment with prompt engineering techniques or meta-prompting (below)
    • RAG - Use retrieval augmented generation to improve performance. Vanilla RAG likely won't be your answer, so experiment with advanced techniques. Check out our content on Full Stack Retrieval for more on this.
    • Fine-tuning - I'm usually hesistant to recommend fine-tuning. It takes a lot of work and your end result is rigid. However, you only need ~50 examples to start experimenting. For others who've tried all other options, this may be another good route.
    • Do it all - There isn't a one-size-fits-all solution. What you end up with will likely be a blend of these techniques.
  • Lean into meta-prompting - Meta-prompting is the act of using a LLM to generate a prompt for you. This can be as simple as asking o1 to generate a prompt based off of your inputs and hand-crafted outputs (only need 1-2 examples) or else with a program like DSPy or DoctETL.

Techniques to try Different methods to try in your LLM stack

  • Techniques to try in your LLM stack - This is such a deep topic, a single bullet point can not do it justice, but let's review what OpenAI recommended: Baseline retrieval with cosine similarity, HyDE, Fine-Tuning, Chunking strategies, Reranking, Classification steps, Prompt engineering, Tool use, and query expansion.

Latency

Latency for LLM calls break down into 3 types:

  1. Network Latency - How long it takes your request to get to the GPU
  2. Input Tokens Latency - Time to first token
  3. Output Tokens Latency - Time between tokens

Note: Latency matters most when you're using LLM for user facing request. If you're doing async batch processing, then you likely care less about this.

Latency breakdown

Network Latency

Network latency is how long it takes for OpenAI to send your request to the GPU to start processing. Not only do they do a series of checks on your prompt before (security, logging, etc.), but then they need to route your request to an available GPU.

OpenAI says this will add about 200ms. There isn't much you can do about this, "it's on us" they say.

Adding more global data centers will help with this.

Input Tokens Latency

This is the time to first token. Without going into a deep technical dive, the longer your prompt, the more data they need to process before you can get your first token. "Time-to-first-token" matters for applications that utilize streaming. This eases the user's wait time and makes your app feel more responsive. The quicker to the first token, the quicker you can distract the user with more content.

How to improve:

  • Use shorter prompts - Shorter prompts = less data to process = quicker time to first token. This also includes reducing the context you give to the model. Don't be lazy and pass in too much information.
  • Use smaller models - Smaller models require less compute to process the same data (though of course they aren't as 'intelligent')
  • Prompt caching and cache hit - Reusing commonly used prompts reduces the time to first token

Output Latency

This is the time between tokens. For each new token you ask the model to produce, it takes a bit of compute to compute. This means the more you ask the model to make, the more compute (and the longer) it will take. The bottleneck is shared between the prompt length and OpenAI processing.

How to improve: OpenAI says its a matter of supply and demand. Weekends are the fastest, PT mornings during the week days are the slowest.

Summary

Ask the model to give you the bare minimum information that you actually need. Prompt length matters. Longer prompts = higher latency

Cost

We all have the same goal: more inference for the same dollar budget. More requests, less money.

How to reduce cost:

  • Prompt caching - This is a no brainer if you re-use prompts often (aka everyone). OpenAI will look for the first different token. Put your static system prompt at the beginning. If 1 token is different, you'll get a cache miss. The cache life is 5-10 minutes. However, no matter what, the cache always clears every hour.
    • Save 50% on cached tokens - No extra work is required. You just do it and your bills should go down.
  • Batch API - By running your requests async, you can save 50% on latency costs. Create a batch files to create a large number of requests. It goes quicker at non-peak times.

Summary

First focus on accuracy. Then target latency or cost depending on your business.

As with all advice, the recommendations above are greyscale, it is dependent on your business and priorities. There isn't a one-size-fits-all.

If you want to jam on problems your LLM apps are facing, let's chat!

On this page