The 6 AI Engineering Patterns In 2025

AI Engineering Patterns

Welcome! This post got shared with friends and family first. If you want to get the next one early, come join the newsletter

If you haven’t noticed, AI and language models are changing skills needed to be a successful engineer.

The people who are learning these new skills are getting jobs that pay up to $435K per year.

They’re creating apps using LLMs that generate millions of dollars per year at 90% margins with no employees.

They’re building features that took months, in minutes.

They’re using the 6 AI Engineering Patterns to build.

6 AI Engineering Patterns

Models

Understanding the foundational AI models, their capabilities, and how to effectively integrate them into your applications.

Prompting

Mastering the art of crafting effective prompts to get reliable, consistent outputs from AI models.

Context (RAG)

Implementing context in your applications (Retrieval Augmented Generation) to enhance AI responses with relevant external knowledge and data.

Orchestration (Agents)

Building and managing AI agents that can coordinate multiple tasks and work together to solve complex problems.

Evals & Observability

Implementing robust evaluation frameworks and monitoring systems to measure and improve AI system performance.

Mindset

Developing the right approach to AI engineering, including how to scale and best practices.

Hello World

First, let's start from the beginning. It's good to start aligned right?

Here we have a hello world example to get you started.

from openai import OpenAI
client = OpenAI()
 
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "developer", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Write hello world with many exclamation points"
        }
    ]
)
 
print(completion.choices[0].message)

> Hello World!!!!

Great, now you've mastered LLMs. But wait, want to learn a bit more? Let's go deeper.

1. Models

LLM Logos

Getting Started

Start by experimenting with the chat interfaces from the major labs:

Experimenting with "thinking" or "reasoning" models like o1 vs "balanced" models like claude-sonnet-3.5 or "workhorse" models like gpt-4o-mini will help you build an intuition of what categories of problems and prompt strategies are best suited for each model (more on this later).

Through these interfaces you can start to get a feel for the models. But you'll quickly see that many capabilities are better through API calls:

Use developer playgrounds like OpenAI's
Study both API docs and recent cookbook examples. (There's often some drift between the API docs and SDKs, which is why it's important to look at the examples.)

Main parameters we see developers using:

response_format: "json_schema" - For structured outputs
temperature - Lower values for more deterministic responses
max_tokens - Control response length

More Advanced Concepts

Start by understanding tool calling, latency and cost features, and open source models. Model training, fine-tuning, model routing, only apply for more mature applications (you don't need to learn these in the beginning). New features from the major labs like Anthropic's MCP or Open AI's Real Time API will unlock a new breed of applications in the near future but the best practices for using them are still being explored.

Tool Calling

Tool calling gives your LLMs the ability to do things besides generate text. This is a must know for anyone trying to build "agentic" applications. Take this conversation example:

User: I want to return this item. Not a fan of the baby blue color.
Assistant: Sure, I can help you with that. One moment please.

LLM without tool calling would... do nothing in this case because all it can do is respond to the user. Whereas an LLM with tool calling enabled, could trigger a function in your backend via its response like:

{
  "role": "assistant",
  "content": [{
    "type": "tool_call",
    "name": "issue_refund",
    "input": {
      "customer_id": "bob@gmail.com",
      "order_id": "123456",
      "reason": "Doesn't like the baby blue color"
    }
  }]
}

Your app is still responsible for actually issuing the refund, but the LLM decided on its own to issue a refund because it made sense given the context of the conversation.

Performance and Cost Optimization

Building a latency sensitive application like a chatbot? Look into streaming and prompt caching.

Aside: Prompt caching happens automatically for OpenAI's APIs whereas Anthropic requires an explicit cache_control parameter to be set. This is just one example of the many divergences between providers and it's advised to look at the source documentation for each.

For latency insensitive use cases, like running an eval suite overnight, you should look at the batch API to run inference at a discounted rate.

Running Open Source Models

Running open-source models are increasingly a viable alternative. There are managed services like together.ai or fireworks.ai that serve open source models or you can use platforms like OpenRouter or OLLAMA to run locally on your own hardware.

Further learning on models

Videos & Podcasts

Check out the previous show and tell videos for experiences with working with models.
Lex Fridman podcast with the Cursor team

Blogs and People to Follow

Applied LLMs - A collective of the top Applied AI consultants
swyx's Latent Space - All in one community/blog/podcast/etc. covers research and conferences (advanced)
Jason Liu - Creator of Instructor. Posts lots of tactical advice about running a consultancy and actually doing the work.
Eugene Yan - Sr. Applied Scientist @ Amazon, expert on ML and Recommendation Systems
Justine Tunney - Cracked low level hacker
Deedy Dass - VC @ Menlo Ventures

2. Prompting

Prompting is the art of eliciting desired behaviors from AI models. For traditional software engineers this might first come across as gimmicky, but it's a legitimate discipline. Even the top labs are hiring for strong prompting skills! ($375K/yr)

In this section you will learn how to write prompts and how to manage them in the context of an application.

Basics

First, stay away from thinkfluencer clickbait like this:

clickbait

If there's only one thing you remember: simply write as if you were writing instructions for a generally resourceful person. In other words, you're throwing a scrambled wall of text at a model and getting frustrated when it inevitably fails to do what you need, just ask yourself "if someone sent this to me, would I be able to understand and respond?"

Once you have this core tenet committed to memory, you can get more tactical with the following techniques.

Anatomy of a prompt:

Instruction: The thing you want it to do.
Context: External information that can aid the response.
Output Indicator: The type or format of the output.
Input: If you're familiar with OOP, an input is to a prompt like an instance is to a class.

Let's run through an example:

**Instruction:** Your job is to categorize the user feedback whether or not it's "constructive".
Your job is to ingest emails and support messages and categorize them into a product category.

**Context:** We are a SaaS company specializes in analytics software and our end users
are usually developers trying to instrument their code or business-intelligence
type people that are trying to make sense of their data.

**Output Indicator:**
```json
{
  "is_constructive": true
}
```

**Input:** "The dashboard is loading kind of slow"

This isn't bad but let's see if we can improve it. There are four key techniques most modern prompt engineers use:

Chain of Thought (Think Out Loud): Encourage the model to explain its thought process step-by-step, rather than immediately providing an answer. It can be as simple as appending a line like "Let's think step by step" to the instruction section of your prompt. If you build an agent with CrewAI, they will wrap your prompt, behind the scenes, with this chain of thought prompt:
To give my best complete final answer to the task use the exact following format: Thought: I now can give a great answer Final Answer: Your final answer must be the great and the most complete as possible, it must be outcome described.
Include Examples: Improve prompt performance by providing clear examples of desired input and output. Building upon the customer feedback example you might have a section for "Examples" that looks like this:
<Example 1> Input: "Timeseries data is missing several days of data. This problem is happening only during the afternoons sometimes." Output: true </Example 1> <Example 2> Input: "Your ceo needs a haircut" Output: false </Example 2>
Use Structured Elements: Consider using XML tags or markdown to structure the prompt, also demonstrated in the example above.
Structured Outputs: Master the ability to get language models to output structured data like JSON or tables to enable integration with other computer programs and systems. In the openAI SDKs, this is usually accomplished by simply passing a pydantic class to the response_format parameter as shown in the official example here.

Prompt Management

After the prototyping phase, most applications quickly outgrow the "directly in the code" approach. Some reasons for this include:

Different models perform better with different formats. For example, Claude's models are reportedly perform better with XML tags to mark section breaks instead of markdown.
A/B testing the performance of different prompts
Dynamic prompts based on real time input like the user's language to generate a more idiomatic response

Dave Ebbelaar of Datalumina has a great video on how to iteratively manage prompts. Start at the step that works for you, then work your way up to the more sophisticated methods. TLDR:

Embed in it code directly (your default)
Put it in text files and import it
Use a templating system with text files to construct dynamic prompts
Use an external database like Promptfoo (endorsed by Tobi from Shopify) or Prompt Layer
Use your own database.

Resources

3. Context and Retrieval

This section focuses on enhancing model responses by providing relevant context beyond their training data. We'll first explore a standard Retrieval Augmented Generation (RAG) Pipeline and then build on this understanding with common issues and tips.

What is RAG?

Let's say you sell custom t-shirts online. If you wanted to host a customer support chatbot to answer questions like "what's your return policy?", you would need some way to inform the model about this specialized knowledge. Getting the right context in a timely manner is no simple feat and the AI community has tirelessly iterated on the following approach:

Ingestion

You have a set of documents (continuing the above example it would be FAQs, support docs, etc.) that you want to give to an LLM. You take these documents and stash it in some kind of database in a format that makes it easy to compare at query-time. For example,
- If you're using a vector database, you create embeddings which are numerical representations of text.
- If you're sending docs to something like Elasticsearch, they're calculating term frequencies, field lengths, and other metrics to make ranking and comparison easy.
Retrieval

At query-time, you want to match a user's live query to a relevant document.
- The vector database approach would be to, this is done by transforming the user's query into an embedding and then searching for the closest match using an algorithm like cosine similarity.
- The Elasticsearch approach would be to, this is done by extracting the same features (term frequencies, field lengths, etc.) from the user's query and then searching for the closest match using an algorithm like BM25.
Response

You then identify and pass the most relevant chunk(s) to the LLM as context. Sometimes this is just as primitive as taking the top n chunks based on whichever ranking algorithm used in the previous step and stuffing it into the model context. Lots of projects have seen success with Cohere's Rerank model as a discrete "reordering" step to identify the most relevant chunks among a batch of candidates.

It's important to remember that garbage in == garbage out. Whatever fancy techniques you use falls secondary to your document quality. Do they actually contain the information you need to answer your users? Look at your data!

Common Issues and Tips

Questions aren’t well defined and lead to irrelevant matching

Add a query understanding step to your pipeline. It basically means "let's not take the user's query verbatim and try to reason what they're really looking for"
- Ex. the query "show me the top fashion trends of this year" needs to map "this year", a relative term, to an absolute date.
- Ex. the query "which shirts are the best sellers and are they in stock?" represents two distinct queries and should be split into two separate retrieval steps.

Chunks aren't storing complete information:

Explore the 5 levels of text splitting
Use semantic chunking to ensure discrete chunks of text have complete information. On opposed to using more arbitrary strategies like by character count or sentence count.
Have the LLM propose questions for every chunk, and then vectorize the pairs. This makes intuitive sense because the user query will likely resemble the question more than the source text.

The model responds poorly even with relevant context

Don't be fooled by large context windows. We may be past the days of 8k limits but the phenomenon of performance inversely related to prompt length is still true.

Resources

Fullstack Retrieval Series - A comprehensive guide to building a retrieval system written by Greg Kamradt
Jason Liu's RAG posts - Jason has a ton of free articles on RAG and how to improve it.
LocalLLaMA subreddit - Tons of alpha (ground level advice) written by real engineers like here
Jo Kristian Bergum - Chief Scientist at Vespa Engine. Shares implementation tips like this thread
Chunkviz - A tool to visualize your chunks

4. Orchestration and Agents

An agent is simply a system that has autonomy over "how it accomplishes a given task." Contrast this to a Workflow (think Zapier) which is a process with a predefined set of steps.

In this section we'll first develop an understanding of of what agents are actually used for in the real world. Following that, we'll cover the current best practices for building with agents.

Real World Use Cases

Before diving into implementation, having awareness of what is and isn't working can guide your decision making to use agents vs more deterministic approaches.

At the time of writing, the current hype cycle is all about agents. Beware of the golden hammer fallacy - many use cases are better serviced with a thoughtfully designed workflow (that may not even have an AI component!) There are a lot of sexy experiments and startups that show early promise but aren't creating real enterprise value yet.

The domains that are getting traction seem to be mainly in lead generation and coding.

Lead Generation

Voice agents YC has a ton of companies in recent batches that are verticalized voice agents for Home Services, medicine and more. These agents can service inbound leads dialing in by answering questions, handling registration and scheduling and more.

Coding

Products like Cursor Composer, or Cognition Labs' Devin are capable of planning and writing code. Devin goes the extra mile by validating its own work and writing its own PRs.

Content Creation (honorable mention)

Tools like Gumloop can build agents that perform a sophisticated sequence of tasks that a growth marketer would normally do (ad performance analysis, generating new iterations, etc.) While this is still a "workflow" by definition, it is seriously impressive hence the honorable mention.

Tooling and Frameworks

Low Code Tools

First try implementing your idea with low-code tools like Gumloop or Lindy. Devs often carry a negative impression of tools in this category because of the previous generation of low code builders like Bubble or FlutterFlow which are notorious for producing buggy products with bloated codebases. The tides are shifting however, as these products are already driving incredible enterprise value despite launching just in the past year.

Graph Frameworks

Developers hit the statute of limitations with these tools when they're modeling something that requires routing or orchestration. LangGraph is a thoughtfully designed piece of software, offering features that actually matter when building with agents. Our recommendation is to jump into LangGraph after you've prototyped your agent with a low-code tool. The learning curve on LangGraph can be high so it's important to know what you need before you learn the syntax.

A couple pointed examples:

Breakpoints - allows for human review of the agent's work. Anything that bears negative consequences if performed incorrectly like making large transactions, could benefit from this affordance.
Checkpoints and Persistence - the idea that you have state that you want to persist between interactions. For an e-commerce chatbot you might want the cart, or the shipping information to persist.
Tool Calling - steps need the ability to call tools (sometimes successively or in parallel) and update the agent's state.

Resources

Frameworks

LangGraph - A framework for building agents
CrewAI - A framework for building multi-agent systems
Haystack - A framework for building agents

Low-Code

Educational Resources

People to Follow

Harrison Chase - LangChain Founder
Joao Moura - CrewAI Founder
Max Brodeur-Urbas - Gumloop Founder
Alex Reibman - CEO of AgentOps

5. Evaluations and Observability

Evaluations (evals) and observability are what distinguishes the toys from the real apps. Jason put it best here:

Jason's quote

This section demonstrates how easy it is to add these aspects to your application.

Evaluations

Evaluations are the unit tests of your language model applications. They help detect regressions or unintended behaviors as your system evolves. This is especially important when dealing with applications that produce more ambiguous outputs, such as creative text generation or nuanced decision-making systems.

1. Step One: What does good and bad output look like?

Let's consider an application that summarizes articles on substack to create SEO descriptions. We'll evaluate the summaries based on a couple of criteria:

Relevance: Does the summary capture the main points of the article?
Conciseness: Is the summary free from unnecessary information?

You can start with just one by the way - what's important is to choose criteria that are relevant to your domain.

2. Step Two: Write a simple test

Don't complicate it, cook up a few examples that represent your inputs. Read over the example first and we'll discuss the tactics below.

import pytest
from ai_app import summarize_article
 
@pytest.mark.parametrize("article, expected_criteria", [
    (
        "Artificial intelligence is transforming the world...",
        {"relevance": True, "conciseness": True}
    ),
    (
        "The economy has seen significant changes...",
        {"relevance": False, "conciseness": False}
    ),
])
def test_summarization_criteria(article, expected_criteria):
    summary = summarize_article(article)
    assert evaluate_relevance(summary, article) == expected_criteria["relevance"]
    assert evaluate_conciseness(summary) == expected_criteria["conciseness"]
 
def evaluate_conciseness(summary):
    # Check if the summary is under a certain length
    return len(summary.split()) <= 50
 
def evaluate_relevance(summary, article):
    # Check if the summary is relevant to the article
    prompt = "Is the summary relevant to the article?" + article + summary
    return llm(prompt)

Let's explore the tactics:

Yes/No Assertions It is encouraged to use boolean values over multi point scales (like low-medium-high, 1-10 etc.) as shown above. It makes interpretation easier, and is amenable to more sophistication if needed.
Use Simple Functions - things like Regex, string includes etc. can still be useful. The evaluate_conciseness is just a string length check! Use LLM-as-a-judge as a last resort for criteria that are difficult to codify like relevance.

The folks at Applied LLMs have a great guide with more advanced content like comparative evaluation, deployment and more here

Observability

Observability enables two things:

Tracing: Logging your different language model calls to easily debug and understand application behavior.
Cost Management: Tracking the costs associated with each LLM call, latency, and errors.

Here's a screenshot of Langsmith:

langsmith

To integrate LangSmith (and any other observability tool) it's a couple lines of code and free to use for thousands of calls per month.

import openai
from langsmith import wrappers, traceable
 
# Auto-trace LLM calls in-context
client = wrappers.wrap_openai(openai.Client())

A couple real examples of how these tools have helped in the past:

I was getting empty responses and beneath layers of error reporting and other libraries, the issue seemed illegible. Looking at the the input and output tokens per call revealed immediately that I was hitting the token limits from the input alone.
I was investigating latency issues for an e-commerce assistant agent. I was able to trace the calls and see that the majority of the time was being spent on fetching data from 3rd party APIs in some cases, and rate limits in others. This helped me prioritize the work to optimize the application.

Hamel Husain - Parlance Labs, prev ML eng at Airbnb, Github
Doug Safreno - CEO of Gentrace
Jason Lopatecki - CEO of Arize AI
Haroon Choudery - CEO of Autoblocks
Eugene Yan - ML, RecSys, LLMs @ Amazon; prev led ML at Alibaba
Shreya Shankar - DB & HCI & AI PhD student UC Berkeley EECS, building 📜http://docetl.org
Aparna Dhinakaran - AI Founder: building Arize AI & Arize Phoenix

6. Mindset

Scaling LLM Apps Colin Jarvis presenting at OpenAI's Dev Day SF 2024

Mindset is a meta-skill that's easy to dismiss but what ties the rest together.

Tangibly, you need to be able to:

Be prepared to throw away what you learned last week: Arguably the hardest thing to do is accept how fast information decays in this space. As an example, specific prompting styles that were effective for GPT 3.5 generation of models like role-based prompting, were displaced by different paradigms like chain of thought in subsequent generations.
Study New Capabilities and Use Cases: To the first point, you need to be able to study new capabilities and use cases. The beta features mentioned in the Models section like Claude's Model Context Protocol and OpenAI's Real Time API.
Build First, Build Quickly: Your learning rate is limited by the ability to quickly prototype and iterate in code. The term "non-technical" founder should be extinct with tools like Cursor, V0, Windsurf, and Anthropic Projects that do so much of the heavy lifting for you.

Streaming: A method of processing data in a continuous flow, rather than all at once. This is used in LLMs to produce text output progressively.
Batch Processing: Processing data in groups or "batches." This can be more efficient for certain types of tasks but less interactive than streaming.
Prompt Caching: Storing previously used prompts and their corresponding outputs. This saves compute and time if a prompt is used again.
Assistants: A specific feature in models that allow you to create agents with defined instructions and tools

Prompt Engineering

Prompt Engineering: The art and science of crafting effective prompts to elicit the desired behavior from a language model.
Chain of Thought (CoT) Prompting: A prompting technique where the model is asked to explain its reasoning step-by-step, leading to more accurate results.
Structured Outputs: Requesting the model to respond in a predefined format such as JSON or a table.

Context and Retrieval

Retrieval Augmented Generation (RAG): A technique that enhances a language model's response by retrieving relevant information and incorporating it into the prompt.
Embeddings: Vector representations of words, phrases, or documents, capturing their meaning in numerical form for computer processing.
Semantic Search: A search technique that focuses on the meaning and intent of a user's query rather than just matching keywords.
Chunking: Breaking up large text into smaller pieces before indexing it in a vector database

Orchestration and Agents

Orchestration: The process of combining different tools or models to achieve a larger goal.
Agent: A language model given access to tools that can decide when a job is done
Long Term Memory: The ability for a model to retain and recall information over extended periods.

Evaluations and Observability

Evaluations: The process of assessing the performance of your model's output.
Observability: The ability to track and understand the inner workings of your system. In LLMs this might include logging requests, and understanding costs
Tracing: A method of following the flow of requests and data within an application, useful for debugging.

Other Terms

Non-deterministic: Describes the output of an LLM as it might be different each time you call it.
Vector Representation: A way to represent text as a list of numbers that computers can read and compute against.

This post got shared with friends and family first. If you want to get the next one early, come join the newsletter

The 6 AI Engineering Patterns In 2025

On this page