The 6 AI Engineering Patterns In 2025
Learning how to build AI powered products
Welcome! This post got shared with friends and family first. If you want to get the next one early, come join the newsletter
If you haven’t noticed, AI and language models are changing skills needed to be a successful engineer.
The people who are learning these new skills are getting jobs that pay up to $435K per year.
They’re creating apps using LLMs that generate millions of dollars per year at 90% margins with no employees.
They’re building features that took months, in minutes.
They’re using the 6 AI Engineering Patterns to build.
6 AI Engineering Patterns
Models
Understanding the foundational AI models, their capabilities, and how to effectively integrate them into your applications.
Prompting
Mastering the art of crafting effective prompts to get reliable, consistent outputs from AI models.
Context (RAG)
Implementing context in your applications (Retrieval Augmented Generation) to enhance AI responses with relevant external knowledge and data.
Orchestration (Agents)
Building and managing AI agents that can coordinate multiple tasks and work together to solve complex problems.
Evals & Observability
Implementing robust evaluation frameworks and monitoring systems to measure and improve AI system performance.
Mindset
Developing the right approach to AI engineering, including how to scale and best practices.
Hello World
First, let's start from the beginning. It's good to start aligned right?
Here we have a hello world example to get you started.
Great, now you've mastered LLMs. But wait, want to learn a bit more? Let's go deeper.
1. Models
Getting Started
Start by experimenting with the chat interfaces from the major labs:
Experimenting with "thinking" or "reasoning" models like o1
vs "balanced" models like claude-sonnet-3.5
or "workhorse" models like gpt-4o-mini
will help you build an intuition of what categories of problems and prompt strategies are best suited for each model (more on this later).
Through these interfaces you can start to get a feel for the models. But you'll quickly see that many capabilities are better through API calls:
- Use developer playgrounds like OpenAI's
- Study both API docs and recent cookbook examples. (There's often some drift between the API docs and SDKs, which is why it's important to look at the examples.)
Main parameters we see developers using:
response_format: "json_schema"
- For structured outputstemperature
- Lower values for more deterministic responsesmax_tokens
- Control response length
More Advanced Concepts
Start by understanding tool calling, latency and cost features, and open source models. Model training, fine-tuning, model routing, only apply for more mature applications (you don't need to learn these in the beginning). New features from the major labs like Anthropic's MCP or Open AI's Real Time API will unlock a new breed of applications in the near future but the best practices for using them are still being explored.
Tool Calling
Tool calling gives your LLMs the ability to do things besides generate text. This is a must know for anyone trying to build "agentic" applications. Take this conversation example:
LLM without tool calling would... do nothing in this case because all it can do is respond to the user. Whereas an LLM with tool calling enabled, could trigger a function in your backend via its response like:
Your app is still responsible for actually issuing the refund, but the LLM decided on its own to issue a refund because it made sense given the context of the conversation.
Performance and Cost Optimization
Building a latency sensitive application like a chatbot? Look into streaming and prompt caching.
Aside: Prompt caching happens automatically for OpenAI's APIs whereas Anthropic requires an explicit cache_control
parameter to be set. This is just one
example of the many divergences between providers and it's advised to look at the source documentation for each.
For latency insensitive use cases, like running an eval suite overnight, you should look at the batch API to run inference at a discounted rate.
Running Open Source Models
Running open-source models are increasingly a viable alternative. There are managed services like together.ai or fireworks.ai that serve open source models or you can use platforms like OpenRouter or OLLAMA to run locally on your own hardware.
Further learning on models
Videos & Podcasts
- Check out the previous show and tell videos for experiences with working with models.
- Lex Fridman podcast with the Cursor team
Blogs and People to Follow
- Applied LLMs - A collective of the top Applied AI consultants
- swyx's Latent Space - All in one community/blog/podcast/etc. covers research and conferences (advanced)
- Jason Liu - Creator of Instructor. Posts lots of tactical advice about running a consultancy and actually doing the work.
- Eugene Yan - Sr. Applied Scientist @ Amazon, expert on ML and Recommendation Systems
- Justine Tunney - Cracked low level hacker
- Deedy Dass - VC @ Menlo Ventures
2. Prompting
Prompting is the art of eliciting desired behaviors from AI models. For traditional software engineers this might first come across as gimmicky, but it's a legitimate discipline. Even the top labs are hiring for strong prompting skills! ($375K/yr)
In this section you will learn how to write prompts and how to manage them in the context of an application.
Basics
First, stay away from thinkfluencer clickbait like this:
If there's only one thing you remember: simply write as if you were writing instructions for a generally resourceful person. In other words, you're throwing a scrambled wall of text at a model and getting frustrated when it inevitably fails to do what you need, just ask yourself "if someone sent this to me, would I be able to understand and respond?"
Once you have this core tenet committed to memory, you can get more tactical with the following techniques.
Anatomy of a prompt:
- Instruction: The thing you want it to do.
- Context: External information that can aid the response.
- Output Indicator: The type or format of the output.
- Input: If you're familiar with OOP, an input is to a prompt like an instance is to a class.
Let's run through an example:
This isn't bad but let's see if we can improve it. There are four key techniques most modern prompt engineers use:
-
Chain of Thought (Think Out Loud): Encourage the model to explain its thought process step-by-step, rather than immediately providing an answer. It can be as simple as appending a line like "Let's think step by step" to the instruction section of your prompt. If you build an agent with CrewAI, they will wrap your prompt, behind the scenes, with this chain of thought prompt:
-
Include Examples: Improve prompt performance by providing clear examples of desired input and output. Building upon the customer feedback example you might have a section for "Examples" that looks like this:
-
Use Structured Elements: Consider using XML tags or markdown to structure the prompt, also demonstrated in the example above.
-
Structured Outputs: Master the ability to get language models to output structured data like JSON or tables to enable integration with other computer programs and systems. In the openAI SDKs, this is usually accomplished by simply passing a pydantic class to the
response_format
parameter as shown in the official example here.
Prompt Management
After the prototyping phase, most applications quickly outgrow the "directly in the code" approach. Some reasons for this include:
- Different models perform better with different formats. For example, Claude's models are reportedly perform better with XML tags to mark section breaks instead of markdown.
- A/B testing the performance of different prompts
- Dynamic prompts based on real time input like the user's language to generate a more idiomatic response
Dave Ebbelaar of Datalumina has a great video on how to iteratively manage prompts. Start at the step that works for you, then work your way up to the more sophisticated methods. TLDR:
- Embed in it code directly (your default)
- Put it in text files and import it
- Use a templating system with text files to construct dynamic prompts
- Use an external database like Promptfoo (endorsed by Tobi from Shopify) or Prompt Layer
- Use your own database.
Resources
- Introduction to Optimized Prompting by Eugene Yan
- Prompt Engineering Guide by DAIR.AI
- Prompting Tuning Playbook, by Google Deepmind Engineers
3. Context and Retrieval
This section focuses on enhancing model responses by providing relevant context beyond their training data. We'll first explore a standard Retrieval Augmented Generation (RAG) Pipeline and then build on this understanding with common issues and tips.
What is RAG?
Let's say you sell custom t-shirts online. If you wanted to host a customer support chatbot to answer questions like "what's your return policy?", you would need some way to inform the model about this specialized knowledge. Getting the right context in a timely manner is no simple feat and the AI community has tirelessly iterated on the following approach:
-
Ingestion
You have a set of documents (continuing the above example it would be FAQs, support docs, etc.) that you want to give to an LLM. You take these documents and stash it in some kind of database in a format that makes it easy to compare at query-time. For example,
- If you're using a vector database, you create embeddings which are numerical representations of text.
- If you're sending docs to something like Elasticsearch, they're calculating term frequencies, field lengths, and other metrics to make ranking and comparison easy.
-
Retrieval
At query-time, you want to match a user's live query to a relevant document.
- The vector database approach would be to, this is done by transforming the user's query into an embedding and then searching for the closest match using an algorithm like cosine similarity.
- The Elasticsearch approach would be to, this is done by extracting the same features (term frequencies, field lengths, etc.) from the user's query and then searching for the closest match using an algorithm like BM25.
-
Response
You then identify and pass the most relevant chunk(s) to the LLM as context. Sometimes this is just as primitive as taking the top n chunks based on whichever ranking algorithm used in the previous step and stuffing it into the model context. Lots of projects have seen success with Cohere's Rerank model as a discrete "reordering" step to identify the most relevant chunks among a batch of candidates.
It's important to remember that garbage in == garbage out. Whatever fancy techniques you use falls secondary to your document quality. Do they actually contain the information you need to answer your users? Look at your data!
Common Issues and Tips
Questions aren’t well defined and lead to irrelevant matching
- Add a query understanding step to your pipeline. It basically means "let's not
take the user's query verbatim and try to reason what they're really looking
for"
- Ex. the query "show me the top fashion trends of this year" needs to map "this year", a relative term, to an absolute date.
- Ex. the query "which shirts are the best sellers and are they in stock?" represents two distinct queries and should be split into two separate retrieval steps.
Chunks aren't storing complete information:
- Explore the 5 levels of text splitting
- Use semantic chunking to ensure discrete chunks of text have complete information. On opposed to using more arbitrary strategies like by character count or sentence count.
- Have the LLM propose questions for every chunk, and then vectorize the pairs. This makes intuitive sense because the user query will likely resemble the question more than the source text.
The model responds poorly even with relevant context
- Don't be fooled by large context windows. We may be past the days of 8k limits but the phenomenon of performance inversely related to prompt length is still true.
Resources
- Fullstack Retrieval Series - A comprehensive guide to building a retrieval system written by Greg Kamradt
- Jason Liu's RAG posts - Jason has a ton of free articles on RAG and how to improve it.
- LocalLLaMA subreddit - Tons of alpha (ground level advice) written by real engineers like here
- Jo Kristian Bergum - Chief Scientist at Vespa Engine. Shares implementation tips like this thread
- Chunkviz - A tool to visualize your chunks
4. Orchestration and Agents
An agent is simply a system that has autonomy over "how it accomplishes a given task." Contrast this to a Workflow (think Zapier) which is a process with a predefined set of steps.
In this section we'll first develop an understanding of of what agents are actually used for in the real world. Following that, we'll cover the current best practices for building with agents.
Real World Use Cases
Before diving into implementation, having awareness of what is and isn't working can guide your decision making to use agents vs more deterministic approaches.
At the time of writing, the current hype cycle is all about agents. Beware of the golden hammer fallacy - many use cases are better serviced with a thoughtfully designed workflow (that may not even have an AI component!) There are a lot of sexy experiments and startups that show early promise but aren't creating real enterprise value yet.
The domains that are getting traction seem to be mainly in lead generation and coding.
- Lead Generation
Voice agents YC has a ton of companies in recent batches that are verticalized voice agents for Home Services, medicine and more. These agents can service inbound leads dialing in by answering questions, handling registration and scheduling and more.
- Coding
Products like Cursor Composer, or Cognition Labs' Devin are capable of planning and writing code. Devin goes the extra mile by validating its own work and writing its own PRs.
- Content Creation (honorable mention)
Tools like Gumloop can build agents that perform a sophisticated sequence of tasks that a growth marketer would normally do (ad performance analysis, generating new iterations, etc.) While this is still a "workflow" by definition, it is seriously impressive hence the honorable mention.
Tooling and Frameworks
Low Code Tools
First try implementing your idea with low-code tools like Gumloop or Lindy. Devs often carry a negative impression of tools in this category because of the previous generation of low code builders like Bubble or FlutterFlow which are notorious for producing buggy products with bloated codebases. The tides are shifting however, as these products are already driving incredible enterprise value despite launching just in the past year.
Graph Frameworks
Developers hit the statute of limitations with these tools when they're modeling something that requires routing or orchestration. LangGraph is a thoughtfully designed piece of software, offering features that actually matter when building with agents. Our recommendation is to jump into LangGraph after you've prototyped your agent with a low-code tool. The learning curve on LangGraph can be high so it's important to know what you need before you learn the syntax.
A couple pointed examples:
- Breakpoints - allows for human review of the agent's work. Anything that bears negative consequences if performed incorrectly like making large transactions, could benefit from this affordance.
- Checkpoints and Persistence - the idea that you have state that you want to persist between interactions. For an e-commerce chatbot you might want the cart, or the shipping information to persist.
- Tool Calling - steps need the ability to call tools (sometimes successively or in parallel) and update the agent's state.
Resources
Frameworks
- LangGraph - A framework for building agents
- CrewAI - A framework for building multi-agent systems
- Haystack - A framework for building agents
Low-Code
Educational Resources
People to Follow
- Harrison Chase - LangChain Founder
- Joao Moura - CrewAI Founder
- Max Brodeur-Urbas - Gumloop Founder
- Alex Reibman - CEO of AgentOps
5. Evaluations and Observability
Evaluations (evals) and observability are what distinguishes the toys from the real apps. Jason put it best here:
This section demonstrates how easy it is to add these aspects to your application.
Evaluations
Evaluations are the unit tests of your language model applications. They help detect regressions or unintended behaviors as your system evolves. This is especially important when dealing with applications that produce more ambiguous outputs, such as creative text generation or nuanced decision-making systems.
1. Step One: What does good and bad output look like?
Let's consider an application that summarizes articles on substack to create SEO descriptions. We'll evaluate the summaries based on a couple of criteria:
- Relevance: Does the summary capture the main points of the article?
- Conciseness: Is the summary free from unnecessary information?
You can start with just one by the way - what's important is to choose criteria that are relevant to your domain.
2. Step Two: Write a simple test
Don't complicate it, cook up a few examples that represent your inputs. Read over the example first and we'll discuss the tactics below.
Let's explore the tactics:
- Yes/No Assertions It is encouraged to use boolean values over multi point scales (like low-medium-high, 1-10 etc.) as shown above. It makes interpretation easier, and is amenable to more sophistication if needed.
- Use Simple Functions - things like Regex, string includes etc. can still
be useful. The
evaluate_conciseness
is just a string length check! Use LLM-as-a-judge as a last resort for criteria that are difficult to codify like relevance.
The folks at Applied LLMs have a great guide with more advanced content like comparative evaluation, deployment and more here
Observability
Observability enables two things:
- Tracing: Logging your different language model calls to easily debug and understand application behavior.
- Cost Management: Tracking the costs associated with each LLM call, latency, and errors.
Here's a screenshot of Langsmith:
To integrate LangSmith (and any other observability tool) it's a couple lines of code and free to use for thousands of calls per month.
A couple real examples of how these tools have helped in the past:
- I was getting empty responses and beneath layers of error reporting and other libraries, the issue seemed illegible. Looking at the the input and output tokens per call revealed immediately that I was hitting the token limits from the input alone.
- I was investigating latency issues for an e-commerce assistant agent. I was able to trace the calls and see that the majority of the time was being spent on fetching data from 3rd party APIs in some cases, and rate limits in others. This helped me prioritize the work to optimize the application.
Resources
Tools
People to Follow
- Hamel Husain - Parlance Labs, prev ML eng at Airbnb, Github
- Doug Safreno - CEO of Gentrace
- Jason Lopatecki - CEO of Arize AI
- Haroon Choudery - CEO of Autoblocks
- Eugene Yan - ML, RecSys, LLMs @ Amazon; prev led ML at Alibaba
- Shreya Shankar - DB & HCI & AI PhD student UC Berkeley EECS, building 📜http://docetl.org
- Aparna Dhinakaran - AI Founder: building Arize AI & Arize Phoenix
6. Mindset
Colin Jarvis presenting at OpenAI's Dev Day SF 2024
Mindset is a meta-skill that's easy to dismiss but what ties the rest together.
Tangibly, you need to be able to:
-
Be prepared to throw away what you learned last week: Arguably the hardest thing to do is accept how fast information decays in this space. As an example, specific prompting styles that were effective for GPT 3.5 generation of models like role-based prompting, were displaced by different paradigms like chain of thought in subsequent generations.
-
Study New Capabilities and Use Cases: To the first point, you need to be able to study new capabilities and use cases. The beta features mentioned in the Models section like Claude's Model Context Protocol and OpenAI's Real Time API.
-
Build First, Build Quickly: Your learning rate is limited by the ability to quickly prototype and iterate in code. The term "non-technical" founder should be extinct with tools like Cursor, V0, Windsurf, and Anthropic Projects that do so much of the heavy lifting for you.
Resources
Education
- How Elvis, Founder of Dair AI, uses AI
- Sam Parr's Hampton AI Report
- Greg Kamradt's Early Signals Idea List
- Balancing accuracy, latency, and cost at scale
Vocabulary
Core Model Concepts
- Streaming: A method of processing data in a continuous flow, rather than all at once. This is used in LLMs to produce text output progressively.
- Batch Processing: Processing data in groups or "batches." This can be more efficient for certain types of tasks but less interactive than streaming.
- Prompt Caching: Storing previously used prompts and their corresponding outputs. This saves compute and time if a prompt is used again.
- Assistants: A specific feature in models that allow you to create agents with defined instructions and tools
Prompt Engineering
- Prompt Engineering: The art and science of crafting effective prompts to elicit the desired behavior from a language model.
- Chain of Thought (CoT) Prompting: A prompting technique where the model is asked to explain its reasoning step-by-step, leading to more accurate results.
- Structured Outputs: Requesting the model to respond in a predefined format such as JSON or a table.
Context and Retrieval
- Retrieval Augmented Generation (RAG): A technique that enhances a language model's response by retrieving relevant information and incorporating it into the prompt.
- Embeddings: Vector representations of words, phrases, or documents, capturing their meaning in numerical form for computer processing.
- Semantic Search: A search technique that focuses on the meaning and intent of a user's query rather than just matching keywords.
- Chunking: Breaking up large text into smaller pieces before indexing it in a vector database
Orchestration and Agents
- Orchestration: The process of combining different tools or models to achieve a larger goal.
- Agent: A language model given access to tools that can decide when a job is done
- Long Term Memory: The ability for a model to retain and recall information over extended periods.
Evaluations and Observability
- Evaluations: The process of assessing the performance of your model's output.
- Observability: The ability to track and understand the inner workings of your system. In LLMs this might include logging requests, and understanding costs
- Tracing: A method of following the flow of requests and data within an application, useful for debugging.
Other Terms
- Non-deterministic: Describes the output of an LLM as it might be different each time you call it.
- Vector Representation: A way to represent text as a list of numbers that computers can read and compute against.
This post got shared with friends and family first. If you want to get the next one early, come join the newsletter