Sully, CEO Of Otto AI
Hacking together GPT-4 and Claude, using AI to write tests and why talking to your computer like a human gets better results
Sully Omar (/in/, X) is the CEO of Cognosys, the company behind Otto. He's one of the best LLM practitioners I've met, and you can tell he has a really deep feeling for how these models are actually working. He speaks from experience. In this interview, we go through his 3 tier system of actually ranking language models. He shows us how he uses meta prompts to develop his real prompts that he uses in production. He also shows us his cursor development flow where he actually has the language model write the test first and then write the actual code. And finally, he walks us through distilling performance from large language models to small language models without losing performance.
Insights
- Treat LLMs like humans: Speaking to the model naturally, as if it were a human, improves performance. Voice input facilitates this natural conversation style.
- Embrace the "vibe": Developing an intuitive understanding of each model's "personality" and nuances comes from consistent use and experimentation. This "vibe" helps predict how a model might respond to different prompts and tasks.
- Min-maxing is key: Constantly seek ways to optimize workflows by combining different models' strengths. Don't be afraid to experiment with unusual combinations and orchestration strategies.
- The last 5-10% is hard: Getting an AI product from 90-95% accuracy to near-perfect is exceptionally challenging, even with robust evaluations. Be prepared for this final hurdle.
- Model routing is coming: The future likely involves automated model selection (routing) based on task characteristics, but current limitations make manual routing more effective.
- Distillation requires vigilance: While distilling knowledge from larger to smaller models offers efficiency, it requires meticulous data pipelines and evaluations to avoid performance regression.
- Context is king (especially for Tier 1): Thinking models excel when given ample context. Use Tier 2 models to gather and structure information before feeding it to Tier 1 for deeper analysis.
- Structured output struggles: Some models (like Claude) struggle with complex structured outputs. Consider alternative models or workarounds (like using a different model for JSON generation) in these cases.
- Needle in a haystack vs. Reasoning: Some models are better at finding specific information within large datasets (Gemini), while others excel at reasoning over that data (GPT-4-0-mini). Choose the right tool for the job.
- Don't be afraid to "hack": Early-stage LLMs require creative workarounds ("hacks") to achieve desired results. These hacks evolve as the technology matures, but they will likely remain a part of the development process.
Model Tier System & Usage
- Tier 3 (Workhorses): GPT-4-0-mini and Gemini Flash. Cheap and fast, used for high-volume tasks like document processing and podcast analysis.
- Tier 2 (Balanced): GPT-4, Claude 3.5, Gemini Pro. Good balance of price and performance, ideal for common tasks like coding, writing, and email editing.
- Tier 1 (Thinking Models): Google's Gemini Ultra, other "thinking" models. Used for complex reasoning and deep dives. Sully's workflow involves building context in Tier 2, then feeding that context to Tier 1 for improved results. Deduplication is another strong use case for Tier 1.
- Multi-Model Approach: Sully leverages different providers because each model has unique strengths and weaknesses. He even orchestrates models to work together, like Claude managing GPT-4-0-mini for structured output.
Prompt Engineering
- Meta-Prompting: Begin with a general problem statement and ask an LLM (like Claude or GPT-4) to generate a prompt. Refine this prompt further with a thinking model (like Gemini Ultra) for optimization.
- Voice Input: Sully uses voice input for faster, more natural prompting and context building.
- Prompt Management: Uses LangSmith for prompt evaluation and dataset management. Prompts themselves are version-controlled in GitHub alongside code.
Development Workflow
- LLM-Driven Test-Driven Development (TDD): Have the LLM generate tests before writing the code. This improves code quality and allows the LLM to self-correct by analyzing test failures. This approach is particularly helpful for complex, multi-file projects.
What're the smartest people in AI talking about?
- Test-time compute: Adding more compute at test time to increase performance.
- Thinking models (like o1): Using step by step planning at inference time to increase performance.
- Agentic tasks: Increasing more autonomy in the AI to get things done.
- Built-in tool usage: Like Anthropic's Constitutional AI.
- Model distillation: Training a smaller model based off the output of a bigger model.
- Rigorous evaluation: Pressure testing with evals.
- Performance plateau: Wondering if AI has hit a peak.
What's in Sully's Toolkit?
- LLM Platforms: Gemini Studio, ChatGPT, Claude, OpenAI Playground, Anthropic Workbench.
- Coding Tools: Cursor, Replit, VS Code.
- Other: Excalidraw, Whisperflow (transcription), LangSmith (evals), v0.
Transcript
00:00:00 Sully: It lets you use AI in basically every nook and cranny of your day to day. When that model came out, it actually opened up a lot of things that you could do. We use a lot of different providers, and that's because what we've seen with our internal evals is that they're all so nuanced and different in like a variety of different ways. But you also start to see where they lack. You'll get to an AI product, you'll get into 90%, even 95%. But that last 5, 10% is nearly impossible.
00:00:27 Greg: How do you think about model distillation?
00:00:29 Sully: It's very powerful, but you have to be very careful.
00:00:43 Greg: I just had an amazing conversation with Sully Omar, the CEO of Cognizys, the company behind auto dotai. Not only is he one of the best LLM practitioners that I've met, but you can tell he has a really deep feeling for how these models are actually working. He speaks from experience. In this interview, we go through his 3 tier system of actually ranking language models. He shows us how he uses meta prompts to develop his real prompts that he uses in production. He also shows us his cursor development flow where he actually has the language model write the test first and then write the actual code. And finally, he walks us through distilling performance from large language models to small language models without losing performance.
00:01:23 Greg: Let's jump into it and let's see what wisdom our friend Sully has to share. The reason why we're doing this interview here is because I see all the cool stuff you're sharing on Twitter. And I'm like, this guy clearly has not only, like, a checklist learned, ability to manipulate these models, but I could tell you feel them. Like, you really feel how these things are actually going in the personalities and the nuances. And so I wanna dig in dig into that today.
00:01:48 Sully: Yeah. Well, thank you. And and I think it just comes from playing with these things every day, day in, day out, and using them and pushing them to their limit. And, like, as cliche as it is, it's just, like, sometimes you just gotta use them to vibe with them. You know? Like like right? So
00:02:04 Greg: Yeah. Yeah. Yeah. It's so true. Well, I tell you what. I wanna start off with one framework that I saw you document recently, which was your 3 tier model of language models. So tier 1 through tier 3. So could you tell me, like, starting at tier 3, what are those and how do you work your way up?
00:02:20 Sully: Yeah. So that's a that's a framework that I I mean, I don't even know if you wanna call it a framework, but it's I like to categorize it and it's, like, based on intelligence and price, which is correlated. Right? Like, the less intelligent models are gonna be your tier 3 models, and then your more expensive, slower are gonna be your more intelligent model. So the reason I I thought of it in 3 tiers was because of the application purposes. So the way that you use something like, let's say, o one, so that would be like a tier 1, is different than the way they would use something like Gemini Flash, which is tier 3.
00:02:53 Sully: And that's because they all provide different purposes. One is super cheap, super fast. The other one's, like, really smart and really slow. So I I broke it down to those 3 tiers. And the 3rd tier is basically what I like to call just like the, you know, the workhorse, the the ones that you're just constantly using 247. And within that category, I think there is 3 main models, but it's kinda come down to 2 for me personally. So the first one is the one that I think people are probably more familiar with, which is GPT 4 o Mini. Now that model and is actually like, I I really, really like it because it lets you use AI in a way that previously you couldn't.
00:03:35 Sully: Like, if you were to go back, let's say, 6 months ago when we had no cheap models, you had, let's say, GPT 4 and maybe even Claude 3.5. There was a lot of scenarios where you couldn't just be, like, throwing that at, like, random problems. Like, you couldn't just be like, hey. I have this, you know, 20 page document. I want you to go paragraph by paragraph and, like, extract the details because, realistically, like, you know, you're gonna be paying a lot of money. So when that model came out, it actually opened up a lot of, like, things that you could do. So that was the the the the first one was with GPT 4 Mini.
00:04:08 Sully: And then the other one that I'm starting to really like is Flash. So Gemini Flash is actually half the price of GPT 4 0 Mini. And those are the tier 3 because, like I said, they they give you a lot of optionality in the different things that you could do that you couldn't do before. It lets you use AI in basically every nook and cranny of your day to day. Right? If whether it's your coding and you wanted to look at, like, you know, 50 different files to summarize to help another model. For example, if you wanted to take a podcast and, you know, look at, you know, when did someone say a specific word in that podcast.
00:04:43 Sully: Right? You're not gonna go to a bigger model. So that was that's what I call the tier 3. And then the second tier that I have is sort of like the the middle, obviously, is the middle tier. And this is where I like to slot in the actual GPT 4, cloud 3.5, Gemini Pro. This is where I think the majority of people use these models and and kinda get the maximum usage out of them. And then the last year is obviously like the o one o one preview, and then what I like to classify as thinking models.
00:05:09 Greg: Yeah. That's so cool. So I wanna dig in more into the use case side. So which use case tasks are you doing tier 2 with? And then I know that o o one and tier 1 is gonna be it's not just, oh, I need it smarter. It's almost like a different type of task you're gonna ask it to do. So how do you differentiate between those 2?
00:05:25 Sully: Right. So the way that I like to differentiate is I like like I pair them. So I will use o one, and I do use this in my day to day. It's like, I'll go to chat gpt. And if you just go and say, hey. Like, 2 0 1, can you do this task for me? 1, it's gonna take a little bit of time. You're probably gonna hit some rate limits because it's highly limited. And, realistically, you're not going to use the model the way that I think it was intended somewhat to be used. So if you say like, hey. How's it going? Like, okay. Sure. You could use it like that. But, realistically, you're better off using, you know, the the tier 2. So how I use the tier 2 is actually the most I use it the most.
00:06:01 Sully: Obviously, everyone uses it for coding, whether it's cloud 3.5, GPT 4. Using it for, like, function calling or tune call tool calling. Like, it it is obviously, like, a good balance between intelligence and price. Mhmm. And and that's kinda, like, what I use it the most, whether I'm writing, whether I'm asking it to, like, hey. Help me edit an email or things like that. I'm using those, like, middle tier ones. Now how I actually use that in tandem with o one is I'll sort of one of the use cases I have is I'll come to Chad GPT or or Claude, and I'll sit there and I'll just create a giant conversation about a specific topic.
00:06:37 Sully: So let's say, for example, you know, I'm deep diving into a research topic, and I wanna learn more about. Now I'm not gonna actually go straight into o one because I feel like, one, it's a bit slow. What I'll what I'll do is I'll start the topic with GPT 4 or Claude, and I'll, like, add files because, obviously, I think right now, o one doesn't support, like, files and web search. So there's a lot of capabilities that o one doesn't support. And what I like to call is the context building. So I will just go and build as much context in this chat as I possibly can or or it could be, you know, in any platform. And and I'll sit there and iterate.
00:07:10 Sully: I'll actually use voice mode as well to sort of give context just a lot quicker, and and that's another workflow. And as soon as I have, like, you know, let's say, like, 2 to 3 pages worth of documents, I'll actually take that and paste it into a chat with o one or o one preview. And I'll say, hey. You know, do this gigantic task for me. So for example, I'll I'll give you one thing Uh-huh. To use it for is, like, I was using it to generate use cases for my product. And I was like, okay. I want to generate use cases, and I want to understand, you know, what are some potential customer segments and ICPs. It's just like a pretty technical question.
00:07:46 Sully: And if I were to just go to o one and ask it that, it would have no context. It doesn't know what my product is. It has no clue what my product does, who my customers are. And if I were to sit there and chat with it, well, I'm gonna hit that limit. But if I go to Claude or Chad GPT, I can upload documents. I can create this basically a PDF and copy paste it into o one. And then I can say, generate me, you know, personas, ICPs. It does a lot better. So that's sort of the the workflow and use case that I have currently running with, like, the the tier 2 and the tier 1 models.
00:08:14 Greg: Yeah. Yeah. Yeah. One of the ways that I found o one works for me really well is around actually deduplication. So if I have a long list of items, that say I've processed 5 different chunks with the same type of workflow for each chunk, well, I'm gonna have a list of duplicated items. I give that whole thing to o one. It's actually really good at deduplicating, and then I'll use one of the tier 2 models to do the structured output after that since o one doesn't yet support structured output and go from there.
00:08:37 Sully: Yeah. That actually, that's a good one. That's another thing I do as well is I'll take o one and give me, like, a long verbose output and then take that and turn it into structured datasets with the the tier 2. And even sometimes, you could even get away with using it with a tier 3 because it's you don't even need to worry about the output. You're just like, hey. I want this nicely formatted in in whatever shape.
00:08:57 Greg: Yeah. Yeah. Yeah. For sure. So it sounds like you're using different models across different providers too for different use cases, or do you stick with 1 all the time?
00:09:06 Sully: Yes. So we use a lot of different providers, and that's because what we've seen with our internal evals is that they're all so nuanced and different in, like, a variety of different ways. So, obviously, the big one, Gemini, multimodal right off the bat. Like, anything to do with videos or audios, I'll go, you know, dive in straight into that and and kind of use Gemini. But you also start to see where they lack. So for example, a really interesting one is Gemini models are really good at needle in the haystack. And so if you say, hey. I want you to find 1 or 2 pieces of information in this, you know, giant long piece of text or video, it's actually really good.
00:09:51 Sully: But then I started to notice that something like GPT 4 0 Mini is a little bit of a a little bit better reasoning over that. So if I give it a long piece of context and I say, hey. I want you to sort of understand the context of it. I saw I found that GPT 4 o Mini is a little bit better. So you start to see where one model does better than the other model in specific area. So, like, another example is claud 3.5 and GPT 4 o. Now clog is obviously everyone loves the model. It's a really good model, but one thing it's absolutely horrible at is tool use with structured outputs. And you'll start to see this if you the very complex tool, like, I want you to create the very deep like, a a nested JSON, a a very, you know, long structured output.
00:10:35 Sully: Like, a very large amount of the time, it fails, and it gives you XML, and it just breaks all your parsers. Whereas GPT 4 0 Mini does a lot better job. But then the caveat is that g p t 4 o Mini is not as good at actually, like, thinking through the problem and acting as an assistant. So there's always these, like, tiny trade offs that you don't really, like, notice. One of the that we did was we set up a like, one of the use cases was to get around that was we set up Claude and GPT 4 0 Mini to work together where the tool use for Claude would be to call GPT 4 0 Mini. And we basically system where Claude could orchestrate GPT 4 0 Mini to create the structured output.
00:11:17 Sully: So it would say, please do this. So the user would say, I want this task. All g p all Claude would do was relay that information to GPT 4 0 Mini. 4 0 Mini creates a structured output, and then that gets returned. So that was like another use case of, like, how we mix and match so many models across different use cases.
00:11:34 Greg: Yeah. Isn't it wild how all these little mini Vibe tricks we have to kind of like hack together in the early days of elements here? And then I think back to how far we've already come. Like, because even like, you know, like January 23, we're dealing with, like, 4,000 token context limits and GPT 3.5. And all the hacks that we had then, we've upgraded from them now, but we still have a bunch of hacks like the ones you're talking about. And so it just makes me think we're never gonna get rid of the hacks, and they're always gonna be there for for a long time.
00:12:01 Sully: I would say so too because, yeah, like, you're right. It's funny looking back at it. The hacks that you used in 2023 were so different. You were hacking around context window, and now you're hacking around, well, tool use, which didn't even exist a year ago, right, or, like, you know, a year and a half ago. So I I agree with you that we're always going to be min maxing. As a user of multiple models, you're gonna be min maxing, trying to figure out for your use case, for your product, for your company, where can I, you know, match these together so that I get the best possible outcome for my users? And I I know a lot of people have, and I'm curious what you think.
00:12:35 Sully: A lot of people have spoken about, like, model routers and how Mhmm. You know, at the end of the day, like, a model is just gonna pick it. But my my personal opinion is I I think that it's gonna cause a lot of unintended side like, you know, side effects. But I'm curious what you think on, like, this whole idea of, like, model routing because, you know, we're talking what we're basically doing we're internally with code model routing, but I'm I'm curious what you think.
00:12:56 Greg: So whenever I get asked a question like this, I think, is there any behavior in practice that tells me what the prediction should be? And you just described, basically, you're doing model routing on your own, like, in and and in and of itself. So that tells me, yes, model routing will be a thing. And I do still think that fine tuning models and having bespoke small models is still too much overhead. Like, it's really hard to do that and manage them and do them all right now. All that is gonna get so much easier. So I would imagine that not only will we have model routing for task specific things against like some of the big ones where you have Vibe based fields with regards to structured output or tool use or whatever it may be.
00:13:32 Greg: But then also for task specific things, I will absolutely do model routing. So I'm a fan. I think it's hard. I think it will be the future. We're not quite there yet, though. That's for sure.
00:13:43 Sully: Got you. Yeah. Like, my my my sentiment there was that there and I it could be just because the model's just where we're at right now. What I've noticed is and I'm sure you've seen the same is where you'll get to an AI product. You'll get it to 90%, even 95%. But that last 5% is last 10 5, 10% is nearly impossible, I find. Like, it even you can run all the evals you want. You can run all the benchmarks. Getting that last 10%, and I my thought process there is that if you have the model sort of choosing other models, that adds to the variance so it causes a lot more potential. Like, you know, that that's kinda where my thinking is.
00:14:25 Sully: And that could just be because, like, we're early. Like, realistically, we're so early. Models have, you know, multiple generations to get better. So that was my thought was that maybe in the future, but right now, probably not because it's it's so hard to get a product in specifically, like, LLMs into production where you're handling every potential edge case in a manner that gives you as high of an accuracy as you can and adding models that you might not have an eval for could give you an output that you didn't expect.
00:14:57 Greg: Yeah. Yeah. Totally. Well, I tell you what. One of the other interesting things that came up during research was your opinion on what is kind of becoming known as model distillation. So you have a really, really good model. You perfect the output from there, but then you realize, wow. I can actually come up with a little bit of a better prompt here and give it to a smaller model so that you have it's faster and it's cheaper. So can you talk me or walk me through how do you think about model distillation
00:15:22 Sully: in your own workflow? Yeah. So that's something I think about a lot, and it's one of those things where you need to be very careful because it's very it's very powerful, but you have to be very careful because it requires a lot of work. And the reason it needs a lot of work is because you need to have a a good data pipeline and understand what you're distilling. So one of the things and mistakes I made previously with the product was that we went we had GPT 40. And this was actually before GPT 40. It was GPT 4 turbo. And we used it, and it was slow. And we're like, hey. Let's distill that to 3.5. OpenAI has a has a really nice way to do it, so we did that.
00:16:01 Sully: And then the problem was that we didn't have good enough evals. We didn't have a good enough dataset. So as the potential you know, the various areas grew that people could use the product, we would notice, okay. We have to revert back to g fifty four because 3.5 was, at that time, not good enough. Now where I do see distillation in our workflow is when you have a defined eval set, you have, like, all your benchmarks, and you have a very good data pipeline where you can say, okay. In this 500 example set, I'm using cloud 3.5 SONET or your, you know, o one, for example. I have my dataset, and you can use a bunch of different there's a lot of different companies that provide you with, like, ways to manage your and prompts and evals, whether it's BrainTrust or Lanxnite.
00:16:46 Sully: And then you can very accurately detect and determine the accuracy of the distilled model, then 10 out of 10 times I would use it. Yes. And the ease and it's it's actually really easy. Like, to actually distill the model down, it's like it's like an it's a single API call. The challenging part is making sure that you don't regress your product when you do the distillation. But I I think it's one of those things that it's gonna become more and more apparent as the tooling around distillation becomes, like, better. I know there's a couple companies working on it. Like, OpenPipe is one of them. Mhmm. And I know OpenAI straight up offers you that.
00:17:24 Sully: So I think as the tooling gets better, you're gonna see this pattern in production of companies launching with the biggest, best model. They collect a bunch of data. They have a good email set and engineering team to support that. Then they go and they distill it to whether open, you know, G50 4 o Mini or an open source model.
00:17:41 Greg: Yeah. That's beautiful. My favorite line with that is the whole make it work, make it right, make it fast. And so it's like, look, you're gonna use the biggest one to start us off, but then you're gonna make it fast eventually and go from there. This is awesome. I tell you what, though. So I know you're a practical person. I would love to jump into, like, you actually showing us some of the ways that you use these tools. And I think a really cool starting off point would be I know that you're a fan of prompt optimizers or, like, meta prompt writing. And so Yes. Because you had a you had a tweet and literally said, pretty good chance you won't be prompting from scratch in 2 to 3 months.
00:18:16 Greg: So I would love to see the way you kind of prompt engineer your way from, like, an idea to, like, I'm gonna go use this thing.
00:18:24 Sully: Okay. Yeah. Hopefully, my prediction ages well because I feel like it's been a month since I said that, and I don't know if we're 2 to 3 months away from it. But Yeah. Yeah. Yeah. Okay. Let me yeah. I so just to add some context, I do a lot of this sort of meta prompting where I'll come in with a problem.
00:18:39 Greg: What is what is meta prompting? Let's start there.
00:18:42 Sully: You come in with a general idea of what you're trying to do. You have a problem that you're trying to solve. Like, realistically, if you're coming in, you don't know what problem you have that you're trying to solve with AI. It's it's sort of useless. So an example would be the other day, I was trying to get one of the models to write like me, which to to this day, I I cannot for whatever reason. Yeah. Yeah. Yeah. I was like I came into it and I came into chat GPT and I had all my examples. And I was like, okay. What do I write? And I normally, I would write something like you know, you you write like a basic prompt structure, and the reality is that prompt's probably not that good.
00:19:17 Sully: So what meta prompting or what I like to think about this work this idea is that you come in with an idea. Hey. I want to have an AI right like me. I have examples. And then I just give that to 0 1 or Claude, and I say, please create the prompt for me. And that's sort of what I like to think of like this. I come in with a a rough idea of what I'm trying to do. I don't really know specifically how to optimize it. I'll go to these models and say, hey. Like, actually, give me this prompt structure, and it does a pretty good job. So that's kind of the the rough idea of how it works. But let's That's Let me should we just hop into, like
00:19:48 Greg: Yeah. I would love to jump into it if you could share your screen. And then are you using just a regular chat interface, or are you going to Anthropic's Workbench and doing their prompt dot optimizer?
00:19:57 Sully: I I just use the chat interface because Cool. I feel like the prompt I mean, people some people do use it. I and I think you can start with it. But I just find it easier because I can iterate a lot better. I can say, hey. Start like this and and do that. So let's actually do it. But I I wanna start and say, do you have some sort of task that, like we should we start we should start with, like, a rough idea. Because I like, do you have any like, what what's the task we could demo?
00:20:22 Greg: Let's do a straightforward one. Let's do I I guess I'll give you a few options. You tell me what you think is best. We could do the classification 1, which is very standard. Hey. I have some data sources, or can you please label them for me? We could do either, like, unstructured to structured extraction, so, like, extracting insights from a piece of text, or we could do idea generation. That's always a fun one too.
00:20:45 Sully: Okay. Let's do the let's do the extracting text 1, and I think that's a good one. So let's say we I like to always preface it with, like, the problem or what we're trying to do. So, again, what I like to come into is, like, alright. I have a problem. I'm trying to do a specific task. And usually, this is like my blank state slate starting point. So let's say the task that I'm trying to do is I have a large piece of text, and I want to, you know, turn that piece of text into something else, some sort of structured output. And it's it's funny because a lot of people say, like, oh, is it complicated? It's really like I just come to chat GPT and I or or Claude, and I basically say that.
00:21:21 Sully: So the way that I go is I'll say, you know, you could use Claude or the or chat GPT. I haven't found which one is really better. Again, I'll and then this is kind of going back to my original workflow Sure. Is what I'll do is I'll actually start with GPT 4 or Claude, and I'll get, like, a rough idea for a prompt. And I'll copy that, and I'll give it to o one. And then I'll start to compare across all three to see which one, like, makes the most sense. So let's say, for example, in this one, I am grabbing transcripts from podcasts, and I want to know, like you know, I I want a nice, like, structured output for Mhmm. All of the key exciting moments.
00:21:57 Sully: Let's say that that's, like, the problem space. So now you could come in and you could create a prompt and says, okay. Given this video, I want you to do this. Or I come to Cloud and say, look. Like and actually, the other workflow that I I wish I could demo is I use voice a lot. So I don't know if if you use voice a lot, but I've noticed that with voice here
00:22:17 Greg: I don't use it a ton. Yeah. It hasn't entered my workflow yet, but I'm I'm voice curious. So I I wanna try it now.
00:22:22 Sully: See this. Let's see this. Okay. So I have I have something here. I wanna show you the whole workflow that I use so that I so Nice. Let's see here.
00:22:31 Greg: And let me know if you need a transcript. I have one handy for us.
00:22:34 Sully: Actually, yeah. Could you could you toss me it there? No. And then I will use it. I will copy paste it. So okay. Let me know when you have that transcript, and then let me see if it's okay.
00:22:43 Greg: I'll plug this is MFM Vault, a website I put together that does insight extraction from my first mailing. There we go.
00:22:49 Sully: Okay. Cool. So let's say our goal is to extract insights. Now my workflow is I have a tool that transcribes this, so I think it works. So let's say I'll just exactly show you how to do it. Okay. Hey. I need a bit of help creating a prompt for a use case. So what we're doing right now is taking podcast transcripts and trying to extract all of the key moments slash key insights. So I need you to create a a nice prompt that will, you know, help us do that, and I'll I'll give I'm gonna put in the prompt as well later on the actual transcript, but I need you to create the prompt slash system prompt. So boom. So that's that's actually sort of how I do it.
00:23:27 Sully: I it's Nice. There's no real science to it. And I and I'll sit there, and I kinda like here, and I'll copy this. And I'll actually do this. So I'll go into chat GPT. I'll paste it. And I'll actually also place it into Claude. And it's gonna go, and it's gonna give me, like, a starting point. And so right off the bat, like, if you're maybe not as good at prompting or you're new to prompting, like, you can read this. Like, obviously, if you're more experienced and you kinda know, like, what you're doing, these kind of prompts are, like, pretty obvious. But for a lot of people, they'll come in and and be like, okay.
00:23:59 Sully: Cool. I have a a good starting point. So then I'll look at it and say, k. The following is a pocket transcript. Identify some so and I'll compare it to here. So right off the bat, I don't know if you which one you think is better, but I'm looking at this, and I like the Claude output better. Beautiful. A little bit more what's it called? Clear direction. So I'll actually copy this, and I'll be like, okay. We have a rough outline. I liked the first pass. I liked the one from cloth. I'll take that, and I'll go back to chat gpt, and I'll open up a new tab. And then I'll say, let's go to o one preview. So then I'll actually do the same thing.
00:24:39 Sully: I'll say and I'll actually give it more context. So I'll say something along the lines of and, again, I'll I'll go back to the voice mode here. I'll say, hey. You're gonna help me optimize a prompt. So I already got another AI model to give me a rough idea for this prompt. I want you to look at it and tell me if there's any areas in the prompt that we could improve. So I'll give you the prompt, and I'll actually give you the prompt that I gave to the I AI that generated this prompt. So it's gonna go, and then I'm gonna go like this. So this is sort of here, you know, original prompt to AI. Mhmm. Paste that in a sec.
00:25:15 Greg: It's amazing just how you speak to it just like a human. Like, it's not complicated. It's literally just being clear in your directions.
00:25:23 Sully: It's something that I recently started to do, and I think it's a very a lot of people talk to the AI as if it's not a human, but they perform the best when you just speak to it naturally. And I found that voice is the best modality to do that in because it's very hard to sound robotic when you're talking to, like, the the chat. It's like you have to just talk naturally. And then I found that it's it's also a lot faster. Like, if I were to sit here and type that, it would take me a lot. So here, I'll go here. I'll I'll paste this original prompt. You know? And then I'll say, okay. Cool. So I like that one. And now this is the second pass.
00:26:02 Sully: And now this is where, again, kinda going back to the workflow that I use, right, is I'll come in here and iterate with voice on this specific subset of a problem, which is generating this kind of, like, mega prompt. We sat there with g p two four o. We sat there with Claude, iterated a bit. And then I'm I'm like, okay. I have a rough idea. This prompt looks somewhat good, and then I'll come back to o one preview. And I'll say, okay. Cool. I want you to optimize this. And I haven't found like, I don't have a real scientific method to which one is best because I just kind of sit here. And and this is kinda where I have, like, a good first generation of the prompt.
00:26:36 Sully: Realistically, I'll put this into production. I'll write a couple of, like, you know, evals. I'll say, okay. How does this actually perform? And then kinda iterate back. But this is sort of my starting point. So we'll let this go. Okay. So here and then it gives me some things. Can you please generate the new prompt now? Alright. Cool. It gives me the revised prompt.
00:27:05 Greg: So it it did take
00:27:06 Sully: me out a
00:27:06 Greg: minute because you finally the answer.
00:27:09 Sully: Yeah. And and sort of you can see here and you can always say here this is just for the sake of this. And now what I'll do is I will take this, and then I will actually go to and this is my full workflow. We can use any model, but let's say we're gonna use you have a preference of which model you wanna test out the actual transcription? We can actually do
00:27:31 Greg: I'd love to hear which one you think and why, and let's just test it out.
00:27:35 Sully: Let's let's test it out. So now we go to Studio. So and you see what I mean? It's like there's all these different models. I'll go to Studio, which is Gemini. Now we're gonna go to Gemini, which I found. So specifically Gemini Pro, better at sorts of these these sort of tasks. And now I'm here with Gemini Pro, which I'm gonna take and grab the prompt that I crafted with o one, put it into the system prompt of, what's it called, Gemini Pro, paste in the the transcript, and we'll see how it goes.
00:28:07 Greg: Alright. Beautiful. Yeah. That sounds great.
00:28:11 Sully: Alright. Just copy this here. Okay.
00:28:15 Greg: This is how the sausage is made.
00:28:18 Sully: Yeah. It's it's this is how I like to think of, like, the first generation of a prompt where I'm not really sure where I'm starting off with. Obviously, like, is this something that I would use in production? Probably not because you wanna test it out and and have a lot of back and forth. But okay. Cool. Can I is there a way to copy paste the transcript?
00:28:36 Greg: You're just gonna have to select all down at the bottom there. That would be nice to just copy the transcript. Actually, I think I might add that feature in there.
00:28:45 Sully: Yeah. So let me see if I can just copy this. Alright. Cool. Now we go grab this. K. Let me and then I'll obviously, like, do a second pass to make sure that this actually makes sense. Key moments. Yeah. This looks pretty good. Time stamp, 3 to make takeaways, extract one sentence, discussion themes, theme name. Yeah. Like okay. Cool. So here, I'll paste this in, and we'll let it we'll let it run here. So I'm using Gemini Pro. Alright. 17,000 tokens. And and for for people who are curious, like Gemini Pro, I I talked about this recently is that a lot of models can't actually reason over a large context. Like but for something like Gemini Pro, anything under a 100 k tokens, it's it's pretty good at, like, being able to synthesize a relatively intelligent answer.
00:29:44 Sully: So here okay. Cool.
00:29:47 Greg: That's really cool.
00:29:50 Sully: And how, yeah, key moments, how you leverage CrossFit. I I'm actually curious to just, like, see how it just would do against, like, you know, other benchmarks because we don't really know if this is a good output or not, and that's where the the whole point of eval is. But there you go. You have how I went from an idea to generating, like, a full, I guess, optima air quote here, optimized prompt. And the reason for that is just like, for me to sit here and write this probably would have taken, like, an hour, hour and a half maybe, like, give or take depending on how good you are. But, you know, we just did it live in whatever 10 minutes.
00:30:27 Sully: So
00:30:28 Greg: Yeah. That's super super cool. I love that. So then out of curiosity, what are you using for prompt management? So I saw a a tweet by the CEO of Prompt Blair, Jared, and he's like, yeah. I see everybody that go through the same they go through the same world. 1st, their prompts are just hard coded in their code. And then second, their prompts are hard coded in text files, but they're still in their code base. And then third, you actually go to a prompt manager. What what are you using for prompt management?
00:30:54 Sully: So for that's an interesting one. We obviously, we use GitHub for our our our prompts. Yeah. So we use a lot of a couple of different things. And maybe maybe we're not, like, we're not prompt managing correctly, but Uh-huh. We just have our prompts that we store in Lang Smith and Mhmm. Sort of I'll just have datasets, and I'll compare that prompt to that dataset. So for example, we have a giant dataset of, like, a 1000 examples that I I run our test against different models, different prompts, and that prompt is just, like, stored, you know, in in the dataset. And then whenever I wanna change the prompt, I'll actually change it and and duplicate the dataset, paste in the new prompt, and like, my version control, so to speak.
00:31:41 Sully: So the actual prompt stays in my code base with the latest version of, like, this is the the source of truth. And all previous other versions are different datasets where I can see how they perform. So for example, if I wanna go back to a prompt that was, like, you know, let's say from a week ago, I just look at the dataset that was from a week ago and I can see the prompt is there, and I can also see how it performed. So that's how I manage, like, in inversion it. I'm not sure if it's that right approach, but that's how I do it.
00:32:08 Greg: Sir, so in your code, is the prompt that's being called, is it actually in your code, or are you calling out to Langhub and Langhub every single time?
00:32:16 Sully: It's in the code. So the the code our code, it's in GitHub. And the nice part is because it's just all version control. Like, I could look at the Git history and I can actually see, okay, this person changed this line as well, which is nice. So I have the line by line version of control from Git. And then if I wanna see the full prompt, I can look back at, like, you know, the the data management tool.
00:32:38 Greg: Yeah. That's very cool. I tell you what, I had one more demo on here that I was like, this would be so cool if solely solely would show us how we use this. It's a cursor one actually. So I saw that you tweet. You you said, well, I actually have the LLM write the test first, then the code. It helps a ton, which that's a framework I don't see too many people doing. Of course, there's test driven development, but, like, not in practice, not usually. I'm not seeing a lot of people do that. Could you walk us through, like, how do you write that test first and then how do you ask it to write code right after that?
00:33:08 Sully: Yeah. Okay. This is one that I the reason I started to do was because the problem I was facing, the model just kept messing up. Like, every single time, it was within our code base, and I was like, this is this is a waste of my time. The model can't figure it out. How about I just get it to generate the test first? And then if the test works, then I can maybe look at the code and say where the issues are because models guess what? If a test fails, you can grab the error output, give it back to the model, and say, hey. Like, please decipher that. So let let's actually see if I can, like, I can spin up
00:33:41 Greg: Like a little mini project or something. Or
00:33:43 Sully: Yeah. Yeah. Let's see here. If I can spin up something new.
00:33:46 Greg: I actually think this is really cool. And this is like something like, really, truly, not enough people are doing this and if it legit helps you write better code because it makes sense. You have the test that's supposed to run successfully, and it can use that as instructions, and it can use that to, like, test to make sure it's actually working.
00:34:01 Sully: I'm surprised not a lot of people not more people are doing this where it's like right? That's like it's just a lot easier for the LLM to, like, do that, and then your code is safe. I guess, like, you know, less spaghetti because you're not you don't you're not worried about, you know, if something changes, like, the model like, you start with the test, and it's really easy for the model to generate it. Okay. So I got a that took a little time. I got a, say, cursor here. So this is just a super quick let me just grab the screen here. Super quick here. So I have this, you know, super basic thing. We can just terminal.
00:34:35 Sully: We can run it, and I can go, you know, fun index. That's yes. Hello, world. Now I actually like to start with cursor, and I'll just say something along the lines of, like, literally and, again, I actually don't know how to write tests and fun, so I can just go to cursor. I open up command I, and those who know this is like the composer. It lets you coordinate and create files. So I'm gonna say, you know, I'm using fun for now. Create a test file for a method, and then make the method that, let's say, for now, reverses a string. Super simple. And, oh, I guess I'm at a slow request, unfortunately. K.
00:35:19 Greg: Wow.
00:35:19 Sully: So it what it'll first do is it'll create the test. Right? And this is obviously a really simple example. And so here, I'll I'm happy with this. Alright. I'll I'll just accept this. Sure. And now right off the bat, like, there's, you know, how many, whatever, 5 tests here. So, obviously, I have the actual function. So here in this example, it's just reversing a string. Now the nice part is I can go here. I can say, you know, bun I guess it's reverse test dot ts. And I can, again, debug with composer. This is a nice part. I can just go debug with AI. Gotta up I gotta up
00:36:00 Greg: my press.
00:36:02 Sully: Man, I'm I'm out of the free that that's how much I use cursor. I I just Yeah. Always blow to the device. But okay. Cool. So here, it, like, you know, passes the test. But let's actually say that, like, we are using something a little bit more complicated than reversing a string. Now I can go into here, and I can say let's just not reverse it. Let's just say, like let's just break the code. Let's just say here we'll split it like this. Okay. Return dot. K. So now if I go here, I go test, I go test files, so all these tests fail. Right? Now, obviously, it's like a pretty simple example. And it's almost as simple as just clicking this button that says add to composer, and then I say, you know, please fix the reverse method due to errors.
00:36:54 Sully: And now the nice part is here, Cursor will pull in that terminal that'll throw, you know, the errors where it happened. And what Cursor will do is they'll look at that, and they'll say, hey. Look. I see what the issue is, and it'll just fix it. So this is kinda what I like to call it. Like, I I don't actually have a name for it yet. Maybe LLM test driven development, whatever you wanna call it. But it's like you come in and you describe what you're trying to do here. The LLM writes the tests for it, and then it's gonna write the method. And then what you can do is have it run. And now if the method itself, like this function, which is reversing a string, is is complex or confusing, it will be able to sort of, like, essentially agentically, air quote here, fix itself, if that makes sense.
00:37:34 Sully: It'll test the code, see if it passes the tests. If not, it'll update the code and then sort of do that until it can, you know, pass the test. And all you have to do is make sure that the tests you're writing are correct. And I and I use this a lot for obviously, for simple functions, it's not that useful. But when you have code that is across a couple different files, you know, in a in a modern code base, it's not just a single function. It's like you have, like, you know, a bunch of different files and and stuff connecting. And ones that require a lot of, like, conditionals or, like, they're not as simple as this.
00:38:09 Sully: It's like that's where I found that whenever I would try to get, like, cursor or sorry. I'll get, like, Sonnet to one shot it, it would fail every single time. But then a second that I was like, okay. Please let's write the test for it, and then I would sit there and kinda help it the test. It was able to debug itself a lot better and go through these, like, bigger, maybe meatier functions that normally wouldn't be be able to even, like, o one and o one mini couldn't solve, but a second that I would apply this, like, test driven development, whatever you wanna call it, The model is able to look at the output, see where it messes up, adjust the code, and kind of iterate on itself like that.
00:38:41 Greg: That's cool. So not only does this test first mindset it's kinda like a prompt engineering technique. It's almost like think out loud, but it's almost like like write the goal first and then tell me what you think we should do for it. But you also get tests out the other end, and so you get a little bit of extra utility as a byproduct.
00:38:59 Sully: Exactly. It's it's a it's a win win. You get a little bit of both. And to me, that was the one thing I never understood why people haven't done more of because you would think, well, if it passed all the tests, the the code is, like you know, you're happy that it passed the test, but it's something that I haven't seen a lot of people do.
00:39:14 Greg: Yeah. Yeah. For sure. Well, that's awesome. Well, that's fabulous. Thank you for showing me the cursor example. One of the questions I love asking is I wanna know what the smart people are talking about right now, like an AI. So, like, as you observe on Twitter in your circles, what are the smart people talking about?
00:39:31 Sully: That's a good question. Oh, man. I think what I see a lot of people talking about is sort of the you know, what's it called? Like, test time compute, like, o one thinking. I see a lot of people talking about those. I see a lot of people talking about having think those thinking models do more agentic sorts of tasks and basically bringing this what I like to think of as an agent as a 4 loop inside to the model thinking process, having and training the the model to just innately be able to call tools, like and we saw that. I think a good example of that is computer use, right, from Anthropic. Right? They they obviously fine tuned it on that.
00:40:12 Sully: So I see a lot of people talking about that. I do see what I started to notice is people starting to talk about whether we've hit some variation of a wall. I don't know if you've seen it too. And I'm hearing the little rumors that, you know, cloud 3.5 Opus is not up to par and, like, the the new Gemini model is not as good. So I I'm hearing that as well. And what else are people really talking about? And I think I think we spoke a lot about the other things, model distillation. And the other thing I've started to see more of is people being a little bit not I guess, talking more about evals. Like, I I think a lot of people didn't really talk about it, and people are saying, hey.
00:40:50 Sully: Like, from a product perspective, if you want your product to be good, you need to write evals, which are just a way of writing tests. So that's kind of what I've been seeing, and I don't know if you see anything different, but just from what I've heard from people talking.
00:41:01 Greg: Yeah. Let me think. Is there any anything else I would add to that list? The one thing people aren't talking about it, but I think it will be a big deal when it actually comes out, is the whole feature engineering weight manipulation like the Golden Gate Claude Anthropic. Yeah. I'm still waiting for access to that because that is going to be an alternative to prompt engineering. And I have no idea, like, how easy it's gonna be to work with, what kind of results we're gonna get, but I'm excited to test that whenever it comes out.
00:41:28 Sully: Yeah. I I I remember seeing that, and I was, like, I was blown away, and I kinda forgot about it. So that that I'm actually interested to see if they ever will ever let you have that much interoperability with those models. Like, maybe there's, like, no. No. We're good. Sorry. We're shelving it. Like, you know how to touch it. Right? But For sure. That would be really interesting.
00:41:44 Greg: Yeah. For sure. For sure. Awesome. Two more questions here. Last one. I love hearing about what is in people's toolkit. So I've seen you use Accely draw on ex Accely draw on your YouTube videos. I've seen you use Replit. I've heard rumblings about v zero. What else is in your toolkit that is in your kinda day to day workflows?
00:42:03 Sully: Okay. Okay. So there's a lot, I guess. Yeah. You got you got a couple v zero. Obviously, there's cursor. Excalidraw, I like it for drawing the little diagrams. The other one, I guess, that I use a lot is the playground from Anthropic and from OpenAI, which is, like, different than ChatGPT. I use that to iterate on prompts. I use this yeah. The the one that I use for transcribing the actual audio is called Whisperflow. It's the one where I, like, I have a hot key that I press, and it takes the voice and transcribes it into the inputs that you saw me use. The other tooling that I use I mean, we can go in do you wanna go into the technical side, or are we just gonna leave it at, like, the high level tools?
00:42:43 Greg: I let's let's not go, like I don't wanna know your entire tech stack, but, like, what is in, like, the cool AI stuff that, like, you're you're you're grabbing for?
00:42:52 Sully: I think that's pretty much it. I think I think you got it there. I I there's not many other tools that I honestly use. I I just I like a lot of it's yeah. Like, just writing the code. LangSmith is one actually. I will say that we we we use LangSmith a lot for evals. That's, like, the other one. But, yes, that's pretty much it from from me. I think you nailed it. V 0, cursor, Excalidraw, OBS if you're recording videos. Yeah.
00:43:18 Greg: Yeah. Yeah. For sure. Alright. Last question. And this is kind of off topic from the AI side, but I know people would be interested in it. So you've had a few bangers on Twitter, like, just some things that just absolutely pop. And as somebody who does a little bit of Twitter himself too, I can look at a tweet and be like, that person thought about it, and they did a really good job as to how they architected it and and constructed it. And I noticed that with yourself. So what hits on Twitter? And what would what's your advice for people who, like, wanna do better on it?
00:43:44 Sully: Oh, man. Okay. So Twitter is just this hilarious platform that the algorithm changes a lot. So it's you kinda gotta get a feel for what works and what doesn't. And luckily, the cost so if anyone's looking to grow, the cost to post on x slash Twitter is 0. Like, you don't pay anything. If it doesn't do well, no one cares. So it's the one platform where the cost is literally 0 because you're just typing. So type things away. How I craft a banger, it's like a mixture of what I see trending. So what I see, what people are talking about. And there's 2 ways to craft a banger. 1 is you have to be controversial. You are not gonna craft a banger if you're not controversial.
00:44:24 Sully: Now there's pros and cons is if you're posting that kind of stuff all the time, people will be like, hey. You're just posting clickbait. So you gotta be careful with it. You can't be like, this is insane and every single tweet starts with that. Like, no one and no one's gonna believe you. But saying something controversial. And the most important part of crafting a banger is your hook. It I can tell like, honestly, I will post something, and I can tell within 20 minutes if it's gonna be a banger or not. And it's basically how natural does it come. That's one. It's like, how natural did this thought come to me, and how well did I craft that hook?
00:44:58 Sully: Everything in between, like, you could you can kinda sit there in min max, but the the that's how I sit there. And sometimes I'll sit on something, and I'll be like, oh, man. Like, I just don't know the right way to say it. So I won't post it. But then it'll just come to me, and I'll be like, alright. I got this. I all the words, I'm using the right structure. It's like the the right timing, and and that's kinda what goes into crafting it. So the one piece of advice that I will give from my personal experience is don't spend too much time on a tweet. Because I unless you're doing it educational, there's there should be a a diagram where the more time you spend thinking about a tweet, the worse it does.
00:45:36 Sully: Because I swear, the majority of my bangers, I spend, like, 15 minutes thinking about. I'm like, alright. I'm just gonna post it. You know, grab a coffee. I come back, and the ends blow it up. And I'll
00:45:45 Greg: And then all of a sudden, you see 1.4000000 views.
00:45:49 Sully: Oh, man. I do you have time? I have to I have to I have to tell you the story of how this Yeah. I'll be sure. Do I have time for that?
00:45:55 Greg: Yeah. Yeah. Let's hear
00:45:56 Sully: it. Okay. So because it it's so relevant to the banger tweet. So my company, we we started, like, a year and a half ago. And right this is around the time that agents like, people were talking about them, but didn't have any clue. This was, let's say, March 2023. And at this time, I I was no one actually knew of my account. I literally had I had been posting tweets, and no one replied. You know, the classic zero views. You know, that's just what happens. Yeah. Yeah. And then and I remember I saw someone else post something about auto g p t. And I saw it, and I was like, oh, it looks pretty cool, but I ignored it. And then it came up again.
00:46:34 Sully: And I was like, no. I can't I cannot not ignore this. Like, this seems something very interesting. And I'd been building actually, like, AI projects, side projects before this. And I was like, you know what? Let me, like, try this thing out. And, obviously, I tried it. And back then, I was like, dude, this is insane. Agents, AI, it's gonna be crazy. So when I was like I just posted about it. And, like, I didn't post anything crazy. And I was like, oh, yeah. This thing is kinda cool. It's pretty crazy. And it, like, got, like I think that was the first post that got over a 1,000 likes. I was like, wait a minute.
00:47:03 Greg: Wow.
00:47:03 Sully: And then I was like, hold up. Hold a second. Then I saw this trend that people wanted to do something about, like, AI agents. And it's interestingly enough, I, like, thought back to an episode of m I f like, my first million. It's getting so funny. And and I remember them talking about, like, there's sometimes you see, like, this opportunity. And I was like, dude, I gotta sit here and I gotta do 2 things. 1st, I gotta craft something. I gotta make a product that people wanna use, and I gotta figure out the right Twitter thread and narrative and story to craft to get people on it. So that weekend, I spent the whole weekend building v zero of Cognosys, which was like our previous product.
00:47:39 Sully: In the meantime, posting Twitter bangers and threads about how AI agents were going to change everyone's life. And every single post was getting, like, a 1000000 views. I'm not even exaggerating. Wow. And I was like, dude and I was like, okay. And all I would be posting, I was like it was kinda clickbaity. I was like, this is gonna change your life. And then getting, like, 1,000,000 view, 1,000,000 views, and I post the product. Like, I was like, hey. Like, here. I built this thing for you people to go and try because I know from what you've been telling me, you don't wanna go through GitHub. And I and I posted out, and it was literally built it in, like, 3 days.
00:48:14 Sully: And within, like, 2 days, we got 50,000 users. So
00:48:18 Greg: Oh my goodness. That is so crazy. The
00:48:21 Sully: the craziest 2 weeks and the most stressful 2 weeks of my life, and it started all from how can I craft a banger tweet? So I I will say it with that. That was why it's so relevant and so funny. It just shows how powerful writing well and writing with the right timing and structure given what's happening can potentially, you know, help you start a company. So
00:48:43 Greg: And with that, that is an absolutely beautiful story to end on. Sully, thank you very much for joining us today.
00:48:49 Sully: Oh, dude. It it was a pleasure. I I enjoyed it. And hopefully, my workflow is applicable to other people. People can look at it and see that, like, hey, using AI is just not that hard. You just gotta talk to the computer, and it'll do stuff for you.