I’ve been working in tech since the Mosaic and Netscape days, and I can say without a doubt that large language models are the most magical, mysterious, and infuriating technology to emerge over my 20+ year career.
The magic of LLMs is clear: ChatGPT regularly does things that I didn’t think computers would be capable of in my lifetime. But the mystery follows quickly after; LLMs solve hugely complex problems but fail at tasks an eight year old would ace. Thus comes the infuriation: how do you make this tool do the right thing most of the time?
To me, this is the most interesting problem in generative AI. Solving it is the difference between gen AI being fundamentally transformative or a vague nice-to-have. We have to unlock consistently great AI results to make gen AI useful, and that’s our mission at Libretto.
It’s also why we raised $3.7 million in seed funding from XYZ Venture Capital and The General Partnership. Now that we have a solution that works, we want to get it to everyone.
The problem & solution
Some of the most brilliant people and most impactful projects in the world are being blocked by bad prompts. We’re all too used to telling computers what to do, and them just doing it. Now we find ourselves in tedious negotiations with our machines: you have to cajole, coerce, and manipulate an LLM into doing what you want, with maddeningly mixed results. It might follow instructions ten times only to go rogue on the eleventh. Nearly imperceptible changes in a prompt can cause divergent behaviors that are totally unpredictable. Prompts that were working great can suddenly produce surprising results in production. And just because you crack the code on one model, doesn’t mean those insights will hold true for the next.
Let’s say you have 30 ideas for how to make a prompt better, but you don’t know what will work. Maybe the first thing you try will get the results you want. Maybe the thirtieth will. Maybe none will. So many of us are burning days and weeks trying to mold prompts that never quite succeed. And unlike building traditional software, prompt engineering is almost entirely empirical. You’ll never know if a task is possible or not – or how long it’s going to take – without testing the heck out of as many prompts as you can.
This is what Libretto is hacking on right now – making this empirical facet of prompt engineering easy. Our hypothesis: we need a tool to test, improve, and monitor LLM prompts rapidly and intelligently. Today’s version of Libretto already makes this possible—it automates away the grunt work of prompt engineering to get you to the best possible answer fast and predictably.
Now let’s take a look under the hood.
Testing: Finding feasibility, fast
Generally speaking, your first step should always be to see if an LLM can solve the problem at hand. This usually requires loosely playing with different models and prompt texts with a handful of example inputs for your prompt.
As you hone in on the answer, you need to build up your test set to be more rigorous in testing changes to your prompts. This quickly becomes a massive headache. Every time you change your prompt, you have to run it over all your tests, and you too often end up skimming spreadsheets to see how the LLM did.
At Libretto, we track all of the test cases you’ve built up for your various prompts, and keep a large library of ways to evaluate the answers that come back from LLMs.
Automated LLM evaluation is tricky and very much dependent on the type of prompt. For some prompts, like sentiment analysis or categorization prompts, you can just do a string compare for your test case. If the LLM gets the right answer, it passes. But for prompts that are more generative, like customer service chats or retrieval augmented generation, you may want to use a fuzzy string match or another LLM to grade the responses.
In Libretto, we have many different options for evaluating the LLM’s response so that you can tailor your evaluation strategy to your prompt. We can test sentiment, toxicity, JSON structure, custom subjective criteria, BLEU, ROUGE, BERTScore, embedding similarity, and even custom-written evaluations.
Once you have test cases and ways to evaluate them, you’re ready to rapidly improve your prompts – and this is where the real magic of Libretto happens. We provide a playground where you can modify your prompt or try out different models or parameters. Then, with the click of a button, you can run all your tests and get empirical, repeatable results. This gives you confidence that, whenever you change your prompt, it’s getting better, not worse. It also ensures that you’re including all the various tests you’ve come up with, not just the one you’re playing with at the moment.
The TLDR; You get to speed through that list of 30 prompt engineering techniques to find the magic bullet.
Improving: Building a real product
But what if we didn’t even have to try those 30 techniques? Libretto’s killer feature is called Experiments, and it takes the pain and labor out of prompt engineering.
The idea behind Experiments is that Libretto can automatically try various techniques, creating dozens, potentially hundreds, of variants of your prompt and figuring out which ones work best. In the time it takes you to go grab a coffee, Libretto can power through a week’s worth of your prompt engineering to-do list and give you a better version of your prompt along with the empirical evidence that shows how much better it is.
No more brainstorming new ways to argue or plead with the LLM, Libretto does that for you. To see how we use Experiments to make a prompt 10 percentage points more accurate in under 5 minutes, check out this demo:
Here you can see how Libretto automates the choice of few-shot examples. Giving your LLM example questions and answers to work off of can significantly increase accuracy, but this still depends on how helpful it finds the examples you picked (finicky, as always).
When you run a few-shot experiment in Libretto, we take a bunch of your test cases and stuff them into your prompt as few-shot examples, creating several dozen variants of your prompt. Then we test each of those variants against the rest of your test set so you get concrete results. Within minutes of launching the experiment, you can see which few-shot examples work well and which ones flop.
This isn’t just an academic exercise. A recent experiment I ran with Claude 3 Haiku showed a 17 percentage point difference in prompt accuracy between the best few-shot examples and the worst ones.
Every day, we’re using Libretto Experiments to learn more about practical, empirical prompt engineering (more on few-shot best practices here and here).We’ve currently got experiments that automatically compare different models, add well-known “magic phrases” to your prompts (like “take a deep breath”), and optimize your few-shot examples. We’re working on adding many more.
Monitoring: Learning from your users
Once you’ve optimized your prompt, it’s time to put it into production. This is where you run into yet another snag: your users. You may have come up with a bunch of good test cases, but I can guarantee you haven’t anticipated all the things your users will throw at your prompt. They will find ways to use and misuse it that never occurred to you. The only way through is to continuously monitor your prompts as they’re out there in the world and pull that knowledge into your prompt engineering.
To make this kind of monitoring possible, Libretto has a drop-in wrapper library for popular LLM clients that records all of the arguments being sent into your prompts and the results you get back from your LLMs. This allows you to see in great detail what’s happening in production and how your customers are using your LLM prompts.
This is useful partly for debugging but also for finding those use cases that hadn’t occurred to you – great ideas for future tests. Libretto highlights the production calls from your users that might be good candidates to add to your prompt tests, and we have a smooth and streamlined process for moving production data into your prompt test sets. Last but not least, we give you a way to record user feedback on the LLM response, which’ll help you improve your prompt template tests.
Putting it all together
We hope this gives you a good sense of what we’re working on here at Libretto and the pain we’re looking to kill at each part of the process. Ultimately, this is all about unleashing more creativity and productivity as we step into this new era of LLMs and learn how they can reach their full potential – so we can reach ours.
As I said at the start, we’re sharing this expressly to get your thoughts on this approach to realizing this mission. Our vision is that prompt engineering should be automated, fast, and fully empirical for everyone, but we know that we’re at the beginning of this journey, and there’s a tremendous amount to learn.
You can sign up to join Libretto’s beta today and see how easy it makes your prompt engineering across projects. Click here to sign up.
Want more battle-tested insights into prompt engineering? Check out Libretto's blog, Notes from the Prompting Lab. Or go ahead and sign up for Libretto's Beta: