There have been a lot of really clever proof-of-concept applications that showcase the capabilities of LLMs to generate code, perform basic classification, and of course generate content. However, it can be a real challenge to turn these demo apps into production applications. This article is the first in a series in which we’ll demonstrate how to take a basic LLM app into production using Libretto, which gives us much better understanding and confidence about the LLM outputs that customers are seeing.
Our application is called WikiDate - it’s a fun site that lets you generate a dating profile for any person or fictional character in Wikipedia. You can see our initial version in github: https://github.com/libretto-ai/wikidate/tree/v0 On the surface, using the LLM feels simple: read in a Wikipedia article, put the article into a prompt, and then pass this prompt to an LLM to generate the profile. For simplicity, we built the basic demonstration app using next.js and postgres, with a few techniques that we’ve learned from building other apps:
At its core, this is a pretty typical Retrieval-Augmented Generation (or RAG) app, and once we had the application up and running, everything seemed to be going fine… until we started generating some pretty odd profiles. For instance:
LLMs are pretty good at solving many of these problems with a properly crafted prompt, and we can add additional prompts to determine if someone is a person, or problematic, etc. But more prompts means even more unknown LLM behavior. We want to improve our current prompt, but what if we actually end up making it worse? How would we even know? Luckily, Libretto can help.
The first step is to get your prompts into Libretto, which you can do either by entering them in the UI or by integrating our SDK. Since we are just starting out here, we’ll just put the prompts directly into the Libretto UI.
Create a project in Libretto by clicking on the ”New Project” button. We expect to have lots of prompts for WikiDate, and a project can contain all of them.
Now click on the project, and then click the “Create Prompt” button.
Click the “Chat” button to create a standard Chat prompt.
For the prompt name, type “Wiki Dating Profile”
Type in the prompt from the demonstration project:
name
and wiki_text
variables. Libretto will recognize these and use them in future steps.Click "Create"
Now click “Generate” to generate our first 10 test cases. Libretto and GPT-4 will generate 10 test cases for you to start with, by creating values for the name
and wiki_text
fields. Click “Save” or “Discard” to pick test cases until you have added 10 test cases. You can see progress at the bottom of the dialog.
Click “Run Test Cases” to run all the test cases that you generated against the prompt. In a few moments you’ll be directed to the reporting page, so you can see the results of the tests.
You’ll note in the reporting page that Libretto has also looked over the prompt and added a few criteria for judging the quality of the response. These ”Evals” are a part of Libretto’s ”LLM-as-Judge” system, a powerful way to leverage LLMs to measure LLM output, and will be covered in a future tutorial.
You'll also notice that the response column is somewhat inconsistent in its JSON output. We'll get to that below.
You should see a mix of evals based on different aspects of the prompt, such as if the Age calculation is correct, or if the First Date ideas are inspired by the person’s specific personality and interests. These evals provide an easy way to start evaluating the qualitative aspects of the generated responses.
Lets add one more test case by hand, to make sure our test cases cover a variety of people. Click on the “Test Cases” link on the left. You’ll see the 10 test cases we generated automatically.
Click the Add Test Case button.
In the dialog that appears, fill in name
and wiki_text
. You can use any values here, even if they aren’t in Wikipedia! Try entering your own name and a sentence or two about yourself! You can leave the “Correct Answer” blank for now.
Click "Add Test Case". Once you have saved the test case, click the Playground link on the left.
Note on the right, you can see the tests we ran a few minutes ago, but they are labeled “Out of Date” This is because we added an additional test case. Click the checkbox, and then click “Re-run” to run the tests against the new test case. In a matter of seconds, you should have a new test run that you can explore with the new tests.
One common problem programmers run into when using LLMs is that they can have a hard time returning structured data that can be parsed in code; it’s not uncommon for JSON to be malformed or not conform to the schema that you asked for. One way to fix this is to add function calling to the prompt, which greatly increases the chances that you will get get properly formatted results.
Click the Playground link to start editing the prompt.
Click the “Add Function” button
Add a new function with the name get_dating_profile
. Set the Parameters field to the following JSON schema:
{
"type": "object",
"properties": {
"age": {
"type": "number",
"description": "The age in years of the person whos wiki profile is being viewd. Even if deceased calcuate their age from birth."
},
"aboutMe": {
"type": "string",
"description": "A short 3-5 sentence bio of the person."
},
"lastName": { "type": "string" },
"firstName": { "type": "string" },
"lookingFor": {
"type": "string",
"description": "A short paragraph on what this person would look for in people they would potentially date."
},
"firstDateIdeas": {
"type": "string",
"description": "Based on this persons wiki profile, a paragraph about interesting first date ideas"
}
},
"required": [
"firstName", "lastName", "age", "aboutMe", "lookingFor", "firstDateIdeas"
]
}
Click "Save" to add the function and then “Save & Run Test” to save the prompt and run your tests with the new function. A new version of the prompt is saved, and given a name like “With Function Tool” or something to that effect.
When the tests are done running, click on the test run title. You’ll notice that the “Response” column now contains the actual function call. The same Evals will will work against this new format.
This tutorial stepped through setting up a prompt, creating test cases, and running tests. We hope that this begins to show the value of Libretto to understand the behavior of LLMs with your prompts. In future parts of this tutorial, we will show you:
Want more battle-tested insights into prompt engineering? Check out Libretto's blog, Notes from the Prompting Lab. Or go ahead and sign up for Libretto's Beta: