August 15, 2024

Alec Flett

Dan Jakaitis

There have been a lot of really clever proof-of-concept applications that showcase the capabilities of LLMs to generate code, perform basic classification, and of course generate content. However, it can be a real challenge to turn these demo apps into production applications. This article is the first in a series in which we’ll demonstrate how to take a basic LLM app into production using Libretto, which gives us much better understanding and confidence about the LLM outputs that customers are seeing.

‍

Our application is called WikiDate - it’s a fun site that lets you generate a dating profile for any person or fictional character in Wikipedia. You can see our initial version in github: https://github.com/libretto-ai/wikidate/tree/v0 On the surface, using the LLM feels simple: read in a Wikipedia article, put the article into a prompt, and then pass this prompt to an LLM to generate the profile. For simplicity, we built the basic demonstration app using next.js and postgres, with a few techniques that we’ve learned from building other apps:

Use function calling to generate the individual sections (name, age, about me, looking for, first date ideas)
Cache the dating profile response in the database so we do not keep making slow and expensive calls for the same profile

At its core, this is a pretty typical Retrieval-Augmented Generation (or RAG) app, and once we had the application up and running, everything seemed to be going fine… until we started generating some pretty odd profiles. For instance:

This profile for the David Bowie song “Ziggy Stardust” is mostly about the character Ziggy Stardust, but this isn’t Ziggy Stardust (character).
It is possible to generate profiles for non-Persons, like the concept of Prompt Engineering itself. While this is funny, we want to keep the site focused on people.
We would like to avoid making dating profiles for people who are obviously problematic, like dictators.
Many of the profiles sounded very similar, using similar language. We wanted more personality than a generic AI looking for love.‍

LLMs are pretty good at solving many of these problems with a properly crafted prompt, and we can add additional prompts to determine if someone is a person, or problematic, etc. But more prompts means even more unknown LLM behavior. We want to improve our current prompt, but what if we actually end up making it worse? How would we even know? Luckily, Libretto can help.

Create your first prompt

The first step is to get your prompts into Libretto, which you can do either by entering them in the UI or by integrating our SDK. Since we are just starting out here, we’ll just put the prompts directly into the Libretto UI.

‍

Create a project in Libretto by clicking on the ”New Project” button. We expect to have lots of prompts for WikiDate, and a project can contain all of them.

‍

Now click on the project, and then click the “Create Prompt” button.

Click the “Chat” button to create a standard Chat prompt.

‍

For the prompt name, type “Wiki Dating Profile”

Type in the prompt from the demonstration project:

Add a System prompt with the text:
You are a dating guru and are here to help create dating profiles based on the provided persons wikipedia page.
Click the “+“ button Add a User prompt, containing the text:
Using the following Wikipedia content, create a dating profile for the subject of this page, called "{name}".
{wiki_text}.
Your output will be in the form of a JSON object with the following properties (each has a corresponding description):
- firstName: First name of the subject of this page.
- lastName: Last name of the subject of this page.
- age: Estimated age in years based on birthday. If deceased, calculate age from birth, ignore their death date.
- aboutMe: Create a concise, engaging summary of the person's key accomplishments or interesting facts, written in a casual, first-person style.
- lookingFor: Based on this person's wiki profile, a paragraph about what they would look for in people they would potentially date, first-person style.
- firstDateIdeas: Based on this person's wiki profile, a paragraph about interesting first date ideas, first-person style.

When writing the dating profile, try to incorporate a voice or tone that matches the person's personality or style. For example, if the person is a comedian, then the profile should be funny. If the person is a serious mathematical theorist, then the profile should be more formal and intellectual.
Note that we have used braces to define the name and wiki_text variables. Libretto will recognize these and use them in future steps.

Click "Create"

Now click “Generate” to generate our first 10 test cases. Libretto and GPT-4 will generate 10 test cases for you to start with, by creating values for the name and wiki_text fields. Click “Save” or “Discard” to pick test cases until you have added 10 test cases. You can see progress at the bottom of the dialog.

Click “Run Test Cases” to run all the test cases that you generated against the prompt. In a few moments you’ll be directed to the reporting page, so you can see the results of the tests.

You’ll note in the reporting page that Libretto has also looked over the prompt and added a few criteria for judging the quality of the response. These ”Evals” are a part of Libretto’s ”LLM-as-Judge” system, a powerful way to leverage LLMs to measure LLM output, and will be covered in a future tutorial.

You'll also notice that the response column is somewhat inconsistent in its JSON output. We'll get to that below.

You should see a mix of evals based on different aspects of the prompt, such as if the Age calculation is correct, or if the First Date ideas are inspired by the person’s specific personality and interests. These evals provide an easy way to start evaluating the qualitative aspects of the generated responses.

Adding Test Cases

Lets add one more test case by hand, to make sure our test cases cover a variety of people. Click on the “Test Cases” link on the left. You’ll see the 10 test cases we generated automatically.

Click the Add Test Case button.

In the dialog that appears, fill in name and wiki_text. You can use any values here, even if they aren’t in Wikipedia! Try entering your own name and a sentence or two about yourself! You can leave the “Correct Answer” blank for now.

Click "Add Test Case". Once you have saved the test case, click the Playground link on the left.

Note on the right, you can see the tests we ran a few minutes ago, but they are labeled “Out of Date” This is because we added an additional test case. Click the checkbox, and then click “Re-run” to run the tests against the new test case. In a matter of seconds, you should have a new test run that you can explore with the new tests.

‍

Changing the prompt

One common problem programmers run into when using LLMs is that they can have a hard time returning structured data that can be parsed in code; it’s not uncommon for JSON to be malformed or not conform to the schema that you asked for. One way to fix this is to add function calling to the prompt, which greatly increases the chances that you will get get properly formatted results.
‍

Click the Playground link to start editing the prompt.

Click the “Add Function” button

‍

Add a new function with the name get_dating_profile. Set the Parameters field to the following JSON schema:

{ "type": "object", "properties": { "age": { "type": "number", "description": "The age in years of the person whos wiki profile is being viewd. Even if deceased calcuate their age from birth." }, "aboutMe": { "type": "string", "description": "A short 3-5 sentence bio of the person." }, "lastName": { "type": "string" }, "firstName": { "type": "string" }, "lookingFor": { "type": "string", "description": "A short paragraph on what this person would look for in people they would potentially date." }, "firstDateIdeas": { "type": "string", "description": "Based on this persons wiki profile, a paragraph about interesting first date ideas" } }, "required": [ "firstName", "lastName", "age", "aboutMe", "lookingFor", "firstDateIdeas" ] }

Click "Save" to add the function and then “Save & Run Test” to save the prompt and run your tests with the new function. A new version of the prompt is saved, and given a name like “With Function Tool” or something to that effect.

When the tests are done running, click on the test run title. You’ll notice that the “Response” column now contains the actual function call. The same Evals will will work against this new format.

Conclusion

This tutorial stepped through setting up a prompt, creating test cases, and running tests. We hope that this begins to show the value of Libretto to understand the behavior of LLMs with your prompts. In future parts of this tutorial, we will show you:

how to add Libretto to your app’s code so that you can get real-time insights about your app’s behavior with real users
how the automatically generated evals work, and how to calibrate them with human evaluation

Building an LLM-based App with Libretto, part 1

Create your first prompt

Adding Test Cases

Changing the prompt

Conclusion

‍