Building an LLM-based App with Libretto, part 2

In part 1, we learned how to use Libretto to create a prompt template, generate test cases for our prompt, and run those test cases.

In part 2, we’ll learn how to integrate libretto into our TypeScript app, so that we can get real-world data into libretto, and use it to improve our prompts.

Integrating the Libretto SDK

The Libretto Typescript libraries record calls to LLM providers like OpenAI or Anthropic, sending the results to Libretto as Events. Inside the Libretto app you can turn these Events into Test Cases with just a click. This makes it incredibly easy to build up a library of test cases based on real-world traffic.

We’ll continue to use our demo app, WikiDate, as a real-world example.

  1. Visit the WikiDate project page, and click on “API Keys” [insert screenshot, API Keys circled]
  2. Set up the SDK, using the “Development” API key for working locally. You can use the same project for development, staging, and production, using different API keys.
    1. Add the API key to the next.js .env file:
      LIBRETTO_API_KEY=XXXX
      Note that there are separate API keys for Development, Staging, and Production.
    2. Install the Libretto Typescript SDK
      npm install @libretto/openai
    3. Replace the stock OpenAI object with Libretto’s wrapper in src/util/profile.ts:
      - import OpenAI from "openai";
      + import { OpenAI, objectTemplate } from "@libretto/openai";
  3. Update the prompt to use Libretto’s templating system rather than JavaScript’s Template literals. This is particularly simple when using JavaScript/TypeScript because you can just replace ${variableName} with {variableName}.

    Replace this:
    const datingProfileV1 = {
     promptTemplate: [{
        role: "system",
          content:
            "You are a dating guru and are here to help create dating profiles based on the provided persons wikipedia page.",
      },
      {
        role: "user",
        content: Using the following Wikipedia content, 
                  create a dating profile for the subject of this page, called "${name}".

                  ${wiki_text}.
     }],
    };

    With this:

    const datingProfileV1 = {
      promptTemplate: [{
        role: "system",
        content:
          "You are a dating guru and are here to help create dating profiles based on the provided persons wikipedia page.",
      },
      {
        role: "user",
        content: Using the following Wikipedia content, 
                  create a dating profile for the subject of this page, called "{name}".

                  {wiki_text}.
      }],
    };
  4. Finally, update the call to the chat completion, using objectTemplate() and the following Libretto configuration:

    openai.chat.completion.create({
      model: “gpt-4o”,
     // send the template with objectTemplate(),
     // rather than the raw prompt
      messages: objectTemplate(datingProfileV1.promptTemplate),
      tools: tools,
      response_format: { type: “json_object”},
      libretto: {
        // uniquely identify this prompt, matching the key created
       // in part 1.
        promptTemplateName: “wiki-dating-profile”,
        // pass along actual values
        templateParams: { wiki_text, name },
      },
    });

That’s it!

Now we can load up the app, clicking the ”Surprise me!” button a few times to generate a few dating profiles.

Back in Libretto, click on the “Production Calls” section to see these events.

Tip: If you’re already on the page, you can click the little refresh button in the upper right corner of the table.

You’ll see the events in the table:

We now have real production data that we can immediately use to enhance our test cases.  

We can turn this production event into a test case by simply using the dropdown menu. Select “Edit & Add to Tests”.

Here you are given the opportunity to adjust the test case before saving it.

For this prompt there is no one "correct answer". As you may recall from Part 1, Libretto generated other evaluations, such as “Accurate Age Calculation”. This means that we do not need the “Correct Answer” section. Scroll down to the Correct Answer section, and click the “X” in the “Function Names” section to clear the call.

Click Add Test Case, and then go to the “Test Cases” page to see how this new test has been integrated into our current suite of tests.

Trying other models

Go back to the Playground, we can now try all of our test cases against a other models, to compare how they perform.

Click on the “Playground” link on the left. Now try running against a few models

  1. Select a model from the dropdown in the upper left, such as “GPT 4o” or “Claude 3.5 Sonnet”
  2. Click “Save & Run Tests”

In the “Tests” panel on the right, you’ll see the various test runs we’ve tried, with different versions and different models. Select a few of these rows by clicking the checkboxes, and then click “Compare”

In the report view, you can see how the different models perform using different LLMs. When we ran it we saw subtle differences between GPT 3.5 Turbo and Claude 3 Haiku. Our data includes the Kazakh vollyball player “Inna Matveyeva”. For this test case:

  • Claude 3 Haiku got the age calculation correct (44) whereas GPT 3.5 Turbo was wrong, at 43. It looks like this might actually be a reflection of what thinks is the current year, an opportunity to improve our prompt.
  • GPT 3.5 Turbo scored higher on the “Looking For” section, but lower on the “About Me” section.

These are only single test cases, so we’ll need to look at the rest of the test results to see if one model is consistently better or not, but the headers can give us some clues before we start combing through the data. In the test cases that we ran, it looks like Claude 3 Haiku is averaging better for both of these evals, though GPT 3.5 Turbo is generally faster:

Improving our prompt

The “current year” problem exists in both models: Claude 3 Haiku was accurate 8.3% of the time while GPT 3.5 Turbo was never accurate. This could be fixed by simply including the current year in the prompt.

Go back to the playground. Add “The current year is 2024” to the system message, and add “based on the current year” to the line in the user prompt about “age”. Click “Save & Run Tests”.

Now re-select the other model(s) that you want to test at the top (in our case, GPT 3.5 Turbo and Claude 3 Haiku)

There are now at least two new entries in the Tests panel, for the new runs against this new version of the prompt. Compare these outputs. In our cases, this improved accuracy for both models by quite a bit:

Deploying the new prompt

Even though our prompt isn’t perfect, it is still an improvement from where we were before. Lets take this model back to our code so we can deploy it to our users. Go back to the Playground. Click the Clipboard button near the bottom to get the updated prompt as JSON:

Paste the JSON back into the demo code in src/prompts/dating-profile-v1.ts. The next time that WikiDate generates a profile, it will start using this new prompt.

Conclusion

Now you know how to integrate Libretto into your production code, and how to use real production data to create tests so that you can improve your prompt while testing it against real-world uses.

In part 3, we’ll use some of the evals that Libretto provides, including the new LLM-as-Judge evals that we introduced recently.

Want more battle-tested insights into prompt engineering? Check out Libretto's blog, Notes from the Prompting Lab. Or go ahead and sign up for Libretto's Beta:

Sign Up For Libretto Beta