Vibes at Scale: Libretto Makes LLM-as-Judge Easy, Understandable, and Dependable

August 8, 2024

Sasha Aickin

Working with LLMs often feels like magic—tasks that used to take specially trained models and hundreds of lines of code can now be built in a few minutes. But this magic comes in a black box and we can’t really be sure exactly what will work and how it will fail. This can be incredibly frustrating, particularly for those of us with Engineering backgrounds that are used to certainty and rigorous testing and optimization.

‍

Given this uncertainty, you may be surprised to learn that the most common technique we hear for knowing if a prompt is good is just running the prompt with a few examples and then manually reading through the answers. Does the summary seem to capture the things I would want? Does the tone of the email sound authentic? We affectionately refer to this type of evaluation as “vibes.” And many companies have shipped good experiences using this evaluation technique.

‍

But vibes checks are necessarily slow and don’t scale. Product managers spend hours poring over logs to make sure nothing is broken. They want to move to a more robust evaluation framework, but each step in the process — setting up examples, choosing and designing evaluation criteria, and reviewing results — is difficult and time consuming. On top of that, many evaluations require a subjective rating, and it is difficult to trust the scores generated by LLMs for these evaluations.

‍

This is what we wanted to solve with Libretto’s new LLM as Judge feature. We generate examples for you, then we help you choose and define the right evaluations for all your prompts and deliver easy-to-parse results. We also provide an easy-to-use grading surface so that you can calibrate each evaluation using ground truth scores from your own content experts.

‍

Wait, will this even work?

I’ll admit that I was quite skeptical at first of using LLMs to judge the outputs of LLMs. It seemed unlikely that asking an LLM if the answer it just generated was good would give any amount of real signal. But it turns out that you can get usable grades from an LLM with three techniques:

‍

Using state of the art models to judge cheaper models: Developers often don’t use the most expensive, most state-of-the-art models to generate text, and it is effective to use an expensive, more capable model to do the judging.
Focusing the judge LLM on specific criteria: In a prompt, you often ask for the answer to adhere to multiple criteria (e.g. the response should be professional, and it should be three paragraphs long, and it should directly answer the user’s question, and it shouldn’t mention any of your competitors). Sometimes the LLM misses on one of the criteria, but if you ask later and focus completely on that criterion, the LLM can give a good response.
Ground the judge with some human grades: Even state-of-the-art models can have difficulty judging criteria the way a human would without any sort of ground truth and context. If you grade as few as 10 sample outputs on the criteria, though, the LLM can be coached into matching human grades much closer.

‍

These three principles guided our development of our LLM-as-Judge feature, and we’re very proud to announce that it’s in a limited beta right now.

‍

Automated bootstrapping of LLM-as-Judge evals

One of the problems in setting up LLM-as-Judge is that it can be hard to figure out where to start. What criteria matter for your LLM responses, and how should you phrase them? Are the criteria pass/fail or are they more of a continuous scale?

‍

At some level, these questions will be answered as you try out different prompts and different models. As you develop more experience with a particular prompt, you will learn more about the ways that LLMs fail for that task. But starting out with a good set of evaluation criteria is also important, and the blank page problem at the start is real. To fix this bootstrapping problem, we now suggest up to 5 subjective criteria for every new prompt added to Libretto and add them to the prompt’s testing automatically. As an example, here’s the criteria Libretto suggested for a prompt that explains complicated concepts:

‍

In addition to coming up with a description of each criterion, the automatically created LLM-as-Judge evaluations are made as either boolean Pass/Fail or 1 to 5 scales, with detailed descriptions of what each score means.

‍

Aligning LLM-as-Judge with Human Judgment

Getting bootstrapped criteria for new prompts is useful, but inevitably you will find cases where the LLM doesn’t do as well as you would like. Sometimes this is because the criteria aren’t phrased quite well, or because the 1 to 5 scale isn’t as exact as it could be, or simply because the LLM is being too agreeable. For these cases we recommend grading a few sample outputs to align the judge LLM.

‍

To do this in Libretto, you pull up a grading and alignment interface directly from the test results page. First you review the criteria that are currently being used. Here you can add new criteria, tweak the phrasing of existing criteria, or completely delete criteria that are not useful.

Second, we ask you to grade some sample outputs from LLMs on each of your criteria. We’ve created a user interface that makes this as painless as possible, allowing you to quickly power through the grading process.

‍

Once you’ve graded at least 10 LLM outputs, we let you move on to the third and last stage, which is calibration. Here, we automatically tweak the LLM judge to get it closer to your grades, and when we’re done, we show you an alignment score that reflects how well we are doing:

‍

‍

We find that grading just a few outputs can radically improve the quality of the LLM judge and get it up to an alignment score that makes the LLM judge useful in day to day experimentation.

‍

But can we trust it?

Even with all of these features, prompt engineers are still sometimes skeptical about LLM-as-judge. And of course they are! They are the ones who see daily that LLMs can and do fail.

‍

The great thing about LLM-as-Judge, though, is that it doesn’t need to be perfect to be useful. Human judges often only agree about 80% of the time when grading LLM outputs, so perfection is not even really possible, much less required.

‍

What a well calibrated LLM-as-Judge lets you do is run through many different models, prompt variations, or hyperparameters and get a quick impression of which ones are very probably doing better and which ones are very probably doing worse. Then you can dig in on the ones that are most likely to be successful and do some manual vibe spot checks to make sure that the LLM judge hasn’t gone off the rails. Rather than gauging vibes from 50 outputs from 20 different prompt variations, you can concentrate on a small handful of outputs from the 2 or 3 best prompt variations to validate that the judge is operating as expected. Think of it as a prompt-engineer-in-the-loop check on your LLM judge.

‍

Want to try? Let us know!

We’re eager to get this into the hands of prompt engineers working with LLMs who are hungry for more accurate results. If that’s you, let us know and sign up for the Libretto Beta!