If a set of few-shot examples works with one model, will they work for another model?
Last time on our stories from prompt engineering, we looked at how few-shot example selection does or doesn’t affect the quality of a prompt’s performance. Our conclusion, in line with current research, was that there can be a pretty wide range in prompt accuracy for different sets of few-shot examples.
Today we’re going to look at another question about few-shot examples, namely: if you’ve found the best few-shot examples for your prompt in model X, how likely are those examples to be the best in model Y?
This is more than just an academic question for prompt engineers. Sometimes it feels like new models are coming out every week, whether they are new versions of popular LLMs like OpenAI’s GPT 3.5 Turbo or completely new open-source LLMs. When a new model comes out, it’s pretty common to want to know how well your prompt will do with that new model in terms of accuracy, cost, and latency. In addition, hosted providers like OpenAI are deprecating models pretty frequently, so you may be forced to move your prompt from its current model to a new one. If you’ve tuned your prompt’s few-shot examples to its current model, can you just port the prompt over to the new model, or do you need to go through tuning all over again?
To test this, I used our new Experiments feature in Libretto. Libretto Experiments allow you to generate and test dozens of different variations of your prompt to automatically optimize the prompt and cut out a lot of the drudgery of prompt engineering. The first experiment type we have created is one that tries different combinations of few-shot examples to find the ones that perform best on your test set.
To test this particular question, I once again turned to the Emoji Movie task from the Big Bench LLM benchmarking suite. This is a pretty simple task that has 100 questions that ask the LLM what movie is represented by a string of emojis. For example:
Q: What movie does this emoji describe? 👩❤️🌊👹
A: the shape of water
To conduct this test, I ran five different OpenAI models in Libretto Experiments: three versions of GPT 3.5 Turbo (0613, 1106, and 0125) and two versions of GPT 4 Turbo (1106 and 0125). I started with only OpenAI models, under the theory that they are the most likely to have correlations in behavior. For each model, Libretto Experiments generated 34 different variations of the prompt, each using a different selection of few-shot examples. We ran the 100 Emoji Movie test cases against each model and each prompt variation 5 times, for a total of:
5 models x 34 prompt variations x 100 test cases = 17,000 calls to OpenAI
The experiments finished in Libretto in a matter of minutes, and then I set to analyzing the results.
The first two models I looked at happened to be GPT 3.5 Turbo versions 0613 and 0125. Here’s a scatterplot of the test results for 3.5-0613 charted against the test results for 3.5-0125. Each dot represents one variation of the prompt, which chooses a particular set of examples to use as few-shots:
This chart tells a pretty clear story: there’s a real correlation between the performance of prompts in these two versions of GPT-3.5, and if you choose the best performing set of few-shot examples for one of the models, you’re likely to end up with a set of few-shot examples that works reasonably well for the other version of the model. This result made a lot of sense to me: different versions of GPT 3.5 Turbo are all made by the same team, presumably with pretty similar training data, and it entirely makes sense that they should have a correlation in behavior. So far, so good.
However, I next looked at the comparison of GPT 3.5 Turbo versions 0613 and 1106, and the picture was completely different:
Here we have two versions of GPT 3.5 Turbo that have essentially no correlation at all. What’s more, GPT 3.5 Turbo 1106 was released in between 0613 and 0125, which means that, somehow, two models that are further apart in release time have much more of a correlation than releases where one was immediately subsequent to the other. Puzzling!
As I skimmed through the correlations of the other models, I found that having very low correlation of results between models was the norm:
In fact, the first two models I’d looked at, GPT 3.5 Turbo 0613 and 0125, were by far the most correlated. Most of the other pairs of models had very little to no correlation at all:
So, what can we conclude from this?
Like it or not, testing which few-shot examples work is probably going to be something prompt engineers do on a model-by-model basis. If you’ve tuned your few-shot examples for model A, it’s very likely that those are not the optimal few-shot examples for model B, even if they are different versions of the same model from the same vendor.
This is sort of a bummer of a result, as tuning few-shot examples is fairly laborious. Luckily, tools like Libretto, with our new Experiments feature, allow you to run tests like this across 100s of test cases and as many models as you’d like with the click of a single button. If you’d like a demo of Libretto or early access to the product, or if you want us to run more or different experiments, sign up for our beta. Thanks for reading, and see you next time!
Want more battle-tested insights into prompt engineering? Check out Libretto's blog, Notes from the Prompting Lab. Or go ahead and sign up for Libretto's Beta: