Published: October 30, 2024
Building features with large language models (LLMs) is quite different from conventional software engineering. Developers need to learn prompt engineering to handle non-deterministic results, pre-processing input, and post-processing results.
One of the challenges you've shared with us is that testing the output from LLMs, determining the validity and quality, is time consuming. Developers often resort to batch-generating the output using different inputs, then manually validating them using human judgment.
A more scalable approach to evaluate the results of different models and prompts is the LLM as a judge technique. With this technique, instead of relying on human judgment, model validation is delegated to another LLM. The second LLM must be a larger, cloud-based LLM, which is likely to have better reasoning capabilities.
In this document, we use summarization to demonstrate how you can approach comparing different models and, as a bonus, show the improvement on the quality from Gemma to Gemma 2.
Choose models for comparison and prep data
We evaluated three models' capabilities in summarization. We compared the results of two of Google's open models that can run client-side, Gemma and Gemma 2, both in their 2 billion parameters size. As a contrast, we also evaluated a larger, more capable cloud-based model: Gemini 1.5 Flash.
We used a dataset of 2225 BBC articles, that cover areas such as business, entertainment, politics, sport, and tech, and we generated a summary of each article, using each of the selected models. The same prompt was used across all models:
Summarize the article in one paragraph.
We stored the original articles and generated summaries in a database so they could be easily accessed at each step.
Select a judge to analyze and score summaries
To analyze the summary quality, we used Gemini 1.5 Flash to judge the summaries created by Gemma 2 and Gemma 2 2B. Our specific approach is based on alignment, which is part of DeepEval's summarization metric.
Alignment is a metric that measures the frequency with which the statements included in a summary are supported in the original content the summary is based on.
We broke the evaluation process into two steps. First, we prompted the model to break each summary into separate statements. Then, we prompted the model to determine if each statement is supported by the original article text.
Extract statement from summaries
We asked Gemini 1.5 Flash to break up longer text into separate statements. For example:
Everton defender David Weir has played down talk of European football, despite his team lying in second place in the Premiership after beating Liverpool.
Gemini 1.5 Flash split this sentence into the following statements:
- "David Weir plays defender for Everton."
- "Everton is currently in second place in the Premiership."
- "Everton beat Liverpool in a recent match."
- "David Weir has minimized discussion about Everton playing in European football."
Validate statements
We then asked Gemini 1.5 Flash to analyze the original sentence, as compared to the split up statements. The model classified each statement's validity as:
- Yes: The statement is supported by the original text.
- No. The statement contradicts the original text.
- Idk. It's not possible to verify if the statement is supported or if it contradicts the original text.
Analysis of the results
This process resulted into two metrics that can be used to compare the models:
- Alignment: How often did the model produce summaries that contain statements that are supported by the original text.
- Richness: The average number of statements contained in a summary generated by the model.
Alignment
We calculated alignment by counting the number of summaries that have at least one statement marked as "No," and dividing it by the total number of summaries.
The Gemini 1.5 Flash model has the highest alignment scores, exceeding 92%. This means it's very good at sticking to the facts and avoids making things up.
Gemma 2 2B has a respectable score of 78.64%, indicating a good level of accuracy. Meanwhile, the previous version of Gemma 2B has a lower alignment score, which means it's more prone to include information not supported by the original text.
Richness
We calculated model richness by averaging the number of statements generated by the model for each summary.
Gemma 2 2B has the highest richness score at 9.1, indicating that its summaries include more details and key points. The Gemini 1.5 Flash model also has high richness scores, exceeding 8.4. Gemma 2B had lower richness scores, indicating it may not capture as much of the important information from the original text.
Conclusion
We determined that smaller models capable of running client-side, such as Gemma 2 2B, can generate great quality output. While cloud-based models, such as Gemini 1.5 Flash, excel at producing summaries that are aligned with the original article, packing a considerable amount of information, the difference should be weighed alongside application performance, privacy and security needs, and other questions you may ask when determining if you should build client-side AI.
There is a clear evolution in the capabilities of the Gemma model family, as Gemma 2 2B is capable of generating richer and more aligned summaries than Gemma 2B.
Evaluate your use cases
This document only scratched the surface of what's possible with the LLM as a judge technique. Even with summarization, you could look at more metrics and the results may differ. For example, you could evaluate coverage by using a prompt to identify key points from an article, then use a different prompt to validate if those key points are covered by each summary.
Other use cases, such as writing text, rewriting text, or retrieval augmented generation (RAG) may have different results for the same metrics or should use other metrics for evaluation.
When implementing this approach, think about how a human would evaluate the output to determine which metrics are best for your use cases. It's also worth looking into existing frameworks, like DeepEval, that may already have a set of metrics that are appropriate for your use case.
Have you implemented LLM as a judge to evaluate models? Tweet us your findings at @ChromiumDev or share with Chrome for Developers on LinkedIn.