Practical prompt engineering for smaller LLMs

Maud Nalpas
Maud Nalpas

A large language model's effectiveness heavily relies on the instructions we give it. Prompt engineering is the process of crafting questions in a way that gets the best output from an LLM. It's a crucial step in implementing an LLM-based feature.

Prompt engineering is an iterative process. If you've experimented with different LLMs, you've probably noticed that you needed to tweak your prompt to achieve a better result.

This is also true for models of different sizes.

Chat interfaces powered by large LLMs, such as Gemini or ChatGPT, can often produce satisfying results with minimal prompting effort. However, when working with a default, smaller LLM that is not fine tuned, you need to adapt your approach.

Smaller LLMs are less powerful and have a smaller pool of information to draw from.

What do we mean with "smaller LLMs"?

Defining LLM sizes is complicated, and they aren't always disclosed by the makers.

In this document, "smaller LLMs" means any model under 30B parameters. As of today, models with a few millions to a few billions parameters can realistically be run in the browser, on consumer-grade devices.

Where are smaller LLMs used?

  • On-device/in-browser generative AI, for example if you're using Gemma 2B with MediaPipe's LLM Inference API (even suitable for CPU-only devices) or DistilBert in the browser with Transformers.js. Downloading a model and running inference on a user's device is only possible with these smaller LLMs, in order to keep web downloads reasonable and to suit a device's memory and GPU/CPU constraints.
  • Custom server-side generative AI. Small open-weight models like Gemma 2B, Gemma 7B or Gemma 27B are available for you to run on your own server (and optionally fine tune).

Get started

Chat interfaces powered by large LLMs, like Gemini or ChatGPT, can often produce satisfying results with minimal prompting effort. However, when working with smaller LLMs, you need to adapt your approach. Smaller LLMs are less powerful and have a smaller pool of information to draw from.

Provide context and precise format instructions

To achieve optimal results with small LLMs, craft more detailed and specific prompts.

For example:

Based on a user review, provide a product rating as an integer between 1 and 5. \n
Only output the integer.

Review: "${review}"
Rating:
Input (review) Output (rating)
  Larger LLM (Gemini 1.5) Smaller LLM (Gemma 2B)
Absolutely love the fit! Distributes weight well and surprisingly comfortable even on all-day treks. Would recommend. 5 4 out of 5 stars**
The straps are flimsy, and they started digging into my shoulders under heavy loads. 1 2/5

While Gemini 1.5 provides the desired output with this simple prompt, Gemma's output isn't satisfying:

  • The format is incorrect. We requested an integer for the rating.
  • The rating doesn't seem quite accurate. The first review is enthusiastic enough to indicate a 5-star review.

To fix this, we need to use prompt engineering techniques, One-, few-, and multi-shot prompting and chain-of-thought prompting. We also must provide clear format instructions and insist that the model should use the full range of ratings.

For example:

`Analyze a product review, and then based on your analysis give me the
corresponding rating (integer). The rating should be an integer between 1 and
5. 1 is the worst rating, and 5 is the best rating. A strongly dissatisfied
review that only mentions issues should have a rating of 1 (worst). A strongly
satisfied review that only mentions positives and upsides should have a rating
of 5 (best). Be opinionated. Use the full range of possible ratings (1 to
5). \n\n
    \n\n
    Here are some examples of reviews and their corresponding analyses and
    ratings:
    \n\n
    Review: 'Stylish and functional. Not sure how it'll handle rugged outdoor
    use, but it's perfect for urban exploring.'
    Analysis: The reviewer appreciates the product's style and basic
    functionality. They express some uncertainty about its ruggedness but
    overall find it suitable for their intended use, resulting in a positive,
    but not top-tier rating.
    Rating (integer): 4
    \n\n
    Review: 'It's a solid backpack at a decent price. Does the job, but nothing
    particularly amazing about it.'
    Analysis: This reflects an average opinion. The backpack is functional and
    fulfills its essential purpose. However, the reviewer finds it unremarkable
    and lacking any standout features deserving of higher praise.
    Rating (integer): 3
    \n\n
    Review: 'The waist belt broke on my first trip! Customer service was unresponsive too. Would not recommend.'
    Analysis: A serious product defect and poor customer service experience naturally warrants the lowest possible rating. The reviewer is extremely unsatisfied with both the product and the company.
    Rating (integer): 1
    \n\n
    Review: 'Love how many pockets and compartments it has. Keeps everything organized on long trips. Durable too!'
    Analysis: The enthusiastic review highlights specific features the user loves (organization and durability), indicating great satisfaction with the product. This justifies the highest rating.
    Rating (integer): 5
    \n\n
    Review: 'The straps are a bit flimsy, and they started digging into my shoulders under heavy loads.'
    Analysis: While not a totally negative review, a significant comfort issue leads the reviewer to rate the product poorly. The straps are a key component of a backpack, and their failure to perform well under load is a major flaw.
    Rating (integer): 1
    \n\n
    Now, here is the review you need to assess:
    \n
    Review: "${review}" \n`;
Input (review) Output with a smaller LLM (Gemma 2B)

Absolutely love the fit! Distributes weight well and surprisingly comfortable even on all-day treks. Would recommend.

Analysis: This review is highly positive and emphasizes the product's fit and comfort, indicating great satisfaction with the product. This justifies a high rating.

Rating (integer): 5

The straps are flimsy, and they started digging into my shoulders under heavy loads.

**Analysis:** The review highlights a significant comfort issue with the straps. It also suggests that the product may not be suitable for heavy loads or prolonged outings.

**Rating:** 1 (worst)

Our more elaborate prompt, combined with few-shot prompting and chain-of-thought prompting techniques, gives more accurate ratings.

Expect manual output parsing

While you should always manually clean up and interpret an LLM's output, this is particularly true for smaller LLMs which might produce less polished results and may require chain-of-thought prompting.

In the first example, we used chain-of-thought prompting, so the output included both an analysis and a rating and we need to parse manually for the rating. Also note the inconsistent formatting in the output it the previous section: the model sometimes outputs markdown, but not every time.

// Use try/catch to catch (parsing) errors
try {
  // Parse the output for a rating
  const int = /[1-5]/;
  const ratingAsString = llmOutput.match(int)[0];
  // Parse as int
  const ratingAsInt = parseInt(ratingAsString);
  // `checkRating` checks the rating is a number between 1 and 5, since the
  // regEx may catch a number like "100"
  const finalRating = checkRating(ratingAsInt);
} catch (e) {
  console.error('Error', e);
}

Mind API differences

LLM cloud APIs like the Gemini API or OpenAI, which are typically the entry point to larger LLMs, offer handy prompt features. For example, Gemini 1.5 Pro offers system instructions and JSON mode.

At the moment, these features aren't always available for custom model usage, or for smaller LLMs accessed using in-browser AI APIs, such as the MediaPipe LLM Inference API or Transformers.js. While this isn't necessarily a technical limitation, in-browser AI APIs tend to be leaner.

Mind token limits

Because your prompt for smaller LLMs needs to include examples or more detailed instructions, it will likely be longer and take up more of your input token limit, if there is one.

Additionally, smaller models tend to have a smaller input token limit. For example, Gemini 1.5 Pro has a 1-million input token limit while Gemma models have a 8K context window.

Use token count functions to avoid hitting the limit.

Adapt your times estimates

Account for prompt design and testing in your engineering time estimates.

Due to API differences and token limits, you'll likely need more time and effort to craft your prompt for a smaller LLM than a larger one. Testing and validating the LLM's output may also be higher effort.

Prompt engineer versus fine tuning?

For web developers, prompt engineering is our preferred way of leveraging generative AI over custom training and fine-tuning. But even advanced prompt engineering may not be sufficient in some use cases, especially if you're using a smaller LLM.

Use fine-tuning when:

  • You require top-notch accuracy and performance for a specific task. Fine tuning directly adjusts the model's internal parameters for optimal results.
  • You have well-curated data, relevant to your task, already labeled with preferred outputs. You need this data for effective fine-tuning.
  • You use the model for the same purpose repeatedly. Fine tuning can be done once, and reused for a specific task.