Understand LLM sizes

Maud Nalpas
Maud Nalpas

While the "L" in Large Language Models (LLMs) suggests massive scale, the reality is more nuanced. Some LLMs contain trillions of parameters, and others operate effectively with far fewer.

Take a look at a few real-world examples and the practical implications of different model sizes.

LLM sizes and size classes

As web developers, we tend to think of the size of a resource as its download size. A model's documented size refers to its number of parameters instead. For example, Gemma 2B signifies Gemma with 2 billion parameters.

LLMs may have hundreds of thousands, millions, billions or even trillions of parameters.

Larger LLMs have more parameters than their smaller counterparts, which allows them to capture more complex language relationships and handle nuanced prompts. They're also often trained on larger datasets.

You may have noticed that certain model sizes, like 2 billion or 7 billion, are common. For example, Gemma 2B, Gemma 7B, or Mistral 7B. Model size classes are approximate groupings. For example, Gemma 2B has approximately 2 billion parameters, but not exactly.

Model size classes offer a practical way to gauge LLM performance. Think of them like weight classes in boxing: models within the same size class are more comparable. Two 2B models should offer similar performance.

That said, a smaller model can equal the same performance as a larger model for specific tasks.

Screenshot of HuggingFace model size checkboxes.
Model size classes on HuggingFace. These classes aren't industry standards, they've emerged organically.

While model sizes for most recent state-of-the-art LLMs, such as GPT-4 and Gemini Pro or Ultra, aren't always disclosed, they're believed to be in the hundreds of billions or trillions of parameters.

Model sizes can vary greatly. In this illustration, DistilBERT is a tiny dot as compared to the giant Gemini Pro.

Not all models indicate the number of parameters in their name. Some models are suffixed with their version number. For example, Gemini 1.5 Pro refers to the 1.5 version of the model (following version 1).

LLM or not?

When is a model too small to be an LLM? The definition of LLM can be somewhat fluid within the AI and ML community.

Some consider only the largest models with billions of parameters to be true LLMs, while smaller models, such as DistilBERT, are considered simple NLP models. Others include smaller, but still powerful, models in the definition of LLM, again such as DistilBERT.

Smaller LLMs for on-device use cases

Larger LLMs require a lot of storage space and a lot of compute power for inference. They need to run on dedicated powerful servers with specific hardware (such as TPUs).

One thing we're interested in, as web developers, is whether a model is small enough to be downloaded and run on a user's device.

But, that's a hard question to answer! As of today, there's no easy way for you to know "this model can run on most mid-range devices", for a few reasons:

  • Device capabilities vary widely across memory, GPU/CPU specs, and more. A low-end Android phone and an NVIDIA® RTX laptop are wildly different. You may have some data points about what devices your users have. We don't yet have a definition for a baseline device used to access the web.
  • A model or the framework it runs in may be optimized to run on certain hardware.
  • There is no programmatic way to determine if a specific LLM can be downloaded and run on a specific device. A device's download capability depends on how much VRAM there is on GPU, among other factors.

However, we have some empirical knowledge: today, some models with a few millions to a few billions parameters can run in the browser, on consumer-grade devices.

For example:

  • Gemma 2B with the MediaPipe LLM Inference API (even suitable for CPU-only devices). Try it.
  • DistilBERT with Transformers.js.

This is a nascent field. You can expect the landscape to evolve:

  • With WebAssembly and WebGPU innovations, WebGPU support landing in more libraries, new libraries, and optimizations, expect user devices to be increasingly able to efficiently run LLMs of various sizes.
  • Expect smaller, highly performant LLMs to become increasingly common, through emerging shrinking techniques.

Considerations for smaller LLMs

When working with smaller LLMs, you should always consider performance and download size.

Performance

The capability of any model heavily depends on your use case! A smaller LLM fine tuned to your use case may perform better than a larger generic LLM.

However, within the same model family, smaller LLMs are less capable than their larger counterparts. For the same use case, you'd typically need to do more prompt engineering work when using a smaller LLM.

Screenshot of Chrome DevTools Network panel.
Gemma 2B's score less than Gemma 7B's score.
Source: HuggingFace Open LLM Leaderboard, April 2024

Download size

More parameters mean a larger download size, which also impacts whether a model, even if considered small, can be reasonably downloaded for on-device use cases.

While there are techniques to calculate a model's download size based on the number of parameters, this can be complex.

As of early 2024, model download sizes are rarely documented. So, for your on-device and in-browser use cases, we recommend you look at the download size empirically, in the Network panel of Chrome DevTools or with other browser developer tools.

Screenshot of Chrome DevTools Network panel.
In the Chrome DevTools Network panel, Gemma 2B and DistilBERT for in-browser, on-device inference in a web application. Download sizes are respectively 1.3GB and 67MB.

Gemma is used with the MediaPipe LLM Inference API. DistilBERT is used with Transformers.js.

Model shrinking techniques

Multiple techniques exist to significantly reduce a model's memory requirements:

  • LoRA (Low-Rank Adaptation): Fine tuning technique where the pre-trained weights are frozen. Read more on LoRA.
  • Pruning: Removing less important weights from the model to reduce its size.
  • Quantization: Reducing the precision of weights from floating-point numbers (such as, 32-bit) to lower-bit representations (such as, 8-bit).
  • Knowledge distillation: Training a smaller model to mimic the behavior of a larger, pre-trained model.
  • Parameter sharing: Using the same weights for multiple parts of the model, reducing the total number of unique parameters.