Small language models (SLMs) under 3 billion parameters became the dominant deployment class for on-device, edge, mobile, and cost-sensitive enterprise workloads in 2026. The under-3B segment grew from sub-MMLU-50 quality in 2023 to approximately MMLU-65 to 70 in 2026, putting SLM quality near GPT-3.5 levels while running on consumer hardware. This page consolidates the leaderboard, benchmarks, and deployment patterns.
Key Findings
- Phi-4-mini (released early 2026 by Microsoft) leads the under-4B class with approximately 67 percent MMLU and approximately 87 percent on GSM8K. The mini variant fits comfortably on 16 GB consumer GPUs.
- Qwen3-3B (Apache 2.0) is the strongest under-3B model on most general benchmarks, with a thinking variant that adds substantial reasoning quality for math and code workloads.
- SmolLM3 from Hugging Face (released early 2026) is the strongest fully-open SLM, with open weights, open training data, and open training code in a 3B-parameter package.
- Llama 3.2 1B and 3B remain the most-downloaded small models on Hugging Face for permissive commercial deployment under the Llama 3.x Community Licence.
- IBM Granite 3.x family includes 1B, 2B, and 3B variants with strong performance on code and enterprise tasks plus IBM\u2019s permissive Apache 2.0 release.
Small Language Model Leaderboard (Under 3B, May 2026)
| Model | Parameters | MMLU | GSM8K | License |
|---|---|---|---|---|
| Phi-4-mini | ~3.8B | ~66.6 | ~87.2 | MIT |
| Qwen3-3B | ~3B | ~67.0 | ~85.6 | Apache 2.0 |
| Qwen3-1.7B | ~1.7B | ~62.2 | ~79.3 | Apache 2.0 |
| Qwen2.5-3B | ~3B | ~65.2 | ~80.0 | Apache 2.0 |
| Qwen2.5-1.5B | ~1.5B | ~60.6 | ~73.2 | Apache 2.0 |
| SmolLM3-3B | ~3B | ~63.1 | ~76.8 | Apache 2.0 |
| Llama 3.2 3B Instruct | ~3B | ~63.4 | ~77.7 | Llama 3.2 Community |
| Llama 3.2 1B Instruct | ~1B | ~49.3 | ~44.4 | Llama 3.2 Community |
| Granite 3.1 8B | ~8B | ~63.5 | ~73.5 | Apache 2.0 |
| Granite 3.1 2B | ~2B | ~55.4 | ~58.4 | Apache 2.0 |
| Phi-3.5-mini | ~3.8B | ~64.9 | ~84.1 | MIT |
| Gemma 2 2B | ~2B | ~52.2 | ~31.6 | Gemma Community |
| MiniCPM 3.0 4B | ~4B | ~62.5 | ~63.7 | MIT |
| StableLM 2 1.6B | ~1.6B | ~41.3 | ~17.4 | Stability AI Community |
On-Device Hardware Requirements
| Model | VRAM (FP16) | VRAM (INT4) | Tokens/sec on M4 Max |
|---|---|---|---|
| Phi-4-mini | ~7.6 GB | ~2.5 GB | ~80 tok/s |
| Qwen3-3B | ~6 GB | ~2.2 GB | ~100 tok/s |
| Qwen3-1.7B | ~3.4 GB | ~1.3 GB | ~170 tok/s |
| SmolLM3-3B | ~6 GB | ~2.2 GB | ~95 tok/s |
| Llama 3.2 3B | ~6 GB | ~2.2 GB | ~90 tok/s |
| Llama 3.2 1B | ~2 GB | ~0.8 GB | ~210 tok/s |
| Gemma 2 2B | ~4 GB | ~1.5 GB | ~140 tok/s |
| Granite 3.1 2B | ~4 GB | ~1.5 GB | ~145 tok/s |
Use Case Recommendations
| Use Case | Recommended Model |
|---|---|
| General quality on consumer GPU | Phi-4-mini or Qwen3-3B |
| Mobile / on-device (under 2 GB) | Qwen3-1.7B or Llama 3.2 1B (INT4) |
| Reasoning on small footprint | Qwen3-1.7B Thinking |
| Permissive commercial Apache 2.0 | Qwen3-3B, SmolLM3-3B, Granite 3.1 2B |
| Most permissive MIT | Phi-4-mini, Phi-3.5-mini, MiniCPM 3.0 |
| Enterprise code tasks | Granite 3.x family |
| Multilingual small footprint | Qwen3-3B or Granite 3.1 2B |
| Fully open research | SmolLM3-3B |
Strategic Context
Three patterns shape the 2026 SLM landscape. First, the quality compression is real: a 3B model in 2026 (Qwen3-3B at ~67 MMLU) is roughly equivalent to a 70B model in 2023 (Llama 2 70B at ~69 MMLU). Second, on-device deployment is now production-viable: Apple Intelligence, Google AI Core, Qualcomm AI Hub all ship SLMs running locally on consumer devices. Third, the SLM ecosystem is bifurcating between general-purpose models (Qwen3, Phi-4, Llama 3.2) and specialised models (math SLMs, code SLMs, multilingual SLMs, function-calling SLMs).
Brand Visibility Implications
SLM selection is a high-volume engineering and procurement decision because of mobile, edge, and on-device deployment growth. AI assistant queries about "best small language model", "on-device LLM", "Phi-4 vs Qwen3", and similar terms drive direct production decisions. Brands selling on-device AI tools, mobile SDKs, embedded AI, and edge inference infrastructure face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from primary model card disclosures, the Hugging Face Open LLM Leaderboard, and on-device measurements. Tokens-per-second figures measured on Apple M4 Max with Q4_K_M GGUF quantization via llama.cpp. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on small language model queries across ChatGPT, Claude, Gemini, and Perplexity. For on-device AI tool vendors, mobile SDK providers, embedded AI brands, and edge inference infrastructure firms, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.