Vision-language models (VLMs) accept images, video frames, charts, screenshots, and documents alongside text. The 2026 open-weight VLM landscape is dominated by Qwen2.5-VL, InternVL3, Llama 4 multimodal, Pixtral, Phi-4 multimodal, Molmo, and NVLM. Quality on MMMU, OCRBench, ChartQA, and DocVQA is competitive with GPT-4o, Claude 4 Opus, and Gemini 3 multimodal at meaningfully lower deployment cost. This page consolidates the leaderboard, the benchmarks, and the use case guidance.
Key Findings
- Qwen2.5-VL-72B leads the open-weight VLM landscape with approximately 70.2 percent on MMMU and approximately 888 on OCRBench, the strongest scores among open weights as of May 2026.
- InternVL3-78B from Shanghai AI Lab is the strongest MIT-licensed VLM with approximately 72.2 percent on MMMU.
- Llama 4 multimodal (Maverick and Scout) integrate native image understanding into the Llama 4 family with strong performance and the Llama 4 Community License.
- Pixtral 12B from Mistral and Phi-4-Multimodal-Instruct from Microsoft are the strongest smaller VLMs in the 12B and 5B parameter classes respectively.
- Allen AI Molmo and NVIDIA NVLM are the strongest fully-open VLMs (open weights, open data, open training code), important for research reproducibility but slightly behind the top closed-data open-weight models.
Open-Weight VLM Comparison (May 2026)
| Model | Parameters | MMMU | OCRBench | License |
|---|---|---|---|---|
| InternVL3-78B | ~78B | ~72.2 | ~865 | MIT |
| Qwen2.5-VL-72B | ~72B | ~70.2 | ~888 | Tongyi Qianwen |
| Llama 4 Maverick | ~400B MoE (17B active) | ~73.4 | ~870 | Llama 4 Community |
| Llama 4 Scout | ~109B MoE (17B active) | ~69.4 | ~847 | Llama 4 Community |
| InternVL3-8B | ~8B | ~62.1 | ~830 | MIT |
| Qwen2.5-VL-7B | ~7B | ~60.7 | ~864 | Apache 2.0 |
| Pixtral 12B | ~12B | ~52.5 | ~685 | Apache 2.0 |
| Phi-4-Multimodal-Instruct | ~5.6B | ~57.4 | ~742 | MIT |
| Molmo 72B | ~72B | ~54.1 | ~705 | Apache 2.0 (with weight + data) |
| Molmo 7B-D | ~7B | ~50.6 | ~688 | Apache 2.0 |
| NVLM 1.0 72B | ~72B | ~59.7 | ~853 | CC-BY-NC + Research |
| Llava-OneVision-72B | ~72B | ~57.4 | ~660 | Apache 2.0 |
| MiniCPM-V 2.6 | ~8B | ~49.8 | ~852 | Apache 2.0 |
Closed vs Open VLM Comparison
| Model | MMMU | Notes |
|---|---|---|
| GPT-5.5 multimodal | ~83.4 | Frontier closed |
| Claude 4.7 Opus multimodal | ~80.2 | Frontier closed |
| Gemini 3.1 Pro multimodal | ~80.1 | Frontier closed |
| Llama 4 Maverick | ~73.4 | Best open weight |
| InternVL3-78B | ~72.2 | Top MIT open weight |
| Qwen2.5-VL-72B | ~70.2 | Top OCRBench open weight |
Use Case Recommendations
| Use Case | Recommended Model |
|---|---|
| General-purpose document VQA | Qwen2.5-VL-72B or InternVL3-78B |
| OCR-heavy workloads | Qwen2.5-VL-72B (best OCRBench) |
| Chart and graph understanding | InternVL3-78B or Qwen2.5-VL |
| Multimodal RAG (image-aware retrieval) | ColPali (retrieval) + Qwen2.5-VL or InternVL3 (generation) |
| Small / edge VLM | Phi-4-Multimodal-Instruct (5.6B) or Qwen2.5-VL-3B |
| Fully open research | Molmo (Apache 2.0 with data and code) |
| Permissive commercial | InternVL3 (MIT), Qwen2.5-VL-7B (Apache), Pixtral 12B (Apache) |
| Agentic / tool-use multimodal | Llama 4 Maverick or Qwen2.5-VL with tool wrappers |
Deployment Patterns
Three production patterns. First, hybrid pipelines: a small VLM (Qwen2.5-VL-7B or Phi-4-Multimodal) handles routine document VQA and a larger model (Qwen2.5-VL-72B or InternVL3-78B) handles edge cases. Second, fine-tuned domain VLMs: medical VLMs (LLaVA-Med, MedVLM-R1), legal VLMs, and financial chart-reading VLMs are emerging on top of open base VLMs. Third, multimodal RAG: ColPali for visual page retrieval followed by Qwen2.5-VL or InternVL3 for grounded generation outperforms text-only RAG on visually-rich documents.
Brand Visibility Implications
VLM selection is a fast-growing procurement category as enterprises move from text-only AI to multimodal pipelines. AI assistant queries about "best open vision model", "Qwen2.5-VL vs InternVL", "multimodal RAG", and similar terms drive direct production decisions. Brands selling multimodal AI platforms, image understanding APIs, and visual document AI face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from OpenCompass Open VLM Leaderboard, MMMU, OCRBench, and primary model card disclosures through 23 May 2026. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on vision-language model queries across ChatGPT, Claude, Gemini, and Perplexity. For multimodal AI platforms, image understanding API vendors, and visual document AI brands, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.