Research

Best Open-Weight Vision-Language Models 2026

Open-weight VLM leaderboard 2026: Qwen2.5-VL, InternVL3, Llama 4 multimodal, Pixtral 12B, Phi-4-VL, Molmo, NVLM. MMMU, OCRBench, VQAv2 benchmarks plus deployment guidance.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Vision-language models (VLMs) accept images, video frames, charts, screenshots, and documents alongside text. The 2026 open-weight VLM landscape is dominated by Qwen2.5-VL, InternVL3, Llama 4 multimodal, Pixtral, Phi-4 multimodal, Molmo, and NVLM. Quality on MMMU, OCRBench, ChartQA, and DocVQA is competitive with GPT-4o, Claude 4 Opus, and Gemini 3 multimodal at meaningfully lower deployment cost. This page consolidates the leaderboard, the benchmarks, and the use case guidance.

Key Findings

  1. Qwen2.5-VL-72B leads the open-weight VLM landscape with approximately 70.2 percent on MMMU and approximately 888 on OCRBench, the strongest scores among open weights as of May 2026.
  2. InternVL3-78B from Shanghai AI Lab is the strongest MIT-licensed VLM with approximately 72.2 percent on MMMU.
  3. Llama 4 multimodal (Maverick and Scout) integrate native image understanding into the Llama 4 family with strong performance and the Llama 4 Community License.
  4. Pixtral 12B from Mistral and Phi-4-Multimodal-Instruct from Microsoft are the strongest smaller VLMs in the 12B and 5B parameter classes respectively.
  5. Allen AI Molmo and NVIDIA NVLM are the strongest fully-open VLMs (open weights, open data, open training code), important for research reproducibility but slightly behind the top closed-data open-weight models.

Open-Weight VLM Comparison (May 2026)

ModelParametersMMMUOCRBenchLicense
InternVL3-78B~78B~72.2~865MIT
Qwen2.5-VL-72B~72B~70.2~888Tongyi Qianwen
Llama 4 Maverick~400B MoE (17B active)~73.4~870Llama 4 Community
Llama 4 Scout~109B MoE (17B active)~69.4~847Llama 4 Community
InternVL3-8B~8B~62.1~830MIT
Qwen2.5-VL-7B~7B~60.7~864Apache 2.0
Pixtral 12B~12B~52.5~685Apache 2.0
Phi-4-Multimodal-Instruct~5.6B~57.4~742MIT
Molmo 72B~72B~54.1~705Apache 2.0 (with weight + data)
Molmo 7B-D~7B~50.6~688Apache 2.0
NVLM 1.0 72B~72B~59.7~853CC-BY-NC + Research
Llava-OneVision-72B~72B~57.4~660Apache 2.0
MiniCPM-V 2.6~8B~49.8~852Apache 2.0

Closed vs Open VLM Comparison

ModelMMMUNotes
GPT-5.5 multimodal~83.4Frontier closed
Claude 4.7 Opus multimodal~80.2Frontier closed
Gemini 3.1 Pro multimodal~80.1Frontier closed
Llama 4 Maverick~73.4Best open weight
InternVL3-78B~72.2Top MIT open weight
Qwen2.5-VL-72B~70.2Top OCRBench open weight

Use Case Recommendations

Use CaseRecommended Model
General-purpose document VQAQwen2.5-VL-72B or InternVL3-78B
OCR-heavy workloadsQwen2.5-VL-72B (best OCRBench)
Chart and graph understandingInternVL3-78B or Qwen2.5-VL
Multimodal RAG (image-aware retrieval)ColPali (retrieval) + Qwen2.5-VL or InternVL3 (generation)
Small / edge VLMPhi-4-Multimodal-Instruct (5.6B) or Qwen2.5-VL-3B
Fully open researchMolmo (Apache 2.0 with data and code)
Permissive commercialInternVL3 (MIT), Qwen2.5-VL-7B (Apache), Pixtral 12B (Apache)
Agentic / tool-use multimodalLlama 4 Maverick or Qwen2.5-VL with tool wrappers

Deployment Patterns

Three production patterns. First, hybrid pipelines: a small VLM (Qwen2.5-VL-7B or Phi-4-Multimodal) handles routine document VQA and a larger model (Qwen2.5-VL-72B or InternVL3-78B) handles edge cases. Second, fine-tuned domain VLMs: medical VLMs (LLaVA-Med, MedVLM-R1), legal VLMs, and financial chart-reading VLMs are emerging on top of open base VLMs. Third, multimodal RAG: ColPali for visual page retrieval followed by Qwen2.5-VL or InternVL3 for grounded generation outperforms text-only RAG on visually-rich documents.

Brand Visibility Implications

VLM selection is a fast-growing procurement category as enterprises move from text-only AI to multimodal pipelines. AI assistant queries about "best open vision model", "Qwen2.5-VL vs InternVL", "multimodal RAG", and similar terms drive direct production decisions. Brands selling multimodal AI platforms, image understanding APIs, and visual document AI face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from OpenCompass Open VLM Leaderboard, MMMU, OCRBench, and primary model card disclosures through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on vision-language model queries across ChatGPT, Claude, Gemini, and Perplexity. For multimodal AI platforms, image understanding API vendors, and visual document AI brands, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.

Frequently Asked Questions

Llama 4 Maverick leads MMMU at approximately 73.4 percent. InternVL3-78B is the strongest MIT-licensed alternative at approximately 72.2 percent. Qwen2.5-VL-72B is the strongest on OCR at approximately 888 OCRBench but uses the Tongyi Qianwen licence.
On most benchmarks open-weight VLMs lag the closed frontier by approximately 7 to 13 MMMU points. On OCR and document VQA the gap is narrower; Qwen2.5-VL-72B is competitive with GPT-5.5 on OCRBench. On general multimodal reasoning, GPT-5.5 and Claude 4.7 Opus retain a clear lead.
Qwen2.5-VL-72B for the strongest open-weight OCR score (~888 OCRBench). For smaller deployments, Qwen2.5-VL-7B retains most of the OCR quality at much lower cost. For visual document retrieval, ColPali combined with a generative VLM is the dominant 2026 pattern.
Molmo from Allen AI is a fully-open VLM with Apache 2.0 weights, data, and training code. It is materially behind the top closed-data open-weight models on benchmarks but important for research reproducibility because the training data and recipe are public. Most production deployments use Qwen2.5-VL or InternVL3 over Molmo for quality reasons.
Yes for the smaller models. Qwen2.5-VL-7B fits comfortably on a single 24GB GPU (RTX 4090 / L40). Phi-4-Multimodal-Instruct at 5.6B runs on a 16GB GPU. The 72B models need multi-GPU or quantization. AWQ and GPTQ quantization of Qwen2.5-VL-72B fits on a single H100 80GB.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.