What is the best open-weight VLM in 2026?

Llama 4 Maverick leads MMMU at approximately 73.4 percent. InternVL3-78B is the strongest MIT-licensed alternative at approximately 72.2 percent. Qwen2.5-VL-72B is the strongest on OCR at approximately 888 OCRBench but uses the Tongyi Qianwen licence.

Is open-weight VLM competitive with GPT-5.5?

On most benchmarks open-weight VLMs lag the closed frontier by approximately 7 to 13 MMMU points. On OCR and document VQA the gap is narrower; Qwen2.5-VL-72B is competitive with GPT-5.5 on OCRBench. On general multimodal reasoning, GPT-5.5 and Claude 4.7 Opus retain a clear lead.

Which VLM is best for OCR and document understanding?

Qwen2.5-VL-72B for the strongest open-weight OCR score (~888 OCRBench). For smaller deployments, Qwen2.5-VL-7B retains most of the OCR quality at much lower cost. For visual document retrieval, ColPali combined with a generative VLM is the dominant 2026 pattern.

What is Molmo and why does it matter?

Molmo from Allen AI is a fully-open VLM with Apache 2.0 weights, data, and training code. It is materially behind the top closed-data open-weight models on benchmarks but important for research reproducibility because the training data and recipe are public. Most production deployments use Qwen2.5-VL or InternVL3 over Molmo for quality reasons.

Can I run a VLM on a single GPU?

Yes for the smaller models. Qwen2.5-VL-7B fits comfortably on a single 24GB GPU (RTX 4090 / L40). Phi-4-Multimodal-Instruct at 5.6B runs on a 16GB GPU. The 72B models need multi-GPU or quantization. AWQ and GPTQ quantization of Qwen2.5-VL-72B fits on a single H100 80GB.

Best Open-Weight Vision-Language Models 2026

Vision-language models (VLMs) accept images, video frames, charts, screenshots, and documents alongside text. The 2026 open-weight VLM landscape is dominated by Qwen2.5-VL, InternVL3, Llama 4 multimodal, Pixtral, Phi-4 multimodal, Molmo, and NVLM. Quality on MMMU, OCRBench, ChartQA, and DocVQA is competitive with GPT-4o, Claude 4 Opus, and Gemini 3 multimodal at meaningfully lower deployment cost. This page consolidates the leaderboard, the benchmarks, and the use case guidance.

Key Findings

Qwen2.5-VL-72B leads the open-weight VLM landscape with approximately 70.2 percent on MMMU and approximately 888 on OCRBench, the strongest scores among open weights as of May 2026.
InternVL3-78B from Shanghai AI Lab is the strongest MIT-licensed VLM with approximately 72.2 percent on MMMU.
Llama 4 multimodal (Maverick and Scout) integrate native image understanding into the Llama 4 family with strong performance and the Llama 4 Community License.
Pixtral 12B from Mistral and Phi-4-Multimodal-Instruct from Microsoft are the strongest smaller VLMs in the 12B and 5B parameter classes respectively.
Allen AI Molmo and NVIDIA NVLM are the strongest fully-open VLMs (open weights, open data, open training code), important for research reproducibility but slightly behind the top closed-data open-weight models.

Open-Weight VLM Comparison (May 2026)

Model	Parameters	MMMU	OCRBench	License
InternVL3-78B	~78B	~72.2	~865	MIT
Qwen2.5-VL-72B	~72B	~70.2	~888	Tongyi Qianwen
Llama 4 Maverick	~400B MoE (17B active)	~73.4	~870	Llama 4 Community
Llama 4 Scout	~109B MoE (17B active)	~69.4	~847	Llama 4 Community
InternVL3-8B	~8B	~62.1	~830	MIT
Qwen2.5-VL-7B	~7B	~60.7	~864	Apache 2.0
Pixtral 12B	~12B	~52.5	~685	Apache 2.0
Phi-4-Multimodal-Instruct	~5.6B	~57.4	~742	MIT
Molmo 72B	~72B	~54.1	~705	Apache 2.0 (with weight + data)
Molmo 7B-D	~7B	~50.6	~688	Apache 2.0
NVLM 1.0 72B	~72B	~59.7	~853	CC-BY-NC + Research
Llava-OneVision-72B	~72B	~57.4	~660	Apache 2.0
MiniCPM-V 2.6	~8B	~49.8	~852	Apache 2.0

Closed vs Open VLM Comparison

Model	MMMU	Notes
GPT-5.5 multimodal	~83.4	Frontier closed
Claude 4.7 Opus multimodal	~80.2	Frontier closed
Gemini 3.1 Pro multimodal	~80.1	Frontier closed
Llama 4 Maverick	~73.4	Best open weight
InternVL3-78B	~72.2	Top MIT open weight
Qwen2.5-VL-72B	~70.2	Top OCRBench open weight

Use Case Recommendations

Use Case	Recommended Model
General-purpose document VQA	Qwen2.5-VL-72B or InternVL3-78B
OCR-heavy workloads	Qwen2.5-VL-72B (best OCRBench)
Chart and graph understanding	InternVL3-78B or Qwen2.5-VL
Multimodal RAG (image-aware retrieval)	ColPali (retrieval) + Qwen2.5-VL or InternVL3 (generation)
Small / edge VLM	Phi-4-Multimodal-Instruct (5.6B) or Qwen2.5-VL-3B
Fully open research	Molmo (Apache 2.0 with data and code)
Permissive commercial	InternVL3 (MIT), Qwen2.5-VL-7B (Apache), Pixtral 12B (Apache)
Agentic / tool-use multimodal	Llama 4 Maverick or Qwen2.5-VL with tool wrappers

Deployment Patterns

Three production patterns. First, hybrid pipelines: a small VLM (Qwen2.5-VL-7B or Phi-4-Multimodal) handles routine document VQA and a larger model (Qwen2.5-VL-72B or InternVL3-78B) handles edge cases. Second, fine-tuned domain VLMs: medical VLMs (LLaVA-Med, MedVLM-R1), legal VLMs, and financial chart-reading VLMs are emerging on top of open base VLMs. Third, multimodal RAG: ColPali for visual page retrieval followed by Qwen2.5-VL or InternVL3 for grounded generation outperforms text-only RAG on visually-rich documents.

Brand Visibility Implications

VLM selection is a fast-growing procurement category as enterprises move from text-only AI to multimodal pipelines. AI assistant queries about "best open vision model", "Qwen2.5-VL vs InternVL", "multimodal RAG", and similar terms drive direct production decisions. Brands selling multimodal AI platforms, image understanding APIs, and visual document AI face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from OpenCompass Open VLM Leaderboard, MMMU, OCRBench, and primary model card disclosures through 23 May 2026. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on vision-language model queries across ChatGPT, Claude, Gemini, and Perplexity. For multimodal AI platforms, image understanding API vendors, and visual document AI brands, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.