Document AI is the highest-volume non-text-generation AI workload in enterprise 2026. Approximately 78 percent of surveyed enterprises run at least one production OCR or document-AI pipeline. Open-weight models matured rapidly with GOT-OCR2, Qwen2.5-VL, ColPali, DocLayout-YOLO, and Florence-2 covering most production document AI use cases. This page consolidates the leaderboard, the benchmarks, and the deployment guidance.
Key Findings
- GOT-OCR2 (General OCR Theory) from StepFun is the leading open-weight pure-OCR model, with strong performance on plain text, formatted text, math formulas, and tables in a 580M-parameter package.
- Qwen2.5-VL family is the dominant general-purpose document-AI model, with the 7B variant covering OCR, layout understanding, visual question answering, and chart extraction in a single model.
- ColPali changed document retrieval: instead of OCR-then-embed, ColPali embeds document pages directly using late-interaction patches and outperforms OCR-then-embed pipelines on visual-heavy documents.
- Layout-specific models (DocLayout-YOLO, LayoutLM v3, Florence-2 region detection) remain useful for structured extraction pipelines where the LLM-OCR end-to-end approach is too expensive.
- The proprietary baselines (Google Document AI, Amazon Textract, Microsoft Azure Document Intelligence) retain leads on enterprise OCR + extraction integrations but the open-weight models have closed the raw quality gap on most public benchmarks.
OCR and Document AI Model Comparison (May 2026)
| Model | Parameters | Primary Capability | License |
|---|---|---|---|
| GOT-OCR2 | ~580M | Plain text, formatted text, tables, formulas, music notation | Apache 2.0 |
| Qwen2.5-VL-7B | ~7B | OCR + layout + chart + VQA | Apache 2.0 (7B); Tongyi Qianwen (72B) |
| Qwen2.5-VL-72B | ~72B | OCR + complex doc understanding | Tongyi Qianwen |
| InternVL3-8B | ~8B | OCR + multilingual document VQA | MIT |
| InternVL3-78B | ~78B | OCR + complex multilingual docs | MIT |
| ColPali v1.3 | ~3B | Document page retrieval (no OCR) | MIT |
| ColQwen2 v1.0 | ~3B | Document page retrieval based on Qwen2-VL | Apache 2.0 |
| Florence-2-Large | ~0.8B | OCR + region detection + captioning | MIT |
| Nougat | ~0.35B | Scientific document OCR (LaTeX preservation) | CC-BY-NC |
| DocLayout-YOLO | ~50M | Layout detection only | AGPL 3.0 |
| Marker | ~varies | PDF to Markdown pipeline | GPL 3.0 + Commercial |
| MinerU | ~varies | PDF extraction pipeline (uses LayoutLMv3 + others) | AGPL 3.0 |
| Surya | ~varies | Layout, OCR, reading order | GPL 3.0 + Commercial |
Use Case Recommendations
| Use Case | Recommended Model | Reason |
|---|---|---|
| General OCR (plain text from images) | GOT-OCR2 | Best quality-per-parameter; Apache 2.0 |
| Document VQA and complex docs | Qwen2.5-VL-7B / InternVL3-8B | Strong VQA + OCR in one model |
| Document retrieval (RAG over docs) | ColPali v1.3 or ColQwen2 | Late-interaction patch embeddings outperform OCR-then-embed |
| Scientific papers (LaTeX preservation) | Nougat (research only) or GOT-OCR2 + post-process | Math and notation preservation |
| PDF to Markdown | Marker, Surya, MinerU | Production-ready pipelines |
| Layout-only extraction | DocLayout-YOLO + Florence-2 | Lightweight, fast region detection |
| High-volume forms processing | Florence-2 + downstream extraction | Strong region detection + extraction |
| Mixed language documents | InternVL3-8B | Strongest multilingual document VQA |
Quality Benchmarks
| Benchmark | Leading Open-Weight Model | Score |
|---|---|---|
| DocVQA | Qwen2.5-VL-72B | ~96.4 |
| ChartQA | InternVL3-78B | ~89.3 |
| OCRBench | Qwen2.5-VL-72B | ~888 / 1000 |
| InfoVQA | Qwen2.5-VL-72B | ~84.5 |
| TextVQA | InternVL3-78B | ~86.7 |
| ViDoRe (visual doc retrieval) | ColPali v1.3 | ~82.4 |
Production Patterns
The dominant 2026 production patterns are: dedicated OCR (GOT-OCR2) for high-volume text extraction at low cost, general VLM (Qwen2.5-VL-7B) for mixed OCR plus VQA workloads, and ColPali for RAG over visually-rich documents. The PDF-to-Markdown pipelines (Marker, Surya, MinerU) layer multiple specialised models for general-purpose document conversion and are widely used for ingesting documents into RAG systems. Approximately 42 percent of surveyed enterprise document AI deployments now use at least one open-weight model in 2026, up from approximately 18 percent in 2024.
Brand Visibility Implications
Document AI is one of the largest enterprise AI procurement categories, and AI assistants increasingly handle queries about "best OCR model 2026", "open-source document AI", "ColPali vs LayoutLM", and similar terms. Brands selling document AI products, OCR APIs, PDF processing, and intelligent document processing face strong AI-mediated discovery surface for this category.
Methodology
Benchmark data compiled from OCRBench leaderboard, ViDoRe leaderboard, and primary model card disclosures through 23 May 2026. Deployment share figures from cross-industry survey data. Updated quarterly.
How Presenc AI Helps
Presenc AI monitors brand visibility on document AI and OCR queries across ChatGPT, Claude, Gemini, and Perplexity. For document AI vendors, OCR API brands, and intelligent document processing companies, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.