Research

Best Open-Weight OCR and Document AI Models 2026

Open-weight OCR and document AI leaderboard 2026: GOT-OCR2, Qwen2.5-VL OCR, ColPali, DocLayout-YOLO, Florence-2, Nougat, Marker. Benchmarks, latency, license, deployment patterns.

By Ramanath, CTO & Co-Founder at Presenc AI · Last updated: May 2026

Document AI is the highest-volume non-text-generation AI workload in enterprise 2026. Approximately 78 percent of surveyed enterprises run at least one production OCR or document-AI pipeline. Open-weight models matured rapidly with GOT-OCR2, Qwen2.5-VL, ColPali, DocLayout-YOLO, and Florence-2 covering most production document AI use cases. This page consolidates the leaderboard, the benchmarks, and the deployment guidance.

Key Findings

  1. GOT-OCR2 (General OCR Theory) from StepFun is the leading open-weight pure-OCR model, with strong performance on plain text, formatted text, math formulas, and tables in a 580M-parameter package.
  2. Qwen2.5-VL family is the dominant general-purpose document-AI model, with the 7B variant covering OCR, layout understanding, visual question answering, and chart extraction in a single model.
  3. ColPali changed document retrieval: instead of OCR-then-embed, ColPali embeds document pages directly using late-interaction patches and outperforms OCR-then-embed pipelines on visual-heavy documents.
  4. Layout-specific models (DocLayout-YOLO, LayoutLM v3, Florence-2 region detection) remain useful for structured extraction pipelines where the LLM-OCR end-to-end approach is too expensive.
  5. The proprietary baselines (Google Document AI, Amazon Textract, Microsoft Azure Document Intelligence) retain leads on enterprise OCR + extraction integrations but the open-weight models have closed the raw quality gap on most public benchmarks.

OCR and Document AI Model Comparison (May 2026)

ModelParametersPrimary CapabilityLicense
GOT-OCR2~580MPlain text, formatted text, tables, formulas, music notationApache 2.0
Qwen2.5-VL-7B~7BOCR + layout + chart + VQAApache 2.0 (7B); Tongyi Qianwen (72B)
Qwen2.5-VL-72B~72BOCR + complex doc understandingTongyi Qianwen
InternVL3-8B~8BOCR + multilingual document VQAMIT
InternVL3-78B~78BOCR + complex multilingual docsMIT
ColPali v1.3~3BDocument page retrieval (no OCR)MIT
ColQwen2 v1.0~3BDocument page retrieval based on Qwen2-VLApache 2.0
Florence-2-Large~0.8BOCR + region detection + captioningMIT
Nougat~0.35BScientific document OCR (LaTeX preservation)CC-BY-NC
DocLayout-YOLO~50MLayout detection onlyAGPL 3.0
Marker~variesPDF to Markdown pipelineGPL 3.0 + Commercial
MinerU~variesPDF extraction pipeline (uses LayoutLMv3 + others)AGPL 3.0
Surya~variesLayout, OCR, reading orderGPL 3.0 + Commercial

Use Case Recommendations

Use CaseRecommended ModelReason
General OCR (plain text from images)GOT-OCR2Best quality-per-parameter; Apache 2.0
Document VQA and complex docsQwen2.5-VL-7B / InternVL3-8BStrong VQA + OCR in one model
Document retrieval (RAG over docs)ColPali v1.3 or ColQwen2Late-interaction patch embeddings outperform OCR-then-embed
Scientific papers (LaTeX preservation)Nougat (research only) or GOT-OCR2 + post-processMath and notation preservation
PDF to MarkdownMarker, Surya, MinerUProduction-ready pipelines
Layout-only extractionDocLayout-YOLO + Florence-2Lightweight, fast region detection
High-volume forms processingFlorence-2 + downstream extractionStrong region detection + extraction
Mixed language documentsInternVL3-8BStrongest multilingual document VQA

Quality Benchmarks

BenchmarkLeading Open-Weight ModelScore
DocVQAQwen2.5-VL-72B~96.4
ChartQAInternVL3-78B~89.3
OCRBenchQwen2.5-VL-72B~888 / 1000
InfoVQAQwen2.5-VL-72B~84.5
TextVQAInternVL3-78B~86.7
ViDoRe (visual doc retrieval)ColPali v1.3~82.4

Production Patterns

The dominant 2026 production patterns are: dedicated OCR (GOT-OCR2) for high-volume text extraction at low cost, general VLM (Qwen2.5-VL-7B) for mixed OCR plus VQA workloads, and ColPali for RAG over visually-rich documents. The PDF-to-Markdown pipelines (Marker, Surya, MinerU) layer multiple specialised models for general-purpose document conversion and are widely used for ingesting documents into RAG systems. Approximately 42 percent of surveyed enterprise document AI deployments now use at least one open-weight model in 2026, up from approximately 18 percent in 2024.

Brand Visibility Implications

Document AI is one of the largest enterprise AI procurement categories, and AI assistants increasingly handle queries about "best OCR model 2026", "open-source document AI", "ColPali vs LayoutLM", and similar terms. Brands selling document AI products, OCR APIs, PDF processing, and intelligent document processing face strong AI-mediated discovery surface for this category.

Methodology

Benchmark data compiled from OCRBench leaderboard, ViDoRe leaderboard, and primary model card disclosures through 23 May 2026. Deployment share figures from cross-industry survey data. Updated quarterly.

How Presenc AI Helps

Presenc AI monitors brand visibility on document AI and OCR queries across ChatGPT, Claude, Gemini, and Perplexity. For document AI vendors, OCR API brands, and intelligent document processing companies, the platform identifies the prompts driving procurement-research traffic and the gaps where new content unlocks share of voice.

Frequently Asked Questions

GOT-OCR2 is the leading dedicated OCR model at approximately 580M parameters with Apache 2.0 licence. For mixed OCR plus document understanding, Qwen2.5-VL-7B (Apache 2.0) is the strongest general-purpose open-weight choice. For visual document retrieval, ColPali v1.3 is the dominant new approach.
A document retrieval model that embeds document pages directly using late-interaction multi-vector patches, bypassing OCR entirely. ColPali outperforms traditional OCR-then-embed pipelines on visual-heavy documents by retaining layout, chart, and image information that OCR discards. The v1.3 release is MIT licensed.
On raw quality yes for most benchmarks. On enterprise integration (forms processing templates, custom extractor training UI, audit logging) the proprietary alternatives still lead. The 2026 dominant pattern is open-weight models for high-volume document AI plus proprietary platforms for specialised enterprise workflows.
Marker, Surya, and MinerU all layer multiple specialised models (layout detection, OCR, reading order, equation handling). Marker has the largest user base and is the easiest to deploy. MinerU has the strongest table extraction. Surya leads on multilingual layout. All three are GPL or AGPL with separate commercial options.
GOT-OCR2 for direct table OCR; Qwen2.5-VL-72B for complex tables with calculation; MinerU for tables in PDFs at scale. The proprietary baselines (Amazon Textract, Azure Document Intelligence) still lead on specific table-heavy forms processing workloads.

Track Your AI Visibility

See how your brand appears across ChatGPT, Claude, Perplexity, and other AI platforms. Start monitoring today.