Edit in GitHubLog an issue

PDF to Markdown API

The PDF to Markdown API (included with the PDF Services API) is a cloud-based web service that automatically converts PDF documents – native or scanned – into well-formatted Markdown text. This service preserves the document's structure and formatting while converting it into a format that's widely used for LLM flows, content authoring and documentation.

Structured Information Output Format

The output of a PDF to Markdown operation includes:

  • A primary .md file containing the converted Markdown content

Output Structure

The following is a summary of key elements in the converted Markdown:

Elements

Ordered list of semantic elements converted from the PDF document, preserving the natural reading order and document structure. The conversion handles:

  • Text content with proper Markdown syntax
  • Document hierarchy and structure
  • Inline formatting and emphasis
  • Links and references
  • Images and figures
  • Tables and complex layouts

Content Types

The API processes various content types as follows:

Text Elements
  • Headings: Converted to appropriate Markdown heading levels (H1-H6)
  • Paragraphs: Preserved with proper spacing and formatting
  • Lists: Both ordered and unordered lists with proper nesting
  • Text Emphasis: Bold, italic, and other text formatting
  • Links: Preserved with proper Markdown link syntax
Images and Figures
  • Provided as base64-embedded images in the Markdown output
  • Referenced correctly in the Markdown output
  • Original quality preserved
  • Proper alt text and captions maintained
Tables
  • Converted to Markdown table syntax
  • Column alignment preserved
  • Cell content formatting maintained
  • Complex table structures supported

Element Types and Paths

The API recognizes and converts the following structural elements:

CategoryElement TypeDescription
Aside
Aside
Content which is not part of regular content flow
Figure
Figure
Non-reflowable constructs like graphs, images, flowcharts
Footnote
Footnote
Footnote
Headings
H, H1, H2, etc
Heading levels
List
L, Li, Lbl, Lbody
List and list item elements
Paragraph
P, ParagraphSpan
Paragraphs and paragraph segments
Reference
Reference
Links
Section
Sect
Logical section of the document
StyleSpan
StyleSpan
Styling variations within text
Table
Table, TD, TH, TR
Table elements
Title
Title
Document title

Reading Order

The reading order in the output Markdown maintains:

  • Natural document flow
  • Proper content hierarchy
  • Column-based layouts
  • Page transitions
  • Inline elements and references

Use Cases

The PDF to Markdown API is particularly valuable for:

  • LLM-friendly content ingestion and prompt creation
  • Training/Fine-tuning LLM with PDFs
  • Content migration from PDF to documentation platforms
  • Legacy document conversion
  • Content repurposing for modern documentation systems
  • Integration with Markdown-based workflows
  • Automated document processing pipelines
  • Searchable internal knowledge repositories

API Limitations

File Constraints

  • File Size: Maximum of 100MB per file
  • Page Count:
    • Non-scanned PDFs: Up to 400 pages
    • Scanned PDFs: Up to 150 pages
  • Page Dimensions: Between 6" and 17.5" in either dimension

Processing Limits

  • Rate Limits: Maximum 25 requests per minute
  • Language Support: Optimized for English, supports other Latin-based languages
  • OCR Quality: Dependent on scan quality (minimum 200 DPI recommended)

Document Requirements

  • Files must be unprotected or allow content copying
  • No support for:
    • Hidden objects (JavaScript, OCG)
    • XFA and fillable forms
    • Complex annotations
    • CAD drawings or vector art
    • Password-protected content

REST API

See our public API Reference for PDF to Markdown API.

  • Privacy
  • Terms of Use
  • Do not sell or share my personal information
  • AdChoices
Copyright © 2025 Adobe. All rights reserved.