PDF to Markdown API

The PDF to Markdown API (included with the PDF Services API) is a cloud-based web service that automatically converts PDF documents – native or scanned – into well-formatted Markdown text. This service preserves the document's structure and formatting while converting it into a format that's widely used for LLM flows, content authoring and documentation.

Structured Information Output Format

The output of a PDF to Markdown operation includes:

A primary .md file containing the converted Markdown content

Output Structure

The following is a summary of key elements in the converted Markdown:

Elements

Ordered list of semantic elements converted from the PDF document, preserving the natural reading order and document structure. The conversion handles:

Text content with proper Markdown syntax
Document hierarchy and structure
Inline formatting and emphasis
Links and references
Images and figures
Tables and complex layouts

Content Types

The API processes various content types as follows:

Text Elements

Headings: Converted to appropriate Markdown heading levels (H1-H6)
Paragraphs: Preserved with proper spacing and formatting
Lists: Both ordered and unordered lists with proper nesting
Text Emphasis: Bold, italic, and other text formatting
Links: Preserved with proper Markdown link syntax

Images and Figures

Provided as base64-embedded images in the Markdown output
Referenced correctly in the Markdown output
Original quality preserved
Proper alt text and captions maintained

Tables

Converted to Markdown table syntax
Column alignment preserved
Cell content formatting maintained
Complex table structures supported

Element Types and Paths

The API recognizes and converts the following structural elements:

Category	Element Type	Description
Aside	Aside	Content which is not part of regular content flow
Figure	Figure	Non-reflowable constructs like graphs, images, flowcharts
Footnote	Footnote	Footnote
Headings	H, H1, H2, etc	Heading levels
List	L, Li, Lbl, Lbody	List and list item elements
Paragraph	P, ParagraphSpan	Paragraphs and paragraph segments
Reference	Reference	Links
Section	Sect	Logical section of the document
StyleSpan	StyleSpan	Styling variations within text
Table	Table, TD, TH, TR	Table elements
Title	Title	Document title

Reading Order

The reading order in the output Markdown maintains:

Natural document flow
Proper content hierarchy
Column-based layouts
Page transitions
Inline elements and references

Use Cases

The PDF to Markdown API is particularly valuable for:

LLM-friendly content ingestion and prompt creation
Training/Fine-tuning LLM with PDFs
Content migration from PDF to documentation platforms
Legacy document conversion
Content repurposing for modern documentation systems
Integration with Markdown-based workflows
Automated document processing pipelines
Searchable internal knowledge repositories

API Limitations

File Constraints

File Size: Maximum of 100MB per file
Page Count:
- Non-scanned PDFs: Up to 400 pages
- Scanned PDFs: Up to 150 pages
Page Dimensions: Between 6" and 17.5" in either dimension

Processing Limits

Rate Limits: Maximum 25 requests per minute
Language Support: Optimized for English, supports other Latin-based languages
OCR Quality: Dependent on scan quality (minimum 200 DPI recommended)

Document Requirements

Files must be unprotected or allow content copying
No support for:
- Hidden objects (JavaScript, OCG)
- XFA and fillable forms
- Complex annotations
- CAD drawings or vector art
- Password-protected content