PDF Tools

Extract Text from PDF

Extract every line of text from PDF documents for use in other applications. Export as plain text for simple copying, Markdown with page structure preserved, or detailed JSON containing word positions, fonts, and form field values. Scanned, image-only pages are automatically detected and run through on-device OCR, so you get text even from documents that have none embedded. Perfect for content migration, text analysis, accessibility, and data extraction workflows.

JavaScript Required

This tool requires JavaScript to run. Please enable JavaScript in your browser settings to use Extract Text from PDF.

Why JavaScript? This tool processes your files entirely in your browser using WebAssembly — nothing is uploaded to servers. This privacy-first approach requires JavaScript to be enabled.

What does this tool do?

The Extract Text tool pulls text content from PDF documents in three useful formats. Plain text output provides clean, copy-pasteable text without formatting. Markdown format preserves page structure with headings and basic formatting cues. JSON format provides detailed extraction including word-level bounding boxes, font information, font sizes, and AcroForm field values for advanced processing and data extraction needs. When a page contains no embedded text (a scan or photo), the tool falls back to optical character recognition automatically — all in your browser.

How it works

Using MuPDF's text extraction capabilities, the tool analyzes the PDF's content streams to identify text elements. It maps glyph data to Unicode characters, extracts positioning information, and identifies font properties. For the JSON output, it provides per-page arrays of text blocks with bounding box coordinates, enabling precise text location identification. Form fields are detected through AcroForm analysis, with widget names, types, and current values extracted and included in the output. For image-only pages, the tool rasterizes the page, cleans it up on the GPU (grayscale, adaptive thresholding, deskew), and recognizes the text with an on-device OCR engine selected to match your hardware — lightweight Tesseract on modest devices and a WebGPU-accelerated neural model on capable ones.

Features

Three output formats: plain text, Markdown, structured JSON
JSON includes per-word bounding boxes, font names, and font sizes
Captures AcroForm field names, types, and current values
Page-level breakdown in all formats
Live preview before downloading
Handles multi-column and complex layouts
Preserves text reading order

How to use

1

Upload your PDF

Drag a text-based PDF (not a pure image scan) onto the drop zone. The tool analyzes the document and prepares extraction.
2

Choose output format

Select Plain text for copy-paste simplicity, Markdown for structured text with page headings, or JSON for programmatic processing with position data.
3

Review the preview

The preview panel shows a sample of the extracted text. Review to ensure the extraction captured text correctly.
4

Download the text file

Click Extract to generate the output file. Download the .txt, .md, or .json file depending on your selected format.

Common use cases

Migrate content to web platforms

Extract text from PDF reports, whitepapers, or documentation for republishing on websites or content management systems.

Data extraction from forms

Use JSON output to programmatically extract values from filled PDF forms for database import or record processing.

Text analysis and NLP processing

Extract text for natural language processing, sentiment analysis, keyword extraction, or machine learning training data preparation.

Accessibility text alternatives

Extract text from PDFs to create accessible HTML versions or screen-reader-friendly content for users with disabilities.

Tips & best practices

Plain text works best for simple documents with single-column layouts
Markdown format helps preserve document structure when converting to web content
JSON output is ideal for developers building automated text extraction pipelines
Scanned, image-only pages are OCR'd automatically — no separate step needed
OCR accuracy is best on clean, high-resolution scans; very low-resolution or skewed pages may need a better source

Frequently asked questions

Will it work on scanned PDFs?

Yes. When a page has no embedded text, the tool automatically runs optical character recognition (OCR) on it, right in your browser. It rasterizes the page, cleans it up, and recognizes the text using an on-device engine matched to your hardware. Quality depends on the scan: clean, high-resolution pages OCR best.

Why is my output empty or garbled?

Some PDFs encode text as custom glyph indices without proper Unicode character maps. These can't be reverse-mapped to readable text. This is common in PDFs with embedded subset fonts that lack proper encoding tables.

What's included in the JSON output?

The JSON contains per-page data including raw text content, an array of word-level blocks with bounding box coordinates (x, y, width, height), font name and size for each block, and any AcroForm fields with their names, types, and values.

Does extraction preserve formatting?

Plain text removes all formatting. Markdown preserves page structure and basic hierarchy. JSON preserves position data allowing you to reconstruct layout programmatically. Complex formatting like tables may require manual cleanup.