Invoice Reader — PDF to JSON, Markdown, or Text
Turn a PDF invoice into structured data you can import into accounting software, spreadsheets, or your own apps. The Invoice Reader extracts vendor and customer details, invoice number and dates, line items with quantities and amounts, and tax totals — then exports everything as JSON, Markdown, or plain text. Indian GST invoices are supported: GSTIN, HSN/SAC codes, and CGST, SGST, or IGST breakdowns are detected when present. Scanned or image-only invoices are recognized with built-in OCR, and your file never leaves your device.
What does this tool do?
The Invoice Reader analyzes the text and layout of a PDF invoice and reconstructs it as structured data. It scans header regions for labeled fields like invoice number, date, and due date; detects the line-item table by matching column headers such as Description, Qty, Rate, Amount, and HSN; and reads the totals block for subtotal, tax lines, and grand total. For Indian tax invoices it recognizes GSTIN numbers, HSN/SAC columns, and separate CGST, SGST, or IGST amounts. Fillable PDF invoices are handled too — AcroForm field values are merged with extracted text. When the layout cannot be parsed confidently, the tool falls back to raw page text so you still get usable output.
How it works
The PDF is opened in MuPDF and text is extracted with position data for every word. Pages with no embedded text are rasterized and run through on-device OCR automatically. The parser scans the top portion of the first page for label:value pairs and GSTIN/PAN identifiers, identifies the vendor from the largest text in the header area, and locates the customer block near Bill To labels. Line items are reconstructed using bounding-box table detection: column headers are matched against invoice keywords, and each row's words are assigned to columns by horizontal position. Totals are read from the bottom region of the last page. The structured result is serialized to JSON (full schema), Markdown (human-readable tables), or plain text (flattened summary). All processing runs locally in your browser.
Features
- Structured JSON output with vendor, customer, line items, and totals
- Markdown and plain text export formats
- Indian GST field detection: GSTIN, HSN/SAC, CGST, SGST, IGST
- Automatic OCR for scanned or image-only invoices
- AcroForm field values merged for fillable PDF invoices
- Live preview before download
- 100% in-browser — your invoice never leaves your device
- Falls back to raw page text when layout parsing is uncertain
How to use
- 1
Upload your invoice
Drag a PDF invoice onto the drop zone. Text-based invoices are read directly; scanned pages are OCR'd automatically.
- 2
Choose output format
Select JSON for programmatic use, Markdown for readable tables, or plain text for a simple summary.
- 3
Review the preview
The preview shows a sample of the extracted data. Check that invoice number, line items, and totals look correct.
- 4
Download the file
Click Read Invoice to generate the output. Download the .json, .md, or .txt file for import into your workflow.
Common use cases
Accounting and bookkeeping import
Extract invoice fields as JSON for import into accounting software, ERP systems, or custom expense-tracking apps.
GST compliance and record keeping
Pull GSTIN, HSN codes, and tax breakdowns from Indian tax invoices for reconciliation and audit trails.
Batch processing preparation
Convert invoices to a consistent JSON schema before feeding them into automated workflows or data pipelines.
Scanned invoice digitization
OCR scanned or photographed invoices and get structured line items and totals instead of raw text dumps.
Tips & best practices
- JSON is the most complete format — it includes all detected fields plus parse confidence and method
- Higher-quality scans produce more accurate OCR and cleaner line-item tables
- Unusual invoice layouts may fall back to raw text; try the Extract Text tool for full block-level JSON if needed
- Amounts are kept as text to preserve exact formatting and currency symbols