2.8 KiB
2.8 KiB
Document Intake (dev notes)
Project overview
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
Goals
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
- Validate and transform extracted data into a consistent schema for MongoDB.
- Make processing resumable and observable (logs, metrics, retry).
Architecture / Flow
- Watch input folder for new PDF files (or run batch processor).
- Convert each PDF page to image(s) using
pdf2image(requirespoppler). - Run OCR on images using
paddleocrto get text, segmentation, and confidence. - Parse OCR results to locate fields, using heuristics and templates.
- Validate/normalize extracted values.
- Insert or upsert documents into MongoDB via
pymongo. - Move processed PDFs to an archive or error folder.
Tech stack / dependencies
- Python 3.14 (project venv:
.venv) - OCR:
paddleocr(and underlyingpaddlepaddle) - PDF → image:
pdf2image(requirespopplerinstalled on host) - MongoDB client:
pymongo - Image handling:
Pillow - Optional:
python-dotenvfor config,richfor nicer logs
System-level requirements:
poppler(forpdf2image)- Optional GPU drivers if using GPU-enabled
paddlepaddlebuilds
Environment / setup (dev)
- Create and activate virtualenv (already done in this workspace):
python3 -m venv .venv
source .venv/bin/activate
- Install Python dependencies (we will add
requirements.txtsoon):
pip install -r requirements.txt
- Configure a
.envfile for MongoDB connection string and paths (example):
MONGO_URI=mongodb://localhost:27017
INPUT_DIR=./input
ARCHIVE_DIR=./archive
ERROR_DIR=./error
Data model (draft)
documentscollection:_id: UUIDfilename: original PDF filenamepages: list of page objects; each page haspage_number,ocr_text,fieldsextracted_fields: dict of normalized field names -> valuesstatus:pending|processed|errorprocessed_at,created_at
Next steps / short-term TODOs
- Create
requirements.txtwith pinned packages and document system deps. - Add
README.mdwith quickstart run instructions. - Implement a small prototype script
processor.pythat converts a PDF to images and runs OCR on a single page. - Add basic unit tests and a sample PDF in
samples/for development.
Notes / caveats
paddleocrmay require selecting CPU vs GPU builds ofpaddlepaddle— document preferred build and install instructions inrequirements.txtor a separate notes section.- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
Created: initial development notes and plan.