Document Intake (dev notes)

Project overview

This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.

Goals

Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
Extract form fields and normalize values (dates, numbers, names, checkboxes).
Validate and transform extracted data into a consistent schema for MongoDB.
Make processing resumable and observable (logs, metrics, retry).

Architecture / Flow

Watch input folder for new PDF files (or run batch processor).
Convert each PDF page to image(s) using pdf2image (requires poppler).
Run OCR on images using paddleocr to get text, segmentation, and confidence.
Parse OCR results to locate fields, using heuristics and templates.
Validate/normalize extracted values.
Insert or upsert documents into MongoDB via pymongo.
Move processed PDFs to an archive or error folder.

Tech stack / dependencies

Python 3.14 (project venv: .venv)
OCR: paddleocr (and underlying paddlepaddle)
PDF → image: pdf2image (requires poppler installed on host)
MongoDB client: pymongo
Image handling: Pillow
Optional: python-dotenv for config, rich for nicer logs

System-level requirements:

poppler (for pdf2image)
Optional GPU drivers if using GPU-enabled paddlepaddle builds

Environment / setup (dev)

Create and activate virtualenv (already done in this workspace):

python3 -m venv .venv
source .venv/bin/activate

Install Python dependencies (we will add requirements.txt soon):

pip install -r requirements.txt

Configure a .env file for MongoDB connection string and paths (example):

MONGO_URI=mongodb://localhost:27017
INPUT_DIR=./input
ARCHIVE_DIR=./archive
ERROR_DIR=./error

Data model (draft)

documents collection:
- _id: UUID
- filename: original PDF filename
- pages: list of page objects; each page has page_number, ocr_text, fields
- extracted_fields: dict of normalized field names -> values
- status: pending|processed|error
- processed_at, created_at

Next steps / short-term TODOs

Create requirements.txt with pinned packages and document system deps.
Add README.md with quickstart run instructions.
Implement a small prototype script processor.py that converts a PDF to images and runs OCR on a single page.
Add basic unit tests and a sample PDF in samples/ for development.

Notes / caveats

paddleocr may require selecting CPU vs GPU builds of paddlepaddle — document preferred build and install instructions in requirements.txt or a separate notes section.
OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.

Created: initial development notes and plan.

2.8 KiB Raw Permalink Blame History