Files
docintake-gm/DEVELOPMENT.md
2026-01-01 21:57:33 -08:00

2.8 KiB

Document Intake (dev notes)

Project overview

This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.

Goals

  • Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
  • Extract form fields and normalize values (dates, numbers, names, checkboxes).
  • Validate and transform extracted data into a consistent schema for MongoDB.
  • Make processing resumable and observable (logs, metrics, retry).

Architecture / Flow

  1. Watch input folder for new PDF files (or run batch processor).
  2. Convert each PDF page to image(s) using pdf2image (requires poppler).
  3. Run OCR on images using paddleocr to get text, segmentation, and confidence.
  4. Parse OCR results to locate fields, using heuristics and templates.
  5. Validate/normalize extracted values.
  6. Insert or upsert documents into MongoDB via pymongo.
  7. Move processed PDFs to an archive or error folder.

Tech stack / dependencies

  • Python 3.14 (project venv: .venv)
  • OCR: paddleocr (and underlying paddlepaddle)
  • PDF → image: pdf2image (requires poppler installed on host)
  • MongoDB client: pymongo
  • Image handling: Pillow
  • Optional: python-dotenv for config, rich for nicer logs

System-level requirements:

  • poppler (for pdf2image)
  • Optional GPU drivers if using GPU-enabled paddlepaddle builds

Environment / setup (dev)

  1. Create and activate virtualenv (already done in this workspace):
python3 -m venv .venv
source .venv/bin/activate
  1. Install Python dependencies (we will add requirements.txt soon):
pip install -r requirements.txt
  1. Configure a .env file for MongoDB connection string and paths (example):
MONGO_URI=mongodb://localhost:27017
INPUT_DIR=./input
ARCHIVE_DIR=./archive
ERROR_DIR=./error

Data model (draft)

  • documents collection:
    • _id: UUID
    • filename: original PDF filename
    • pages: list of page objects; each page has page_number, ocr_text, fields
    • extracted_fields: dict of normalized field names -> values
    • status: pending|processed|error
    • processed_at, created_at

Next steps / short-term TODOs

  • Create requirements.txt with pinned packages and document system deps.
  • Add README.md with quickstart run instructions.
  • Implement a small prototype script processor.py that converts a PDF to images and runs OCR on a single page.
  • Add basic unit tests and a sample PDF in samples/ for development.

Notes / caveats

  • paddleocr may require selecting CPU vs GPU builds of paddlepaddle — document preferred build and install instructions in requirements.txt or a separate notes section.
  • OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.

Created: initial development notes and plan.