Initial commit

2026-01-01 21:57:33 -08:00
commit d246d2a0d7
6 changed files with 285 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,22 @@
+# Virtual environment
+.venv/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+build/
+dist/
+*.egg-info/
+
+# Editor dirs
+.vscode/
+.idea/
+
+# macOS
+.DS_Store
+
+# Logs
+*.log
--- a/DEVELOPMENT.md
+++ b/DEVELOPMENT.md
@@ -0,0 +1,86 @@
+# Document Intake (dev notes)
+
+## Project overview
+
+This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
+
+## Goals
+
+- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
+- Extract form fields and normalize values (dates, numbers, names, checkboxes).
+- Validate and transform extracted data into a consistent schema for MongoDB.
+- Make processing resumable and observable (logs, metrics, retry).
+
+## Architecture / Flow
+
+1. Watch input folder for new PDF files (or run batch processor).
+2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
+3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
+4. Parse OCR results to locate fields, using heuristics and templates.
+5. Validate/normalize extracted values.
+6. Insert or upsert documents into MongoDB via `pymongo`.
+7. Move processed PDFs to an archive or error folder.
+
+## Tech stack / dependencies
+
+- Python 3.14 (project venv: `.venv`)
+- OCR: `paddleocr` (and underlying `paddlepaddle`)
+- PDF → image: `pdf2image` (requires `poppler` installed on host)
+- MongoDB client: `pymongo`
+- Image handling: `Pillow`
+- Optional: `python-dotenv` for config, `rich` for nicer logs
+
+System-level requirements:
+
+- `poppler` (for `pdf2image`)
+- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
+
+## Environment / setup (dev)
+
+1. Create and activate virtualenv (already done in this workspace):
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+```
+
+2. Install Python dependencies (we will add `requirements.txt` soon):
+
+```bash
+pip install -r requirements.txt
+```
+
+3. Configure a `.env` file for MongoDB connection string and paths (example):
+
+```
+MONGO_URI=mongodb://localhost:27017
+INPUT_DIR=./input
+ARCHIVE_DIR=./archive
+ERROR_DIR=./error
+```
+
+## Data model (draft)
+
+- `documents` collection:
+  - `_id`: UUID
+  - `filename`: original PDF filename
+  - `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
+  - `extracted_fields`: dict of normalized field names -> values
+  - `status`: `pending|processed|error`
+  - `processed_at`, `created_at`
+
+## Next steps / short-term TODOs
+
+- Create `requirements.txt` with pinned packages and document system deps.
+- Add `README.md` with quickstart run instructions.
+- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
+- Add basic unit tests and a sample PDF in `samples/` for development.
+
+## Notes / caveats
+
+- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
+- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
+
+---
+
+Created: initial development notes and plan.
--- a/README.md
+++ b/README.md
@@ -0,0 +1,56 @@
+# Document Intake — Quickstart
+
+Prerequisite: you already created the project virtual environment `.venv` in the repository root.
+
+## macOS system deps
+
+Install `poppler` (needed by `pdf2image`):
+
+```bash
+brew install poppler
+```
+
+## Activate the virtual environment
+
+```bash
+source .venv/bin/activate
+```
+
+## Install Python dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
+
+## Quick verification
+
+Check that `paddleocr` and `pymongo` import successfully:
+
+```bash
+python -c "import paddleocr; import pymongo; print('imports OK')"
+```
+
+## Running a processor (prototype)
+
+We will add a prototype script `processor.py` that:
+
+- Converts pages from a PDF to images using `pdf2image`.
+- Runs OCR on one page with `paddleocr`.
+- Prints basic extraction results.
+
+To run the prototype (once added):
+
+```bash
+python processor.py --input samples/example.pdf
+```
+
+## Useful files
+
+- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
+- Python dependencies: [requirements.txt](requirements.txt)
+
+---
+
+If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?
--- a/conf.yaml
+++ b/conf.yaml
@@ -0,0 +1,11 @@
+## Configuration for the processor prototype
+
+# Either point `input_dir` to a folder containing PDFs, or set `input_file`
+input_dir: ./input
+input_file: ""
+
+archive_dir: ./archive
+error_dir: ./error
+
+# MongoDB connection (optional for prototype)
+mongo_uri: "mongodb://localhost:27017"
--- a/processor.py
+++ b/processor.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+"""Minimal processor prototype: convert a PDF page to an image and run PaddleOCR.
+
+This script is intentionally small and defensive: it checks for missing
+dependencies and prints actionable instructions.
+"""
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+
+try:
+    import yaml
+    from pdf2image import convert_from_path
+    from paddleocr import PaddleOCR
+    import numpy as np
+except Exception as exc:  # pragma: no cover - runtime dependency guard
+    print("Dependency error:", exc)
+    print("Please install requirements: pip install -r requirements.txt")
+    sys.exit(1)
+
+
+def process_pdf_first_page(pdf_path: str) -> list:
+    pages = convert_from_path(pdf_path, first_page=1, last_page=1)
+    if not pages:
+        raise RuntimeError("No pages returned from pdf2image")
+    img = pages[0]
+    ocr = PaddleOCR(use_angle_cls=True, lang='en')
+    # PaddleOCR accepts numpy arrays for in-memory images
+    result = ocr.ocr(np.array(img), cls=True)
+    return result
+
+
+def load_conf(path: str) -> dict:
+    with open(path, "r") as fh:
+        return yaml.safe_load(fh) or {}
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description="Processor prototype")
+    p.add_argument("--conf", default="conf.yaml", help="Path to conf.yaml")
+    p.add_argument(
+        "--input", help="Path to input PDF or input directory (overrides conf)")
+    args = p.parse_args()
+
+    conf = {}
+    if os.path.exists(args.conf):
+        conf = load_conf(args.conf)
+
+    input_spec = args.input or conf.get("input_file") or conf.get("input_dir")
+    if not input_spec:
+        print("No input specified. Set --input or define 'input_file' / 'input_dir' in conf.yaml")
+        sys.exit(2)
+
+    if os.path.isdir(input_spec):
+        for fname in sorted(os.listdir(input_spec)):
+            if fname.lower().endswith(".pdf"):
+                input_spec = os.path.join(input_spec, fname)
+                break
+        else:
+            print("No PDF files found in directory:", input_spec)
+            sys.exit(3)
+
+    if not os.path.exists(input_spec):
+        print("Input path does not exist:", input_spec)
+        sys.exit(4)
+
+    print("Processing:", input_spec)
+    try:
+        res = process_pdf_first_page(input_spec)
+    except Exception as exc:
+        print("Processing error:", exc)
+        sys.exit(5)
+
+    print("OCR result (first page):")
+    for block in res:
+        print(block)
+
+
+if __name__ == "__main__":
+    main()
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,28 @@
+# Core OCR
+paddleocr>=2.7.0
+# PaddlePaddle (choose CPU or GPU build appropriate for your system)
+# For CPU-only install: `pip install paddlepaddle` or follow the official install guide
+#paddlepaddle>=2.5.0
+
+# PDF -> image
+pdf2image>=1.16.0
+Pillow>=10.0.0
+
+# Database
+pymongo>=4.4.0
+
+# Utilities
+python-dotenv>=1.0.0
+rich>=13.0.0
+tqdm>=4.65.0
+
+# Testing / linting (optional)
+pytest>=8.0.0
+mypy>=1.5.0
+
+# YAML config parsing
+PyYAML>=6.0
+
+# Notes:
+# - `pdf2image` requires the `poppler` system package (e.g. `brew install poppler` on macOS).
+# - Select the appropriate `paddlepaddle` wheel for your platform (CPU vs GPU, macOS vs Linux).