Initial commit

2026-01-01 21:57:33 -08:00
commit d246d2a0d7
6 changed files with 285 additions and 0 deletions
--- a/DEVELOPMENT.md
+++ b/DEVELOPMENT.md
@@ -0,0 +1,86 @@
+# Document Intake (dev notes)
+
+## Project overview
+
+This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
+
+## Goals
+
+- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
+- Extract form fields and normalize values (dates, numbers, names, checkboxes).
+- Validate and transform extracted data into a consistent schema for MongoDB.
+- Make processing resumable and observable (logs, metrics, retry).
+
+## Architecture / Flow
+
+1. Watch input folder for new PDF files (or run batch processor).
+2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
+3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
+4. Parse OCR results to locate fields, using heuristics and templates.
+5. Validate/normalize extracted values.
+6. Insert or upsert documents into MongoDB via `pymongo`.
+7. Move processed PDFs to an archive or error folder.
+
+## Tech stack / dependencies
+
+- Python 3.14 (project venv: `.venv`)
+- OCR: `paddleocr` (and underlying `paddlepaddle`)
+- PDF → image: `pdf2image` (requires `poppler` installed on host)
+- MongoDB client: `pymongo`
+- Image handling: `Pillow`
+- Optional: `python-dotenv` for config, `rich` for nicer logs
+
+System-level requirements:
+
+- `poppler` (for `pdf2image`)
+- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
+
+## Environment / setup (dev)
+
+1. Create and activate virtualenv (already done in this workspace):
+
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+```
+
+2. Install Python dependencies (we will add `requirements.txt` soon):
+
+```bash
+pip install -r requirements.txt
+```
+
+3. Configure a `.env` file for MongoDB connection string and paths (example):
+
+```
+MONGO_URI=mongodb://localhost:27017
+INPUT_DIR=./input
+ARCHIVE_DIR=./archive
+ERROR_DIR=./error
+```
+
+## Data model (draft)
+
+- `documents` collection:
+  - `_id`: UUID
+  - `filename`: original PDF filename
+  - `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
+  - `extracted_fields`: dict of normalized field names -> values
+  - `status`: `pending|processed|error`
+  - `processed_at`, `created_at`
+
+## Next steps / short-term TODOs
+
+- Create `requirements.txt` with pinned packages and document system deps.
+- Add `README.md` with quickstart run instructions.
+- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
+- Add basic unit tests and a sample PDF in `samples/` for development.
+
+## Notes / caveats
+
+- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
+- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
+
+---
+
+Created: initial development notes and plan.