# Document Intake (dev notes) ## Project overview This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database. ## Goals - Build a reliable pipeline to convert scanned PDF pages into images and run OCR. - Extract form fields and normalize values (dates, numbers, names, checkboxes). - Validate and transform extracted data into a consistent schema for MongoDB. - Make processing resumable and observable (logs, metrics, retry). ## Architecture / Flow 1. Watch input folder for new PDF files (or run batch processor). 2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`). 3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence. 4. Parse OCR results to locate fields, using heuristics and templates. 5. Validate/normalize extracted values. 6. Insert or upsert documents into MongoDB via `pymongo`. 7. Move processed PDFs to an archive or error folder. ## Tech stack / dependencies - Python 3.14 (project venv: `.venv`) - OCR: `paddleocr` (and underlying `paddlepaddle`) - PDF → image: `pdf2image` (requires `poppler` installed on host) - MongoDB client: `pymongo` - Image handling: `Pillow` - Optional: `python-dotenv` for config, `rich` for nicer logs System-level requirements: - `poppler` (for `pdf2image`) - Optional GPU drivers if using GPU-enabled `paddlepaddle` builds ## Environment / setup (dev) 1. Create and activate virtualenv (already done in this workspace): ```bash python3 -m venv .venv source .venv/bin/activate ``` 2. Install Python dependencies (we will add `requirements.txt` soon): ```bash pip install -r requirements.txt ``` 3. Configure a `.env` file for MongoDB connection string and paths (example): ``` MONGO_URI=mongodb://localhost:27017 INPUT_DIR=./input ARCHIVE_DIR=./archive ERROR_DIR=./error ``` ## Data model (draft) - `documents` collection: - `_id`: UUID - `filename`: original PDF filename - `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields` - `extracted_fields`: dict of normalized field names -> values - `status`: `pending|processed|error` - `processed_at`, `created_at` ## Next steps / short-term TODOs - Create `requirements.txt` with pinned packages and document system deps. - Add `README.md` with quickstart run instructions. - Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page. - Add basic unit tests and a sample PDF in `samples/` for development. ## Notes / caveats - `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section. - OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy. --- Created: initial development notes and plan.