docintake-gm/DEVELOPMENT.md

# Document Intake (dev notes)

## Project overview

This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.

## Goals

- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
- Validate and transform extracted data into a consistent schema for MongoDB.
- Make processing resumable and observable (logs, metrics, retry).

## Architecture / Flow

1. Watch input folder for new PDF files (or run batch processor).
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
4. Parse OCR results to locate fields, using heuristics and templates.
5. Validate/normalize extracted values.
6. Insert or upsert documents into MongoDB via `pymongo`.
7. Move processed PDFs to an archive or error folder.

## Tech stack / dependencies

- Python 3.14 (project venv: `.venv`)
- OCR: `paddleocr` (and underlying `paddlepaddle`)
- PDF → image: `pdf2image` (requires `poppler` installed on host)
- MongoDB client: `pymongo`
- Image handling: `Pillow`
- Optional: `python-dotenv` for config, `rich` for nicer logs

System-level requirements:

- `poppler` (for `pdf2image`)
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds

## Environment / setup (dev)

1. Create and activate virtualenv (already done in this workspace):

```bash
python3 -m venv .venv
source .venv/bin/activate
```

2. Install Python dependencies (we will add `requirements.txt` soon):

```bash
pip install -r requirements.txt
```

3. Configure a `.env` file for MongoDB connection string and paths (example):

```
MONGO_URI=mongodb://localhost:27017
INPUT_DIR=./input
ARCHIVE_DIR=./archive
ERROR_DIR=./error
```

## Data model (draft)

- `documents` collection:
  - `_id`: UUID
  - `filename`: original PDF filename
  - `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
  - `extracted_fields`: dict of normalized field names -> values
  - `status`: `pending|processed|error`
  - `processed_at`, `created_at`

## Next steps / short-term TODOs

- Create `requirements.txt` with pinned packages and document system deps.
- Add `README.md` with quickstart run instructions.
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
- Add basic unit tests and a sample PDF in `samples/` for development.

## Notes / caveats

- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.

---

Created: initial development notes and plan.