Files
docintake-gm/DEVELOPMENT.md
2026-01-01 21:57:33 -08:00

87 lines
2.8 KiB
Markdown

# Document Intake (dev notes)
## Project overview
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
## Goals
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
- Validate and transform extracted data into a consistent schema for MongoDB.
- Make processing resumable and observable (logs, metrics, retry).
## Architecture / Flow
1. Watch input folder for new PDF files (or run batch processor).
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
4. Parse OCR results to locate fields, using heuristics and templates.
5. Validate/normalize extracted values.
6. Insert or upsert documents into MongoDB via `pymongo`.
7. Move processed PDFs to an archive or error folder.
## Tech stack / dependencies
- Python 3.14 (project venv: `.venv`)
- OCR: `paddleocr` (and underlying `paddlepaddle`)
- PDF → image: `pdf2image` (requires `poppler` installed on host)
- MongoDB client: `pymongo`
- Image handling: `Pillow`
- Optional: `python-dotenv` for config, `rich` for nicer logs
System-level requirements:
- `poppler` (for `pdf2image`)
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
## Environment / setup (dev)
1. Create and activate virtualenv (already done in this workspace):
```bash
python3 -m venv .venv
source .venv/bin/activate
```
2. Install Python dependencies (we will add `requirements.txt` soon):
```bash
pip install -r requirements.txt
```
3. Configure a `.env` file for MongoDB connection string and paths (example):
```
MONGO_URI=mongodb://localhost:27017
INPUT_DIR=./input
ARCHIVE_DIR=./archive
ERROR_DIR=./error
```
## Data model (draft)
- `documents` collection:
- `_id`: UUID
- `filename`: original PDF filename
- `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
- `extracted_fields`: dict of normalized field names -> values
- `status`: `pending|processed|error`
- `processed_at`, `created_at`
## Next steps / short-term TODOs
- Create `requirements.txt` with pinned packages and document system deps.
- Add `README.md` with quickstart run instructions.
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
- Add basic unit tests and a sample PDF in `samples/` for development.
## Notes / caveats
- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
---
Created: initial development notes and plan.