87 lines
2.8 KiB
Markdown
87 lines
2.8 KiB
Markdown
# Document Intake (dev notes)
|
|
|
|
## Project overview
|
|
|
|
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
|
|
|
|
## Goals
|
|
|
|
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
|
|
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
|
|
- Validate and transform extracted data into a consistent schema for MongoDB.
|
|
- Make processing resumable and observable (logs, metrics, retry).
|
|
|
|
## Architecture / Flow
|
|
|
|
1. Watch input folder for new PDF files (or run batch processor).
|
|
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
|
|
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
|
|
4. Parse OCR results to locate fields, using heuristics and templates.
|
|
5. Validate/normalize extracted values.
|
|
6. Insert or upsert documents into MongoDB via `pymongo`.
|
|
7. Move processed PDFs to an archive or error folder.
|
|
|
|
## Tech stack / dependencies
|
|
|
|
- Python 3.14 (project venv: `.venv`)
|
|
- OCR: `paddleocr` (and underlying `paddlepaddle`)
|
|
- PDF → image: `pdf2image` (requires `poppler` installed on host)
|
|
- MongoDB client: `pymongo`
|
|
- Image handling: `Pillow`
|
|
- Optional: `python-dotenv` for config, `rich` for nicer logs
|
|
|
|
System-level requirements:
|
|
|
|
- `poppler` (for `pdf2image`)
|
|
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
|
|
|
|
## Environment / setup (dev)
|
|
|
|
1. Create and activate virtualenv (already done in this workspace):
|
|
|
|
```bash
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
```
|
|
|
|
2. Install Python dependencies (we will add `requirements.txt` soon):
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Configure a `.env` file for MongoDB connection string and paths (example):
|
|
|
|
```
|
|
MONGO_URI=mongodb://localhost:27017
|
|
INPUT_DIR=./input
|
|
ARCHIVE_DIR=./archive
|
|
ERROR_DIR=./error
|
|
```
|
|
|
|
## Data model (draft)
|
|
|
|
- `documents` collection:
|
|
- `_id`: UUID
|
|
- `filename`: original PDF filename
|
|
- `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
|
|
- `extracted_fields`: dict of normalized field names -> values
|
|
- `status`: `pending|processed|error`
|
|
- `processed_at`, `created_at`
|
|
|
|
## Next steps / short-term TODOs
|
|
|
|
- Create `requirements.txt` with pinned packages and document system deps.
|
|
- Add `README.md` with quickstart run instructions.
|
|
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
|
|
- Add basic unit tests and a sample PDF in `samples/` for development.
|
|
|
|
## Notes / caveats
|
|
|
|
- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
|
|
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
|
|
|
|
---
|
|
|
|
Created: initial development notes and plan.
|