Initial commit
This commit is contained in:
86
DEVELOPMENT.md
Normal file
86
DEVELOPMENT.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Document Intake (dev notes)
|
||||
|
||||
## Project overview
|
||||
|
||||
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
|
||||
|
||||
## Goals
|
||||
|
||||
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
|
||||
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
|
||||
- Validate and transform extracted data into a consistent schema for MongoDB.
|
||||
- Make processing resumable and observable (logs, metrics, retry).
|
||||
|
||||
## Architecture / Flow
|
||||
|
||||
1. Watch input folder for new PDF files (or run batch processor).
|
||||
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
|
||||
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
|
||||
4. Parse OCR results to locate fields, using heuristics and templates.
|
||||
5. Validate/normalize extracted values.
|
||||
6. Insert or upsert documents into MongoDB via `pymongo`.
|
||||
7. Move processed PDFs to an archive or error folder.
|
||||
|
||||
## Tech stack / dependencies
|
||||
|
||||
- Python 3.14 (project venv: `.venv`)
|
||||
- OCR: `paddleocr` (and underlying `paddlepaddle`)
|
||||
- PDF → image: `pdf2image` (requires `poppler` installed on host)
|
||||
- MongoDB client: `pymongo`
|
||||
- Image handling: `Pillow`
|
||||
- Optional: `python-dotenv` for config, `rich` for nicer logs
|
||||
|
||||
System-level requirements:
|
||||
|
||||
- `poppler` (for `pdf2image`)
|
||||
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
|
||||
|
||||
## Environment / setup (dev)
|
||||
|
||||
1. Create and activate virtualenv (already done in this workspace):
|
||||
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
2. Install Python dependencies (we will add `requirements.txt` soon):
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure a `.env` file for MongoDB connection string and paths (example):
|
||||
|
||||
```
|
||||
MONGO_URI=mongodb://localhost:27017
|
||||
INPUT_DIR=./input
|
||||
ARCHIVE_DIR=./archive
|
||||
ERROR_DIR=./error
|
||||
```
|
||||
|
||||
## Data model (draft)
|
||||
|
||||
- `documents` collection:
|
||||
- `_id`: UUID
|
||||
- `filename`: original PDF filename
|
||||
- `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
|
||||
- `extracted_fields`: dict of normalized field names -> values
|
||||
- `status`: `pending|processed|error`
|
||||
- `processed_at`, `created_at`
|
||||
|
||||
## Next steps / short-term TODOs
|
||||
|
||||
- Create `requirements.txt` with pinned packages and document system deps.
|
||||
- Add `README.md` with quickstart run instructions.
|
||||
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
|
||||
- Add basic unit tests and a sample PDF in `samples/` for development.
|
||||
|
||||
## Notes / caveats
|
||||
|
||||
- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
|
||||
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
|
||||
|
||||
---
|
||||
|
||||
Created: initial development notes and plan.
|
||||
Reference in New Issue
Block a user