Initial commit

This commit is contained in:
2026-01-01 21:57:33 -08:00
commit d246d2a0d7
6 changed files with 285 additions and 0 deletions

22
.gitignore vendored Normal file
View File

@@ -0,0 +1,22 @@
# Virtual environment
.venv/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# Distribution / packaging
build/
dist/
*.egg-info/
# Editor dirs
.vscode/
.idea/
# macOS
.DS_Store
# Logs
*.log

86
DEVELOPMENT.md Normal file
View File

@@ -0,0 +1,86 @@
# Document Intake (dev notes)
## Project overview
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
## Goals
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
- Validate and transform extracted data into a consistent schema for MongoDB.
- Make processing resumable and observable (logs, metrics, retry).
## Architecture / Flow
1. Watch input folder for new PDF files (or run batch processor).
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
4. Parse OCR results to locate fields, using heuristics and templates.
5. Validate/normalize extracted values.
6. Insert or upsert documents into MongoDB via `pymongo`.
7. Move processed PDFs to an archive or error folder.
## Tech stack / dependencies
- Python 3.14 (project venv: `.venv`)
- OCR: `paddleocr` (and underlying `paddlepaddle`)
- PDF → image: `pdf2image` (requires `poppler` installed on host)
- MongoDB client: `pymongo`
- Image handling: `Pillow`
- Optional: `python-dotenv` for config, `rich` for nicer logs
System-level requirements:
- `poppler` (for `pdf2image`)
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
## Environment / setup (dev)
1. Create and activate virtualenv (already done in this workspace):
```bash
python3 -m venv .venv
source .venv/bin/activate
```
2. Install Python dependencies (we will add `requirements.txt` soon):
```bash
pip install -r requirements.txt
```
3. Configure a `.env` file for MongoDB connection string and paths (example):
```
MONGO_URI=mongodb://localhost:27017
INPUT_DIR=./input
ARCHIVE_DIR=./archive
ERROR_DIR=./error
```
## Data model (draft)
- `documents` collection:
- `_id`: UUID
- `filename`: original PDF filename
- `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
- `extracted_fields`: dict of normalized field names -> values
- `status`: `pending|processed|error`
- `processed_at`, `created_at`
## Next steps / short-term TODOs
- Create `requirements.txt` with pinned packages and document system deps.
- Add `README.md` with quickstart run instructions.
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
- Add basic unit tests and a sample PDF in `samples/` for development.
## Notes / caveats
- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
---
Created: initial development notes and plan.

56
README.md Normal file
View File

@@ -0,0 +1,56 @@
# Document Intake — Quickstart
Prerequisite: you already created the project virtual environment `.venv` in the repository root.
## macOS system deps
Install `poppler` (needed by `pdf2image`):
```bash
brew install poppler
```
## Activate the virtual environment
```bash
source .venv/bin/activate
```
## Install Python dependencies
```bash
pip install -r requirements.txt
```
If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
## Quick verification
Check that `paddleocr` and `pymongo` import successfully:
```bash
python -c "import paddleocr; import pymongo; print('imports OK')"
```
## Running a processor (prototype)
We will add a prototype script `processor.py` that:
- Converts pages from a PDF to images using `pdf2image`.
- Runs OCR on one page with `paddleocr`.
- Prints basic extraction results.
To run the prototype (once added):
```bash
python processor.py --input samples/example.pdf
```
## Useful files
- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
- Python dependencies: [requirements.txt](requirements.txt)
---
If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?

11
conf.yaml Normal file
View File

@@ -0,0 +1,11 @@
## Configuration for the processor prototype
# Either point `input_dir` to a folder containing PDFs, or set `input_file`
input_dir: ./input
input_file: ""
archive_dir: ./archive
error_dir: ./error
# MongoDB connection (optional for prototype)
mongo_uri: "mongodb://localhost:27017"

82
processor.py Normal file
View File

@@ -0,0 +1,82 @@
#!/usr/bin/env python3
"""Minimal processor prototype: convert a PDF page to an image and run PaddleOCR.
This script is intentionally small and defensive: it checks for missing
dependencies and prints actionable instructions.
"""
from __future__ import annotations
import argparse
import os
import sys
try:
import yaml
from pdf2image import convert_from_path
from paddleocr import PaddleOCR
import numpy as np
except Exception as exc: # pragma: no cover - runtime dependency guard
print("Dependency error:", exc)
print("Please install requirements: pip install -r requirements.txt")
sys.exit(1)
def process_pdf_first_page(pdf_path: str) -> list:
pages = convert_from_path(pdf_path, first_page=1, last_page=1)
if not pages:
raise RuntimeError("No pages returned from pdf2image")
img = pages[0]
ocr = PaddleOCR(use_angle_cls=True, lang='en')
# PaddleOCR accepts numpy arrays for in-memory images
result = ocr.ocr(np.array(img), cls=True)
return result
def load_conf(path: str) -> dict:
with open(path, "r") as fh:
return yaml.safe_load(fh) or {}
def main() -> None:
p = argparse.ArgumentParser(description="Processor prototype")
p.add_argument("--conf", default="conf.yaml", help="Path to conf.yaml")
p.add_argument(
"--input", help="Path to input PDF or input directory (overrides conf)")
args = p.parse_args()
conf = {}
if os.path.exists(args.conf):
conf = load_conf(args.conf)
input_spec = args.input or conf.get("input_file") or conf.get("input_dir")
if not input_spec:
print("No input specified. Set --input or define 'input_file' / 'input_dir' in conf.yaml")
sys.exit(2)
if os.path.isdir(input_spec):
for fname in sorted(os.listdir(input_spec)):
if fname.lower().endswith(".pdf"):
input_spec = os.path.join(input_spec, fname)
break
else:
print("No PDF files found in directory:", input_spec)
sys.exit(3)
if not os.path.exists(input_spec):
print("Input path does not exist:", input_spec)
sys.exit(4)
print("Processing:", input_spec)
try:
res = process_pdf_first_page(input_spec)
except Exception as exc:
print("Processing error:", exc)
sys.exit(5)
print("OCR result (first page):")
for block in res:
print(block)
if __name__ == "__main__":
main()

28
requirements.txt Normal file
View File

@@ -0,0 +1,28 @@
# Core OCR
paddleocr>=2.7.0
# PaddlePaddle (choose CPU or GPU build appropriate for your system)
# For CPU-only install: `pip install paddlepaddle` or follow the official install guide
#paddlepaddle>=2.5.0
# PDF -> image
pdf2image>=1.16.0
Pillow>=10.0.0
# Database
pymongo>=4.4.0
# Utilities
python-dotenv>=1.0.0
rich>=13.0.0
tqdm>=4.65.0
# Testing / linting (optional)
pytest>=8.0.0
mypy>=1.5.0
# YAML config parsing
PyYAML>=6.0
# Notes:
# - `pdf2image` requires the `poppler` system package (e.g. `brew install poppler` on macOS).
# - Select the appropriate `paddlepaddle` wheel for your platform (CPU vs GPU, macOS vs Linux).