Initial commit
This commit is contained in:
22
.gitignore
vendored
Normal file
22
.gitignore
vendored
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
# Virtual environment
|
||||||
|
.venv/
|
||||||
|
|
||||||
|
# Byte-compiled / optimized / DLL files
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
|
||||||
|
# Distribution / packaging
|
||||||
|
build/
|
||||||
|
dist/
|
||||||
|
*.egg-info/
|
||||||
|
|
||||||
|
# Editor dirs
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
|
||||||
|
# macOS
|
||||||
|
.DS_Store
|
||||||
|
|
||||||
|
# Logs
|
||||||
|
*.log
|
||||||
86
DEVELOPMENT.md
Normal file
86
DEVELOPMENT.md
Normal file
@@ -0,0 +1,86 @@
|
|||||||
|
# Document Intake (dev notes)
|
||||||
|
|
||||||
|
## Project overview
|
||||||
|
|
||||||
|
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
|
||||||
|
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
|
||||||
|
- Validate and transform extracted data into a consistent schema for MongoDB.
|
||||||
|
- Make processing resumable and observable (logs, metrics, retry).
|
||||||
|
|
||||||
|
## Architecture / Flow
|
||||||
|
|
||||||
|
1. Watch input folder for new PDF files (or run batch processor).
|
||||||
|
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
|
||||||
|
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
|
||||||
|
4. Parse OCR results to locate fields, using heuristics and templates.
|
||||||
|
5. Validate/normalize extracted values.
|
||||||
|
6. Insert or upsert documents into MongoDB via `pymongo`.
|
||||||
|
7. Move processed PDFs to an archive or error folder.
|
||||||
|
|
||||||
|
## Tech stack / dependencies
|
||||||
|
|
||||||
|
- Python 3.14 (project venv: `.venv`)
|
||||||
|
- OCR: `paddleocr` (and underlying `paddlepaddle`)
|
||||||
|
- PDF → image: `pdf2image` (requires `poppler` installed on host)
|
||||||
|
- MongoDB client: `pymongo`
|
||||||
|
- Image handling: `Pillow`
|
||||||
|
- Optional: `python-dotenv` for config, `rich` for nicer logs
|
||||||
|
|
||||||
|
System-level requirements:
|
||||||
|
|
||||||
|
- `poppler` (for `pdf2image`)
|
||||||
|
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
|
||||||
|
|
||||||
|
## Environment / setup (dev)
|
||||||
|
|
||||||
|
1. Create and activate virtualenv (already done in this workspace):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Install Python dependencies (we will add `requirements.txt` soon):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Configure a `.env` file for MongoDB connection string and paths (example):
|
||||||
|
|
||||||
|
```
|
||||||
|
MONGO_URI=mongodb://localhost:27017
|
||||||
|
INPUT_DIR=./input
|
||||||
|
ARCHIVE_DIR=./archive
|
||||||
|
ERROR_DIR=./error
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data model (draft)
|
||||||
|
|
||||||
|
- `documents` collection:
|
||||||
|
- `_id`: UUID
|
||||||
|
- `filename`: original PDF filename
|
||||||
|
- `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
|
||||||
|
- `extracted_fields`: dict of normalized field names -> values
|
||||||
|
- `status`: `pending|processed|error`
|
||||||
|
- `processed_at`, `created_at`
|
||||||
|
|
||||||
|
## Next steps / short-term TODOs
|
||||||
|
|
||||||
|
- Create `requirements.txt` with pinned packages and document system deps.
|
||||||
|
- Add `README.md` with quickstart run instructions.
|
||||||
|
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
|
||||||
|
- Add basic unit tests and a sample PDF in `samples/` for development.
|
||||||
|
|
||||||
|
## Notes / caveats
|
||||||
|
|
||||||
|
- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
|
||||||
|
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Created: initial development notes and plan.
|
||||||
56
README.md
Normal file
56
README.md
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# Document Intake — Quickstart
|
||||||
|
|
||||||
|
Prerequisite: you already created the project virtual environment `.venv` in the repository root.
|
||||||
|
|
||||||
|
## macOS system deps
|
||||||
|
|
||||||
|
Install `poppler` (needed by `pdf2image`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
brew install poppler
|
||||||
|
```
|
||||||
|
|
||||||
|
## Activate the virtual environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source .venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
## Install Python dependencies
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
|
||||||
|
|
||||||
|
## Quick verification
|
||||||
|
|
||||||
|
Check that `paddleocr` and `pymongo` import successfully:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -c "import paddleocr; import pymongo; print('imports OK')"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Running a processor (prototype)
|
||||||
|
|
||||||
|
We will add a prototype script `processor.py` that:
|
||||||
|
|
||||||
|
- Converts pages from a PDF to images using `pdf2image`.
|
||||||
|
- Runs OCR on one page with `paddleocr`.
|
||||||
|
- Prints basic extraction results.
|
||||||
|
|
||||||
|
To run the prototype (once added):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python processor.py --input samples/example.pdf
|
||||||
|
```
|
||||||
|
|
||||||
|
## Useful files
|
||||||
|
|
||||||
|
- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
|
||||||
|
- Python dependencies: [requirements.txt](requirements.txt)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?
|
||||||
11
conf.yaml
Normal file
11
conf.yaml
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
## Configuration for the processor prototype
|
||||||
|
|
||||||
|
# Either point `input_dir` to a folder containing PDFs, or set `input_file`
|
||||||
|
input_dir: ./input
|
||||||
|
input_file: ""
|
||||||
|
|
||||||
|
archive_dir: ./archive
|
||||||
|
error_dir: ./error
|
||||||
|
|
||||||
|
# MongoDB connection (optional for prototype)
|
||||||
|
mongo_uri: "mongodb://localhost:27017"
|
||||||
82
processor.py
Normal file
82
processor.py
Normal file
@@ -0,0 +1,82 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Minimal processor prototype: convert a PDF page to an image and run PaddleOCR.
|
||||||
|
|
||||||
|
This script is intentionally small and defensive: it checks for missing
|
||||||
|
dependencies and prints actionable instructions.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
|
||||||
|
try:
|
||||||
|
import yaml
|
||||||
|
from pdf2image import convert_from_path
|
||||||
|
from paddleocr import PaddleOCR
|
||||||
|
import numpy as np
|
||||||
|
except Exception as exc: # pragma: no cover - runtime dependency guard
|
||||||
|
print("Dependency error:", exc)
|
||||||
|
print("Please install requirements: pip install -r requirements.txt")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
def process_pdf_first_page(pdf_path: str) -> list:
|
||||||
|
pages = convert_from_path(pdf_path, first_page=1, last_page=1)
|
||||||
|
if not pages:
|
||||||
|
raise RuntimeError("No pages returned from pdf2image")
|
||||||
|
img = pages[0]
|
||||||
|
ocr = PaddleOCR(use_angle_cls=True, lang='en')
|
||||||
|
# PaddleOCR accepts numpy arrays for in-memory images
|
||||||
|
result = ocr.ocr(np.array(img), cls=True)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def load_conf(path: str) -> dict:
|
||||||
|
with open(path, "r") as fh:
|
||||||
|
return yaml.safe_load(fh) or {}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
p = argparse.ArgumentParser(description="Processor prototype")
|
||||||
|
p.add_argument("--conf", default="conf.yaml", help="Path to conf.yaml")
|
||||||
|
p.add_argument(
|
||||||
|
"--input", help="Path to input PDF or input directory (overrides conf)")
|
||||||
|
args = p.parse_args()
|
||||||
|
|
||||||
|
conf = {}
|
||||||
|
if os.path.exists(args.conf):
|
||||||
|
conf = load_conf(args.conf)
|
||||||
|
|
||||||
|
input_spec = args.input or conf.get("input_file") or conf.get("input_dir")
|
||||||
|
if not input_spec:
|
||||||
|
print("No input specified. Set --input or define 'input_file' / 'input_dir' in conf.yaml")
|
||||||
|
sys.exit(2)
|
||||||
|
|
||||||
|
if os.path.isdir(input_spec):
|
||||||
|
for fname in sorted(os.listdir(input_spec)):
|
||||||
|
if fname.lower().endswith(".pdf"):
|
||||||
|
input_spec = os.path.join(input_spec, fname)
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
print("No PDF files found in directory:", input_spec)
|
||||||
|
sys.exit(3)
|
||||||
|
|
||||||
|
if not os.path.exists(input_spec):
|
||||||
|
print("Input path does not exist:", input_spec)
|
||||||
|
sys.exit(4)
|
||||||
|
|
||||||
|
print("Processing:", input_spec)
|
||||||
|
try:
|
||||||
|
res = process_pdf_first_page(input_spec)
|
||||||
|
except Exception as exc:
|
||||||
|
print("Processing error:", exc)
|
||||||
|
sys.exit(5)
|
||||||
|
|
||||||
|
print("OCR result (first page):")
|
||||||
|
for block in res:
|
||||||
|
print(block)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
28
requirements.txt
Normal file
28
requirements.txt
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Core OCR
|
||||||
|
paddleocr>=2.7.0
|
||||||
|
# PaddlePaddle (choose CPU or GPU build appropriate for your system)
|
||||||
|
# For CPU-only install: `pip install paddlepaddle` or follow the official install guide
|
||||||
|
#paddlepaddle>=2.5.0
|
||||||
|
|
||||||
|
# PDF -> image
|
||||||
|
pdf2image>=1.16.0
|
||||||
|
Pillow>=10.0.0
|
||||||
|
|
||||||
|
# Database
|
||||||
|
pymongo>=4.4.0
|
||||||
|
|
||||||
|
# Utilities
|
||||||
|
python-dotenv>=1.0.0
|
||||||
|
rich>=13.0.0
|
||||||
|
tqdm>=4.65.0
|
||||||
|
|
||||||
|
# Testing / linting (optional)
|
||||||
|
pytest>=8.0.0
|
||||||
|
mypy>=1.5.0
|
||||||
|
|
||||||
|
# YAML config parsing
|
||||||
|
PyYAML>=6.0
|
||||||
|
|
||||||
|
# Notes:
|
||||||
|
# - `pdf2image` requires the `poppler` system package (e.g. `brew install poppler` on macOS).
|
||||||
|
# - Select the appropriate `paddlepaddle` wheel for your platform (CPU vs GPU, macOS vs Linux).
|
||||||
Reference in New Issue
Block a user