Initial commit
This commit is contained in:
22
.gitignore
vendored
Normal file
22
.gitignore
vendored
Normal file
@@ -0,0 +1,22 @@
|
||||
# Virtual environment
|
||||
.venv/
|
||||
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# Distribution / packaging
|
||||
build/
|
||||
dist/
|
||||
*.egg-info/
|
||||
|
||||
# Editor dirs
|
||||
.vscode/
|
||||
.idea/
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
86
DEVELOPMENT.md
Normal file
86
DEVELOPMENT.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Document Intake (dev notes)
|
||||
|
||||
## Project overview
|
||||
|
||||
This project ingests PDF files of scanned forms from an input folder, extracts structured data using OCR, and stores the results in a MongoDB database.
|
||||
|
||||
## Goals
|
||||
|
||||
- Build a reliable pipeline to convert scanned PDF pages into images and run OCR.
|
||||
- Extract form fields and normalize values (dates, numbers, names, checkboxes).
|
||||
- Validate and transform extracted data into a consistent schema for MongoDB.
|
||||
- Make processing resumable and observable (logs, metrics, retry).
|
||||
|
||||
## Architecture / Flow
|
||||
|
||||
1. Watch input folder for new PDF files (or run batch processor).
|
||||
2. Convert each PDF page to image(s) using `pdf2image` (requires `poppler`).
|
||||
3. Run OCR on images using `paddleocr` to get text, segmentation, and confidence.
|
||||
4. Parse OCR results to locate fields, using heuristics and templates.
|
||||
5. Validate/normalize extracted values.
|
||||
6. Insert or upsert documents into MongoDB via `pymongo`.
|
||||
7. Move processed PDFs to an archive or error folder.
|
||||
|
||||
## Tech stack / dependencies
|
||||
|
||||
- Python 3.14 (project venv: `.venv`)
|
||||
- OCR: `paddleocr` (and underlying `paddlepaddle`)
|
||||
- PDF → image: `pdf2image` (requires `poppler` installed on host)
|
||||
- MongoDB client: `pymongo`
|
||||
- Image handling: `Pillow`
|
||||
- Optional: `python-dotenv` for config, `rich` for nicer logs
|
||||
|
||||
System-level requirements:
|
||||
|
||||
- `poppler` (for `pdf2image`)
|
||||
- Optional GPU drivers if using GPU-enabled `paddlepaddle` builds
|
||||
|
||||
## Environment / setup (dev)
|
||||
|
||||
1. Create and activate virtualenv (already done in this workspace):
|
||||
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
2. Install Python dependencies (we will add `requirements.txt` soon):
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
3. Configure a `.env` file for MongoDB connection string and paths (example):
|
||||
|
||||
```
|
||||
MONGO_URI=mongodb://localhost:27017
|
||||
INPUT_DIR=./input
|
||||
ARCHIVE_DIR=./archive
|
||||
ERROR_DIR=./error
|
||||
```
|
||||
|
||||
## Data model (draft)
|
||||
|
||||
- `documents` collection:
|
||||
- `_id`: UUID
|
||||
- `filename`: original PDF filename
|
||||
- `pages`: list of page objects; each page has `page_number`, `ocr_text`, `fields`
|
||||
- `extracted_fields`: dict of normalized field names -> values
|
||||
- `status`: `pending|processed|error`
|
||||
- `processed_at`, `created_at`
|
||||
|
||||
## Next steps / short-term TODOs
|
||||
|
||||
- Create `requirements.txt` with pinned packages and document system deps.
|
||||
- Add `README.md` with quickstart run instructions.
|
||||
- Implement a small prototype script `processor.py` that converts a PDF to images and runs OCR on a single page.
|
||||
- Add basic unit tests and a sample PDF in `samples/` for development.
|
||||
|
||||
## Notes / caveats
|
||||
|
||||
- `paddleocr` may require selecting CPU vs GPU builds of `paddlepaddle` — document preferred build and install instructions in `requirements.txt` or a separate notes section.
|
||||
- OCR of scanned forms may require pre-processing (deskewing, denoising) for acceptable accuracy.
|
||||
|
||||
---
|
||||
|
||||
Created: initial development notes and plan.
|
||||
56
README.md
Normal file
56
README.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Document Intake — Quickstart
|
||||
|
||||
Prerequisite: you already created the project virtual environment `.venv` in the repository root.
|
||||
|
||||
## macOS system deps
|
||||
|
||||
Install `poppler` (needed by `pdf2image`):
|
||||
|
||||
```bash
|
||||
brew install poppler
|
||||
```
|
||||
|
||||
## Activate the virtual environment
|
||||
|
||||
```bash
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
## Install Python dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
|
||||
|
||||
## Quick verification
|
||||
|
||||
Check that `paddleocr` and `pymongo` import successfully:
|
||||
|
||||
```bash
|
||||
python -c "import paddleocr; import pymongo; print('imports OK')"
|
||||
```
|
||||
|
||||
## Running a processor (prototype)
|
||||
|
||||
We will add a prototype script `processor.py` that:
|
||||
|
||||
- Converts pages from a PDF to images using `pdf2image`.
|
||||
- Runs OCR on one page with `paddleocr`.
|
||||
- Prints basic extraction results.
|
||||
|
||||
To run the prototype (once added):
|
||||
|
||||
```bash
|
||||
python processor.py --input samples/example.pdf
|
||||
```
|
||||
|
||||
## Useful files
|
||||
|
||||
- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
|
||||
- Python dependencies: [requirements.txt](requirements.txt)
|
||||
|
||||
---
|
||||
|
||||
If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?
|
||||
11
conf.yaml
Normal file
11
conf.yaml
Normal file
@@ -0,0 +1,11 @@
|
||||
## Configuration for the processor prototype
|
||||
|
||||
# Either point `input_dir` to a folder containing PDFs, or set `input_file`
|
||||
input_dir: ./input
|
||||
input_file: ""
|
||||
|
||||
archive_dir: ./archive
|
||||
error_dir: ./error
|
||||
|
||||
# MongoDB connection (optional for prototype)
|
||||
mongo_uri: "mongodb://localhost:27017"
|
||||
82
processor.py
Normal file
82
processor.py
Normal file
@@ -0,0 +1,82 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Minimal processor prototype: convert a PDF page to an image and run PaddleOCR.
|
||||
|
||||
This script is intentionally small and defensive: it checks for missing
|
||||
dependencies and prints actionable instructions.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
|
||||
try:
|
||||
import yaml
|
||||
from pdf2image import convert_from_path
|
||||
from paddleocr import PaddleOCR
|
||||
import numpy as np
|
||||
except Exception as exc: # pragma: no cover - runtime dependency guard
|
||||
print("Dependency error:", exc)
|
||||
print("Please install requirements: pip install -r requirements.txt")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def process_pdf_first_page(pdf_path: str) -> list:
|
||||
pages = convert_from_path(pdf_path, first_page=1, last_page=1)
|
||||
if not pages:
|
||||
raise RuntimeError("No pages returned from pdf2image")
|
||||
img = pages[0]
|
||||
ocr = PaddleOCR(use_angle_cls=True, lang='en')
|
||||
# PaddleOCR accepts numpy arrays for in-memory images
|
||||
result = ocr.ocr(np.array(img), cls=True)
|
||||
return result
|
||||
|
||||
|
||||
def load_conf(path: str) -> dict:
|
||||
with open(path, "r") as fh:
|
||||
return yaml.safe_load(fh) or {}
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser(description="Processor prototype")
|
||||
p.add_argument("--conf", default="conf.yaml", help="Path to conf.yaml")
|
||||
p.add_argument(
|
||||
"--input", help="Path to input PDF or input directory (overrides conf)")
|
||||
args = p.parse_args()
|
||||
|
||||
conf = {}
|
||||
if os.path.exists(args.conf):
|
||||
conf = load_conf(args.conf)
|
||||
|
||||
input_spec = args.input or conf.get("input_file") or conf.get("input_dir")
|
||||
if not input_spec:
|
||||
print("No input specified. Set --input or define 'input_file' / 'input_dir' in conf.yaml")
|
||||
sys.exit(2)
|
||||
|
||||
if os.path.isdir(input_spec):
|
||||
for fname in sorted(os.listdir(input_spec)):
|
||||
if fname.lower().endswith(".pdf"):
|
||||
input_spec = os.path.join(input_spec, fname)
|
||||
break
|
||||
else:
|
||||
print("No PDF files found in directory:", input_spec)
|
||||
sys.exit(3)
|
||||
|
||||
if not os.path.exists(input_spec):
|
||||
print("Input path does not exist:", input_spec)
|
||||
sys.exit(4)
|
||||
|
||||
print("Processing:", input_spec)
|
||||
try:
|
||||
res = process_pdf_first_page(input_spec)
|
||||
except Exception as exc:
|
||||
print("Processing error:", exc)
|
||||
sys.exit(5)
|
||||
|
||||
print("OCR result (first page):")
|
||||
for block in res:
|
||||
print(block)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
28
requirements.txt
Normal file
28
requirements.txt
Normal file
@@ -0,0 +1,28 @@
|
||||
# Core OCR
|
||||
paddleocr>=2.7.0
|
||||
# PaddlePaddle (choose CPU or GPU build appropriate for your system)
|
||||
# For CPU-only install: `pip install paddlepaddle` or follow the official install guide
|
||||
#paddlepaddle>=2.5.0
|
||||
|
||||
# PDF -> image
|
||||
pdf2image>=1.16.0
|
||||
Pillow>=10.0.0
|
||||
|
||||
# Database
|
||||
pymongo>=4.4.0
|
||||
|
||||
# Utilities
|
||||
python-dotenv>=1.0.0
|
||||
rich>=13.0.0
|
||||
tqdm>=4.65.0
|
||||
|
||||
# Testing / linting (optional)
|
||||
pytest>=8.0.0
|
||||
mypy>=1.5.0
|
||||
|
||||
# YAML config parsing
|
||||
PyYAML>=6.0
|
||||
|
||||
# Notes:
|
||||
# - `pdf2image` requires the `poppler` system package (e.g. `brew install poppler` on macOS).
|
||||
# - Select the appropriate `paddlepaddle` wheel for your platform (CPU vs GPU, macOS vs Linux).
|
||||
Reference in New Issue
Block a user