Files
docintake-gm/README.md
2026-01-01 21:57:33 -08:00

1.2 KiB

Document Intake — Quickstart

Prerequisite: you already created the project virtual environment .venv in the repository root.

macOS system deps

Install poppler (needed by pdf2image):

brew install poppler

Activate the virtual environment

source .venv/bin/activate

Install Python dependencies

pip install -r requirements.txt

If you need a specific paddlepaddle flavor (CPU vs GPU) follow the official install guide before or instead of the line above.

Quick verification

Check that paddleocr and pymongo import successfully:

python -c "import paddleocr; import pymongo; print('imports OK')"

Running a processor (prototype)

We will add a prototype script processor.py that:

  • Converts pages from a PDF to images using pdf2image.
  • Runs OCR on one page with paddleocr.
  • Prints basic extraction results.

To run the prototype (once added):

python processor.py --input samples/example.pdf

Useful files


If you want, I can now add a minimal processor.py prototype and a samples/ folder with a placeholder PDF. Which should I do next?