d246d2a0d73db44bd631b7c111a367d9c414c9da
Document Intake — Quickstart
Prerequisite: you already created the project virtual environment .venv in the repository root.
macOS system deps
Install poppler (needed by pdf2image):
brew install poppler
Activate the virtual environment
source .venv/bin/activate
Install Python dependencies
pip install -r requirements.txt
If you need a specific paddlepaddle flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
Quick verification
Check that paddleocr and pymongo import successfully:
python -c "import paddleocr; import pymongo; print('imports OK')"
Running a processor (prototype)
We will add a prototype script processor.py that:
- Converts pages from a PDF to images using
pdf2image. - Runs OCR on one page with
paddleocr. - Prints basic extraction results.
To run the prototype (once added):
python processor.py --input samples/example.pdf
Useful files
- Development notes: DEVELOPMENT.md
- Python dependencies: requirements.txt
If you want, I can now add a minimal processor.py prototype and a samples/ folder with a placeholder PDF. Which should I do next?
Description