Files
docintake-gm/README.md
2026-01-01 21:57:33 -08:00

57 lines
1.2 KiB
Markdown

# Document Intake — Quickstart
Prerequisite: you already created the project virtual environment `.venv` in the repository root.
## macOS system deps
Install `poppler` (needed by `pdf2image`):
```bash
brew install poppler
```
## Activate the virtual environment
```bash
source .venv/bin/activate
```
## Install Python dependencies
```bash
pip install -r requirements.txt
```
If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
## Quick verification
Check that `paddleocr` and `pymongo` import successfully:
```bash
python -c "import paddleocr; import pymongo; print('imports OK')"
```
## Running a processor (prototype)
We will add a prototype script `processor.py` that:
- Converts pages from a PDF to images using `pdf2image`.
- Runs OCR on one page with `paddleocr`.
- Prints basic extraction results.
To run the prototype (once added):
```bash
python processor.py --input samples/example.pdf
```
## Useful files
- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
- Python dependencies: [requirements.txt](requirements.txt)
---
If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?