docintake-gm/README.md

# Document Intake — Quickstart

Prerequisite: you already created the project virtual environment `.venv` in the repository root.

## macOS system deps

Install `poppler` (needed by `pdf2image`):

```bash
brew install poppler
```

## Activate the virtual environment

```bash
source .venv/bin/activate
```

## Install Python dependencies

```bash
pip install -r requirements.txt
```

If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.

## Quick verification

Check that `paddleocr` and `pymongo` import successfully:

```bash
python -c "import paddleocr; import pymongo; print('imports OK')"
```

## Running a processor (prototype)

We will add a prototype script `processor.py` that:

- Converts pages from a PDF to images using `pdf2image`.
- Runs OCR on one page with `paddleocr`.
- Prints basic extraction results.

To run the prototype (once added):

```bash
python processor.py --input samples/example.pdf
```

## Useful files

- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
- Python dependencies: [requirements.txt](requirements.txt)

---

If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?