57 lines
1.2 KiB
Markdown
57 lines
1.2 KiB
Markdown
# Document Intake — Quickstart
|
|
|
|
Prerequisite: you already created the project virtual environment `.venv` in the repository root.
|
|
|
|
## macOS system deps
|
|
|
|
Install `poppler` (needed by `pdf2image`):
|
|
|
|
```bash
|
|
brew install poppler
|
|
```
|
|
|
|
## Activate the virtual environment
|
|
|
|
```bash
|
|
source .venv/bin/activate
|
|
```
|
|
|
|
## Install Python dependencies
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
|
|
|
|
## Quick verification
|
|
|
|
Check that `paddleocr` and `pymongo` import successfully:
|
|
|
|
```bash
|
|
python -c "import paddleocr; import pymongo; print('imports OK')"
|
|
```
|
|
|
|
## Running a processor (prototype)
|
|
|
|
We will add a prototype script `processor.py` that:
|
|
|
|
- Converts pages from a PDF to images using `pdf2image`.
|
|
- Runs OCR on one page with `paddleocr`.
|
|
- Prints basic extraction results.
|
|
|
|
To run the prototype (once added):
|
|
|
|
```bash
|
|
python processor.py --input samples/example.pdf
|
|
```
|
|
|
|
## Useful files
|
|
|
|
- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
|
|
- Python dependencies: [requirements.txt](requirements.txt)
|
|
|
|
---
|
|
|
|
If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?
|