Initial commit
This commit is contained in:
56
README.md
Normal file
56
README.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Document Intake — Quickstart
|
||||
|
||||
Prerequisite: you already created the project virtual environment `.venv` in the repository root.
|
||||
|
||||
## macOS system deps
|
||||
|
||||
Install `poppler` (needed by `pdf2image`):
|
||||
|
||||
```bash
|
||||
brew install poppler
|
||||
```
|
||||
|
||||
## Activate the virtual environment
|
||||
|
||||
```bash
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
## Install Python dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
If you need a specific `paddlepaddle` flavor (CPU vs GPU) follow the official install guide before or instead of the line above.
|
||||
|
||||
## Quick verification
|
||||
|
||||
Check that `paddleocr` and `pymongo` import successfully:
|
||||
|
||||
```bash
|
||||
python -c "import paddleocr; import pymongo; print('imports OK')"
|
||||
```
|
||||
|
||||
## Running a processor (prototype)
|
||||
|
||||
We will add a prototype script `processor.py` that:
|
||||
|
||||
- Converts pages from a PDF to images using `pdf2image`.
|
||||
- Runs OCR on one page with `paddleocr`.
|
||||
- Prints basic extraction results.
|
||||
|
||||
To run the prototype (once added):
|
||||
|
||||
```bash
|
||||
python processor.py --input samples/example.pdf
|
||||
```
|
||||
|
||||
## Useful files
|
||||
|
||||
- Development notes: [DEVELOPMENT.md](DEVELOPMENT.md)
|
||||
- Python dependencies: [requirements.txt](requirements.txt)
|
||||
|
||||
---
|
||||
|
||||
If you want, I can now add a minimal `processor.py` prototype and a `samples/` folder with a placeholder PDF. Which should I do next?
|
||||
Reference in New Issue
Block a user