paperless
Scan, index, and archive all of your paper documents
I hate paper. Environmental issues aside, it's a tech person's nightmare:
- There's no search feature
- It takes up physical space
- Backups mean more paper
In the past few months I've been bitten more than a few times by the problem of not having the right document around. Sometimes I recycled a document I needed (who keeps water bills for two years?) and other times I just lost it... because paper. I wrote this to make my life easier.
How it Works:
- Buy a document scanner like this one.
- Set it up to "scan to FTP" or something similar. It should be able to push scanned images to a server without you having to do anything.
- Have the target server run the paperless consumption script to OCR the PDF and index it into a local database.
- Use the web frontend to sift through the database and find what you want.
- Download the PDF you need/want via the web interface and do whatever you like with it. You can even print it and send it as if it's the original. In most cases, no one will care or notice.
Requirements
This is all really a quite simple, shiny, user-friendly wrapper around some very powerful tools.
- ImageMagick converts the images between colour and greyscale.
- Tesseract does the character recognition
- GNU Privacy Guard
- Python 3 is the language of the project
- Pillow converts the PDFs to images
- PyOCR is a slick programmatic wrapper around tesseract
- Django is the framework this project is written against.
- Python-GNUPG
Instructions
-
Check out this repo to somewhere convenient and install the requirements listed here into your environment.
-
Configure
settings.py
and make sure thatCONVERT_BINARY
,SCRATCH_DIR
, andCONSUMPTION_DIR
are set to values you'd expect:CONVERT_BINARY
: The path toconvert
, installed as part of ImageMagick.SCRATCH_DIR
: A place for files to be created and destroyed. The default is as good a place as any.CONSUMPTION_DIR
: The directory you scanner will be depositing files. Note that the consumption script will import files from here and then delete them.
-
Run
python manage.py migrate
. This will create your local database. -
Run
python manage.py consume
and enter your preferred passphrase when prompted. -
Start the webserver with
python manage.py runserver
and enter the same passphrase when prompted. -
Log into your new toy by visiting
http://localhost:8000/
.