Updated the requirements section

This commit is contained in:
Daniel Quinn 2016-01-10 15:51:16 +00:00
parent 94ce12f9ee
commit 5b479c5058

View File

@ -35,13 +35,21 @@ powerful tools.
* [Tesseract](https://github.com/tesseract-ocr) does the character recognition
* [GNU Privacy Guard](https://gnupg.org)
* [Python 3](https://python.org/) is the language of the project
* [Pillow](https://pypi.python.org/pypi/pillowfight/) converts the PDFs to
images
* [Pillow](https://pypi.python.org/pypi/pillowfight/) loads the image data
as a python object to be used with PyOCR.
* [PyOCR](https://github.com/jflesch/pyocr) is a slick programmatic wrapper
around tesseract
* [Django](https://djangoproject.org/) is the framework this project is
written against.
* [Python-GNUPG](http://pythonhosted.org/python-gnupg/)
* [Python-GNUPG](http://pythonhosted.org/python-gnupg/) decrypts the PDFs
on-the-fly to allow you to download unencrypted files, leaving the
encrypted ones on-disk.
The keen eye might have noticed that we're converting a PDF to an image to be
read by Tesseract, and to do this we're using a chain of: scanned PDF >
Imagemagick > Pillow > PyOCR > Tesseract > text. It's not ideal, but
apparently, Pillow lacks the ability to read PDFs, and PyOCR requires a Pillow
object, so we're sort of stuck.
## Instructions