From 5b479c505808b3e72c81b169707d8e07c81b8c07 Mon Sep 17 00:00:00 2001 From: Daniel Quinn Date: Sun, 10 Jan 2016 15:51:16 +0000 Subject: [PATCH] Updated the requirements section --- README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index ea00d1af8..17030595b 100644 --- a/README.md +++ b/README.md @@ -35,13 +35,21 @@ powerful tools. * [Tesseract](https://github.com/tesseract-ocr) does the character recognition * [GNU Privacy Guard](https://gnupg.org) * [Python 3](https://python.org/) is the language of the project - * [Pillow](https://pypi.python.org/pypi/pillowfight/) converts the PDFs to - images + * [Pillow](https://pypi.python.org/pypi/pillowfight/) loads the image data + as a python object to be used with PyOCR. * [PyOCR](https://github.com/jflesch/pyocr) is a slick programmatic wrapper around tesseract * [Django](https://djangoproject.org/) is the framework this project is written against. - * [Python-GNUPG](http://pythonhosted.org/python-gnupg/) + * [Python-GNUPG](http://pythonhosted.org/python-gnupg/) decrypts the PDFs + on-the-fly to allow you to download unencrypted files, leaving the + encrypted ones on-disk. + +The keen eye might have noticed that we're converting a PDF to an image to be +read by Tesseract, and to do this we're using a chain of: scanned PDF > +Imagemagick > Pillow > PyOCR > Tesseract > text. It's not ideal, but +apparently, Pillow lacks the ability to read PDFs, and PyOCR requires a Pillow +object, so we're sort of stuck. ## Instructions