Updated the requirements section

2026-02-11 23:59:31 -06:00 · 2016-01-10 15:51:16 +00:00
parent 94ce12f9ee
commit 5b479c5058
1 changed files with 11 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -35,13 +35,21 @@ powerful tools.
 * [Tesseract](https://github.com/tesseract-ocr) does the character recognition
 * [GNU Privacy Guard](https://gnupg.org)
 * [Python 3](https://python.org/) is the language of the project
-    * [Pillow](https://pypi.python.org/pypi/pillowfight/) converts the PDFs to
-      images
+    * [Pillow](https://pypi.python.org/pypi/pillowfight/) loads the image data
+      as a python object to be used with PyOCR.
    * [PyOCR](https://github.com/jflesch/pyocr) is a slick programmatic wrapper
      around tesseract
    * [Django](https://djangoproject.org/) is the framework this project is 
      written against.
-    * [Python-GNUPG](http://pythonhosted.org/python-gnupg/)
+    * [Python-GNUPG](http://pythonhosted.org/python-gnupg/) decrypts the PDFs
+      on-the-fly to allow you to download unencrypted files, leaving the
+      encrypted ones on-disk.
+
+The keen eye might have noticed that we're converting a PDF to an image to be
+read by Tesseract, and to do this we're using a chain of: scanned PDF >
+Imagemagick > Pillow > PyOCR > Tesseract > text.  It's not ideal, but
+apparently, Pillow lacks the ability to read PDFs, and PyOCR requires a Pillow
+object, so we're sort of stuck.


 ## Instructions