Improves the docs: OCRing files in languages other than English + fixes typos

2025-07-20 17:44:56 -05:00 · 2016-03-21 21:57:36 +01:00 · 2016-03-21 21:57:36 +01:00 · 8115cf8905
commit 8115cf8905
parent 840626e571
5 changed files with 22 additions and 3 deletions
--- a/README.rst
+++ b/README.rst
@ -59,7 +59,7 @@ powerful tools.

 * `ImageMagick`_ converts the images between colour and greyscale.
 * `Tesseract`_ does the character recognition.
-* `Unpaper`_ despeckles and and deskews the scanned image.
+* `Unpaper`_ despeckles and deskews the scanned image.
 * `GNU Privacy Guard`_ is used as the encryption backend.
 * `Python 3`_ is the language of the project.

--- a/docs/consumption.rst
+++ b/docs/consumption.rst
@ -128,7 +128,7 @@ following name/value pairs:
  don't start uploading stuff to your server.  The means of generating this
  signature is defined below.

-Specify ``enctype="multipart/form-data"``, and then POST your file with:::
+Specify ``enctype="multipart/form-data"``, and then POST your file with::

    Content-Disposition: form-data; name="document"; filename="whatever.pdf"

--- a/docs/index.rst
+++ b/docs/index.rst
@ -33,4 +33,5 @@ Contents
   api
   utilities
   migrating
+   troubleshooting 
   changelog
--- a/docs/requirements.rst
+++ b/docs/requirements.rst
@ -8,7 +8,7 @@ should work) that has the following software installed on it:

 * `Python3`_ (with development libraries, pip and virtualenv)
 * `GNU Privacy Guard`_
-* `Tesseract`_
+* `Tesseract`_, plus it's language files matching your document base.
 * `Imagemagick`_
 * `unpaper`_

--- a/docs/troubleshooting.rst
+++ b/docs/troubleshooting.rst
@ -0,0 +1,18 @@
+.. _troubleshooting:
+
+Troubleshooting
+===============
+
+.. _troubleshooting_ocr_language_files_missing:
+
+Consumer warns ``OCR for XX failed``
+------------------------------------
+
+If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
+XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
+might need to install the `Tesseract language files
+<http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
+
+As an example, if your documents are written in Spanish you may need to run::
+
+    apt-get install -y tesseract-ocr-spa