Merge pull request #96 from JaimeObregon/master

Improves the docs: OCRing files in languages other than English + fixes typos
2026-02-26 01:09:34 -06:00 · 2016-03-23 11:27:20 +00:00
parent 840626e571 37191f0383
commit ef54e2f94a
5 changed files with 23 additions and 3 deletions
--- a/README.rst
+++ b/README.rst
@@ -59,7 +59,7 @@ powerful tools.

 * `ImageMagick`_ converts the images between colour and greyscale.
 * `Tesseract`_ does the character recognition.
-* `Unpaper`_ despeckles and and deskews the scanned image.
+* `Unpaper`_ despeckles and deskews the scanned image.
 * `GNU Privacy Guard`_ is used as the encryption backend.
 * `Python 3`_ is the language of the project.

--- a/docs/consumption.rst
+++ b/docs/consumption.rst
@@ -128,7 +128,7 @@ following name/value pairs:
  don't start uploading stuff to your server.  The means of generating this
  signature is defined below.

-Specify ``enctype="multipart/form-data"``, and then POST your file with:::
+Specify ``enctype="multipart/form-data"``, and then POST your file with::

    Content-Disposition: form-data; name="document"; filename="whatever.pdf"

--- a/docs/index.rst
+++ b/docs/index.rst
@@ -33,4 +33,5 @@ Contents
   api
   utilities
   migrating
+   troubleshooting 
   changelog
--- a/docs/requirements.rst
+++ b/docs/requirements.rst
@@ -8,7 +8,7 @@ should work) that has the following software installed on it:

 * `Python3`_ (with development libraries, pip and virtualenv)
 * `GNU Privacy Guard`_
-* `Tesseract`_
+* `Tesseract`_, plus its language files matching your document base.
 * `Imagemagick`_
 * `unpaper`_

--- a/docs/troubleshooting.rst
+++ b/docs/troubleshooting.rst
@@ -0,0 +1,19 @@
+.. _troubleshooting:
+
+Troubleshooting
+===============
+
+.. _troubleshooting_ocr_language_files_missing:
+
+Consumer warns ``OCR for XX failed``
+------------------------------------
+
+If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
+XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
+might need to install the `Tesseract language files
+<http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
+
+As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian
+box), and your documents are written in Spanish you may need to run::
+
+    apt-get install -y tesseract-ocr-spa