From 54443fa808b89a74ba097cf4b206252f334968ab Mon Sep 17 00:00:00 2001
From: Daniel Quinn <code@danielquinn.org>
Date: Mon, 28 Mar 2016 14:54:09 +0100
Subject: [PATCH] Documented all of the guesswork Paperless does

---
 docs/changelog.rst   |  4 +++
 docs/consumption.rst | 37 +------------------
 docs/guesswork.rst   | 85 ++++++++++++++++++++++++++++++++++++++++++++
 docs/index.rst       |  3 +-
 docs/utilities.rst   |  7 ++--
 5 files changed, 97 insertions(+), 39 deletions(-)
 create mode 100644 docs/guesswork.rst

diff --git a/docs/changelog.rst b/docs/changelog.rst
index c1397bb6c..7be137b05 100644
--- a/docs/changelog.rst
+++ b/docs/changelog.rst
@@ -3,6 +3,8 @@ Changelog
 
 * 0.2.0
 
+  * `#89`_ Ported the auto-tagging code to correspondents as well.  Thanks to
+    `Justin Snyman`_ for the pointers in the issue queue.
   * Added support for guessing the date from the file name along with the
     correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull
     request that I took forever to merge and to `Pit`_ for his efforts on the
@@ -97,6 +99,7 @@ Changelog
 .. _zedster: https://github.com/zedster
 .. _Martin Honermeyer: https://github.com/djmaze
 .. _Tim White: https://github.com/timwhite
+.. _Justin Snyman: https://github.com/stringlytyped
 
 .. _#20: https://github.com/danielquinn/paperless/issues/20
 .. _#44: https://github.com/danielquinn/paperless/issues/44
@@ -110,4 +113,5 @@ Changelog
 .. _#67: https://github.com/danielquinn/paperless/issues/67
 .. _#68: https://github.com/danielquinn/paperless/issues/68
 .. _#71: https://github.com/danielquinn/paperless/issues/71
+.. _#89: https://github.com/danielquinn/paperless/issues/89
 .. _#94: https://github.com/danielquinn/paperless/issues/71
diff --git a/docs/consumption.rst b/docs/consumption.rst
index 2e404fddd..33b7f3969 100644
--- a/docs/consumption.rst
+++ b/docs/consumption.rst
@@ -3,7 +3,7 @@
 Consumption
 ###########
 
-Once you've got *Paperless* setup, you need to start feeding documents into it.
+Once you've got Paperless setup, you need to start feeding documents into it.
 Currently, there are three options: the consumption directory, IMAP (email), and
 HTTP POST.
 
@@ -35,41 +35,6 @@ appropriate for your use and put some documents in there.  When you're ready,
 follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
 
 
-.. _consumption-directory-naming:
-
-A Note on File Naming
----------------------
-
-Any document you put into the consumption directory will be consumed, but if
-you name the file right, it'll automatically set some values in the database
-for you.  This is is the logic the consumer follows:
-
-1. Try to find the correspondent, title, and tags in the file name following
-   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
-   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
-   ``YYYYMMDDZ``.  The ``Z`` is for "Zulu time" AKA "UTC".
-2. If that doesn't work, we skip the date and try this pattern:
-   the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
-3. If that doesn't work, we try to find the correspondent and title in the file
-   name following the pattern:  ``Correspondent - Title.pdf``.
-4. If that doesn't work, just assume that the name of the file is the title.
-
-So given the above, the following examples would work as you'd expect:
-
-* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
-* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
-* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
-* ``Another Company - Letter of Reference.jpg``
-* ``Dad's Recipe for Pancakes.png``
-
-These however wouldn't work:
-
-* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
-* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
-* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
-* ``Another Company- Letter of Reference.jpg``
-
-
 .. _consumption-imap:
 
 IMAP (Email)
diff --git a/docs/guesswork.rst b/docs/guesswork.rst
new file mode 100644
index 000000000..20407b265
--- /dev/null
+++ b/docs/guesswork.rst
@@ -0,0 +1,85 @@
+.. _guesswork:
+
+Guesswork
+#########
+
+During the consumption process, Paperless tries to guess some of the attributes
+of the document it's looking at.  To do this it uses two approaches:
+
+
+.. _guesswork-naming:
+
+File Naming
+===========
+
+Any document you put into the consumption directory will be consumed, but if
+you name the file right, it'll automatically set some values in the database
+for you.  This is is the logic the consumer follows:
+
+1. Try to find the correspondent, title, and tags in the file name following
+   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
+   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
+   ``YYYYMMDDZ``.  The ``Z`` refers "Zulu time" AKA "UTC".
+2. If that doesn't work, we skip the date and try this pattern:
+   ``Correspondent - Title - tag,tag,tag.pdf``.
+3. If that doesn't work, we try to find the correspondent and title in the file
+   name following the pattern: ``Correspondent - Title.pdf``.
+4. If that doesn't work, just assume that the name of the file is the title.
+
+So given the above, the following examples would work as you'd expect:
+
+* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``Another Company - Letter of Reference.jpg``
+* ``Dad's Recipe for Pancakes.png``
+
+These however wouldn't work:
+
+* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``Another Company- Letter of Reference.jpg``
+
+
+.. _guesswork-content:
+
+Reading the Document Contents
+=============================
+
+After the consumer has tried to figure out what it could from the file name,
+it starts looking at the content of the document itself.  It will compare the
+matching algorithms defined by every tag and correspondent already set in your
+database to see if they apply to the text in that document.  In other words,
+if you defined a tag called ``Home Utility`` that had a ``match`` property of
+``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
+automatically tag your newly-consumed document with your ``Home Utility`` tag
+so long as the text ``bc hydro`` appears in the body of the document somewhere.
+
+The matching logic is quite powerful, and supports searching the text of your
+document with different algorithms, and as such, some experimentation may be
+necessary to get things Just Right.
+
+
+.. _guesswork-content-howto:
+
+How Do I Set Up These Matching Algorithms?
+------------------------------------------
+
+Setting up of the algorithms is easily done through the admin interface.  When
+you create a new correspondent or tag, there are optional fields for matching
+text and matching algorithm.  From the help info there:
+
+.. note::
+
+    Which algorithm you want to use when matching text to the OCR'd PDF.  Here,
+    "any" looks for any occurrence of any word provided in the PDF, while "all"
+    requires that every word provided appear in the PDF, albeit not in the
+    order provided.  A "literal" match means that the text you enter must
+    appear in the PDF exactly as you've entered it, and "regular expression"
+    uses a regex to match the PDF.  If you don't know what a regex is, you
+    probably don't want this option.
+
+Then just save your tag/correspondent and run another document through the
+consumer.  Once complete, you should see the newly-created document,
+automatically tagged with the appropriate data.
diff --git a/docs/index.rst b/docs/index.rst
index 43f77b15a..687a7894b 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -32,6 +32,7 @@ Contents
    consumption
    api
    utilities
+   guesswork
    migrating
-   troubleshooting 
+   troubleshooting
    changelog
diff --git a/docs/utilities.rst b/docs/utilities.rst
index ce3555b73..b9e45b20c 100644
--- a/docs/utilities.rst
+++ b/docs/utilities.rst
@@ -52,9 +52,12 @@ for PDF files to parse and index.  The process is pretty straightforward:
    wait 10 seconds and try again.
 2. Parse the PDF with Tesseract
 3. Create a new record in the database with the OCR'd text
-4. Encrypt the PDF and store it in the ``media`` directory under
+4. Attempt to automatically assign document attributes by doing some guesswork.
+   Read up on the :ref:`guesswork documentation<guesswork>` for more
+   information about this process.
+5. Encrypt the PDF and store it in the ``media`` directory under
    ``documents/pdf``.
-5. Go to #1.
+6. Go to #1.
 
 
 .. _utilities-consumer-howto: