Documented all of the guesswork Paperless does

2025-06-28 15:54:41 -05:00 · 2016-03-28 14:54:09 +01:00 · 2016-03-28 14:54:09 +01:00 · 54443fa808
commit 54443fa808
parent aea9ea50e5
5 changed files with 97 additions and 39 deletions
--- a/docs/changelog.rst
+++ b/docs/changelog.rst
@ -3,6 +3,8 @@ Changelog
 * 0.2.0
  * `#89`_ Ported the auto-tagging code to correspondents as well.  Thanks to
    `Justin Snyman`_ for the pointers in the issue queue.
  * Added support for guessing the date from the file name along with the
    correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull
    request that I took forever to merge and to `Pit`_ for his efforts on the
@ -97,6 +99,7 @@ Changelog
 .. _zedster: https://github.com/zedster
 .. _Martin Honermeyer: https://github.com/djmaze
 .. _Tim White: https://github.com/timwhite
 .. _Justin Snyman: https://github.com/stringlytyped
 .. _#20: https://github.com/danielquinn/paperless/issues/20
 .. _#44: https://github.com/danielquinn/paperless/issues/44
@ -110,4 +113,5 @@ Changelog
 .. _#67: https://github.com/danielquinn/paperless/issues/67
 .. _#68: https://github.com/danielquinn/paperless/issues/68
 .. _#71: https://github.com/danielquinn/paperless/issues/71
 .. _#89: https://github.com/danielquinn/paperless/issues/89
 .. _#94: https://github.com/danielquinn/paperless/issues/71
--- a/docs/consumption.rst
+++ b/docs/consumption.rst
@ -3,7 +3,7 @@
 Consumption
 ###########
-Once you've got *Paperless* setup, you need to start feeding documents into it.
+Once you've got Paperless setup, you need to start feeding documents into it.
 Currently, there are three options: the consumption directory, IMAP (email), and
 HTTP POST.
@ -35,41 +35,6 @@ appropriate for your use and put some documents in there.  When you're ready,
 follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
 .. _consumption-directory-naming:
 A Note on File Naming
 ---------------------
 Any document you put into the consumption directory will be consumed, but if
 you name the file right, it'll automatically set some values in the database
 for you.  This is is the logic the consumer follows:
 1. Try to find the correspondent, title, and tags in the file name following
   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
   ``YYYYMMDDZ``.  The ``Z`` is for "Zulu time" AKA "UTC".
 2. If that doesn't work, we skip the date and try this pattern:
   the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
 3. If that doesn't work, we try to find the correspondent and title in the file
   name following the pattern:  ``Correspondent - Title.pdf``.
 4. If that doesn't work, just assume that the name of the file is the title.
 So given the above, the following examples would work as you'd expect:
 * ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Another Company - Letter of Reference.jpg``
 * ``Dad's Recipe for Pancakes.png``
 These however wouldn't work:
 * ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Another Company- Letter of Reference.jpg``
 .. _consumption-imap:
 IMAP (Email)
--- a/docs/guesswork.rst
+++ b/docs/guesswork.rst
@ -0,0 +1,85 @@
 .. _guesswork:
 Guesswork
 #########
 During the consumption process, Paperless tries to guess some of the attributes
 of the document it's looking at.  To do this it uses two approaches:
 .. _guesswork-naming:
 File Naming
 ===========
 Any document you put into the consumption directory will be consumed, but if
 you name the file right, it'll automatically set some values in the database
 for you.  This is is the logic the consumer follows:
 1. Try to find the correspondent, title, and tags in the file name following
   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
   ``YYYYMMDDZ``.  The ``Z`` refers "Zulu time" AKA "UTC".
 2. If that doesn't work, we skip the date and try this pattern:
   ``Correspondent - Title - tag,tag,tag.pdf``.
 3. If that doesn't work, we try to find the correspondent and title in the file
   name following the pattern: ``Correspondent - Title.pdf``.
 4. If that doesn't work, just assume that the name of the file is the title.
 So given the above, the following examples would work as you'd expect:
 * ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Another Company - Letter of Reference.jpg``
 * ``Dad's Recipe for Pancakes.png``
 These however wouldn't work:
 * ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Another Company- Letter of Reference.jpg``
 .. _guesswork-content:
 Reading the Document Contents
 =============================
 After the consumer has tried to figure out what it could from the file name,
 it starts looking at the content of the document itself.  It will compare the
 matching algorithms defined by every tag and correspondent already set in your
 database to see if they apply to the text in that document.  In other words,
 if you defined a tag called ``Home Utility`` that had a ``match`` property of
 ``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
 automatically tag your newly-consumed document with your ``Home Utility`` tag
 so long as the text ``bc hydro`` appears in the body of the document somewhere.
 The matching logic is quite powerful, and supports searching the text of your
 document with different algorithms, and as such, some experimentation may be
 necessary to get things Just Right.
 .. _guesswork-content-howto:
 How Do I Set Up These Matching Algorithms?
 ------------------------------------------
 Setting up of the algorithms is easily done through the admin interface.  When
 you create a new correspondent or tag, there are optional fields for matching
 text and matching algorithm.  From the help info there:
 .. note::
    Which algorithm you want to use when matching text to the OCR'd PDF.  Here,
    "any" looks for any occurrence of any word provided in the PDF, while "all"
    requires that every word provided appear in the PDF, albeit not in the
    order provided.  A "literal" match means that the text you enter must
    appear in the PDF exactly as you've entered it, and "regular expression"
    uses a regex to match the PDF.  If you don't know what a regex is, you
    probably don't want this option.
 Then just save your tag/correspondent and run another document through the
 consumer.  Once complete, you should see the newly-created document,
 automatically tagged with the appropriate data.
--- a/docs/index.rst
+++ b/docs/index.rst
@ -32,6 +32,7 @@ Contents
   consumption
   api
   utilities
   guesswork
   migrating
   troubleshooting
   changelog
--- a/docs/utilities.rst
+++ b/docs/utilities.rst
@ -52,9 +52,12 @@ for PDF files to parse and index.  The process is pretty straightforward:
   wait 10 seconds and try again.
 2. Parse the PDF with Tesseract
 3. Create a new record in the database with the OCR'd text
-4. Encrypt the PDF and store it in the ``media`` directory under
+4. Attempt to automatically assign document attributes by doing some guesswork.
   Read up on the :ref:`guesswork documentation<guesswork>` for more
   information about this process.
 5. Encrypt the PDF and store it in the ``media`` directory under
   ``documents/pdf``.
-5. Go to #1.
+6. Go to #1.
 .. _utilities-consumer-howto: