From 54443fa808b89a74ba097cf4b206252f334968ab Mon Sep 17 00:00:00 2001 From: Daniel Quinn Date: Mon, 28 Mar 2016 14:54:09 +0100 Subject: [PATCH] Documented all of the guesswork Paperless does --- docs/changelog.rst | 4 +++ docs/consumption.rst | 37 +------------------ docs/guesswork.rst | 85 ++++++++++++++++++++++++++++++++++++++++++++ docs/index.rst | 3 +- docs/utilities.rst | 7 ++-- 5 files changed, 97 insertions(+), 39 deletions(-) create mode 100644 docs/guesswork.rst diff --git a/docs/changelog.rst b/docs/changelog.rst index c1397bb6c..7be137b05 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -3,6 +3,8 @@ Changelog * 0.2.0 + * `#89`_ Ported the auto-tagging code to correspondents as well. Thanks to + `Justin Snyman`_ for the pointers in the issue queue. * Added support for guessing the date from the file name along with the correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull request that I took forever to merge and to `Pit`_ for his efforts on the @@ -97,6 +99,7 @@ Changelog .. _zedster: https://github.com/zedster .. _Martin Honermeyer: https://github.com/djmaze .. _Tim White: https://github.com/timwhite +.. _Justin Snyman: https://github.com/stringlytyped .. _#20: https://github.com/danielquinn/paperless/issues/20 .. _#44: https://github.com/danielquinn/paperless/issues/44 @@ -110,4 +113,5 @@ Changelog .. _#67: https://github.com/danielquinn/paperless/issues/67 .. _#68: https://github.com/danielquinn/paperless/issues/68 .. _#71: https://github.com/danielquinn/paperless/issues/71 +.. _#89: https://github.com/danielquinn/paperless/issues/89 .. _#94: https://github.com/danielquinn/paperless/issues/71 diff --git a/docs/consumption.rst b/docs/consumption.rst index 2e404fddd..33b7f3969 100644 --- a/docs/consumption.rst +++ b/docs/consumption.rst @@ -3,7 +3,7 @@ Consumption ########### -Once you've got *Paperless* setup, you need to start feeding documents into it. +Once you've got Paperless setup, you need to start feeding documents into it. Currently, there are three options: the consumption directory, IMAP (email), and HTTP POST. @@ -35,41 +35,6 @@ appropriate for your use and put some documents in there. When you're ready, follow the :ref:`consumer ` instructions to get it running. -.. _consumption-directory-naming: - -A Note on File Naming ---------------------- - -Any document you put into the consumption directory will be consumed, but if -you name the file right, it'll automatically set some values in the database -for you. This is is the logic the consumer follows: - -1. Try to find the correspondent, title, and tags in the file name following - the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that - the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or - ``YYYYMMDDZ``. The ``Z`` is for "Zulu time" AKA "UTC". -2. If that doesn't work, we skip the date and try this pattern: - the pattern: ``Correspondent - Title - tag,tag,tag.pdf``. -3. If that doesn't work, we try to find the correspondent and title in the file - name following the pattern: ``Correspondent - Title.pdf``. -4. If that doesn't work, just assume that the name of the file is the title. - -So given the above, the following examples would work as you'd expect: - -* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` -* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` -* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` -* ``Another Company - Letter of Reference.jpg`` -* ``Dad's Recipe for Pancakes.png`` - -These however wouldn't work: - -* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` -* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` -* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` -* ``Another Company- Letter of Reference.jpg`` - - .. _consumption-imap: IMAP (Email) diff --git a/docs/guesswork.rst b/docs/guesswork.rst new file mode 100644 index 000000000..20407b265 --- /dev/null +++ b/docs/guesswork.rst @@ -0,0 +1,85 @@ +.. _guesswork: + +Guesswork +######### + +During the consumption process, Paperless tries to guess some of the attributes +of the document it's looking at. To do this it uses two approaches: + + +.. _guesswork-naming: + +File Naming +=========== + +Any document you put into the consumption directory will be consumed, but if +you name the file right, it'll automatically set some values in the database +for you. This is is the logic the consumer follows: + +1. Try to find the correspondent, title, and tags in the file name following + the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that + the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or + ``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC". +2. If that doesn't work, we skip the date and try this pattern: + ``Correspondent - Title - tag,tag,tag.pdf``. +3. If that doesn't work, we try to find the correspondent and title in the file + name following the pattern: ``Correspondent - Title.pdf``. +4. If that doesn't work, just assume that the name of the file is the title. + +So given the above, the following examples would work as you'd expect: + +* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``Another Company - Letter of Reference.jpg`` +* ``Dad's Recipe for Pancakes.png`` + +These however wouldn't work: + +* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``Another Company- Letter of Reference.jpg`` + + +.. _guesswork-content: + +Reading the Document Contents +============================= + +After the consumer has tried to figure out what it could from the file name, +it starts looking at the content of the document itself. It will compare the +matching algorithms defined by every tag and correspondent already set in your +database to see if they apply to the text in that document. In other words, +if you defined a tag called ``Home Utility`` that had a ``match`` property of +``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will +automatically tag your newly-consumed document with your ``Home Utility`` tag +so long as the text ``bc hydro`` appears in the body of the document somewhere. + +The matching logic is quite powerful, and supports searching the text of your +document with different algorithms, and as such, some experimentation may be +necessary to get things Just Right. + + +.. _guesswork-content-howto: + +How Do I Set Up These Matching Algorithms? +------------------------------------------ + +Setting up of the algorithms is easily done through the admin interface. When +you create a new correspondent or tag, there are optional fields for matching +text and matching algorithm. From the help info there: + +.. note:: + + Which algorithm you want to use when matching text to the OCR'd PDF. Here, + "any" looks for any occurrence of any word provided in the PDF, while "all" + requires that every word provided appear in the PDF, albeit not in the + order provided. A "literal" match means that the text you enter must + appear in the PDF exactly as you've entered it, and "regular expression" + uses a regex to match the PDF. If you don't know what a regex is, you + probably don't want this option. + +Then just save your tag/correspondent and run another document through the +consumer. Once complete, you should see the newly-created document, +automatically tagged with the appropriate data. diff --git a/docs/index.rst b/docs/index.rst index 43f77b15a..687a7894b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -32,6 +32,7 @@ Contents consumption api utilities + guesswork migrating - troubleshooting + troubleshooting changelog diff --git a/docs/utilities.rst b/docs/utilities.rst index ce3555b73..b9e45b20c 100644 --- a/docs/utilities.rst +++ b/docs/utilities.rst @@ -52,9 +52,12 @@ for PDF files to parse and index. The process is pretty straightforward: wait 10 seconds and try again. 2. Parse the PDF with Tesseract 3. Create a new record in the database with the OCR'd text -4. Encrypt the PDF and store it in the ``media`` directory under +4. Attempt to automatically assign document attributes by doing some guesswork. + Read up on the :ref:`guesswork documentation` for more + information about this process. +5. Encrypt the PDF and store it in the ``media`` directory under ``documents/pdf``. -5. Go to #1. +6. Go to #1. .. _utilities-consumer-howto: