mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-09-16 21:55:37 -05:00
Documented all of the guesswork Paperless does
This commit is contained in:
85
docs/guesswork.rst
Normal file
85
docs/guesswork.rst
Normal file
@@ -0,0 +1,85 @@
|
||||
.. _guesswork:
|
||||
|
||||
Guesswork
|
||||
#########
|
||||
|
||||
During the consumption process, Paperless tries to guess some of the attributes
|
||||
of the document it's looking at. To do this it uses two approaches:
|
||||
|
||||
|
||||
.. _guesswork-naming:
|
||||
|
||||
File Naming
|
||||
===========
|
||||
|
||||
Any document you put into the consumption directory will be consumed, but if
|
||||
you name the file right, it'll automatically set some values in the database
|
||||
for you. This is is the logic the consumer follows:
|
||||
|
||||
1. Try to find the correspondent, title, and tags in the file name following
|
||||
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
||||
2. If that doesn't work, we skip the date and try this pattern:
|
||||
``Correspondent - Title - tag,tag,tag.pdf``.
|
||||
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||
name following the pattern: ``Correspondent - Title.pdf``.
|
||||
4. If that doesn't work, just assume that the name of the file is the title.
|
||||
|
||||
So given the above, the following examples would work as you'd expect:
|
||||
|
||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Another Company - Letter of Reference.jpg``
|
||||
* ``Dad's Recipe for Pancakes.png``
|
||||
|
||||
These however wouldn't work:
|
||||
|
||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Another Company- Letter of Reference.jpg``
|
||||
|
||||
|
||||
.. _guesswork-content:
|
||||
|
||||
Reading the Document Contents
|
||||
=============================
|
||||
|
||||
After the consumer has tried to figure out what it could from the file name,
|
||||
it starts looking at the content of the document itself. It will compare the
|
||||
matching algorithms defined by every tag and correspondent already set in your
|
||||
database to see if they apply to the text in that document. In other words,
|
||||
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
||||
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
||||
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
||||
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
||||
|
||||
The matching logic is quite powerful, and supports searching the text of your
|
||||
document with different algorithms, and as such, some experimentation may be
|
||||
necessary to get things Just Right.
|
||||
|
||||
|
||||
.. _guesswork-content-howto:
|
||||
|
||||
How Do I Set Up These Matching Algorithms?
|
||||
------------------------------------------
|
||||
|
||||
Setting up of the algorithms is easily done through the admin interface. When
|
||||
you create a new correspondent or tag, there are optional fields for matching
|
||||
text and matching algorithm. From the help info there:
|
||||
|
||||
.. note::
|
||||
|
||||
Which algorithm you want to use when matching text to the OCR'd PDF. Here,
|
||||
"any" looks for any occurrence of any word provided in the PDF, while "all"
|
||||
requires that every word provided appear in the PDF, albeit not in the
|
||||
order provided. A "literal" match means that the text you enter must
|
||||
appear in the PDF exactly as you've entered it, and "regular expression"
|
||||
uses a regex to match the PDF. If you don't know what a regex is, you
|
||||
probably don't want this option.
|
||||
|
||||
Then just save your tag/correspondent and run another document through the
|
||||
consumer. Once complete, you should see the newly-created document,
|
||||
automatically tagged with the appropriate data.
|
Reference in New Issue
Block a user