mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-17 10:13:56 -05:00
Documented all of the guesswork Paperless does
This commit is contained in:
parent
aea9ea50e5
commit
54443fa808
@ -3,6 +3,8 @@ Changelog
|
|||||||
|
|
||||||
* 0.2.0
|
* 0.2.0
|
||||||
|
|
||||||
|
* `#89`_ Ported the auto-tagging code to correspondents as well. Thanks to
|
||||||
|
`Justin Snyman`_ for the pointers in the issue queue.
|
||||||
* Added support for guessing the date from the file name along with the
|
* Added support for guessing the date from the file name along with the
|
||||||
correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull
|
correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull
|
||||||
request that I took forever to merge and to `Pit`_ for his efforts on the
|
request that I took forever to merge and to `Pit`_ for his efforts on the
|
||||||
@ -97,6 +99,7 @@ Changelog
|
|||||||
.. _zedster: https://github.com/zedster
|
.. _zedster: https://github.com/zedster
|
||||||
.. _Martin Honermeyer: https://github.com/djmaze
|
.. _Martin Honermeyer: https://github.com/djmaze
|
||||||
.. _Tim White: https://github.com/timwhite
|
.. _Tim White: https://github.com/timwhite
|
||||||
|
.. _Justin Snyman: https://github.com/stringlytyped
|
||||||
|
|
||||||
.. _#20: https://github.com/danielquinn/paperless/issues/20
|
.. _#20: https://github.com/danielquinn/paperless/issues/20
|
||||||
.. _#44: https://github.com/danielquinn/paperless/issues/44
|
.. _#44: https://github.com/danielquinn/paperless/issues/44
|
||||||
@ -110,4 +113,5 @@ Changelog
|
|||||||
.. _#67: https://github.com/danielquinn/paperless/issues/67
|
.. _#67: https://github.com/danielquinn/paperless/issues/67
|
||||||
.. _#68: https://github.com/danielquinn/paperless/issues/68
|
.. _#68: https://github.com/danielquinn/paperless/issues/68
|
||||||
.. _#71: https://github.com/danielquinn/paperless/issues/71
|
.. _#71: https://github.com/danielquinn/paperless/issues/71
|
||||||
|
.. _#89: https://github.com/danielquinn/paperless/issues/89
|
||||||
.. _#94: https://github.com/danielquinn/paperless/issues/71
|
.. _#94: https://github.com/danielquinn/paperless/issues/71
|
||||||
|
@ -3,7 +3,7 @@
|
|||||||
Consumption
|
Consumption
|
||||||
###########
|
###########
|
||||||
|
|
||||||
Once you've got *Paperless* setup, you need to start feeding documents into it.
|
Once you've got Paperless setup, you need to start feeding documents into it.
|
||||||
Currently, there are three options: the consumption directory, IMAP (email), and
|
Currently, there are three options: the consumption directory, IMAP (email), and
|
||||||
HTTP POST.
|
HTTP POST.
|
||||||
|
|
||||||
@ -35,41 +35,6 @@ appropriate for your use and put some documents in there. When you're ready,
|
|||||||
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
|
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
|
||||||
|
|
||||||
|
|
||||||
.. _consumption-directory-naming:
|
|
||||||
|
|
||||||
A Note on File Naming
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
Any document you put into the consumption directory will be consumed, but if
|
|
||||||
you name the file right, it'll automatically set some values in the database
|
|
||||||
for you. This is is the logic the consumer follows:
|
|
||||||
|
|
||||||
1. Try to find the correspondent, title, and tags in the file name following
|
|
||||||
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
|
||||||
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
|
||||||
``YYYYMMDDZ``. The ``Z`` is for "Zulu time" AKA "UTC".
|
|
||||||
2. If that doesn't work, we skip the date and try this pattern:
|
|
||||||
the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
|
|
||||||
3. If that doesn't work, we try to find the correspondent and title in the file
|
|
||||||
name following the pattern: ``Correspondent - Title.pdf``.
|
|
||||||
4. If that doesn't work, just assume that the name of the file is the title.
|
|
||||||
|
|
||||||
So given the above, the following examples would work as you'd expect:
|
|
||||||
|
|
||||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
|
||||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
|
||||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
|
||||||
* ``Another Company - Letter of Reference.jpg``
|
|
||||||
* ``Dad's Recipe for Pancakes.png``
|
|
||||||
|
|
||||||
These however wouldn't work:
|
|
||||||
|
|
||||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
|
||||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
|
||||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
|
||||||
* ``Another Company- Letter of Reference.jpg``
|
|
||||||
|
|
||||||
|
|
||||||
.. _consumption-imap:
|
.. _consumption-imap:
|
||||||
|
|
||||||
IMAP (Email)
|
IMAP (Email)
|
||||||
|
85
docs/guesswork.rst
Normal file
85
docs/guesswork.rst
Normal file
@ -0,0 +1,85 @@
|
|||||||
|
.. _guesswork:
|
||||||
|
|
||||||
|
Guesswork
|
||||||
|
#########
|
||||||
|
|
||||||
|
During the consumption process, Paperless tries to guess some of the attributes
|
||||||
|
of the document it's looking at. To do this it uses two approaches:
|
||||||
|
|
||||||
|
|
||||||
|
.. _guesswork-naming:
|
||||||
|
|
||||||
|
File Naming
|
||||||
|
===========
|
||||||
|
|
||||||
|
Any document you put into the consumption directory will be consumed, but if
|
||||||
|
you name the file right, it'll automatically set some values in the database
|
||||||
|
for you. This is is the logic the consumer follows:
|
||||||
|
|
||||||
|
1. Try to find the correspondent, title, and tags in the file name following
|
||||||
|
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||||
|
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||||
|
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
||||||
|
2. If that doesn't work, we skip the date and try this pattern:
|
||||||
|
``Correspondent - Title - tag,tag,tag.pdf``.
|
||||||
|
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||||
|
name following the pattern: ``Correspondent - Title.pdf``.
|
||||||
|
4. If that doesn't work, just assume that the name of the file is the title.
|
||||||
|
|
||||||
|
So given the above, the following examples would work as you'd expect:
|
||||||
|
|
||||||
|
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||||
|
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||||
|
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||||
|
* ``Another Company - Letter of Reference.jpg``
|
||||||
|
* ``Dad's Recipe for Pancakes.png``
|
||||||
|
|
||||||
|
These however wouldn't work:
|
||||||
|
|
||||||
|
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||||
|
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||||
|
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||||
|
* ``Another Company- Letter of Reference.jpg``
|
||||||
|
|
||||||
|
|
||||||
|
.. _guesswork-content:
|
||||||
|
|
||||||
|
Reading the Document Contents
|
||||||
|
=============================
|
||||||
|
|
||||||
|
After the consumer has tried to figure out what it could from the file name,
|
||||||
|
it starts looking at the content of the document itself. It will compare the
|
||||||
|
matching algorithms defined by every tag and correspondent already set in your
|
||||||
|
database to see if they apply to the text in that document. In other words,
|
||||||
|
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
||||||
|
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
||||||
|
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
||||||
|
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
||||||
|
|
||||||
|
The matching logic is quite powerful, and supports searching the text of your
|
||||||
|
document with different algorithms, and as such, some experimentation may be
|
||||||
|
necessary to get things Just Right.
|
||||||
|
|
||||||
|
|
||||||
|
.. _guesswork-content-howto:
|
||||||
|
|
||||||
|
How Do I Set Up These Matching Algorithms?
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
Setting up of the algorithms is easily done through the admin interface. When
|
||||||
|
you create a new correspondent or tag, there are optional fields for matching
|
||||||
|
text and matching algorithm. From the help info there:
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Which algorithm you want to use when matching text to the OCR'd PDF. Here,
|
||||||
|
"any" looks for any occurrence of any word provided in the PDF, while "all"
|
||||||
|
requires that every word provided appear in the PDF, albeit not in the
|
||||||
|
order provided. A "literal" match means that the text you enter must
|
||||||
|
appear in the PDF exactly as you've entered it, and "regular expression"
|
||||||
|
uses a regex to match the PDF. If you don't know what a regex is, you
|
||||||
|
probably don't want this option.
|
||||||
|
|
||||||
|
Then just save your tag/correspondent and run another document through the
|
||||||
|
consumer. Once complete, you should see the newly-created document,
|
||||||
|
automatically tagged with the appropriate data.
|
@ -32,6 +32,7 @@ Contents
|
|||||||
consumption
|
consumption
|
||||||
api
|
api
|
||||||
utilities
|
utilities
|
||||||
|
guesswork
|
||||||
migrating
|
migrating
|
||||||
troubleshooting
|
troubleshooting
|
||||||
changelog
|
changelog
|
||||||
|
@ -52,9 +52,12 @@ for PDF files to parse and index. The process is pretty straightforward:
|
|||||||
wait 10 seconds and try again.
|
wait 10 seconds and try again.
|
||||||
2. Parse the PDF with Tesseract
|
2. Parse the PDF with Tesseract
|
||||||
3. Create a new record in the database with the OCR'd text
|
3. Create a new record in the database with the OCR'd text
|
||||||
4. Encrypt the PDF and store it in the ``media`` directory under
|
4. Attempt to automatically assign document attributes by doing some guesswork.
|
||||||
|
Read up on the :ref:`guesswork documentation<guesswork>` for more
|
||||||
|
information about this process.
|
||||||
|
5. Encrypt the PDF and store it in the ``media`` directory under
|
||||||
``documents/pdf``.
|
``documents/pdf``.
|
||||||
5. Go to #1.
|
6. Go to #1.
|
||||||
|
|
||||||
|
|
||||||
.. _utilities-consumer-howto:
|
.. _utilities-consumer-howto:
|
||||||
|
Loading…
x
Reference in New Issue
Block a user