mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-02 13:45:10 -05:00
328 lines
13 KiB
ReStructuredText
328 lines
13 KiB
ReStructuredText
***************
|
|
Advanced topics
|
|
***************
|
|
|
|
Paperless offers a couple features that automate certain tasks and make your life
|
|
easier.
|
|
|
|
Guesswork
|
|
#########
|
|
|
|
|
|
Any document you put into the consumption directory will be consumed, but if
|
|
you name the file right, it'll automatically set some values in the database
|
|
for you. This is is the logic the consumer follows:
|
|
|
|
1. Try to find the correspondent, title, and tags in the file name following
|
|
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
|
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
|
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
|
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
|
|
works as well.
|
|
2. If that doesn't work, we skip the date and try this pattern:
|
|
``Correspondent - Title - tag,tag,tag.pdf``.
|
|
3. If that doesn't work, we try to find the correspondent and title in the file
|
|
name following the pattern: ``Correspondent - Title.pdf``.
|
|
4. If that doesn't work, just assume that the name of the file is the title.
|
|
|
|
So given the above, the following examples would work as you'd expect:
|
|
|
|
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
|
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
|
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
|
* ``Another Company - Letter of Reference.jpg``
|
|
* ``Dad's Recipe for Pancakes.png``
|
|
|
|
These however wouldn't work:
|
|
|
|
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
|
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
|
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
|
* ``Another Company- Letter of Reference.jpg``
|
|
|
|
Do I have to be so strict about naming?
|
|
=======================================
|
|
|
|
Rather than using the strict document naming rules, one can also set the option
|
|
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
|
|
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
|
|
to any date format that is found in the title, instead of a date pulled from
|
|
the document's text, without requiring the strict formatting of the document
|
|
filename as described above.
|
|
|
|
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
|
|
|
|
.. _advanced-transforming_filenames:
|
|
|
|
Transforming filenames for parsing
|
|
==================================
|
|
|
|
Some devices can't produce filenames that can be parsed by the default
|
|
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
|
|
``paperless.conf`` one can add transformations that are applied to the filename
|
|
before it's parsed.
|
|
|
|
The option contains a list of dictionaries of regular expressions (key:
|
|
``pattern``) and replacements (key: ``repl``) in JSON format, which are
|
|
applied in order by passing them to ``re.subn``. Transformation stops
|
|
after the first match, so at most one transformation is applied. The general
|
|
syntax is
|
|
|
|
.. code:: python
|
|
|
|
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
|
|
|
|
The example below is for a Brother ADS-2400N, a scanner that allows
|
|
different names to different hardware buttons (useful for handling
|
|
multiple entities in one instance), but insists on adding ``_<count>``
|
|
to the filename.
|
|
|
|
.. code:: python
|
|
|
|
# Brother profile configuration, support "Name_Date_Count" (the default
|
|
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
|
|
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
|
|
|
|
|
|
Matching tags, correspondents and document types
|
|
################################################
|
|
|
|
After the consumer has tried to figure out what it could from the file name,
|
|
it starts looking at the content of the document itself. It will compare the
|
|
matching algorithms defined by every tag and correspondent already set in your
|
|
database to see if they apply to the text in that document. In other words,
|
|
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
|
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
|
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
|
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
|
|
|
The matching logic is quite powerful, and supports searching the text of your
|
|
document with different algorithms, and as such, some experimentation may be
|
|
necessary to get things right.
|
|
|
|
In order to have a tag, correspondent or type assigned automatically to newly
|
|
consumed documents, assign a match and matching algorithm using the web
|
|
interface. These settings define when to assign correspondents, tags and types
|
|
to documents.
|
|
|
|
The following algorithms are available:
|
|
|
|
* **Any:** Looks for any occurrence of any word provided in match in the PDF.
|
|
If you define the match as ``Bank1 Bank2``, it will match documents containing
|
|
either of these terms.
|
|
* **All:** Requires that every word provided appears in the PDF, albeit not in the
|
|
order provided.
|
|
* **Literal:** Matches only if the match appears exactly as provided in the PDF.
|
|
* **Regular expression:** Parses the match as a regular expression and tries to
|
|
find a match within the document.
|
|
* **Fuzzy match:** I dont know. Look at the source.
|
|
* **Auto:** Tries to automatically match new documents. This does not require you
|
|
to set a match. See the notes below.
|
|
|
|
When using the "any" or "all" matching algorithms, you can search for terms
|
|
that consist of multiple words by enclosing them in double quotes. For example,
|
|
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
|
|
will match documents that contain either "Bank of America" or "BofA", but will
|
|
not match documents containing "Bank of South America".
|
|
|
|
Then just save your tag/correspondent and run another document through the
|
|
consumer. Once complete, you should see the newly-created document,
|
|
automatically tagged with the appropriate data.
|
|
|
|
|
|
.. _advanced-automatic_matching:
|
|
|
|
Automatic matching
|
|
==================
|
|
|
|
Paperless-ng comes with a new matching algorithm called *Auto*. This matching
|
|
algorithm tries to assign tags, correspondents and document types to your
|
|
documents based on how you have assigned these on existing documents. It
|
|
uses a neural network under the hood.
|
|
|
|
If, for example, all your bank statements of your account 123 at the Bank of
|
|
America are tagged with the tag "bofa_123" and the matching algorithm of this
|
|
tag is set to *Auto*, this neural network will examine your documents and
|
|
automatically learn when to assign this tag.
|
|
|
|
There are a couple caveats you need to keep in mind when using this feature:
|
|
|
|
* Changes to your documents are not immediately reflected by the matching
|
|
algorithm. The neural network needs to be *trained* on your documents after
|
|
changes. Paperless periodically (default: once each hour) checks for changes
|
|
and does this automatically for you.
|
|
* The Auto matching algorithm only takes documents into account which are NOT
|
|
placed in your inbox (i.e., have inbox tags assigned to them). This ensures
|
|
that the neural network only learns from documents which you have correctly
|
|
tagged before.
|
|
* The matching algorithm can only work if there is a correlation between the
|
|
tag, correspondent or document type and the document itself. Your bank
|
|
statements usually contain your bank account number and the name of the bank,
|
|
so this works reasonably well, However, tags such as "TODO" cannot be
|
|
automatically assigned.
|
|
* The matching algorithm needs a reasonable number of documents to identify when
|
|
to assign tags, correspondents, and types. If one out of a thousand documents
|
|
has the correspondent "Very obscure web shop I bought something five years
|
|
ago", it will probably not assign this correspondent automatically if you buy
|
|
something from them again. The more documents, the better.
|
|
|
|
Hooking into the consumption process
|
|
####################################
|
|
|
|
Sometimes you may want to do something arbitrary whenever a document is
|
|
consumed. Rather than try to predict what you may want to do, Paperless lets
|
|
you execute scripts of your own choosing just before or after a document is
|
|
consumed using a couple simple hooks.
|
|
|
|
Just write a script, put it somewhere that Paperless can read & execute, and
|
|
then put the path to that script in ``paperless.conf`` with the variable name
|
|
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
|
``PAPERLESS_POST_CONSUME_SCRIPT``.
|
|
|
|
.. important::
|
|
|
|
These scripts are executed in a **blocking** process, which means that if
|
|
a script takes a long time to run, it can significantly slow down your
|
|
document consumption flow. If you want things to run asynchronously,
|
|
you'll have to fork the process in your script and exit.
|
|
|
|
|
|
Pre-consumption script
|
|
======================
|
|
|
|
Executed after the consumer sees a new document in the consumption folder, but
|
|
before any processing of the document is performed. This script receives exactly
|
|
one argument:
|
|
|
|
* Document file name
|
|
|
|
A simple but common example for this would be creating a simple script like
|
|
this:
|
|
|
|
``/usr/local/bin/ocr-pdf``
|
|
|
|
.. code:: bash
|
|
|
|
#!/usr/bin/env bash
|
|
pdf2pdfocr.py -i ${1}
|
|
|
|
``/etc/paperless.conf``
|
|
|
|
.. code:: bash
|
|
|
|
...
|
|
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
|
|
...
|
|
|
|
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
|
|
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
|
|
overwrite the file with an OCR'd version of the file and exit. At which point,
|
|
the consumption process will begin with the newly modified file.
|
|
|
|
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
|
|
|
|
.. _advanced-post_consume_script:
|
|
|
|
Post-consumption script
|
|
=======================
|
|
|
|
Executed after the consumer has successfully processed a document and has moved it
|
|
into paperless. It receives the following arguments:
|
|
|
|
* Document id
|
|
* Generated file name
|
|
* Source path
|
|
* Thumbnail path
|
|
* Download URL
|
|
* Thumbnail URL
|
|
* Correspondent
|
|
* Tags
|
|
|
|
The script can be in any language you like, but for a simple shell script
|
|
example, you can take a look at ``post-consumption-example.sh`` in the
|
|
``scripts`` directory in this project.
|
|
|
|
The post consumption script cannot cancel the consumption process.
|
|
|
|
.. _advanced-file_name_handling:
|
|
|
|
File name handling
|
|
##################
|
|
|
|
By default, paperless stores your documents in the media directory and renames them
|
|
using the identifier which it has assigned to each document. You will end up getting
|
|
files like ``0000123.pdf`` in your media directory. This isn't necessarily a bad
|
|
thing, because you normally don't have to access these files manually. However, if
|
|
you wish to name your files differently, you can do that by adjustng the
|
|
``PAPERLESS_FILENAME_FORMAT`` settings variable.
|
|
|
|
This variable allows you to configure the filename (folders are allowed!) using
|
|
placeholders. For example, setting
|
|
|
|
.. code:: bash
|
|
|
|
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
|
|
|
|
will create a directory structure as follows:
|
|
|
|
.. code::
|
|
|
|
2019/
|
|
my_bank/
|
|
statement-january-0000001.pdf
|
|
statement-february-0000002.pdf
|
|
2020/
|
|
my_bank/
|
|
statement-january-0000003.pdf
|
|
shoe_store/
|
|
my_new_shoes-0000004.pdf
|
|
|
|
Paperless appends the unique identifier of each document to the filename. This
|
|
avoides filename clashes.
|
|
|
|
.. danger::
|
|
|
|
Do not manually move your files in the media folder. Paperless remembers the
|
|
last filename a document was stored as. If you do rename a file, paperless will
|
|
report your files as missing and won't be able to find them.
|
|
|
|
Paperless provides the following placeholders withing filenames:
|
|
|
|
* ``{correspondent}``: The name of the correspondent, or "none".
|
|
* ``{title}``: The title of the document.
|
|
* ``{created}``: The full date and time the document was created.
|
|
* ``{created_year}``: Year created only.
|
|
* ``{created_month}``: Month created only (number 1-12).
|
|
* ``{created_day}``: Day created only (number 1-31).
|
|
* ``{added}``: The full date and time the document was added to paperless.
|
|
* ``{added_year}``: Year added only.
|
|
* ``{added_month}``: Month added only (number 1-12).
|
|
* ``{added_day}``: Day added only (number 1-31).
|
|
* ``{tags}``: I don't know how this works. Look at the source.
|
|
|
|
Paperless will convert all values for the placeholders into values which are safe
|
|
for use in filenames.
|
|
|
|
.. hint::
|
|
|
|
Paperless checks the filename of a document whenever it is saved. Therefore,
|
|
you need to update the filenames of your documents and move them after altering
|
|
this setting by invoking the :ref:`document renamer <utilities-renamer>`.
|
|
|
|
.. warning::
|
|
|
|
Make absolutely sure you get the spelling of the placeholders right, or else
|
|
paperless will use the default naming scheme instead.
|
|
|
|
.. caution::
|
|
|
|
As of now, you could totally tell paperless to store your files anywhere outside
|
|
the media directory by setting
|
|
|
|
.. code::
|
|
|
|
PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}
|
|
|
|
However, keep in mind that inside docker, if files get stored outside of the
|
|
predefined volumes, they will be lost after a restart of paperless.
|