mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-02 13:45:10 -05:00

+ instead of positional arguments because it's easier to use in the shell script and easier to read in the python code.
361 lines
14 KiB
ReStructuredText
361 lines
14 KiB
ReStructuredText
***************
|
|
Advanced topics
|
|
***************
|
|
|
|
Paperless offers a couple features that automate certain tasks and make your life
|
|
easier.
|
|
|
|
.. _advanced-matching:
|
|
|
|
Matching tags, correspondents, document types, and storage paths
|
|
################################################################
|
|
|
|
Paperless will compare the matching algorithms defined by every tag, correspondent,
|
|
document type, and storage path in your database to see if they apply to the text
|
|
in a document. In other words, if you define a tag called ``Home Utility``
|
|
that had a ``match`` property of ``bc hydro`` and a ``matching_algorithm`` of
|
|
``literal``, Paperless will automatically tag your newly-consumed document with
|
|
your ``Home Utility`` tag so long as the text ``bc hydro`` appears in the body
|
|
of the document somewhere.
|
|
|
|
The matching logic is quite powerful. It supports searching the text of your
|
|
document with different algorithms, and as such, some experimentation may be
|
|
necessary to get things right.
|
|
|
|
In order to have a tag, correspondent, document type, or storage path assigned
|
|
automatically to newly consumed documents, assign a match and matching algorithm
|
|
using the web interface. These settings define when to assign tags, correspondents,
|
|
document types, and storage paths to documents.
|
|
|
|
The following algorithms are available:
|
|
|
|
* **Any:** Looks for any occurrence of any word provided in match in the PDF.
|
|
If you define the match as ``Bank1 Bank2``, it will match documents containing
|
|
either of these terms.
|
|
* **All:** Requires that every word provided appears in the PDF, albeit not in the
|
|
order provided.
|
|
* **Literal:** Matches only if the match appears exactly as provided (i.e. preserve ordering) in the PDF.
|
|
* **Regular expression:** Parses the match as a regular expression and tries to
|
|
find a match within the document.
|
|
* **Fuzzy match:** I don't know. Look at the source.
|
|
* **Auto:** Tries to automatically match new documents. This does not require you
|
|
to set a match. See the notes below.
|
|
|
|
When using the *any* or *all* matching algorithms, you can search for terms
|
|
that consist of multiple words by enclosing them in double quotes. For example,
|
|
defining a match text of ``"Bank of America" BofA`` using the *any* algorithm,
|
|
will match documents that contain either "Bank of America" or "BofA", but will
|
|
not match documents containing "Bank of South America".
|
|
|
|
Then just save your tag, correspondent, document type, or storage path and run
|
|
another document through the consumer. Once complete, you should see the
|
|
newly-created document, automatically tagged with the appropriate data.
|
|
|
|
|
|
.. _advanced-automatic_matching:
|
|
|
|
Automatic matching
|
|
==================
|
|
|
|
Paperless-ngx comes with a new matching algorithm called *Auto*. This matching
|
|
algorithm tries to assign tags, correspondents, document types, and storage paths
|
|
to your documents based on how you have already assigned these on existing documents.
|
|
It uses a neural network under the hood.
|
|
|
|
If, for example, all your bank statements of your account 123 at the Bank of
|
|
America are tagged with the tag "bofa_123" and the matching algorithm of this
|
|
tag is set to *Auto*, this neural network will examine your documents and
|
|
automatically learn when to assign this tag.
|
|
|
|
Paperless tries to hide much of the involved complexity with this approach.
|
|
However, there are a couple caveats you need to keep in mind when using this
|
|
feature:
|
|
|
|
* Changes to your documents are not immediately reflected by the matching
|
|
algorithm. The neural network needs to be *trained* on your documents after
|
|
changes. Paperless periodically (default: once each hour) checks for changes
|
|
and does this automatically for you.
|
|
* The Auto matching algorithm only takes documents into account which are NOT
|
|
placed in your inbox (i.e. have any inbox tags assigned to them). This ensures
|
|
that the neural network only learns from documents which you have correctly
|
|
tagged before.
|
|
* The matching algorithm can only work if there is a correlation between the
|
|
tag, correspondent, document type, or storage path and the document itself.
|
|
Your bank statements usually contain your bank account number and the name
|
|
of the bank, so this works reasonably well, However, tags such as "TODO"
|
|
cannot be automatically assigned.
|
|
* The matching algorithm needs a reasonable number of documents to identify when
|
|
to assign tags, correspondents, storage paths, and types. If one out of a
|
|
thousand documents has the correspondent "Very obscure web shop I bought
|
|
something five years ago", it will probably not assign this correspondent
|
|
automatically if you buy something from them again. The more documents, the better.
|
|
* Paperless also needs a reasonable amount of negative examples to decide when
|
|
not to assign a certain tag, correspondent, document type, or storage path. This will
|
|
usually be the case as you start filling up paperless with documents.
|
|
Example: If all your documents are either from "Webshop" and "Bank", paperless
|
|
will assign one of these correspondents to ANY new document, if both are set
|
|
to automatic matching.
|
|
|
|
Hooking into the consumption process
|
|
####################################
|
|
|
|
Sometimes you may want to do something arbitrary whenever a document is
|
|
consumed. Rather than try to predict what you may want to do, Paperless lets
|
|
you execute scripts of your own choosing just before or after a document is
|
|
consumed using a couple simple hooks.
|
|
|
|
Just write a script, put it somewhere that Paperless can read & execute, and
|
|
then put the path to that script in ``paperless.conf`` or ``docker-compose.env`` with the variable name
|
|
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
|
``PAPERLESS_POST_CONSUME_SCRIPT``.
|
|
|
|
.. important::
|
|
|
|
These scripts are executed in a **blocking** process, which means that if
|
|
a script takes a long time to run, it can significantly slow down your
|
|
document consumption flow. If you want things to run asynchronously,
|
|
you'll have to fork the process in your script and exit.
|
|
|
|
|
|
Pre-consumption script
|
|
======================
|
|
|
|
Executed after the consumer sees a new document in the consumption folder, but
|
|
before any processing of the document is performed. This script can access the
|
|
following relevant environment variables set:
|
|
|
|
* ``DOCUMENT_SOURCE_PATH``
|
|
|
|
A simple but common example for this would be creating a simple script like
|
|
this:
|
|
|
|
``/usr/local/bin/ocr-pdf``
|
|
|
|
.. code:: bash
|
|
|
|
#!/usr/bin/env bash
|
|
pdf2pdfocr.py -i ${DOCUMENT_SOURCE_PATH}
|
|
|
|
``/etc/paperless.conf``
|
|
|
|
.. code:: bash
|
|
|
|
...
|
|
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
|
|
...
|
|
|
|
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
|
|
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
|
|
overwrite the file with an OCR'd version of the file and exit. At which point,
|
|
the consumption process will begin with the newly modified file.
|
|
|
|
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
|
|
|
|
.. _advanced-post_consume_script:
|
|
|
|
Post-consumption script
|
|
=======================
|
|
|
|
Executed after the consumer has successfully processed a document and has moved it
|
|
into paperless. It receives the following environment variables:
|
|
|
|
* ``DOCUMENT_ID``
|
|
* ``DOCUMENT_FILE_NAME``
|
|
* ``DOCUMENT_SOURCE_PATH``
|
|
* ``DOCUMENT_THUMBNAIL_PATH``
|
|
* ``DOCUMENT_DOWNLOAD_URL``
|
|
* ``DOCUMENT_THUMBNAIL_URL``
|
|
* ``DOCUMENT_CORRESPONDENT``
|
|
* ``DOCUMENT_TAGS``
|
|
|
|
The script can be in any language, but for a simple shell script
|
|
example, you can take a look at `post-consumption-example.sh`_ in this project.
|
|
|
|
The post consumption script cannot cancel the consumption process.
|
|
|
|
Docker
|
|
------
|
|
Assumed you have ``/home/foo/paperless-ngx/scripts/post-consumption-example.sh``.
|
|
|
|
You can pass that script into the consumer container via a host mount in your ``docker-compose.yml``.
|
|
|
|
.. code:: bash
|
|
|
|
...
|
|
consumer:
|
|
...
|
|
volumes:
|
|
...
|
|
- /home/paperless-ngx/scripts:/path/in/container/scripts/
|
|
...
|
|
|
|
Example (docker-compose.yml): ``- /home/foo/paperless-ngx/scripts:/usr/src/paperless/scripts``
|
|
|
|
which in turn requires the variable ``PAPERLESS_POST_CONSUME_SCRIPT`` in ``docker-compose.env`` to point to ``/path/in/container/scripts/post-consumption-example.sh``.
|
|
|
|
Example (docker-compose.env): ``PAPERLESS_POST_CONSUME_SCRIPT=/usr/src/paperless/scripts/post-consumption-example.sh``
|
|
|
|
Troubleshooting:
|
|
|
|
- Monitor the docker-compose log ``cd ~/paperless-ngx; docker-compose logs -f``
|
|
- Check your script's permission e.g. in case of permission error ``sudo chmod 755 post-consumption-example.sh``
|
|
- Pipe your scripts's output to a log file e.g. ``echo "${DOCUMENT_ID}" | tee --append /usr/src/paperless/scripts/post-consumption-example.log``
|
|
|
|
.. _post-consumption-example.sh: https://github.com/paperless-ngx/paperless-ngx/blob/main/scripts/post-consumption-example.sh
|
|
|
|
.. _advanced-file_name_handling:
|
|
|
|
File name handling
|
|
##################
|
|
|
|
By default, paperless stores your documents in the media directory and renames them
|
|
using the identifier which it has assigned to each document. You will end up getting
|
|
files like ``0000123.pdf`` in your media directory. This isn't necessarily a bad
|
|
thing, because you normally don't have to access these files manually. However, if
|
|
you wish to name your files differently, you can do that by adjusting the
|
|
``PAPERLESS_FILENAME_FORMAT`` configuration option.
|
|
|
|
This variable allows you to configure the filename (folders are allowed) using
|
|
placeholders. For example, configuring this to
|
|
|
|
.. code:: bash
|
|
|
|
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
|
|
|
|
will create a directory structure as follows:
|
|
|
|
.. code::
|
|
|
|
2019/
|
|
My bank/
|
|
Statement January.pdf
|
|
Statement February.pdf
|
|
2020/
|
|
My bank/
|
|
Statement January.pdf
|
|
Letter.pdf
|
|
Letter_01.pdf
|
|
Shoe store/
|
|
My new shoes.pdf
|
|
|
|
.. danger::
|
|
|
|
Do not manually move your files in the media folder. Paperless remembers the
|
|
last filename a document was stored as. If you do rename a file, paperless will
|
|
report your files as missing and won't be able to find them.
|
|
|
|
Paperless provides the following placeholders within filenames:
|
|
|
|
* ``{asn}``: The archive serial number of the document, or "none".
|
|
* ``{correspondent}``: The name of the correspondent, or "none".
|
|
* ``{document_type}``: The name of the document type, or "none".
|
|
* ``{tag_list}``: A comma separated list of all tags assigned to the document.
|
|
* ``{title}``: The title of the document.
|
|
* ``{created}``: The full date (ISO format) the document was created.
|
|
* ``{created_year}``: Year created only.
|
|
* ``{created_month}``: Month created only (number 01-12).
|
|
* ``{created_day}``: Day created only (number 01-31).
|
|
* ``{added}``: The full date (ISO format) the document was added to paperless.
|
|
* ``{added_year}``: Year added only.
|
|
* ``{added_month}``: Month added only (number 01-12).
|
|
* ``{added_day}``: Day added only (number 01-31).
|
|
|
|
|
|
Paperless will try to conserve the information from your database as much as possible.
|
|
However, some characters that you can use in document titles and correspondent names (such
|
|
as ``: \ /`` and a couple more) are not allowed in filenames and will be replaced with dashes.
|
|
|
|
If paperless detects that two documents share the same filename, paperless will automatically
|
|
append ``_01``, ``_02``, etc to the filename. This happens if all the placeholders in a filename
|
|
evaluate to the same value.
|
|
|
|
.. hint::
|
|
You can affect how empty placeholders are treated by changing the following setting to
|
|
`true`.
|
|
|
|
.. code::
|
|
|
|
PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=True
|
|
|
|
Doing this results in all empty placeholders resolving to "" instead of "none" as stated above.
|
|
Spaces before empty placeholders are removed as well, empty directories are omitted.
|
|
|
|
.. hint::
|
|
|
|
Paperless checks the filename of a document whenever it is saved. Therefore,
|
|
you need to update the filenames of your documents and move them after altering
|
|
this setting by invoking the :ref:`document renamer <utilities-renamer>`.
|
|
|
|
.. warning::
|
|
|
|
Make absolutely sure you get the spelling of the placeholders right, or else
|
|
paperless will use the default naming scheme instead.
|
|
|
|
.. caution::
|
|
|
|
As of now, you could totally tell paperless to store your files anywhere outside
|
|
the media directory by setting
|
|
|
|
.. code::
|
|
|
|
PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}
|
|
|
|
However, keep in mind that inside docker, if files get stored outside of the
|
|
predefined volumes, they will be lost after a restart of paperless.
|
|
|
|
|
|
Storage paths
|
|
#############
|
|
|
|
One of the best things in Paperless is that you can not only access the documents via the
|
|
web interface, but also via the file system.
|
|
|
|
When as single storage layout is not sufficient for your use case, storage paths come to
|
|
the rescue. Storage paths allow you to configure more precisely where each document is stored
|
|
in the file system.
|
|
|
|
- Each storage path is a `PAPERLESS_FILENAME_FORMAT` and follows the rules described above
|
|
- Each document is assigned a storage path using the matching algorithms described above, but
|
|
can be overwritten at any time
|
|
|
|
For example, you could define the following two storage paths:
|
|
|
|
1. Normal communications are put into a folder structure sorted by `year/correspondent`
|
|
2. Communications with insurance companies are stored in a flat structure with longer file names,
|
|
but containing the full date of the correspondence.
|
|
|
|
.. code::
|
|
|
|
By Year = {created_year}/{correspondent}/{title}
|
|
Insurances = Insurances/{correspondent}/{created_year}-{created_month}-{created_day} {title}
|
|
|
|
|
|
If you then map these storage paths to the documents, you might get the following result.
|
|
For simplicity, `By Year` defines the same structure as in the previous example above.
|
|
|
|
.. code:: text
|
|
|
|
2019/ # By Year
|
|
My bank/
|
|
Statement January.pdf
|
|
Statement February.pdf
|
|
|
|
Insurances/ # Insurances
|
|
Healthcare 123/
|
|
2022-01-01 Statement January.pdf
|
|
2022-02-02 Letter.pdf
|
|
2022-02-03 Letter.pdf
|
|
Dental 456/
|
|
2021-12-01 New Conditions.pdf
|
|
|
|
|
|
.. hint::
|
|
|
|
Defining a storage path is optional. If no storage path is defined for a document, the global
|
|
`PAPERLESS_FILENAME_FORMAT` is applied.
|
|
|
|
.. caution::
|
|
|
|
If you adjust the format of an existing storage path, old documents don't get relocated automatically.
|
|
You need to run the :ref:`document renamer <utilities-renamer>` to adjust their pathes.
|