reworking the documentation.

2025-12-16 01:31:09 -06:00 · 2020-11-13 18:46:19 +01:00
parent 04335e4aac
commit f2dbb74d44
21 changed files with 1042 additions and 1427 deletions
--- a/docs/advanced_usage.rst
+++ b/docs/advanced_usage.rst
@@ -0,0 +1,244 @@
+***************
+Advanced topics
+***************
+
+Paperless offers a couple features that automate certain tasks and make your life
+easier.
+
+Guesswork
+#########
+
+
+Any document you put into the consumption directory will be consumed, but if
+you name the file right, it'll automatically set some values in the database
+for you.  This is is the logic the consumer follows:
+
+1. Try to find the correspondent, title, and tags in the file name following
+   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
+   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
+   ``YYYYMMDDZ``.  The ``Z`` refers "Zulu time" AKA "UTC".
+   The tags are optional, so the format ``Date - Correspondent - Title.pdf``
+   works as well.
+2. If that doesn't work, we skip the date and try this pattern:
+   ``Correspondent - Title - tag,tag,tag.pdf``.
+3. If that doesn't work, we try to find the correspondent and title in the file
+   name following the pattern: ``Correspondent - Title.pdf``.
+4. If that doesn't work, just assume that the name of the file is the title.
+
+So given the above, the following examples would work as you'd expect:
+
+* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``Another Company - Letter of Reference.jpg``
+* ``Dad's Recipe for Pancakes.png``
+
+These however wouldn't work:
+
+* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``Another Company- Letter of Reference.jpg``
+
+Do I have to be so strict about naming?
+=======================================
+
+Rather than using the strict document naming rules, one can also set the option
+``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
+that is accepted by dateparser_. Doing so will cause ``paperless`` to default
+to any date format that is found in the title, instead of a date pulled from
+the document's text, without requiring the strict formatting of the document
+filename as described above.
+
+.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
+
+Transforming filenames for parsing
+==================================
+
+Some devices can't produce filenames that can be parsed by the default
+parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
+``paperless.conf`` one can add transformations that are applied to the filename
+before it's parsed.
+
+The option contains a list of dictionaries of regular expressions (key:
+``pattern``) and replacements (key: ``repl``) in JSON format, which are
+applied in order by passing them to ``re.subn``. Transformation stops
+after the first match, so at most one transformation is applied. The general
+syntax is
+
+.. code:: python
+
+   [{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
+
+The example below is for a Brother ADS-2400N, a scanner that allows
+different names to different hardware buttons (useful for handling
+multiple entities in one instance), but insists on adding ``_<count>``
+to the filename.
+
+.. code:: python
+
+   # Brother profile configuration, support "Name_Date_Count" (the default
+   # setting) and "Name_Count" (use "Name" as tag and "Count" as title).
+   PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
+
+
+Matching tags, correspondents and document types
+################################################
+
+After the consumer has tried to figure out what it could from the file name,
+it starts looking at the content of the document itself.  It will compare the
+matching algorithms defined by every tag and correspondent already set in your
+database to see if they apply to the text in that document.  In other words,
+if you defined a tag called ``Home Utility`` that had a ``match`` property of
+``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
+automatically tag your newly-consumed document with your ``Home Utility`` tag
+so long as the text ``bc hydro`` appears in the body of the document somewhere.
+
+The matching logic is quite powerful, and supports searching the text of your
+document with different algorithms, and as such, some experimentation may be
+necessary to get things right.
+
+In order to have a tag, correspondent or type assigned automatically to newly
+consumed documents, assign a match and matching algorithm using the web
+interface. These settings define when to assign correspondents, tags and types
+to documents.
+
+The following algorithms are available:
+
+* **Any:** Looks for any occurrence of any word provided in match in the PDF.
+  If you define the match as ``Bank1 Bank2``, it will match documents containing
+  either of these terms.
+* **All:** Requires that every word provided appears in the PDF, albeit not in the
+  order provided.
+* **Literal:** Matches only if the match appears exactly as provided in the PDF.
+* **Regular expression:** Parses the match as a regular expression and tries to
+  find a match within the document.
+* **Fuzzy match:** I dont know. Look at the source.
+* **Auto:** Tries to automatically match new documents. This does not require you
+  to set a match. See the notes below.
+
+When using the "any" or "all" matching algorithms, you can search for terms
+that consist of multiple words by enclosing them in double quotes. For example,
+defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
+will match documents that contain either "Bank of America" or "BofA", but will
+not match documents containing "Bank of South America".
+
+Then just save your tag/correspondent and run another document through the
+consumer.  Once complete, you should see the newly-created document,
+automatically tagged with the appropriate data.
+
+
+Automatic matching
+==================
+
+Paperless-ng comes with a new matching algorithm called *Auto*. This matching
+algorithm tries to assign tags, correspondents and document types to your
+documents based on how you have assigned these on existing documents. It
+uses a neural network under the hood.
+
+If, for example, all your bank statements of your account 123 at the Bank of
+America are tagged with the tag "bofa_123" and the matching algorithm of this
+tag is set to *Auto*, this neural network will examine your documents and
+automatically learn when to assign this tag.
+
+There are a couple caveats you need to keep in mind when using this feature:
+
+* Changes to your documents are not immediately reflected by the matching
+  algorithm. The neural network needs to be *trained* on your documents after
+  changes. Paperless periodically (default: once each hour) checks for changes
+  and does this automatically for you.
+* The Auto matching algorithm only takes documents into account which are NOT
+  placed in your inbox (i.e., have inbox tags assigned to them). This ensures
+  that the neural network only learns from documents which you have correctly
+  tagged before.
+* The matching algorithm can only work if there is a correlation between the
+  tag, correspondent or document type and the document itself. Your bank
+  statements usually contain your bank account number and the name of the bank,
+  so this works reasonably well, However, tags such as "TODO" cannot be
+  automatically assigned.
+* The matching algorithm needs a reasonable number of documents to identify when
+  to assign tags, correspondents, and types. If one out of a thousand documents
+  has the correspondent "Very obscure web shop I bought something five years
+  ago", it will probably not assign this correspondent automatically if you buy
+  something from them again. The more documents, the better.
+
+Hooking into the consumption process
+####################################
+
+Sometimes you may want to do something arbitrary whenever a document is
+consumed.  Rather than try to predict what you may want to do, Paperless lets
+you execute scripts of your own choosing just before or after a document is
+consumed using a couple simple hooks.
+
+Just write a script, put it somewhere that Paperless can read & execute, and
+then put the path to that script in ``paperless.conf`` with the variable name
+of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
+``PAPERLESS_POST_CONSUME_SCRIPT``.
+
+.. TODO HYPEREF TO CONFIG
+
+.. important::
+
+    These scripts are executed in a **blocking** process, which means that if
+    a script takes a long time to run, it can significantly slow down your
+    document consumption flow.  If you want things to run asynchronously,
+    you'll have to fork the process in your script and exit.
+
+
+Pre-consumption script
+======================
+
+Executed after the consumer sees a new document in the consumption folder, but
+before any processing of the document is performed. This script receives exactly
+one argument:
+
+* Document file name
+
+A simple but common example for this would be creating a simple script like
+this:
+
+``/usr/local/bin/ocr-pdf``
+
+.. code:: bash
+
+    #!/usr/bin/env bash
+    pdf2pdfocr.py -i ${1}
+
+``/etc/paperless.conf``
+
+.. code:: bash
+
+    ...
+    PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
+    ...
+
+This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
+which will in turn call `pdf2pdfocr.py`_ on your document, which will then
+overwrite the file with an OCR'd version of the file and exit.  At which point,
+the consumption process will begin with the newly modified file.
+
+.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
+
+
+.. _consumption-director-hook-variables-post:
+
+Post-consumption script
+=======================
+
+Executed after the consumer has successfully processed a document and has moved it
+into paperless. It receives the following arguments:
+
+* Document id
+* Generated file name
+* Source path
+* Thumbnail path
+* Download URL
+* Thumbnail URL
+* Correspondent
+* Tags
+
+The script can be in any language you like, but for a simple shell script
+example, you can take a look at ``post-consumption-example.sh`` in the
+``scripts`` directory in this project.
+
+The post consumption script cannot cancel the consumption process.