Move docs to material-mkdocs

2026-02-16 00:19:32 -06:00 · 2022-11-29 12:49:23 -08:00
parent a96f79f6a3
commit 605f885e19
58 changed files with 5198 additions and 5895 deletions
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -0,0 +1,464 @@
+# Advanced Topics
+
+Paperless offers a couple features that automate certain tasks and make
+your life easier.
+
+## Matching tags, correspondents, document types, and storage paths {#matching}
+
+Paperless will compare the matching algorithms defined by every tag,
+correspondent, document type, and storage path in your database to see
+if they apply to the text in a document. In other words, if you define a
+tag called `Home Utility` that had a `match` property of `bc hydro` and
+a `matching_algorithm` of `literal`, Paperless will automatically tag
+your newly-consumed document with your `Home Utility` tag so long as the
+text `bc hydro` appears in the body of the document somewhere.
+
+The matching logic is quite powerful. It supports searching the text of
+your document with different algorithms, and as such, some
+experimentation may be necessary to get things right.
+
+In order to have a tag, correspondent, document type, or storage path
+assigned automatically to newly consumed documents, assign a match and
+matching algorithm using the web interface. These settings define when
+to assign tags, correspondents, document types, and storage paths to
+documents.
+
+The following algorithms are available:
+
+- **Any:** Looks for any occurrence of any word provided in match in
+  the PDF. If you define the match as `Bank1 Bank2`, it will match
+  documents containing either of these terms.
+- **All:** Requires that every word provided appears in the PDF,
+  albeit not in the order provided.
+- **Literal:** Matches only if the match appears exactly as provided
+  (i.e. preserve ordering) in the PDF.
+- **Regular expression:** Parses the match as a regular expression and
+  tries to find a match within the document.
+- **Fuzzy match:** I don't know. Look at the source.
+- **Auto:** Tries to automatically match new documents. This does not
+  require you to set a match. See the notes below.
+
+When using the _any_ or _all_ matching algorithms, you can search for
+terms that consist of multiple words by enclosing them in double quotes.
+For example, defining a match text of `"Bank of America" BofA` using the
+_any_ algorithm, will match documents that contain either "Bank of
+America" or "BofA", but will not match documents containing "Bank of
+South America".
+
+Then just save your tag, correspondent, document type, or storage path
+and run another document through the consumer. Once complete, you should
+see the newly-created document, automatically tagged with the
+appropriate data.
+
+### Automatic matching {#automatic_matching}
+
+Paperless-ngx comes with a new matching algorithm called _Auto_. This
+matching algorithm tries to assign tags, correspondents, document types,
+and storage paths to your documents based on how you have already
+assigned these on existing documents. It uses a neural network under the
+hood.
+
+If, for example, all your bank statements of your account 123 at the
+Bank of America are tagged with the tag "bofa*123" and the matching
+algorithm of this tag is set to \_Auto*, this neural network will examine
+your documents and automatically learn when to assign this tag.
+
+Paperless tries to hide much of the involved complexity with this
+approach. However, there are a couple caveats you need to keep in mind
+when using this feature:
+
+- Changes to your documents are not immediately reflected by the
+  matching algorithm. The neural network needs to be _trained_ on your
+  documents after changes. Paperless periodically (default: once each
+  hour) checks for changes and does this automatically for you.
+- The Auto matching algorithm only takes documents into account which
+  are NOT placed in your inbox (i.e. have any inbox tags assigned to
+  them). This ensures that the neural network only learns from
+  documents which you have correctly tagged before.
+- The matching algorithm can only work if there is a correlation
+  between the tag, correspondent, document type, or storage path and
+  the document itself. Your bank statements usually contain your bank
+  account number and the name of the bank, so this works reasonably
+  well, However, tags such as "TODO" cannot be automatically
+  assigned.
+- The matching algorithm needs a reasonable number of documents to
+  identify when to assign tags, correspondents, storage paths, and
+  types. If one out of a thousand documents has the correspondent
+  "Very obscure web shop I bought something five years ago", it will
+  probably not assign this correspondent automatically if you buy
+  something from them again. The more documents, the better.
+- Paperless also needs a reasonable amount of negative examples to
+  decide when not to assign a certain tag, correspondent, document
+  type, or storage path. This will usually be the case as you start
+  filling up paperless with documents. Example: If all your documents
+  are either from "Webshop" and "Bank", paperless will assign one
+  of these correspondents to ANY new document, if both are set to
+  automatic matching.
+
+## Hooking into the consumption process
+
+Sometimes you may want to do something arbitrary whenever a document is
+consumed. Rather than try to predict what you may want to do, Paperless
+lets you execute scripts of your own choosing just before or after a
+document is consumed using a couple simple hooks.
+
+Just write a script, put it somewhere that Paperless can read & execute,
+and then put the path to that script in `paperless.conf` or
+`docker-compose.env` with the variable name of either
+`PAPERLESS_PRE_CONSUME_SCRIPT` or `PAPERLESS_POST_CONSUME_SCRIPT`.
+
+!!! info
+
+    These scripts are executed in a **blocking** process, which means that
+    if a script takes a long time to run, it can significantly slow down
+    your document consumption flow. If you want things to run
+    asynchronously, you'll have to fork the process in your script and
+    exit.
+
+### Pre-consumption script
+
+Executed after the consumer sees a new document in the consumption
+folder, but before any processing of the document is performed. This
+script can access the following relevant environment variables set:
+
+- `DOCUMENT_SOURCE_PATH`
+
+A simple but common example for this would be creating a simple script
+like this:
+
+`/usr/local/bin/ocr-pdf`
+
+```bash
+#!/usr/bin/env bash
+pdf2pdfocr.py -i ${DOCUMENT_SOURCE_PATH}
+```
+
+`/etc/paperless.conf`
+
+```bash
+...
+PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
+...
+```
+
+This will pass the path to the document about to be consumed to
+`/usr/local/bin/ocr-pdf`, which will in turn call
+[pdf2pdfocr.py](https://github.com/LeoFCardoso/pdf2pdfocr) on your
+document, which will then overwrite the file with an OCR'd version of
+the file and exit. At which point, the consumption process will begin
+with the newly modified file.
+
+The script's stdout and stderr will be logged line by line to the
+webserver log, along with the exit code of the script.
+
+### Post-consumption script {#post_consume_script}
+
+Executed after the consumer has successfully processed a document and
+has moved it into paperless. It receives the following environment
+variables:
+
+- `DOCUMENT_ID`
+- `DOCUMENT_FILE_NAME`
+- `DOCUMENT_CREATED`
+- `DOCUMENT_MODIFIED`
+- `DOCUMENT_ADDED`
+- `DOCUMENT_SOURCE_PATH`
+- `DOCUMENT_ARCHIVE_PATH`
+- `DOCUMENT_THUMBNAIL_PATH`
+- `DOCUMENT_DOWNLOAD_URL`
+- `DOCUMENT_THUMBNAIL_URL`
+- `DOCUMENT_CORRESPONDENT`
+- `DOCUMENT_TAGS`
+- `DOCUMENT_ORIGINAL_FILENAME`
+
+The script can be in any language, but for a simple shell script
+example, you can take a look at
+[post-consumption-example.sh](https://github.com/paperless-ngx/paperless-ngx/blob/main/scripts/post-consumption-example.sh)
+in this project.
+
+The post consumption script cannot cancel the consumption process.
+
+The script's stdout and stderr will be logged line by line to the
+webserver log, along with the exit code of the script.
+
+#### Docker
+
+Assumed you have
+`/home/foo/paperless-ngx/scripts/post-consumption-example.sh`.
+
+You can pass that script into the consumer container via a host mount in
+your `docker-compose.yml`.
+
+```bash
+...
+consumer:
+  ...
+  volumes:
+    ...
+    - /home/paperless-ngx/scripts:/path/in/container/scripts/
+...
+```
+
+Example (docker-compose.yml):
+`- /home/foo/paperless-ngx/scripts:/usr/src/paperless/scripts`
+
+which in turn requires the variable `PAPERLESS_POST_CONSUME_SCRIPT` in
+`docker-compose.env` to point to
+`/path/in/container/scripts/post-consumption-example.sh`.
+
+Example (docker-compose.env):
+`PAPERLESS_POST_CONSUME_SCRIPT=/usr/src/paperless/scripts/post-consumption-example.sh`
+
+Troubleshooting:
+
+- Monitor the docker-compose log
+  `cd ~/paperless-ngx; docker-compose logs -f`
+- Check your script's permission e.g. in case of permission error
+  `sudo chmod 755 post-consumption-example.sh`
+- Pipe your scripts's output to a log file e.g.
+  `echo "${DOCUMENT_ID}" | tee --append /usr/src/paperless/scripts/post-consumption-example.log`
+
+## File name handling {#file_name_handling}
+
+By default, paperless stores your documents in the media directory and
+renames them using the identifier which it has assigned to each
+document. You will end up getting files like `0000123.pdf` in your media
+directory. This isn't necessarily a bad thing, because you normally
+don't have to access these files manually. However, if you wish to name
+your files differently, you can do that by adjusting the
+`PAPERLESS_FILENAME_FORMAT` configuration option. Paperless adds the
+correct file extension e.g. `.pdf`, `.jpg` automatically.
+
+This variable allows you to configure the filename (folders are allowed)
+using placeholders. For example, configuring this to
+
+```bash
+PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
+```
+
+will create a directory structure as follows:
+
+```
+2019/
+  My bank/
+    Statement January.pdf
+    Statement February.pdf
+2020/
+  My bank/
+    Statement January.pdf
+    Letter.pdf
+    Letter_01.pdf
+  Shoe store/
+    My new shoes.pdf
+```
+
+!!! warning
+
+    Do not manually move your files in the media folder. Paperless remembers
+    the last filename a document was stored as. If you do rename a file,
+    paperless will report your files as missing and won't be able to find
+    them.
+
+Paperless provides the following placeholders within filenames:
+
+- `{asn}`: The archive serial number of the document, or "none".
+- `{correspondent}`: The name of the correspondent, or "none".
+- `{document_type}`: The name of the document type, or "none".
+- `{tag_list}`: A comma separated list of all tags assigned to the
+  document.
+- `{title}`: The title of the document.
+- `{created}`: The full date (ISO format) the document was created.
+- `{created_year}`: Year created only, formatted as the year with
+  century.
+- `{created_year_short}`: Year created only, formatted as the year
+  without century, zero padded.
+- `{created_month}`: Month created only (number 01-12).
+- `{created_month_name}`: Month created name, as per locale
+- `{created_month_name_short}`: Month created abbreviated name, as per
+  locale
+- `{created_day}`: Day created only (number 01-31).
+- `{added}`: The full date (ISO format) the document was added to
+  paperless.
+- `{added_year}`: Year added only.
+- `{added_year_short}`: Year added only, formatted as the year without
+  century, zero padded.
+- `{added_month}`: Month added only (number 01-12).
+- `{added_month_name}`: Month added name, as per locale
+- `{added_month_name_short}`: Month added abbreviated name, as per
+  locale
+- `{added_day}`: Day added only (number 01-31).
+
+Paperless will try to conserve the information from your database as
+much as possible. However, some characters that you can use in document
+titles and correspondent names (such as `: \ /` and a couple more) are
+not allowed in filenames and will be replaced with dashes.
+
+If paperless detects that two documents share the same filename,
+paperless will automatically append `_01`, `_02`, etc to the filename.
+This happens if all the placeholders in a filename evaluate to the same
+value.
+
+!!! tip
+
+    You can affect how empty placeholders are treated by changing the
+    following setting to [true]{.title-ref}.
+
+    ```
+    PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=True
+    ```
+
+    Doing this results in all empty placeholders resolving to "" instead
+    of "none" as stated above. Spaces before empty placeholders are
+    removed as well, empty directories are omitted.
+
+!!! tip
+
+    Paperless checks the filename of a document whenever it is saved.
+    Therefore, you need to update the filenames of your documents and move
+    them after altering this setting by invoking the
+    [`document renamer <utilities-renamer>`]().
+
+!!! warning
+
+    Make absolutely sure you get the spelling of the placeholders right, or
+    else paperless will use the default naming scheme instead.
+
+!!! caution
+
+    As of now, you could totally tell paperless to store your files anywhere
+    outside the media directory by setting
+
+    ```
+    PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}
+    ```
+
+    However, keep in mind that inside docker, if files get stored outside of
+    the predefined volumes, they will be lost after a restart of paperless.
+
+## Storage paths
+
+One of the best things in Paperless is that you can not only access the
+documents via the web interface, but also via the file system.
+
+When as single storage layout is not sufficient for your use case,
+storage paths come to the rescue. Storage paths allow you to configure
+more precisely where each document is stored in the file system.
+
+- Each storage path is a [PAPERLESS_FILENAME_FORMAT]{.title-ref} and
+  follows the rules described above
+- Each document is assigned a storage path using the matching
+  algorithms described above, but can be overwritten at any time
+
+For example, you could define the following two storage paths:
+
+1.  Normal communications are put into a folder structure sorted by
+    [year/correspondent]{.title-ref}
+2.  Communications with insurance companies are stored in a flat
+    structure with longer file names, but containing the full date of
+    the correspondence.
+
+```
+By Year = {created_year}/{correspondent}/{title}
+Insurances = Insurances/{correspondent}/{created_year}-{created_month}-{created_day} {title}
+```
+
+If you then map these storage paths to the documents, you might get the
+following result. For simplicity, [By Year]{.title-ref} defines the same
+structure as in the previous example above.
+
+```text
+2019/                                   # By Year
+   My bank/
+     Statement January.pdf
+     Statement February.pdf
+
+ Insurances/                           # Insurances
+   Healthcare 123/
+     2022-01-01 Statement January.pdf
+     2022-02-02 Letter.pdf
+     2022-02-03 Letter.pdf
+   Dental 456/
+     2021-12-01 New Conditions.pdf
+```
+
+!!! tip
+
+    Defining a storage path is optional. If no storage path is defined for a
+    document, the global [PAPERLESS_FILENAME_FORMAT]{.title-ref} is applied.
+
+!!! warning
+
+    If you adjust the format of an existing storage path, old documents
+    don't get relocated automatically. You need to run the
+    [document renamer](/administration#renamer) to
+    adjust their pathes.
+
+## Celery Monitoring {#celery-monitoring}
+
+The monitoring tool
+[Flower](https://flower.readthedocs.io/en/latest/index.html) can be used
+to view more detailed information about the health of the celery workers
+used for asynchronous tasks. This includes details on currently running,
+queued and completed tasks, timing and more. Flower can also be used
+with Prometheus, as it exports metrics. For details on its capabilities,
+refer to the Flower documentation.
+
+To configure Flower further, create a [flowerconfig.py]{.title-ref} and
+place it into the [src/paperless]{.title-ref} directory. For a Docker
+installation, you can use volumes to accomplish this:
+
+```yaml
+services:
+  # ...
+  webserver:
+    # ...
+    volumes:
+      - /path/to/my/flowerconfig.py:/usr/src/paperless/src/paperless/flowerconfig.py:ro
+```
+
+## Custom Container Initialization
+
+The Docker image includes the ability to run custom user scripts during
+startup. This could be utilized for installing additional tools or
+Python packages, for example.
+
+To utilize this, mount a folder containing your scripts to the custom
+initialization directory, [/custom-cont-init.d]{.title-ref} and place
+scripts you wish to run inside. For security, the folder and its
+contents must be owned by [root]{.title-ref}. Additionally, scripts must
+only be writable by [root]{.title-ref}.
+
+Your scripts will be run directly before the webserver completes
+startup. Scripts will be run by the [root]{.title-ref} user. This is an
+advanced functionality with which you could break functionality or lose
+data.
+
+For example, using Docker Compose:
+
+```yaml
+services:
+  # ...
+  webserver:
+    # ...
+    volumes:
+      - /path/to/my/scripts:/custom-cont-init.d:ro
+```
+
+## MySQL Caveats {#mysql-caveats}
+
+### Case Sensitivity
+
+The database interface does not provide a method to configure a MySQL
+database to be case sensitive. This would prevent a user from creating a
+tag `Name` and `NAME` as they are considered the same.
+
+Per Django documentation, to enable this requires manual intervention.
+To enable case sensetive tables, you can execute the following command
+against each table:
+
+`ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;`
+
+You can also set the default for new tables (this does NOT affect
+existing tables) with:
+
+`ALTER DATABASE <db_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;`