Merge branch 'main' into dev

2026-02-09 23:49:29 -06:00 · 2023-04-01 21:08:33 -07:00
parent 1b4020b3d7 b948750d55
commit de155a753d
6 changed files with 259 additions and 224 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -17,6 +17,8 @@ run paperless, these settings have to be defined in different places.

 ## Required services

+### Redis Broker
+
 `PAPERLESS_REDIS=<url>`

 : This is required for processing scheduled tasks such as email
@@ -33,6 +35,8 @@ matcher.

    Defaults to `redis://localhost:6379`.

+### Database
+
 `PAPERLESS_DBENGINE=<engine_name>`

 : Optional, gives the ability to choose Postgres or MariaDB for
@@ -124,6 +128,47 @@ changing to postgresql if you need to increase this.

    Defaults to unset, keeping the Django defaults.

+## Optional Services
+
+### Tika {#tika}
+
+Paperless can make use of [Tika](https://tika.apache.org/) and
+[Gotenberg](https://gotenberg.dev/) for parsing and converting
+"Office" documents (such as ".doc", ".xlsx" and ".odt").
+Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
+
+If you wish to use this, you must provide a Tika server and a Gotenberg server,
+configure their endpoints, and enable the feature.
+
+`PAPERLESS_TIKA_ENABLED=<bool>`
+
+: Enable (or disable) the Tika parser.
+
+    Defaults to false.
+
+`PAPERLESS_TIKA_ENDPOINT=<url>`
+
+: Set the endpoint URL were Paperless can reach your Tika server.
+
+    Defaults to "<http://localhost:9998>".
+
+`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
+
+: Set the endpoint URL were Paperless can reach your Gotenberg server.
+
+    Defaults to "<http://localhost:3000>".
+
+If you run paperless on docker, you can add those services to the
+docker-compose file (see the provided
+[`docker-compose.sqlite-tika.yml`](https://github.com/paperless-ngx/paperless-ngx/blob/main/docker/compose/docker-compose.sqlite-tika.yml)
+file for reference).
+
+Add all three configuration parameters to your configuration. If using
+Docker, this may be the `environment` key of the webserver or a
+`docker-compose.env` file. Bare metal installations may have a `.conf` file
+containing the configuration parameters. Be sure to use the correct format
+and watch out for indentation if editing the YAML file.
+
 ## Paths and folders

 `PAPERLESS_CONSUMPTION_DIR=<path>`
@@ -257,8 +302,7 @@ not include a trailing slash. E.g. <https://paperless.domain.com>

 : A list of trusted origins for unsafe requests (e.g. POST). As of
 Django 4.0 this is required to access the Django admin via the web.
-See
-<https://docs.djangoproject.com/en/4.0/ref/settings/#csrf-trusted-origins>
+See the [Django project documentation on the settings](https://docs.djangoproject.com/en/4.1/ref/settings/#csrf-trusted-origins)

    Can also be set using PAPERLESS_URL (see above).

@@ -269,8 +313,8 @@ See

 : If you're planning on putting Paperless on the open internet, then
 you really should set this value to the domain name you're using.
-Failing to do so leaves you open to HTTP host header attacks:
-<https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation>
+Failing to do so leaves you open to HTTP host header attacks.
+You can read more about this in [the Django project's documentation](https://docs.djangoproject.com/en/4.1/topics/security/#host-header-validation)

    Just remember that this is a comma-separated list, so
    "example.com" is fine, as is "example.com,www.example.com", but
@@ -386,16 +430,16 @@ applications.
        If you're exposing paperless to the internet directly, do not use
        this.

-        Also see the warning [in the official documentation](https://docs.djangoproject.com/en/3.1/howto/auth-remote-user/#configuration).
+        Also see the warning [in the official documentation](https://docs.djangoproject.com/en/4.1/howto/auth-remote-user/#configuration).

    Defaults to "false" which disables this feature.

 `PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=<str>`

-: If "PAPERLESS*ENABLE_HTTP_REMOTE_USER" is enabled, this
+: If "PAPERLESS_ENABLE_HTTP_REMOTE_USER" is enabled, this
 property allows to customize the name of the HTTP header from which
 the authenticated username is extracted. Values are in terms of
-[HttpRequest.META](https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META).
+[HttpRequest.META](https://docs.djangoproject.com/en/4.1/ref/request-response/#django.http.HttpRequest.META).
 Thus, the configured value must start with `HTTP*`
 followed by the normalized actual header name.

@@ -421,21 +465,20 @@ needs.
 : Customize the language that paperless will attempt to use when
 parsing documents.

-    It should be a 3-letter language code consistent with ISO 639:
-    https://www.loc.gov/standards/iso639-2/php/code_list.php
+    It should be a 3-letter code, see the list of [languages Tesseract supports](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).

    Set this to the language most of your documents are written in.

    This can be a combination of multiple languages such as `deu+eng`,
-    in which case tesseract will use whatever language matches best.
-    Keep in mind that tesseract uses much more cpu time with multiple
+    in which case Tesseract will use whatever language matches best.
+    Keep in mind that Tesseract uses much more CPU time with multiple
    languages enabled.

    Defaults to "eng".

    !!! note

-        If your language contains a '-' such as chi-sim, you must use chi_sim
+        If your language contains a '-' such as chi-sim, you must use `chi_sim`.

 `PAPERLESS_OCR_MODE=<mode>`

@@ -622,76 +665,6 @@ they use underscores instead of dashes.
    {"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
    ```

-## Tika settings {#tika}
-
-Paperless can make use of [Tika](https://tika.apache.org/) and
-[Gotenberg](https://gotenberg.dev/) for parsing and converting
-"Office" documents (such as ".doc", ".xlsx" and ".odt").
-Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
-
-If you wish to use this, you must provide a Tika server and a Gotenberg server,
-configure their endpoints, and enable the feature.
-
-`PAPERLESS_TIKA_ENABLED=<bool>`
-
-: Enable (or disable) the Tika parser.
-
-    Defaults to false.
-
-`PAPERLESS_TIKA_ENDPOINT=<url>`
-
-: Set the endpoint URL were Paperless can reach your Tika server.
-
-    Defaults to "<http://localhost:9998>".
-
-`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
-
-: Set the endpoint URL were Paperless can reach your Gotenberg server.
-
-    Defaults to "<http://localhost:3000>".
-
-If you run paperless on docker, you can add those services to the
-docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
-file for reference). The changes requires are as follows:
-
-```yaml
-services:
-  # ...
-
-  webserver:
-    # ...
-
-    environment:
-      # ...
-
-      PAPERLESS_TIKA_ENABLED: 1
-      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
-      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
-
-    # ...
-
-    gotenberg:
-      image: gotenberg/gotenberg:7.8
-      restart: unless-stopped
-      # The gotenberg chromium route is used to convert .eml files. We do not
-      # want to allow external content like tracking pixels or even javascript.
-      command:
-        - 'gotenberg'
-        - '--chromium-disable-javascript=true'
-        - '--chromium-allow-list=file:///tmp/.*'
-
-  tika:
-    image: ghcr.io/paperless-ngx/tika:latest
-    restart: unless-stopped
-```
-
-Add the configuration variables to the environment of the webserver
-(alternatively put the configuration in the `docker-compose.env` file)
-and add the additional services below the webserver service. Watch out
-for indentation.
-
-Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
-
 ## Software tweaks {#software_tweaks}

 `PAPERLESS_TASK_WORKERS=<num>`
@@ -743,17 +716,10 @@ paperless will process in parallel on a single document.
 on large documents within the default 1800 seconds. So extending
 this timeout may prove to be useful on weak hardware setups.

-`PAPERLESS_WORKER_RETRY=<num>`
-
-: If PAPERLESS_WORKER_TIMEOUT has been configured, the retry time for
-a task can also be configured. By default, this value will be set to
-10s more than the worker timeout. This value should never be set
-less than the worker timeout.
-
 `PAPERLESS_TIME_ZONE=<timezone>`

-: Set the time zone here. See
-<https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE>
+: Set the time zone here. See more details on
+why and how to set it [in the Django project documentation](https://docs.djangoproject.com/en/4.1/ref/settings/#std:setting-TIME_ZONE)
 for details on how to set it.

    Defaults to UTC.
@@ -815,46 +781,33 @@ should be a valid crontab(5) expression describing when to run.
        to enable compression in your proxy configuration rather than
        the webserver

-## Polling {#polling}
+`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`

-`PAPERLESS_CONSUMER_POLLING=<num>`
+: On smaller systems, or even in the case of Very Large Documents, the
+consumer may explode, complaining about how it's "unable to extend
+pixel cache". In such cases, try setting this to a reasonably low
+value, like 32. The default is to use whatever is necessary to do
+everything without writing to disk, and units are in megabytes.

-: If paperless won't find documents added to your consume folder, it
-might not be able to automatically detect filesystem changes. In
-that case, specify a polling interval in seconds here, which will
-then cause paperless to periodically check your consumption
-directory for changes. This will also disable listening for file
-system changes with `inotify`.
+    For more information on how to use this value, you should search the
+    web for "MAGICK_MEMORY_LIMIT".

-    Defaults to 0, which disables polling and uses filesystem
-    notifications.
+    Defaults to 0, which disables the limit.

-`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
+`PAPERLESS_CONVERT_TMPDIR=<path>`

-: If consumer polling is enabled, sets the number of times paperless
-will check for a file to remain unmodified.
+: Similar to the memory limit, if you've got a small system and your
+OS mounts /tmp as tmpfs, you should set this to a path that's on a
+physical disk, like /home/your_user/tmp or something. ImageMagick
+will use this as scratch space when crunching through very large
+documents.

-    Defaults to 5.
+    For more information on how to use this value, you should search the
+    web for "MAGICK_TMPDIR".

-`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
+    Default is none, which disables the temporary directory.

-: If consumer polling is enabled, sets the delay in seconds between
-each check (above) paperless will do while waiting for a file to
-remain unmodified.
-
-    Defaults to 5.
-
-## iNotify {#inotify}
-
-`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
-
-: Sets the time in seconds the consumer will wait for additional
-events from inotify before the consumer will consider a file ready
-and begin consumption. Certain scanners or network setups may
-generate multiple events for a single file, leading to multiple
-consumers working on the same file. Configure this to prevent that.
-
-    Defaults to 0.5 seconds.
+## Document Consumption {#consume_config}

 `PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>`

@@ -885,22 +838,23 @@ don't exist yet.

    Defaults to false.

-`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
+`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`

-: Enables the scanning and page separation based on detected barcodes.
-This allows for scanning and adding multiple documents per uploaded
-file, which are separated by one or multiple barcode pages.
+: By default, paperless ignores certain files and folders in the
+consumption directory, such as system files created by the Mac OS
+or hidden folders some tools use to store data.

-    For ease of use, it is suggested to use a standardized separation
-    page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
+    This can be adjusted by configuring a custom json array with
+    patterns to exclude.

-    If no barcodes are detected in the uploaded file, no page separation
-    will happen.
+    For example, `.DS_STORE/*` will ignore any files found in a folder
+    named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`

-    The original document will be removed and the separated pages will
-    be saved as pdf.
+    A pattern like `._*` will ignore anything starting with `._`, including:
+    `._foo.pdf` and `._bar/foo.pdf`

-    Defaults to false.
+    Defaults to
+    `[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.

 `PAPERLESS_CONSUMER_BARCODE_SCANNER=<string>`

@@ -912,79 +866,23 @@ file, which are separated by one or multiple barcode pages.

    zxing is not available on all platforms.

-`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
+`PAPERLESS_PRE_CONSUME_SCRIPT=<filename>`

-: Whether TIFF image files should be scanned for barcodes. This will
-automatically convert any TIFF image(s) to pdfs for later
-processing. This only has an effect, if
-PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
+: After some initial validation, Paperless can trigger an arbitrary
+script if you like before beginning consumption. This script will be provided
+data for it to work with via the environment.

-    Defaults to false.
+    For more information, take a look at [pre-consumption script](/advanced_usage#pre-consume-script).

-`PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT`
-
-: Defines the string to be detected as a separator barcode. If
-paperless is used with the PATCH-T separator pages, users shouldn't
-change this.
-
-    Defaults to "PATCHT"
-
-`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
-
-: Enables the detection of barcodes in the scanned document and
-setting the ASN (archive serial number) if a properly formatted
-barcode is detected.
-
-    The barcode must consist of a (configurable) prefix and the ASN
-    to be set, for instance `ASN00123`.
-
-    This option is compatible with barcode page separation, since
-    pages will be split up before reading the ASN.
-
-    If no ASN barcodes are detected in the uploaded file, no ASN will
-    be set. If a barcode with an already existing ASN is detected, no ASN
-    will be set either and a warning will be logged.
-
-    Defaults to false.
-
-`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN`
-
-: Defines the prefix that is used to identify a barcode as an ASN
-barcode.
-
-    Defaults to "ASN"
-
-`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
-
-: On smaller systems, or even in the case of Very Large Documents, the
-consumer may explode, complaining about how it's "unable to extend
-pixel cache". In such cases, try setting this to a reasonably low
-value, like 32. The default is to use whatever is necessary to do
-everything without writing to disk, and units are in megabytes.
-
-    For more information on how to use this value, you should search the
-    web for "MAGICK_MEMORY_LIMIT".
-
-    Defaults to 0, which disables the limit.
-
-`PAPERLESS_CONVERT_TMPDIR=<path>`
-
-: Similar to the memory limit, if you've got a small system and your
-OS mounts /tmp as tmpfs, you should set this to a path that's on a
-physical disk, like /home/your_user/tmp or something. ImageMagick
-will use this as scratch space when crunching through very large
-documents.
-
-    For more information on how to use this value, you should search the
-    web for "MAGICK_TMPDIR".
-
-    Default is none, which disables the temporary directory.
+    The default is blank, which means nothing will be executed.

 `PAPERLESS_POST_CONSUME_SCRIPT=<filename>`

 : After a document is consumed, Paperless can trigger an arbitrary
-script if you like. This script will be passed a number of arguments
-for you to work with. For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
+script if you like. This script will be provided
+data for it to work with via the environment.
+
+    For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).

    The default is blank, which means nothing will be executed.

@@ -1051,23 +949,109 @@ within your documents.
    second, and year last order. Characters D, M, or Y can be shuffled
    to meet the required order.

-`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
+### Polling {#polling}

-: By default, paperless ignores certain files and folders in the
-consumption directory, such as system files created by the Mac OS
-or hidden folders some tools use to store data.
+`PAPERLESS_CONSUMER_POLLING=<num>`

-    This can be adjusted by configuring a custom json array with
-    patterns to exclude.
+: If paperless won't find documents added to your consume folder, it
+might not be able to automatically detect filesystem changes. In
+that case, specify a polling interval in seconds here, which will
+then cause paperless to periodically check your consumption
+directory for changes. This will also disable listening for file
+system changes with `inotify`.

-    For example, `.DS_STORE/*` will ignore any files found in a folder
-    named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
+    Defaults to 0, which disables polling and uses filesystem
+    notifications.

-    A pattern like `._*` will ignore anything starting with `._`, including:
-    `._foo.pdf` and `._bar/foo.pdf`
+`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`

-    Defaults to
-    `[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.
+: If consumer polling is enabled, sets the number of times paperless
+will check for a file to remain unmodified.
+
+    Defaults to 5.
+
+`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
+
+: If consumer polling is enabled, sets the delay in seconds between
+each check (above) paperless will do while waiting for a file to
+remain unmodified.
+
+    Defaults to 5.
+
+### iNotify {#inotify}
+
+`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
+
+: Sets the time in seconds the consumer will wait for additional
+events from inotify before the consumer will consider a file ready
+and begin consumption. Certain scanners or network setups may
+generate multiple events for a single file, leading to multiple
+consumers working on the same file. Configure this to prevent that.
+
+    Defaults to 0.5 seconds.
+
+## Barcodes {#barcodes}
+
+`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
+
+: Enables the scanning and page separation based on detected barcodes.
+This allows for scanning and adding multiple documents per uploaded
+file, which are separated by one or multiple barcode pages.
+
+    For ease of use, it is suggested to use a standardized separation
+    page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
+
+    If no barcodes are detected in the uploaded file, no page separation
+    will happen.
+
+    The original document will be removed and the separated pages will
+    be saved as pdf.
+
+    See additional information in the [advanced usage documentation](/advanced_usage#barcodes)
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
+
+: Whether TIFF image files should be scanned for barcodes. This will
+automatically convert any TIFF image(s) to pdfs for later
+processing. This only has an effect, if
+PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_BARCODE_STRING=<string>`
+
+: Defines the string to be detected as a separator barcode. If
+paperless is used with the PATCH-T separator pages, users shouldn't
+change this.
+
+    Defaults to "PATCHT"
+
+`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
+
+: Enables the detection of barcodes in the scanned document and
+setting the ASN (archive serial number) if a properly formatted
+barcode is detected.
+
+    The barcode must consist of a (configurable) prefix and the ASN
+    to be set, for instance `ASN00123`.
+
+    This option is compatible with barcode page separation, since
+    pages will be split up before reading the ASN.
+
+    If no ASN barcodes are detected in the uploaded file, no ASN will
+    be set. If a barcode with an already existing ASN is detected, no ASN
+    will be set either and a warning will be logged.
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=<string>`
+
+: Defines the prefix that is used to identify a barcode as an ASN
+barcode.
+
+    Defaults to "ASN"

 ## Binaries

@@ -1159,12 +1143,14 @@ actual group ID on the host system, which you can get by executing
 : Additional OCR languages to install. By default, paperless comes
 with English, German, Italian, Spanish and French. If your language
 is not in this list, install additional languages with this
-configuration option:
+configuration option ([find the right LangCodes](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)):

    ``` bash
    PAPERLESS_OCR_LANGUAGES=tur ces
    ```

+    Make sure it's a space separated list when using several values.
+
    To actually use these languages, also set the default OCR language
    of paperless: