Improve documentation around barcodes, re-organize configuration and update links

2025-12-22 01:55:49 -06:00 · 2023-03-14 09:06:11 -07:00
parent 3cb26722f1
commit 8ac7d56fc5
3 changed files with 269 additions and 201 deletions
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -507,3 +507,43 @@ existing tables) with:
    Using mariadb version 10.4+ is recommended. Using the `utf8mb3` character set on
    an older system may fix issues that can arise while setting up Paperless-ngx but
    `utf8mb3` can cause issues with consumption (where `utf8mb4` does not).
+
+## Barcodes {#barcodes}
+
+Paperless is able to utilize barcodes for automatically preforming some tasks.
+
+At this time, the library utilized for detection of bacodes supports the following types:
+
+- AN-13/UPC-A
+- UPC-E
+- EAN-8
+- Code 128
+- Code 93
+- Code 39
+- Codabar
+- Interleaved 2 of 5
+- QR Code
+- SQ Code
+
+You may check for updates on the [zbar library homepage](https://github.com/mchehab/zbar).
+For usage in Paperless, the type of barcode does not matter, only the contents of it.
+
+For how to enable barcode usage, see [the configuration](/configuration#barcodes).
+The two settings may be enabled independently, but do have interactions as explained
+below.
+
+### Document Splitting
+
+When enabled, Paperless will look for a barcode with the configured value and create a new document
+starting from the next page. The page with the barcode on it will _not_ be retained. It
+is expected to be a page existing only for triggering the split.
+
+### Archive Serial Number Assignment
+
+When enabled, the value of the barcode (as an integer) will be used to set the document's
+archive serial number, allowing quick reference back to the original, paper document.
+
+If document splitting via barcode is also enabled, documents will be split when an ASN
+barcode is located. However, differing from the splitting, the page with the
+barcode _will_ be retained. This allows application of a barcode to any page, including
+one which holds data to keep in the document.
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -17,6 +17,8 @@ run paperless, these settings have to be defined in different places.

 ## Required services

+### Redis Broker
+
 `PAPERLESS_REDIS=<url>`

 : This is required for processing scheduled tasks such as email
@@ -33,6 +35,8 @@ matcher.

    Defaults to `redis://localhost:6379`.

+### Database
+
 `PAPERLESS_DBENGINE=<engine_name>`

 : Optional, gives the ability to choose Postgres or MariaDB for
@@ -94,6 +98,78 @@ changing to postgresql if you need to increase this.

    Defaults to unset, keeping the Django defaults.

+## Optional Services
+
+### Tika {#tika}
+
+Paperless can make use of [Tika](https://tika.apache.org/) and
+[Gotenberg](https://gotenberg.dev/) for parsing and converting
+"Office" documents (such as ".doc", ".xlsx" and ".odt").
+Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
+
+If you wish to use this, you must provide a Tika server and a Gotenberg server,
+configure their endpoints, and enable the feature.
+
+`PAPERLESS_TIKA_ENABLED=<bool>`
+
+: Enable (or disable) the Tika parser.
+
+    Defaults to false.
+
+`PAPERLESS_TIKA_ENDPOINT=<url>`
+
+: Set the endpoint URL were Paperless can reach your Tika server.
+
+    Defaults to "<http://localhost:9998>".
+
+`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
+
+: Set the endpoint URL were Paperless can reach your Gotenberg server.
+
+    Defaults to "<http://localhost:3000>".
+
+If you run paperless on docker, you can add those services to the
+docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
+file for reference). The changes requires are as follows:
+
+```yaml
+services:
+  # ...
+
+  webserver:
+    # ...
+
+    environment:
+      # ...
+
+      PAPERLESS_TIKA_ENABLED: 1
+      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
+      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
+
+    # ...
+
+    gotenberg:
+      image: gotenberg/gotenberg:7.8
+      restart: unless-stopped
+      # The gotenberg chromium route is used to convert .eml files. We do not
+      # want to allow external content like tracking pixels or even javascript.
+      command:
+        - 'gotenberg'
+        - '--chromium-disable-javascript=true'
+        - '--chromium-allow-list=file:///tmp/.*'
+
+  tika:
+    image: ghcr.io/paperless-ngx/tika:latest
+    restart: unless-stopped
+```
+
+Add the configuration variables to the environment of the webserver
+(alternatively put the configuration in the `docker-compose.env` file)
+and add the additional services below the webserver service. Watch out
+for indentation.
+
+Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
+
 ## Paths and folders

 `PAPERLESS_CONSUMPTION_DIR=<path>`
@@ -227,8 +303,7 @@ not include a trailing slash. E.g. <https://paperless.domain.com>

 : A list of trusted origins for unsafe requests (e.g. POST). As of
 Django 4.0 this is required to access the Django admin via the web.
-See
-<https://docs.djangoproject.com/en/4.0/ref/settings/#csrf-trusted-origins>
+See the [Django project documentation on the settings](https://docs.djangoproject.com/en/4.1/ref/settings/#csrf-trusted-origins)

    Can also be set using PAPERLESS_URL (see above).

@@ -239,8 +314,8 @@ See

 : If you're planning on putting Paperless on the open internet, then
 you really should set this value to the domain name you're using.
-Failing to do so leaves you open to HTTP host header attacks:
-<https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation>
+Failing to do so leaves you open to HTTP host header attacks.
+You can read more about this in [the Django project's documentation](https://docs.djangoproject.com/en/4.1/topics/security/#host-header-validation)

    Just remember that this is a comma-separated list, so
    "example.com" is fine, as is "example.com,www.example.com", but
@@ -348,7 +423,7 @@ applications.
        If you're exposing paperless to the internet directly, do not use
        this.

-        Also see the warning [in the official documentation](https://docs.djangoproject.com/en/3.1/howto/auth-remote-user/#configuration).
+        Also see the warning [in the official documentation](https://docs.djangoproject.com/en/4.1/howto/auth-remote-user/#configuration).

    Defaults to "false" which disables this feature.

@@ -357,7 +432,7 @@ applications.
 : If "PAPERLESS*ENABLE_HTTP_REMOTE_USER" is enabled, this
 property allows to customize the name of the HTTP header from which
 the authenticated username is extracted. Values are in terms of
-[HttpRequest.META](https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META).
+[HttpRequest.META](https://docs.djangoproject.com/en/4.1/ref/request-response/#django.http.HttpRequest.META).
 Thus, the configured value must start with `HTTP*`
 followed by the normalized actual header name.

@@ -576,76 +651,6 @@ they use underscores instead of dashes.
    {"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
    ```

-## Tika settings {#tika}
-
-Paperless can make use of [Tika](https://tika.apache.org/) and
-[Gotenberg](https://gotenberg.dev/) for parsing and converting
-"Office" documents (such as ".doc", ".xlsx" and ".odt").
-Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
-
-If you wish to use this, you must provide a Tika server and a Gotenberg server,
-configure their endpoints, and enable the feature.
-
-`PAPERLESS_TIKA_ENABLED=<bool>`
-
-: Enable (or disable) the Tika parser.
-
-    Defaults to false.
-
-`PAPERLESS_TIKA_ENDPOINT=<url>`
-
-: Set the endpoint URL were Paperless can reach your Tika server.
-
-    Defaults to "<http://localhost:9998>".
-
-`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
-
-: Set the endpoint URL were Paperless can reach your Gotenberg server.
-
-    Defaults to "<http://localhost:3000>".
-
-If you run paperless on docker, you can add those services to the
-docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
-file for reference). The changes requires are as follows:
-
-```yaml
-services:
-  # ...
-
-  webserver:
-    # ...
-
-    environment:
-      # ...
-
-      PAPERLESS_TIKA_ENABLED: 1
-      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
-      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
-
-    # ...
-
-    gotenberg:
-      image: gotenberg/gotenberg:7.8
-      restart: unless-stopped
-      # The gotenberg chromium route is used to convert .eml files. We do not
-      # want to allow external content like tracking pixels or even javascript.
-      command:
-        - 'gotenberg'
-        - '--chromium-disable-javascript=true'
-        - '--chromium-allow-list=file:///tmp/.*'
-
-  tika:
-    image: ghcr.io/paperless-ngx/tika:latest
-    restart: unless-stopped
-```
-
-Add the configuration variables to the environment of the webserver
-(alternatively put the configuration in the `docker-compose.env` file)
-and add the additional services below the webserver service. Watch out
-for indentation.
-
-Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
-
 ## Software tweaks {#software_tweaks}

 `PAPERLESS_TASK_WORKERS=<num>`
@@ -699,8 +704,8 @@ this timeout may prove to be useful on weak hardware setups.

 `PAPERLESS_TIME_ZONE=<timezone>`

-: Set the time zone here. See
-<https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE>
+: Set the time zone here. See more details on
+why and how to set it [in the Django project documentation](https://docs.djangoproject.com/en/4.1/ref/settings/#std:setting-TIME_ZONE)
 for details on how to set it.

    Defaults to UTC.
@@ -762,46 +767,33 @@ should be a valid crontab(5) expression describing when to run.
        to enable compression in your proxy configuration rather than
        the webserver

-## Polling {#polling}
+`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`

-`PAPERLESS_CONSUMER_POLLING=<num>`
+: On smaller systems, or even in the case of Very Large Documents, the
+consumer may explode, complaining about how it's "unable to extend
+pixel cache". In such cases, try setting this to a reasonably low
+value, like 32. The default is to use whatever is necessary to do
+everything without writing to disk, and units are in megabytes.

-: If paperless won't find documents added to your consume folder, it
-might not be able to automatically detect filesystem changes. In
-that case, specify a polling interval in seconds here, which will
-then cause paperless to periodically check your consumption
-directory for changes. This will also disable listening for file
-system changes with `inotify`.
+    For more information on how to use this value, you should search the
+    web for "MAGICK_MEMORY_LIMIT".

-    Defaults to 0, which disables polling and uses filesystem
-    notifications.
+    Defaults to 0, which disables the limit.

-`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
+`PAPERLESS_CONVERT_TMPDIR=<path>`

-: If consumer polling is enabled, sets the number of times paperless
-will check for a file to remain unmodified.
+: Similar to the memory limit, if you've got a small system and your
+OS mounts /tmp as tmpfs, you should set this to a path that's on a
+physical disk, like /home/your_user/tmp or something. ImageMagick
+will use this as scratch space when crunching through very large
+documents.

-    Defaults to 5.
+    For more information on how to use this value, you should search the
+    web for "MAGICK_TMPDIR".

-`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
+    Default is none, which disables the temporary directory.

-: If consumer polling is enabled, sets the delay in seconds between
-each check (above) paperless will do while waiting for a file to
-remain unmodified.
-
-    Defaults to 5.
-
-## iNotify {#inotify}
-
-`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
-
-: Sets the time in seconds the consumer will wait for additional
-events from inotify before the consumer will consider a file ready
-and begin consumption. Certain scanners or network setups may
-generate multiple events for a single file, leading to multiple
-consumers working on the same file. Configure this to prevent that.
-
-    Defaults to 0.5 seconds.
+## Document Consumption {#consume_config}

 `PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>`

@@ -832,96 +824,41 @@ don't exist yet.

    Defaults to false.

-`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
+`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`

-: Enables the scanning and page separation based on detected barcodes.
-This allows for scanning and adding multiple documents per uploaded
-file, which are separated by one or multiple barcode pages.
+: By default, paperless ignores certain files and folders in the
+consumption directory, such as system files created by the Mac OS
+or hidden folders some tools use to store data.

-    For ease of use, it is suggested to use a standardized separation
-    page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
+    This can be adjusted by configuring a custom json array with
+    patterns to exclude.

-    If no barcodes are detected in the uploaded file, no page separation
-    will happen.
+    For example, `.DS_STORE/*` will ignore any files found in a folder
+    named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`

-    The original document will be removed and the separated pages will
-    be saved as pdf.
+    A pattern like `._*` will ignore anything starting with `._`, including:
+    `._foo.pdf` and `._bar/foo.pdf`

-    Defaults to false.
+    Defaults to
+    `[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.

-`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
+`PAPERLESS_PRE_CONSUME_SCRIPT=<filename>`

-: Whether TIFF image files should be scanned for barcodes. This will
-automatically convert any TIFF image(s) to pdfs for later
-processing. This only has an effect, if
-PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
+: After some initial validation, Paperless can trigger an arbitrary
+script if you like before beginning consumption. This script will be provided
+data for it to work with via the environment.

-    Defaults to false.
+    For more information, take a look at [pre-consumption script](/advanced_usage#pre-consume-script).

-`PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT`
-
-: Defines the string to be detected as a separator barcode. If
-paperless is used with the PATCH-T separator pages, users shouldn't
-change this.
-
-    Defaults to "PATCHT"
-
-`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
-
-: Enables the detection of barcodes in the scanned document and
-setting the ASN (archive serial number) if a properly formatted
-barcode is detected.
-
-    The barcode must consist of a (configurable) prefix and the ASN
-    to be set, for instance `ASN00123`.
-
-    This option is compatible with barcode page separation, since
-    pages will be split up before reading the ASN.
-
-    If no ASN barcodes are detected in the uploaded file, no ASN will
-    be set. If a barcode with an already existing ASN is detected, no ASN
-    will be set either and a warning will be logged.
-
-    Defaults to false.
-
-`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN`
-
-: Defines the prefix that is used to identify a barcode as an ASN
-barcode.
-
-    Defaults to "ASN"
-
-`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
-
-: On smaller systems, or even in the case of Very Large Documents, the
-consumer may explode, complaining about how it's "unable to extend
-pixel cache". In such cases, try setting this to a reasonably low
-value, like 32. The default is to use whatever is necessary to do
-everything without writing to disk, and units are in megabytes.
-
-    For more information on how to use this value, you should search the
-    web for "MAGICK_MEMORY_LIMIT".
-
-    Defaults to 0, which disables the limit.
-
-`PAPERLESS_CONVERT_TMPDIR=<path>`
-
-: Similar to the memory limit, if you've got a small system and your
-OS mounts /tmp as tmpfs, you should set this to a path that's on a
-physical disk, like /home/your_user/tmp or something. ImageMagick
-will use this as scratch space when crunching through very large
-documents.
-
-    For more information on how to use this value, you should search the
-    web for "MAGICK_TMPDIR".
-
-    Default is none, which disables the temporary directory.
+    The default is blank, which means nothing will be executed.

 `PAPERLESS_POST_CONSUME_SCRIPT=<filename>`

 : After a document is consumed, Paperless can trigger an arbitrary
-script if you like. This script will be passed a number of arguments
-for you to work with. For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
+script if you like. This script will be provided
+data for it to work with via the environment.
+
+    For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).

    The default is blank, which means nothing will be executed.

@@ -988,23 +925,109 @@ within your documents.
    second, and year last order. Characters D, M, or Y can be shuffled
    to meet the required order.

-`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
+### Polling {#polling}

-: By default, paperless ignores certain files and folders in the
-consumption directory, such as system files created by the Mac OS
-or hidden folders some tools use to store data.
+`PAPERLESS_CONSUMER_POLLING=<num>`

-    This can be adjusted by configuring a custom json array with
-    patterns to exclude.
+: If paperless won't find documents added to your consume folder, it
+might not be able to automatically detect filesystem changes. In
+that case, specify a polling interval in seconds here, which will
+then cause paperless to periodically check your consumption
+directory for changes. This will also disable listening for file
+system changes with `inotify`.

-    For example, `.DS_STORE/*` will ignore any files found in a folder
-    named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
+    Defaults to 0, which disables polling and uses filesystem
+    notifications.

-    A pattern like `._*` will ignore anything starting with `._`, including:
-    `._foo.pdf` and `._bar/foo.pdf`
+`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`

-    Defaults to
-    `[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.
+: If consumer polling is enabled, sets the number of times paperless
+will check for a file to remain unmodified.
+
+    Defaults to 5.
+
+`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
+
+: If consumer polling is enabled, sets the delay in seconds between
+each check (above) paperless will do while waiting for a file to
+remain unmodified.
+
+    Defaults to 5.
+
+### iNotify {#inotify}
+
+`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
+
+: Sets the time in seconds the consumer will wait for additional
+events from inotify before the consumer will consider a file ready
+and begin consumption. Certain scanners or network setups may
+generate multiple events for a single file, leading to multiple
+consumers working on the same file. Configure this to prevent that.
+
+    Defaults to 0.5 seconds.
+
+## Barcodes {#barcodes}
+
+`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
+
+: Enables the scanning and page separation based on detected barcodes.
+This allows for scanning and adding multiple documents per uploaded
+file, which are separated by one or multiple barcode pages.
+
+    For ease of use, it is suggested to use a standardized separation
+    page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
+
+    If no barcodes are detected in the uploaded file, no page separation
+    will happen.
+
+    The original document will be removed and the separated pages will
+    be saved as pdf.
+
+    See additional information in the [advanced usage documentation](/advanced_usage#barcodes)
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
+
+: Whether TIFF image files should be scanned for barcodes. This will
+automatically convert any TIFF image(s) to pdfs for later
+processing. This only has an effect, if
+PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT`
+
+: Defines the string to be detected as a separator barcode. If
+paperless is used with the PATCH-T separator pages, users shouldn't
+change this.
+
+    Defaults to "PATCHT"
+
+`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
+
+: Enables the detection of barcodes in the scanned document and
+setting the ASN (archive serial number) if a properly formatted
+barcode is detected.
+
+    The barcode must consist of a (configurable) prefix and the ASN
+    to be set, for instance `ASN00123`.
+
+    This option is compatible with barcode page separation, since
+    pages will be split up before reading the ASN.
+
+    If no ASN barcodes are detected in the uploaded file, no ASN will
+    be set. If a barcode with an already existing ASN is detected, no ASN
+    will be set either and a warning will be logged.
+
+    Defaults to false.
+
+`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN`
+
+: Defines the prefix that is used to identify a barcode as an ASN
+barcode.
+
+    Defaults to "ASN"

 ## Binaries

--- a/docs/development.md
+++ b/docs/development.md
@@ -119,7 +119,9 @@ first-time setup.

 ## Back end development

-The back end is a [Django](https://www.djangoproject.com/) application. [PyCharm](https://www.jetbrains.com/de-de/pycharm/) as well as [Visual Studio Code](https://code.visualstudio.com) work well for development, but you can use whatever you want.
+The back end is a [Django](https://www.djangoproject.com/) application.
+[PyCharm](https://www.jetbrains.com/de-de/pycharm/) as well as [Visual Studio Code](https://code.visualstudio.com)
+work well for development, but you can use whatever you want.

 Configure the IDE to use the `src/`-folder as the base source folder.
 Configure the following launch configurations in your IDE:
@@ -138,7 +140,10 @@ $ python3 manage.py runserver & \
  celery --app paperless worker -l DEBUG
 ```

-You might need the front end to test your back end code. This assumes that you have AngularJS installed on your system. Go to the [Front end development](#front-end-development) section for further details. To build the front end once use this commmand:
+You might need the front end to test your back end code.
+This assumes that you have AngularJS installed on your system.
+Go to the [Front end development](#front-end-development) section for further details.
+To build the front end once use this command:

 ```bash
 # src-ui/