Merge branch 'main' into dev

This commit is contained in:
shamoon
2023-04-01 21:08:33 -07:00
6 changed files with 259 additions and 224 deletions

View File

@@ -17,6 +17,8 @@ run paperless, these settings have to be defined in different places.
## Required services
### Redis Broker
`PAPERLESS_REDIS=<url>`
: This is required for processing scheduled tasks such as email
@@ -33,6 +35,8 @@ matcher.
Defaults to `redis://localhost:6379`.
### Database
`PAPERLESS_DBENGINE=<engine_name>`
: Optional, gives the ability to choose Postgres or MariaDB for
@@ -124,6 +128,47 @@ changing to postgresql if you need to increase this.
Defaults to unset, keeping the Django defaults.
## Optional Services
### Tika {#tika}
Paperless can make use of [Tika](https://tika.apache.org/) and
[Gotenberg](https://gotenberg.dev/) for parsing and converting
"Office" documents (such as ".doc", ".xlsx" and ".odt").
Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
If you wish to use this, you must provide a Tika server and a Gotenberg server,
configure their endpoints, and enable the feature.
`PAPERLESS_TIKA_ENABLED=<bool>`
: Enable (or disable) the Tika parser.
Defaults to false.
`PAPERLESS_TIKA_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Tika server.
Defaults to "<http://localhost:9998>".
`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Gotenberg server.
Defaults to "<http://localhost:3000>".
If you run paperless on docker, you can add those services to the
docker-compose file (see the provided
[`docker-compose.sqlite-tika.yml`](https://github.com/paperless-ngx/paperless-ngx/blob/main/docker/compose/docker-compose.sqlite-tika.yml)
file for reference).
Add all three configuration parameters to your configuration. If using
Docker, this may be the `environment` key of the webserver or a
`docker-compose.env` file. Bare metal installations may have a `.conf` file
containing the configuration parameters. Be sure to use the correct format
and watch out for indentation if editing the YAML file.
## Paths and folders
`PAPERLESS_CONSUMPTION_DIR=<path>`
@@ -257,8 +302,7 @@ not include a trailing slash. E.g. <https://paperless.domain.com>
: A list of trusted origins for unsafe requests (e.g. POST). As of
Django 4.0 this is required to access the Django admin via the web.
See
<https://docs.djangoproject.com/en/4.0/ref/settings/#csrf-trusted-origins>
See the [Django project documentation on the settings](https://docs.djangoproject.com/en/4.1/ref/settings/#csrf-trusted-origins)
Can also be set using PAPERLESS_URL (see above).
@@ -269,8 +313,8 @@ See
: If you're planning on putting Paperless on the open internet, then
you really should set this value to the domain name you're using.
Failing to do so leaves you open to HTTP host header attacks:
<https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation>
Failing to do so leaves you open to HTTP host header attacks.
You can read more about this in [the Django project's documentation](https://docs.djangoproject.com/en/4.1/topics/security/#host-header-validation)
Just remember that this is a comma-separated list, so
"example.com" is fine, as is "example.com,www.example.com", but
@@ -386,16 +430,16 @@ applications.
If you're exposing paperless to the internet directly, do not use
this.
Also see the warning [in the official documentation](https://docs.djangoproject.com/en/3.1/howto/auth-remote-user/#configuration).
Also see the warning [in the official documentation](https://docs.djangoproject.com/en/4.1/howto/auth-remote-user/#configuration).
Defaults to "false" which disables this feature.
`PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=<str>`
: If "PAPERLESS*ENABLE_HTTP_REMOTE_USER" is enabled, this
: If "PAPERLESS_ENABLE_HTTP_REMOTE_USER" is enabled, this
property allows to customize the name of the HTTP header from which
the authenticated username is extracted. Values are in terms of
[HttpRequest.META](https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META).
[HttpRequest.META](https://docs.djangoproject.com/en/4.1/ref/request-response/#django.http.HttpRequest.META).
Thus, the configured value must start with `HTTP*`
followed by the normalized actual header name.
@@ -421,21 +465,20 @@ needs.
: Customize the language that paperless will attempt to use when
parsing documents.
It should be a 3-letter language code consistent with ISO 639:
https://www.loc.gov/standards/iso639-2/php/code_list.php
It should be a 3-letter code, see the list of [languages Tesseract supports](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html).
Set this to the language most of your documents are written in.
This can be a combination of multiple languages such as `deu+eng`,
in which case tesseract will use whatever language matches best.
Keep in mind that tesseract uses much more cpu time with multiple
in which case Tesseract will use whatever language matches best.
Keep in mind that Tesseract uses much more CPU time with multiple
languages enabled.
Defaults to "eng".
!!! note
If your language contains a '-' such as chi-sim, you must use chi_sim
If your language contains a '-' such as chi-sim, you must use `chi_sim`.
`PAPERLESS_OCR_MODE=<mode>`
@@ -622,76 +665,6 @@ they use underscores instead of dashes.
{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
```
## Tika settings {#tika}
Paperless can make use of [Tika](https://tika.apache.org/) and
[Gotenberg](https://gotenberg.dev/) for parsing and converting
"Office" documents (such as ".doc", ".xlsx" and ".odt").
Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
If you wish to use this, you must provide a Tika server and a Gotenberg server,
configure their endpoints, and enable the feature.
`PAPERLESS_TIKA_ENABLED=<bool>`
: Enable (or disable) the Tika parser.
Defaults to false.
`PAPERLESS_TIKA_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Tika server.
Defaults to "<http://localhost:9998>".
`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Gotenberg server.
Defaults to "<http://localhost:3000>".
If you run paperless on docker, you can add those services to the
docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
file for reference). The changes requires are as follows:
```yaml
services:
# ...
webserver:
# ...
environment:
# ...
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
# ...
gotenberg:
image: gotenberg/gotenberg:7.8
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- 'gotenberg'
- '--chromium-disable-javascript=true'
- '--chromium-allow-list=file:///tmp/.*'
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
```
Add the configuration variables to the environment of the webserver
(alternatively put the configuration in the `docker-compose.env` file)
and add the additional services below the webserver service. Watch out
for indentation.
Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
## Software tweaks {#software_tweaks}
`PAPERLESS_TASK_WORKERS=<num>`
@@ -743,17 +716,10 @@ paperless will process in parallel on a single document.
on large documents within the default 1800 seconds. So extending
this timeout may prove to be useful on weak hardware setups.
`PAPERLESS_WORKER_RETRY=<num>`
: If PAPERLESS_WORKER_TIMEOUT has been configured, the retry time for
a task can also be configured. By default, this value will be set to
10s more than the worker timeout. This value should never be set
less than the worker timeout.
`PAPERLESS_TIME_ZONE=<timezone>`
: Set the time zone here. See
<https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE>
: Set the time zone here. See more details on
why and how to set it [in the Django project documentation](https://docs.djangoproject.com/en/4.1/ref/settings/#std:setting-TIME_ZONE)
for details on how to set it.
Defaults to UTC.
@@ -815,46 +781,33 @@ should be a valid crontab(5) expression describing when to run.
to enable compression in your proxy configuration rather than
the webserver
## Polling {#polling}
`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
`PAPERLESS_CONSUMER_POLLING=<num>`
: On smaller systems, or even in the case of Very Large Documents, the
consumer may explode, complaining about how it's "unable to extend
pixel cache". In such cases, try setting this to a reasonably low
value, like 32. The default is to use whatever is necessary to do
everything without writing to disk, and units are in megabytes.
: If paperless won't find documents added to your consume folder, it
might not be able to automatically detect filesystem changes. In
that case, specify a polling interval in seconds here, which will
then cause paperless to periodically check your consumption
directory for changes. This will also disable listening for file
system changes with `inotify`.
For more information on how to use this value, you should search the
web for "MAGICK_MEMORY_LIMIT".
Defaults to 0, which disables polling and uses filesystem
notifications.
Defaults to 0, which disables the limit.
`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
`PAPERLESS_CONVERT_TMPDIR=<path>`
: If consumer polling is enabled, sets the number of times paperless
will check for a file to remain unmodified.
: Similar to the memory limit, if you've got a small system and your
OS mounts /tmp as tmpfs, you should set this to a path that's on a
physical disk, like /home/your_user/tmp or something. ImageMagick
will use this as scratch space when crunching through very large
documents.
Defaults to 5.
For more information on how to use this value, you should search the
web for "MAGICK_TMPDIR".
`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
Default is none, which disables the temporary directory.
: If consumer polling is enabled, sets the delay in seconds between
each check (above) paperless will do while waiting for a file to
remain unmodified.
Defaults to 5.
## iNotify {#inotify}
`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
: Sets the time in seconds the consumer will wait for additional
events from inotify before the consumer will consider a file ready
and begin consumption. Certain scanners or network setups may
generate multiple events for a single file, leading to multiple
consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
## Document Consumption {#consume_config}
`PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>`
@@ -885,22 +838,23 @@ don't exist yet.
Defaults to false.
`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
: Enables the scanning and page separation based on detected barcodes.
This allows for scanning and adding multiple documents per uploaded
file, which are separated by one or multiple barcode pages.
: By default, paperless ignores certain files and folders in the
consumption directory, such as system files created by the Mac OS
or hidden folders some tools use to store data.
For ease of use, it is suggested to use a standardized separation
page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
This can be adjusted by configuring a custom json array with
patterns to exclude.
If no barcodes are detected in the uploaded file, no page separation
will happen.
For example, `.DS_STORE/*` will ignore any files found in a folder
named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
The original document will be removed and the separated pages will
be saved as pdf.
A pattern like `._*` will ignore anything starting with `._`, including:
`._foo.pdf` and `._bar/foo.pdf`
Defaults to false.
Defaults to
`[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.
`PAPERLESS_CONSUMER_BARCODE_SCANNER=<string>`
@@ -912,79 +866,23 @@ file, which are separated by one or multiple barcode pages.
zxing is not available on all platforms.
`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
`PAPERLESS_PRE_CONSUME_SCRIPT=<filename>`
: Whether TIFF image files should be scanned for barcodes. This will
automatically convert any TIFF image(s) to pdfs for later
processing. This only has an effect, if
PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
: After some initial validation, Paperless can trigger an arbitrary
script if you like before beginning consumption. This script will be provided
data for it to work with via the environment.
Defaults to false.
For more information, take a look at [pre-consumption script](/advanced_usage#pre-consume-script).
`PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT`
: Defines the string to be detected as a separator barcode. If
paperless is used with the PATCH-T separator pages, users shouldn't
change this.
Defaults to "PATCHT"
`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
: Enables the detection of barcodes in the scanned document and
setting the ASN (archive serial number) if a properly formatted
barcode is detected.
The barcode must consist of a (configurable) prefix and the ASN
to be set, for instance `ASN00123`.
This option is compatible with barcode page separation, since
pages will be split up before reading the ASN.
If no ASN barcodes are detected in the uploaded file, no ASN will
be set. If a barcode with an already existing ASN is detected, no ASN
will be set either and a warning will be logged.
Defaults to false.
`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN`
: Defines the prefix that is used to identify a barcode as an ASN
barcode.
Defaults to "ASN"
`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
: On smaller systems, or even in the case of Very Large Documents, the
consumer may explode, complaining about how it's "unable to extend
pixel cache". In such cases, try setting this to a reasonably low
value, like 32. The default is to use whatever is necessary to do
everything without writing to disk, and units are in megabytes.
For more information on how to use this value, you should search the
web for "MAGICK_MEMORY_LIMIT".
Defaults to 0, which disables the limit.
`PAPERLESS_CONVERT_TMPDIR=<path>`
: Similar to the memory limit, if you've got a small system and your
OS mounts /tmp as tmpfs, you should set this to a path that's on a
physical disk, like /home/your_user/tmp or something. ImageMagick
will use this as scratch space when crunching through very large
documents.
For more information on how to use this value, you should search the
web for "MAGICK_TMPDIR".
Default is none, which disables the temporary directory.
The default is blank, which means nothing will be executed.
`PAPERLESS_POST_CONSUME_SCRIPT=<filename>`
: After a document is consumed, Paperless can trigger an arbitrary
script if you like. This script will be passed a number of arguments
for you to work with. For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
script if you like. This script will be provided
data for it to work with via the environment.
For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
The default is blank, which means nothing will be executed.
@@ -1051,23 +949,109 @@ within your documents.
second, and year last order. Characters D, M, or Y can be shuffled
to meet the required order.
`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
### Polling {#polling}
: By default, paperless ignores certain files and folders in the
consumption directory, such as system files created by the Mac OS
or hidden folders some tools use to store data.
`PAPERLESS_CONSUMER_POLLING=<num>`
This can be adjusted by configuring a custom json array with
patterns to exclude.
: If paperless won't find documents added to your consume folder, it
might not be able to automatically detect filesystem changes. In
that case, specify a polling interval in seconds here, which will
then cause paperless to periodically check your consumption
directory for changes. This will also disable listening for file
system changes with `inotify`.
For example, `.DS_STORE/*` will ignore any files found in a folder
named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
Defaults to 0, which disables polling and uses filesystem
notifications.
A pattern like `._*` will ignore anything starting with `._`, including:
`._foo.pdf` and `._bar/foo.pdf`
`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
Defaults to
`[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.
: If consumer polling is enabled, sets the number of times paperless
will check for a file to remain unmodified.
Defaults to 5.
`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
: If consumer polling is enabled, sets the delay in seconds between
each check (above) paperless will do while waiting for a file to
remain unmodified.
Defaults to 5.
### iNotify {#inotify}
`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
: Sets the time in seconds the consumer will wait for additional
events from inotify before the consumer will consider a file ready
and begin consumption. Certain scanners or network setups may
generate multiple events for a single file, leading to multiple
consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
## Barcodes {#barcodes}
`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
: Enables the scanning and page separation based on detected barcodes.
This allows for scanning and adding multiple documents per uploaded
file, which are separated by one or multiple barcode pages.
For ease of use, it is suggested to use a standardized separation
page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
If no barcodes are detected in the uploaded file, no page separation
will happen.
The original document will be removed and the separated pages will
be saved as pdf.
See additional information in the [advanced usage documentation](/advanced_usage#barcodes)
Defaults to false.
`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
: Whether TIFF image files should be scanned for barcodes. This will
automatically convert any TIFF image(s) to pdfs for later
processing. This only has an effect, if
PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
Defaults to false.
`PAPERLESS_CONSUMER_BARCODE_STRING=<string>`
: Defines the string to be detected as a separator barcode. If
paperless is used with the PATCH-T separator pages, users shouldn't
change this.
Defaults to "PATCHT"
`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
: Enables the detection of barcodes in the scanned document and
setting the ASN (archive serial number) if a properly formatted
barcode is detected.
The barcode must consist of a (configurable) prefix and the ASN
to be set, for instance `ASN00123`.
This option is compatible with barcode page separation, since
pages will be split up before reading the ASN.
If no ASN barcodes are detected in the uploaded file, no ASN will
be set. If a barcode with an already existing ASN is detected, no ASN
will be set either and a warning will be logged.
Defaults to false.
`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=<string>`
: Defines the prefix that is used to identify a barcode as an ASN
barcode.
Defaults to "ASN"
## Binaries
@@ -1159,12 +1143,14 @@ actual group ID on the host system, which you can get by executing
: Additional OCR languages to install. By default, paperless comes
with English, German, Italian, Spanish and French. If your language
is not in this list, install additional languages with this
configuration option:
configuration option ([find the right LangCodes](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)):
``` bash
PAPERLESS_OCR_LANGUAGES=tur ces
```
Make sure it's a space separated list when using several values.
To actually use these languages, also set the default OCR language
of paperless: