paperless-ngx/docs/configuration.md
Michael Shamoon cd5432fec0 Squashed commit of the following:
commit b1410a854e03087023c89998b14c3296ac669f1f
Merge: f9ce4d8f 8ec9c77e
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Thu Dec 29 20:09:09 2022 -0800

    Merge pull request #2263 from paperless-ngx/v1.11.0-changelog

    [Documentation] Add v1.11.0 changelog

commit 8ec9c77e51dc492f6b7f468ab533204848a554b3
Author: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Date:   Fri Dec 30 04:08:17 2022 +0000

    Changelog v1.11.0 - GHA

commit f9ce4d8f6a9086d21f7f9c5411a28dd8b0b7135e
Author: Michael Shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Thu Dec 29 19:40:25 2022 -0800

    Update version strings for 1.11.0

commit 8c9a74ee0ca03d1f1afd7dee9203648d48bb33c1
Merge: 605f86f0 0b59ef2c
Author: Michael Shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Thu Dec 29 19:39:38 2022 -0800

    Merge branch 'dev'

commit 605f86f0cfb908761d2f71d7e17c1e60668b7edf
Merge: 800e842a 8cbaca22
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Wed Dec 28 15:55:35 2022 -0800

    Merge pull request #2256 from mendelk/patch-1

    Fixed typo in docs

commit 8cbaca22c12b5f3129b52a376dd56f00600f27be
Author: Mendel Kramer <mendelk@users.noreply.github.com>
Date:   Wed Dec 28 18:16:00 2022 -0500

    Fixed typo in docs

commit 800e842ab304ce2fcb1c126d491dac0770ad66ff
Author: ThellraAK <github.com@absurdlybored.com>
Date:   Wed Dec 21 01:36:37 2022 -0900

    Removing Mariadb default open port (#2227)

    * Removing Mariadb default open port

    Removing the listening port 3306 for the DB, Docker networks will let the containers talk to one another.  The existing setup would allow anyone to connect to the DB and use the default passwords.

    * Update docker-compose.mariadb-tika.yml

    Adding change to the other compose file to remove open port

    * Remove excess blank lines

    * Remove excess blank lines

    Co-authored-by: Felix E <felix@eckhofer.com>

commit 6f6f365e2b36410110275ca92b5ba467500bb577
Merge: 6d324dbd 43b863b8
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Sat Dec 17 19:58:06 2022 -0800

    Merge pull request #2203 from tooomm/docs_updates

    Docs: More fixes and improvements

commit 43b863b816337dd19dd9b903e76ecf50b47f1583
Author: tooomm <tooomm@users.noreply.github.com>
Date:   Sun Dec 11 19:44:18 2022 +0100

    doc fixes

    This reverts commit e015babdc102a65a3cce0cc71812d3eb730da92e.

    link fix

    fix escaping, spacing, profile links, typo

    revert

    ~~add~~ at fixes

    Revert "~~add~~ at fixes"

    This reverts commit ce0192b733c19614048de81ea917660e25bb35f2.

commit 6d324dbd8e73c5acdd3b53fd9013c70c53d012e1
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Fri Dec 16 09:10:11 2022 -0800

    Update config.yml

commit 8ddf05e573c4bc2a55ef6d20f5e36181ccf534b5
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Fri Dec 16 09:09:48 2022 -0800

    Update bug-report.yml

commit 0472dfe25a02b3bc9b148f435bcda6e2e2987355
Author: tooomm <tooomm@users.noreply.github.com>
Date:   Sun Dec 11 19:12:58 2022 +0100

    Docs: Fix leftover issues from conversion (#2172)

commit 8b36c9ad64bb7638e33d9cb22217f3d8345d5c1e
Author: tooomm <tooomm@users.noreply.github.com>
Date:   Sun Dec 11 16:07:08 2022 +0100

    more fixes and cleanup

commit 1266f2d5b948b7d99dab267e34840ece6a3fbaa4
Author: tooomm <tooomm@users.noreply.github.com>
Date:   Sun Dec 11 12:06:15 2022 +0100

    fix links

commit 81960519592095df714fb0e0f7a0e907488fa269
Merge: 06a6eb03 d198142a
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Fri Dec 9 16:12:20 2022 -0800

    Merge pull request #2157 from Weltraumschaf/patch-1

    Update setup.md

commit d198142a1ef8cdcaa0d19d126d67b4ade754fceb
Author: Sven Strittmatter <ich@weltraumschaf.de>
Date:   Fri Dec 9 22:09:06 2022 +0100

    Update setup.md

    W/o the slash it resolves to /setup/configuration/ which does 404.

commit 06a6eb0326af6eb3bbe523b0c0061fc324578834
Author: Michael Shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Fri Dec 9 08:15:03 2022 -0800

    fix code block indentation

commit 28819d6d0fb77b8f6030865b0c0d2a1b74a39cad
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Fri Dec 9 08:11:42 2022 -0800

    Fix code block indentation

commit 8cd5e25364768512af90c773c6a2d307cf59febe
Merge: 32d54674 7788d932
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 6 11:23:15 2022 -0800

    Merge pull request #2137 from paperless-ngx/more-docs-cleanup

    Chore: Cleanup of new documentation

commit 7788d932275fd108f6ab9425b1daeabd2c931422
Author: Trenton Holmes <797416+stumpylog@users.noreply.github.com>
Date:   Sun Dec 4 08:34:49 2022 -0800

    Further cleanup of docs, including fixing autoconvert issues and general cleanups

commit 32d546740bd4f086369d1a81ddb6658b2f9298b0
Merge: b0ca57a7 24da3e50
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Sun Dec 4 19:12:27 2022 -0800

    Merge pull request #2118 from alexander-bauer/chart-bump

commit 24da3e50342d3494ba93c83a601c8f44c635e43d
Author: Alexander Bauer <sasha@linux.com>
Date:   Mon Dec 5 02:51:35 2022 +0000

    Bump Helm Chart version to trigger release

commit b0ca57a7f0e5694f5442303e6b17cf6abe120f9a
Merge: cdd49c51 c864b3cd
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Sun Dec 4 14:36:00 2022 -0800

    Merge pull request #2114 from paperless-ngx/v1.10.2-changelog

    [Documentation] Add v1.10.2 changelog

commit cdd49c51426e0de8937210a65e717fb46eea6101
Author: Michael Shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Sun Dec 4 14:32:08 2022 -0800

    Update frontend compilation info

commit c864b3cd19da3dc37f2f3ba3afa34cfcb73892a8
Author: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Date:   Sun Dec 4 21:17:16 2022 +0000

    Changelog v1.10.2 - GHA
2022-12-29 20:09:58 -08:00

1055 lines
34 KiB
Markdown

# Configuration
Paperless provides a wide range of customizations. Depending on how you
run paperless, these settings have to be defined in different places.
- If you run paperless on docker, `paperless.conf` is not used.
Rather, configure paperless by copying necessary options to
`docker-compose.env`.
- If you are running paperless on anything else, paperless will search
for the configuration file in these locations and use the first one
it finds:
- The environment variable `PAPERLESS_CONFIGURATION_PATH`
- `/path/to/paperless/paperless.conf`
- `/etc/paperless.conf`
- `/usr/local/etc/paperless.conf`
## Required services
`PAPERLESS_REDIS=<url>`
: This is required for processing scheduled tasks such as email
fetching, index optimization and for training the automatic document
matcher.
- If your Redis server needs login credentials PAPERLESS_REDIS =
`redis://<username>:<password>@<host>:<port>`
- With the requirepass option PAPERLESS_REDIS =
`redis://:<password>@<host>:<port>`
[More information on securing your Redis
Instance](https://redis.io/docs/getting-started/#securing-redis).
Defaults to `redis://localhost:6379`.
`PAPERLESS_DBENGINE=<engine_name>`
: Optional, gives the ability to choose Postgres or MariaDB for
database engine. Available options are `postgresql` and
`mariadb`.
Default is `postgresql`.
!!! warning
Using MariaDB comes with some caveats. See [MySQL Caveats](/advanced_usage#mysql-caveats).
`PAPERLESS_DBHOST=<hostname>`
: By default, sqlite is used as the database backend. This can be
changed here.
Set PAPERLESS_DBHOST and another database will be used instead of
sqlite.
`PAPERLESS_DBPORT=<port>`
: Adjust port if necessary.
Default is 5432.
`PAPERLESS_DBNAME=<name>`
: Database name in PostgreSQL or MariaDB.
Defaults to "paperless".
`PAPERLESS_DBUSER=<name>`
: Database user in PostgreSQL or MariaDB.
Defaults to "paperless".
`PAPERLESS_DBPASS=<password>`
: Database password for PostgreSQL or MariaDB.
Defaults to "paperless".
`PAPERLESS_DBSSLMODE=<mode>`
: SSL mode to use when connecting to PostgreSQL.
See [the official documentation about
sslmode](https://www.postgresql.org/docs/current/libpq-ssl.html).
Default is `prefer`.
`PAPERLESS_DB_TIMEOUT=<float>`
: Amount of time for a database connection to wait for the database to
unlock. Mostly applicable for an sqlite based installation, consider
changing to postgresql if you need to increase this.
Defaults to unset, keeping the Django defaults.
## Paths and folders
`PAPERLESS_CONSUMPTION_DIR=<path>`
: This where your documents should go to be consumed. Make sure that
it exists and that the user running the paperless service can
read/write its contents before you start Paperless.
Don't change this when using docker, as it only changes the path
within the container. Change the local consumption directory in the
docker-compose.yml file instead.
Defaults to "../consume/", relative to the "src" directory.
`PAPERLESS_DATA_DIR=<path>`
: This is where paperless stores all its data (search index, SQLite
database, classification model, etc).
Defaults to "../data/", relative to the "src" directory.
`PAPERLESS_TRASH_DIR=<path>`
: Instead of removing deleted documents, they are moved to this
directory.
This must be writeable by the user running paperless. When running
inside docker, ensure that this path is within a permanent volume
(such as "../media/trash") so it won't get lost on upgrades.
Defaults to empty (i.e. really delete documents).
`PAPERLESS_MEDIA_ROOT=<path>`
: This is where your documents and thumbnails are stored.
You can set this and PAPERLESS_DATA_DIR to the same folder to have
paperless store all its data within the same volume.
Defaults to "../media/", relative to the "src" directory.
`PAPERLESS_STATICDIR=<path>`
: Override the default STATIC_ROOT here. This is where all static
files created using "collectstatic" manager command are stored.
Unless you're doing something fancy, there is no need to override
this.
Defaults to "../static/", relative to the "src" directory.
`PAPERLESS_FILENAME_FORMAT=<format>`
: Changes the filenames paperless uses to store documents in the media
directory. See [File name handling](/advanced_usage#file-name-handling) for details.
Default is none, which disables this feature.
`PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=<bool>`
: Tells paperless to replace placeholders in
`PAPERLESS_FILENAME_FORMAT` that would resolve to
'none' to be omitted from the resulting filename. This also holds
true for directory names. See [File name handling](/advanced_usage#file-name-handling) for
details.
Defaults to `false` which disables this feature.
`PAPERLESS_LOGGING_DIR=<path>`
: This is where paperless will store log files.
Defaults to `PAPERLESS_DATA_DIR/log/`.
`PAPERLESS_NLTK_DIR=<path>`
: This is where paperless will search for the data required for NLTK
processing, if you are using it. If you are using the Docker image,
this should not be changed, as the data is included in the image
already.
Previously, the location defaulted to `PAPERLESS_DATA_DIR/nltk`.
Unless you are using this in a bare metal install or other setup,
this folder is no longer needed and can be removed manually.
Defaults to `/usr/local/share/nltk_data`
## Logging
`PAPERLESS_LOGROTATE_MAX_SIZE=<num>`
: Maximum file size for log files before they are rotated, in bytes.
Defaults to 1 MiB.
`PAPERLESS_LOGROTATE_MAX_BACKUPS=<num>`
: Number of rotated log files to keep.
Defaults to 20.
## Hosting & Security {#hosting-and-security}
`PAPERLESS_SECRET_KEY=<key>`
: Paperless uses this to make session tokens. If you expose paperless
on the internet, you need to change this, since the default secret
is well known.
Use any sequence of characters. The more, the better. You don't
need to remember this. Just face-roll your keyboard.
Default is listed in the file `src/paperless/settings.py`.
`PAPERLESS_URL=<url>`
: This setting can be used to set the three options below
(ALLOWED_HOSTS, CORS_ALLOWED_HOSTS and CSRF_TRUSTED_ORIGINS). If the
other options are set the values will be combined with this one. Do
not include a trailing slash. E.g. <https://paperless.domain.com>
Defaults to empty string, leaving the other settings unaffected.
`PAPERLESS_CSRF_TRUSTED_ORIGINS=<comma-separated-list>`
: A list of trusted origins for unsafe requests (e.g. POST). As of
Django 4.0 this is required to access the Django admin via the web.
See
<https://docs.djangoproject.com/en/4.0/ref/settings/#csrf-trusted-origins>
Can also be set using PAPERLESS_URL (see above).
Defaults to empty string, which does not add any origins to the
trusted list.
`PAPERLESS_ALLOWED_HOSTS=<comma-separated-list>`
: If you're planning on putting Paperless on the open internet, then
you really should set this value to the domain name you're using.
Failing to do so leaves you open to HTTP host header attacks:
<https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation>
Just remember that this is a comma-separated list, so
"example.com" is fine, as is "example.com,www.example.com", but
NOT " example.com" or "example.com,"
Can also be set using PAPERLESS_URL (see above).
If manually set, please remember to include "localhost". Otherwise
docker healthcheck will fail.
Defaults to "\*", which is all hosts.
`PAPERLESS_CORS_ALLOWED_HOSTS=<comma-separated-list>`
: You need to add your servers to the list of allowed hosts that can
do CORS calls. Set this to your public domain name.
Can also be set using PAPERLESS_URL (see above).
Defaults to "<http://localhost:8000>".
`PAPERLESS_FORCE_SCRIPT_NAME=<path>`
: To host paperless under a subpath url like example.com/paperless you
set this value to /paperless. No trailing slash!
Defaults to none, which hosts paperless at "/".
`PAPERLESS_STATIC_URL=<path>`
: Override the STATIC_URL here. Unless you're hosting Paperless off a
subdomain like /paperless/, you probably don't need to change this.
If you do change it, be sure to include the trailing slash.
Defaults to "/static/".
!!! note
When hosting paperless behind a reverse proxy like Traefik or Nginx
at a subpath e.g. example.com/paperlessngx you will also need to set
`PAPERLESS_FORCE_SCRIPT_NAME` (see above).
`PAPERLESS_AUTO_LOGIN_USERNAME=<username>`
: Specify a username here so that paperless will automatically perform
login with the selected user.
!!! danger
Do not use this when exposing paperless on the internet. There are
no checks in place that would prevent you from doing this.
Defaults to none, which disables this feature.
`PAPERLESS_ADMIN_USER=<username>`
: If this environment variable is specified, Paperless automatically
creates a superuser with the provided username at start. This is
useful in cases where you can not run the
`createsuperuser` command separately, such as Kubernetes
or AWS ECS.
Requires PAPERLESS_ADMIN_PASSWORD be set.
!!! note
This will not change an existing \[super\]user's password, nor will
it recreate a user that already exists. You can leave this
throughout the lifecycle of the containers.
`PAPERLESS_ADMIN_MAIL=<email>`
: (Optional) Specify superuser email address. Only used when
PAPERLESS_ADMIN_USER is set.
Defaults to `root@localhost`.
`PAPERLESS_ADMIN_PASSWORD=<password>`
: Only used when PAPERLESS_ADMIN_USER is set. This will
be the password of the automatically created superuser.
`PAPERLESS_COOKIE_PREFIX=<str>`
: Specify a prefix that is added to the cookies used by paperless to
identify the currently logged in user. This is useful for when
you're running two instances of paperless on the same host.
After changing this, you will have to login again.
Defaults to `""`, which does not alter the cookie names.
`PAPERLESS_ENABLE_HTTP_REMOTE_USER=<bool>`
: Allows authentication via HTTP_REMOTE_USER which is used by some SSO
applications.
!!! warning
This will allow authentication by simply adding a
`Remote-User: <username>` header to a request. Use with care! You
especially *must: ensure that any such header is not passed from
your proxy server to paperless.
If you're exposing paperless to the internet directly, do not use
this.
Also see the warning [in the official documentation](https://docs.djangoproject.com/en/3.1/howto/auth-remote-user/#configuration).
Defaults to "false" which disables this feature.
`PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=<str>`
: If "PAPERLESS*ENABLE_HTTP_REMOTE_USER" is enabled, this
property allows to customize the name of the HTTP header from which
the authenticated username is extracted. Values are in terms of
[HttpRequest.META](https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META).
Thus, the configured value must start with `HTTP*`
followed by the normalized actual header name.
Defaults to "HTTP_REMOTE_USER".
`PAPERLESS_LOGOUT_REDIRECT_URL=<str>`
: URL to redirect the user to after a logout. This can be used
together with PAPERLESS_ENABLE_HTTP_REMOTE_USER to
redirect the user back to the SSO application's logout page.
Defaults to None, which disables this feature.
## OCR settings {#ocr}
Paperless uses [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/)
for performing OCR on documents and images. Paperless uses sensible
defaults for most settings, but all of them can be configured to your
needs.
`PAPERLESS_OCR_LANGUAGE=<lang>`
: Customize the language that paperless will attempt to use when
parsing documents.
It should be a 3-letter language code consistent with ISO 639:
https://www.loc.gov/standards/iso639-2/php/code_list.php
Set this to the language most of your documents are written in.
This can be a combination of multiple languages such as `deu+eng`,
in which case tesseract will use whatever language matches best.
Keep in mind that tesseract uses much more cpu time with multiple
languages enabled.
Defaults to "eng".
!!! note
If your language contains a '-' such as chi-sim, you must use chi_sim
`PAPERLESS_OCR_MODE=<mode>`
: Tell paperless when and how to perform ocr on your documents. Four
modes are available:
- `skip`: Paperless skips all pages and will perform ocr only on
pages where no text is present. This is the safest option.
- `skip_noarchive`: In addition to skip, paperless won't create
an archived version of your documents when it finds any text in
them. This is useful if you don't want to have two
almost-identical versions of your digital documents in the media
folder. This is the fastest option.
- `redo`: Paperless will OCR all pages of your documents and
attempt to replace any existing text layers with new text. This
will be useful for documents from scanners that already
performed OCR with insufficient results. It will also perform
OCR on purely digital documents.
This option may fail on some documents that have features that
cannot be removed, such as forms. In this case, the text from
the document is used instead.
- `force`: Paperless rasterizes your documents, converting any
text into images and puts the OCRed text on top. This works for
all documents, however, the resulting document may be
significantly larger and text won't appear as sharp when zoomed
in.
The default is `skip`, which only performs OCR when necessary and
always creates archived documents.
Read more about this in the [OCRmyPDF
documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).
`PAPERLESS_OCR_CLEAN=<mode>`
: Tells paperless to use `unpaper` to clean any input document before
sending it to tesseract. This uses more resources, but generally
results in better OCR results. The following modes are available:
- `clean`: Apply unpaper.
- `clean-final`: Apply unpaper, and use the cleaned images to
build the output file instead of the original images.
- `none`: Do not apply unpaper.
Defaults to `clean`.
!!! note
`clean-final` is incompatible with ocr mode `redo`. When both
`clean-final` and the ocr mode `redo` is configured, `clean` is used
instead.
`PAPERLESS_OCR_DESKEW=<bool>`
: Tells paperless to correct skewing (slight rotation of input images
mainly due to improper scanning)
Defaults to `true`, which enables this feature.
!!! note
Deskewing is incompatible with ocr mode `redo`. Deskewing will get
disabled automatically if `redo` is used as the ocr mode.
`PAPERLESS_OCR_ROTATE_PAGES=<bool>`
: Tells paperless to correct page rotation (90°, 180° and 270°
rotation).
If you notice that paperless is not rotating incorrectly rotated
pages (or vice versa), try adjusting the threshold up or down (see
below).
Defaults to `true`, which enables this feature.
`PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=<num>`
: Adjust the threshold for automatic page rotation by
`PAPERLESS_OCR_ROTATE_PAGES`. This is an arbitrary value reported by
tesseract. "15" is a very conservative value, whereas "2" is a
very aggressive option and will often result in correctly rotated
pages being rotated as well.
Defaults to "12".
`PAPERLESS_OCR_OUTPUT_TYPE=<type>`
: Specify the the type of PDF documents that paperless should produce.
- `pdf`: Modify the PDF document as little as possible.
- `pdfa`: Convert PDF documents into PDF/A-2b documents, which is
a subset of the entire PDF specification and meant for storing
documents long term.
- `pdfa-1`, `pdfa-2`, `pdfa-3` to specify the exact version of
PDF/A you wish to use.
If not specified, `pdfa` is used. Remember that paperless also keeps
the original input file as well as the archived version.
`PAPERLESS_OCR_PAGES=<num>`
: Tells paperless to use only the specified amount of pages for OCR.
Documents with less than the specified amount of pages get OCR'ed
completely.
Specifying 1 here will only use the first page.
When combined with `PAPERLESS_OCR_MODE=redo` or
`PAPERLESS_OCR_MODE=force`, paperless will not modify any text it
finds on excluded pages and copy it verbatim.
Defaults to 0, which disables this feature and always uses all
pages.
`PAPERLESS_OCR_IMAGE_DPI=<num>`
: Paperless will OCR any images you put into the system and convert
them into PDF documents. This is useful if your scanner produces
images. In order to do so, paperless needs to know the DPI of the
image. Most images from scanners will have this information embedded
and paperless will detect and use that information. In case this
fails, it uses this value as a fallback.
Set this to the DPI your scanner produces images at.
Default is none, which will automatically calculate image DPI so
that the produced PDF documents are A4 sized.
`PAPERLESS_OCR_MAX_IMAGE_PIXELS=<num>`
: Paperless will raise a warning when OCRing images which are over
this limit and will not OCR images which are more than twice this
limit. Note this does not prevent the document from being consumed,
but could result in missing text content.
If unset, will default to the value determined by
[Pillow](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.MAX_IMAGE_PIXELS).
!!! note
Increasing this limit could cause Paperless to consume additional
resources when consuming a file. Be sure you have sufficient system
resources.
!!! warning
The limit is intended to prevent malicious files from consuming
system resources and causing crashes and other errors. Only increase
this value if you are certain your documents are not malicious and
you need the text which was not OCRed
`PAPERLESS_OCR_USER_ARGS=<json>`
: OCRmyPDF offers many more options. Use this parameter to specify any
additional arguments you wish to pass to OCRmyPDF. Since Paperless
uses the API of OCRmyPDF, you have to specify these in a format that
can be passed to the API. See [the API reference of
OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/api.html#reference)
for valid parameters. All command line options are supported, but
they use underscores instead of dashes.
!!! warning
Paperless has been tested to work with the OCR options provided
above. There are many options that are incompatible with each other,
so specifying invalid options may prevent paperless from consuming
any documents.
Specify arguments as a JSON dictionary. Keep note of lower case
booleans and double quoted parameter names and strings. Examples:
``` json
{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
```
## Tika settings {#tika}
Paperless can make use of [Tika](https://tika.apache.org/) and
[Gotenberg](https://gotenberg.dev/) for parsing and converting
"Office" documents (such as ".doc", ".xlsx" and ".odt").
Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
If you wish to use this, you must provide a Tika server and a Gotenberg server,
configure their endpoints, and enable the feature.
`PAPERLESS_TIKA_ENABLED=<bool>`
: Enable (or disable) the Tika parser.
Defaults to false.
`PAPERLESS_TIKA_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Tika server.
Defaults to "<http://localhost:9998>".
`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Gotenberg server.
Defaults to "<http://localhost:3000>".
If you run paperless on docker, you can add those services to the
docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
file for reference). The changes requires are as follows:
```yaml
services:
# ...
webserver:
# ...
environment:
# ...
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
# ...
gotenberg:
image: gotenberg/gotenberg:7.6
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- 'gotenberg'
- '--chromium-disable-javascript=true'
- '--chromium-allow-list=file:///tmp/.*'
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
```
Add the configuration variables to the environment of the webserver
(alternatively put the configuration in the `docker-compose.env` file)
and add the additional services below the webserver service. Watch out
for indentation.
Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
## Software tweaks {#software_tweaks}
`PAPERLESS_TASK_WORKERS=<num>`
: Paperless does multiple things in the background: Maintain the
search index, maintain the automatic matching algorithm, check
emails, consume documents, etc. This variable specifies how many
things it will do in parallel.
Defaults to 1
`PAPERLESS_THREADS_PER_WORKER=<num>`
: Furthermore, paperless uses multiple threads when consuming
documents to speed up OCR. This variable specifies how many pages
paperless will process in parallel on a single document.
!!! warning
Ensure that the product
`PAPERLESS_TASK_WORKERS * PAPERLESS_THREADS_PER_WORKER`
does not exceed your CPU core count or else paperless will be
extremely slow. If you want paperless to process many documents in
parallel, choose a high worker count. If you want paperless to
process very large documents faster, use a higher thread per worker
count.
The default is a balance between the two, according to your CPU core
count, with a slight favor towards threads per worker:
| CPU core count | Workers | Threads |
| -------------- | ------- | ------- |
| > 1 | > 1 | > 1 |
| > 2 | > 2 | > 1 |
| > 4 | > 2 | > 2 |
| > 6 | > 2 | > 3 |
| > 8 | > 2 | > 4 |
| > 12 | > 3 | > 4 |
| > 16 | > 4 | > 4 |
If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust
PAPERLESS_THREADS_PER_WORKER automatically.
`PAPERLESS_WORKER_TIMEOUT=<num>`
: Machines with few cores or weak ones might not be able to finish OCR
on large documents within the default 1800 seconds. So extending
this timeout may prove to be useful on weak hardware setups.
`PAPERLESS_WORKER_RETRY=<num>`
: If PAPERLESS_WORKER_TIMEOUT has been configured, the retry time for
a task can also be configured. By default, this value will be set to
10s more than the worker timeout. This value should never be set
less than the worker timeout.
`PAPERLESS_TIME_ZONE=<timezone>`
: Set the time zone here. See
<https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE>
for details on how to set it.
Defaults to UTC.
`PAPERLESS_ENABLE_NLTK=<bool>`
: Enables or disables the advanced natural language processing
used during automatic classification. If disabled, paperless will
still preform some basic text pre-processing before matching.
See also `PAPERLESS_NLTK_DIR`.
Defaults to 1.
## Polling {#polling}
`PAPERLESS_CONSUMER_POLLING=<num>`
: If paperless won't find documents added to your consume folder, it
might not be able to automatically detect filesystem changes. In
that case, specify a polling interval in seconds here, which will
then cause paperless to periodically check your consumption
directory for changes. This will also disable listening for file
system changes with `inotify`.
Defaults to 0, which disables polling and uses filesystem
notifications.
`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
: If consumer polling is enabled, sets the number of times paperless
will check for a file to remain unmodified.
Defaults to 5.
`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
: If consumer polling is enabled, sets the delay in seconds between
each check (above) paperless will do while waiting for a file to
remain unmodified.
Defaults to 5.
## iNotify {#inotify}
`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
: Sets the time in seconds the consumer will wait for additional
events from inotify before the consumer will consider a file ready
and begin consumption. Certain scanners or network setups may
generate multiple events for a single file, leading to multiple
consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
`PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>`
: When the consumer detects a duplicate document, it will not touch
the original document. This default behavior can be changed here.
Defaults to false.
`PAPERLESS_CONSUMER_RECURSIVE=<bool>`
: Enable recursive watching of the consumption directory. Paperless
will then pickup files from files in subdirectories within your
consumption directory as well.
Defaults to false.
`PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=<bool>`
: Set the names of subdirectories as tags for consumed files. E.g.
`<CONSUMPTION_DIR>/foo/bar/file.pdf` will add the tags "foo" and
"bar" to the consumed file. Paperless will create any tags that
don't exist yet.
This is useful for sorting documents with certain tags such as `car`
or `todo` prior to consumption. These folders won't be deleted.
PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work.
Defaults to false.
`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
: Enables the scanning and page separation based on detected barcodes.
This allows for scanning and adding multiple documents per uploaded
file, which are separated by one or multiple barcode pages.
For ease of use, it is suggested to use a standardized separation
page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
If no barcodes are detected in the uploaded file, no page separation
will happen.
The original document will be removed and the separated pages will
be saved as pdf.
Defaults to false.
`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
: Whether TIFF image files should be scanned for barcodes. This will
automatically convert any TIFF image(s) to pdfs for later
processing. This only has an effect, if
PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
Defaults to false.
PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
: Defines the string to be detected as a separator barcode. If
paperless is used with the PATCH-T separator pages, users shouldn't
change this.
Defaults to "PATCHT"
`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
: On smaller systems, or even in the case of Very Large Documents, the
consumer may explode, complaining about how it's "unable to extend
pixel cache". In such cases, try setting this to a reasonably low
value, like 32. The default is to use whatever is necessary to do
everything without writing to disk, and units are in megabytes.
For more information on how to use this value, you should search the
web for "MAGICK_MEMORY_LIMIT".
Defaults to 0, which disables the limit.
`PAPERLESS_CONVERT_TMPDIR=<path>`
: Similar to the memory limit, if you've got a small system and your
OS mounts /tmp as tmpfs, you should set this to a path that's on a
physical disk, like /home/your_user/tmp or something. ImageMagick
will use this as scratch space when crunching through very large
documents.
For more information on how to use this value, you should search the
web for "MAGICK_TMPDIR".
Default is none, which disables the temporary directory.
`PAPERLESS_POST_CONSUME_SCRIPT=<filename>`
: After a document is consumed, Paperless can trigger an arbitrary
script if you like. This script will be passed a number of arguments
for you to work with. For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
The default is blank, which means nothing will be executed.
`PAPERLESS_FILENAME_DATE_ORDER=<format>`
: Paperless will check the document text for document date
information. Use this setting to enable checking the document
filename for date information. The date order can be set to any
option as specified in
<https://dateparser.readthedocs.io/en/latest/settings.html#date-order>.
The filename will be checked first, and if nothing is found, the
document text will be checked as normal.
A date in a filename must have some separators (`.`, `,`, `-`, `/`, etc) for it to be parsed.
Defaults to none, which disables this feature.
`PAPERLESS_NUMBER_OF_SUGGESTED_DATES=<num>`
: Paperless searches an entire document for dates. The first date
found will be used as the initial value for the created date. When
this variable is greater than 0 (or left to it's default value),
paperless will also suggest other dates found in the document, up to
a maximum of this setting. Note that duplicates will be removed,
which can result in fewer dates displayed in the frontend than this
setting value.
The task to find all dates can be time-consuming and increases with
a higher (maximum) number of suggested dates and slower hardware.
Defaults to 3. Set to 0 to disable this feature.
`PAPERLESS_THUMBNAIL_FONT_NAME=<filename>`
: Paperless creates thumbnails for plain text files by rendering the
content of the file on an image and uses a predefined font for that.
This font can be changed here.
Note that this won't have any effect on already generated
thumbnails.
Defaults to
`/usr/share/fonts/liberation/LiberationSerif-Regular.ttf`.
`PAPERLESS_IGNORE_DATES=<string>`
: Paperless parses a documents creation date from filename and file
content. You may specify a comma separated list of dates that should
be ignored during this process. This is useful for special dates
(like date of birth) that appear in documents regularly but are very
unlikely to be the documents creation date.
The date is parsed using the order specified in PAPERLESS_DATE_ORDER
Defaults to an empty string to not ignore any dates.
`PAPERLESS_DATE_ORDER=<format>`
: Paperless will try to determine the document creation date from its
contents. Specify the date format Paperless should expect to see
within your documents.
This option defaults to DMY which translates to day first, month
second, and year last order. Characters D, M, or Y can be shuffled
to meet the required order.
`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
: By default, paperless ignores certain files and folders in the
consumption directory, such as system files created by the Mac OS.
This can be adjusted by configuring a custom json array with
patterns to exclude.
Defaults to
`[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini"]`.
## Binaries
There are a few external software packages that Paperless expects to
find on your system when it starts up. Unless you've done something
creative with their installation, you probably won't need to edit any
of these. However, if you've installed these programs somewhere where
simply typing the name of the program doesn't automatically execute it
(ie. the program isn't in your \$PATH), then you'll need to specify
the literal path for that program.
`PAPERLESS_CONVERT_BINARY=<path>`
: Defaults to "convert".
`PAPERLESS_GS_BINARY=<path>`
: Defaults to "gs".
## Docker-specific options {#docker}
These options don't have any effect in `paperless.conf`. These options
adjust the behavior of the docker container. Configure these in
`docker-compose.env`.
`PAPERLESS_WEBSERVER_WORKERS=<num>`
: The number of worker processes the webserver should spawn. More
worker processes usually result in the front end to load data much
quicker. However, each worker process also loads the entire
application into memory separately, so increasing this value will
increase RAM usage.
Defaults to 1.
`PAPERLESS_BIND_ADDR=<ip address>`
: The IP address the webserver will listen on inside the container.
There are special setups where you may need to configure this value
to restrict the Ip address or interface the webserver listens on.
Defaults to `[::]`, meaning all interfaces, including IPv6.
`PAPERLESS_PORT=<port>`
: The port number the webserver will listen on inside the container.
There are special setups where you may need this to avoid collisions
with other services (like using podman with multiple containers in
one pod).
Don't change this when using Docker. To change the port the
webserver is reachable outside of the container, instead refer to
the "ports" key in `docker-compose.yml`.
Defaults to 8000.
`USERMAP_UID=<uid>`
: The ID of the paperless user in the container. Set this to your
actual user ID on the host system, which you can get by executing
``` shell-session
$ id -u
```
Paperless will change ownership on its folders to this user, so you
need to get this right in order to be able to write to the
consumption directory.
Defaults to 1000.
`USERMAP_GID=<gid>`
: The ID of the paperless Group in the container. Set this to your
actual group ID on the host system, which you can get by executing
``` shell-session
$ id -g
```
Paperless will change ownership on its folders to this group, so you
need to get this right in order to be able to write to the
consumption directory.
Defaults to 1000.
`PAPERLESS_OCR_LANGUAGES=<list>`
: Additional OCR languages to install. By default, paperless comes
with English, German, Italian, Spanish and French. If your language
is not in this list, install additional languages with this
configuration option:
``` bash
PAPERLESS_OCR_LANGUAGES=tur ces
```
To actually use these languages, also set the default OCR language
of paperless:
``` bash
PAPERLESS_OCR_LANGUAGE=tur
```
Defaults to none, which does not install any additional languages.
`PAPERLESS_ENABLE_FLOWER=<defined>`
: If this environment variable is defined, the Celery monitoring tool
[Flower](https://flower.readthedocs.io/en/latest/index.html) will be
started by the container.
You can read more about this in the [advanced documentation](/advanced_usage#celery-monitoring).
## Update Checking {#update-checking}
`PAPERLESS_ENABLE_UPDATE_CHECK=<bool>`
!!! note
This setting was deprecated in favor of a frontend setting after
v1.9.2. A one-time migration is performed for users who have this
setting set. This setting is always ignored if the corresponding
frontend setting has been set.