Improve documentation around barcodes, re-organize configuration and update links

This commit is contained in:
Trenton H 2023-03-14 09:06:11 -07:00
parent 3cb26722f1
commit 8ac7d56fc5
3 changed files with 269 additions and 201 deletions

View File

@ -507,3 +507,43 @@ existing tables) with:
Using mariadb version 10.4+ is recommended. Using the `utf8mb3` character set on
an older system may fix issues that can arise while setting up Paperless-ngx but
`utf8mb3` can cause issues with consumption (where `utf8mb4` does not).
## Barcodes {#barcodes}
Paperless is able to utilize barcodes for automatically preforming some tasks.
At this time, the library utilized for detection of bacodes supports the following types:
- AN-13/UPC-A
- UPC-E
- EAN-8
- Code 128
- Code 93
- Code 39
- Codabar
- Interleaved 2 of 5
- QR Code
- SQ Code
You may check for updates on the [zbar library homepage](https://github.com/mchehab/zbar).
For usage in Paperless, the type of barcode does not matter, only the contents of it.
For how to enable barcode usage, see [the configuration](/configuration#barcodes).
The two settings may be enabled independently, but do have interactions as explained
below.
### Document Splitting
When enabled, Paperless will look for a barcode with the configured value and create a new document
starting from the next page. The page with the barcode on it will _not_ be retained. It
is expected to be a page existing only for triggering the split.
### Archive Serial Number Assignment
When enabled, the value of the barcode (as an integer) will be used to set the document's
archive serial number, allowing quick reference back to the original, paper document.
If document splitting via barcode is also enabled, documents will be split when an ASN
barcode is located. However, differing from the splitting, the page with the
barcode _will_ be retained. This allows application of a barcode to any page, including
one which holds data to keep in the document.

View File

@ -17,6 +17,8 @@ run paperless, these settings have to be defined in different places.
## Required services
### Redis Broker
`PAPERLESS_REDIS=<url>`
: This is required for processing scheduled tasks such as email
@ -33,6 +35,8 @@ matcher.
Defaults to `redis://localhost:6379`.
### Database
`PAPERLESS_DBENGINE=<engine_name>`
: Optional, gives the ability to choose Postgres or MariaDB for
@ -94,6 +98,78 @@ changing to postgresql if you need to increase this.
Defaults to unset, keeping the Django defaults.
## Optional Services
### Tika {#tika}
Paperless can make use of [Tika](https://tika.apache.org/) and
[Gotenberg](https://gotenberg.dev/) for parsing and converting
"Office" documents (such as ".doc", ".xlsx" and ".odt").
Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
If you wish to use this, you must provide a Tika server and a Gotenberg server,
configure their endpoints, and enable the feature.
`PAPERLESS_TIKA_ENABLED=<bool>`
: Enable (or disable) the Tika parser.
Defaults to false.
`PAPERLESS_TIKA_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Tika server.
Defaults to "<http://localhost:9998>".
`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Gotenberg server.
Defaults to "<http://localhost:3000>".
If you run paperless on docker, you can add those services to the
docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
file for reference). The changes requires are as follows:
```yaml
services:
# ...
webserver:
# ...
environment:
# ...
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
# ...
gotenberg:
image: gotenberg/gotenberg:7.8
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- 'gotenberg'
- '--chromium-disable-javascript=true'
- '--chromium-allow-list=file:///tmp/.*'
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
```
Add the configuration variables to the environment of the webserver
(alternatively put the configuration in the `docker-compose.env` file)
and add the additional services below the webserver service. Watch out
for indentation.
Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
## Paths and folders
`PAPERLESS_CONSUMPTION_DIR=<path>`
@ -227,8 +303,7 @@ not include a trailing slash. E.g. <https://paperless.domain.com>
: A list of trusted origins for unsafe requests (e.g. POST). As of
Django 4.0 this is required to access the Django admin via the web.
See
<https://docs.djangoproject.com/en/4.0/ref/settings/#csrf-trusted-origins>
See the [Django project documentation on the settings](https://docs.djangoproject.com/en/4.1/ref/settings/#csrf-trusted-origins)
Can also be set using PAPERLESS_URL (see above).
@ -239,8 +314,8 @@ See
: If you're planning on putting Paperless on the open internet, then
you really should set this value to the domain name you're using.
Failing to do so leaves you open to HTTP host header attacks:
<https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation>
Failing to do so leaves you open to HTTP host header attacks.
You can read more about this in [the Django project's documentation](https://docs.djangoproject.com/en/4.1/topics/security/#host-header-validation)
Just remember that this is a comma-separated list, so
"example.com" is fine, as is "example.com,www.example.com", but
@ -348,7 +423,7 @@ applications.
If you're exposing paperless to the internet directly, do not use
this.
Also see the warning [in the official documentation](https://docs.djangoproject.com/en/3.1/howto/auth-remote-user/#configuration).
Also see the warning [in the official documentation](https://docs.djangoproject.com/en/4.1/howto/auth-remote-user/#configuration).
Defaults to "false" which disables this feature.
@ -357,7 +432,7 @@ applications.
: If "PAPERLESS*ENABLE_HTTP_REMOTE_USER" is enabled, this
property allows to customize the name of the HTTP header from which
the authenticated username is extracted. Values are in terms of
[HttpRequest.META](https://docs.djangoproject.com/en/3.1/ref/request-response/#django.http.HttpRequest.META).
[HttpRequest.META](https://docs.djangoproject.com/en/4.1/ref/request-response/#django.http.HttpRequest.META).
Thus, the configured value must start with `HTTP*`
followed by the normalized actual header name.
@ -576,76 +651,6 @@ they use underscores instead of dashes.
{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
```
## Tika settings {#tika}
Paperless can make use of [Tika](https://tika.apache.org/) and
[Gotenberg](https://gotenberg.dev/) for parsing and converting
"Office" documents (such as ".doc", ".xlsx" and ".odt").
Tika and Gotenberg are also needed to allow parsing of E-Mails (.eml).
If you wish to use this, you must provide a Tika server and a Gotenberg server,
configure their endpoints, and enable the feature.
`PAPERLESS_TIKA_ENABLED=<bool>`
: Enable (or disable) the Tika parser.
Defaults to false.
`PAPERLESS_TIKA_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Tika server.
Defaults to "<http://localhost:9998>".
`PAPERLESS_TIKA_GOTENBERG_ENDPOINT=<url>`
: Set the endpoint URL were Paperless can reach your Gotenberg server.
Defaults to "<http://localhost:3000>".
If you run paperless on docker, you can add those services to the
docker-compose file (see the provided `docker-compose.sqlite-tika.yml`
file for reference). The changes requires are as follows:
```yaml
services:
# ...
webserver:
# ...
environment:
# ...
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
# ...
gotenberg:
image: gotenberg/gotenberg:7.8
restart: unless-stopped
# The gotenberg chromium route is used to convert .eml files. We do not
# want to allow external content like tracking pixels or even javascript.
command:
- 'gotenberg'
- '--chromium-disable-javascript=true'
- '--chromium-allow-list=file:///tmp/.*'
tika:
image: ghcr.io/paperless-ngx/tika:latest
restart: unless-stopped
```
Add the configuration variables to the environment of the webserver
(alternatively put the configuration in the `docker-compose.env` file)
and add the additional services below the webserver service. Watch out
for indentation.
Make sure to use the correct format `PAPERLESS_TIKA_ENABLED = 1` so python_dotenv can parse the statement correctly.
## Software tweaks {#software_tweaks}
`PAPERLESS_TASK_WORKERS=<num>`
@ -699,8 +704,8 @@ this timeout may prove to be useful on weak hardware setups.
`PAPERLESS_TIME_ZONE=<timezone>`
: Set the time zone here. See
<https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE>
: Set the time zone here. See more details on
why and how to set it [in the Django project documentation](https://docs.djangoproject.com/en/4.1/ref/settings/#std:setting-TIME_ZONE)
for details on how to set it.
Defaults to UTC.
@ -762,46 +767,33 @@ should be a valid crontab(5) expression describing when to run.
to enable compression in your proxy configuration rather than
the webserver
## Polling {#polling}
`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
`PAPERLESS_CONSUMER_POLLING=<num>`
: On smaller systems, or even in the case of Very Large Documents, the
consumer may explode, complaining about how it's "unable to extend
pixel cache". In such cases, try setting this to a reasonably low
value, like 32. The default is to use whatever is necessary to do
everything without writing to disk, and units are in megabytes.
: If paperless won't find documents added to your consume folder, it
might not be able to automatically detect filesystem changes. In
that case, specify a polling interval in seconds here, which will
then cause paperless to periodically check your consumption
directory for changes. This will also disable listening for file
system changes with `inotify`.
For more information on how to use this value, you should search the
web for "MAGICK_MEMORY_LIMIT".
Defaults to 0, which disables polling and uses filesystem
notifications.
Defaults to 0, which disables the limit.
`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
`PAPERLESS_CONVERT_TMPDIR=<path>`
: If consumer polling is enabled, sets the number of times paperless
will check for a file to remain unmodified.
: Similar to the memory limit, if you've got a small system and your
OS mounts /tmp as tmpfs, you should set this to a path that's on a
physical disk, like /home/your_user/tmp or something. ImageMagick
will use this as scratch space when crunching through very large
documents.
Defaults to 5.
For more information on how to use this value, you should search the
web for "MAGICK_TMPDIR".
`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
Default is none, which disables the temporary directory.
: If consumer polling is enabled, sets the delay in seconds between
each check (above) paperless will do while waiting for a file to
remain unmodified.
Defaults to 5.
## iNotify {#inotify}
`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
: Sets the time in seconds the consumer will wait for additional
events from inotify before the consumer will consider a file ready
and begin consumption. Certain scanners or network setups may
generate multiple events for a single file, leading to multiple
consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
## Document Consumption {#consume_config}
`PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>`
@ -832,96 +824,41 @@ don't exist yet.
Defaults to false.
`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
: Enables the scanning and page separation based on detected barcodes.
This allows for scanning and adding multiple documents per uploaded
file, which are separated by one or multiple barcode pages.
: By default, paperless ignores certain files and folders in the
consumption directory, such as system files created by the Mac OS
or hidden folders some tools use to store data.
For ease of use, it is suggested to use a standardized separation
page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
This can be adjusted by configuring a custom json array with
patterns to exclude.
If no barcodes are detected in the uploaded file, no page separation
will happen.
For example, `.DS_STORE/*` will ignore any files found in a folder
named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
The original document will be removed and the separated pages will
be saved as pdf.
A pattern like `._*` will ignore anything starting with `._`, including:
`._foo.pdf` and `._bar/foo.pdf`
Defaults to false.
Defaults to
`[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.
`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
`PAPERLESS_PRE_CONSUME_SCRIPT=<filename>`
: Whether TIFF image files should be scanned for barcodes. This will
automatically convert any TIFF image(s) to pdfs for later
processing. This only has an effect, if
PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
: After some initial validation, Paperless can trigger an arbitrary
script if you like before beginning consumption. This script will be provided
data for it to work with via the environment.
Defaults to false.
For more information, take a look at [pre-consumption script](/advanced_usage#pre-consume-script).
`PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT`
: Defines the string to be detected as a separator barcode. If
paperless is used with the PATCH-T separator pages, users shouldn't
change this.
Defaults to "PATCHT"
`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
: Enables the detection of barcodes in the scanned document and
setting the ASN (archive serial number) if a properly formatted
barcode is detected.
The barcode must consist of a (configurable) prefix and the ASN
to be set, for instance `ASN00123`.
This option is compatible with barcode page separation, since
pages will be split up before reading the ASN.
If no ASN barcodes are detected in the uploaded file, no ASN will
be set. If a barcode with an already existing ASN is detected, no ASN
will be set either and a warning will be logged.
Defaults to false.
`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN`
: Defines the prefix that is used to identify a barcode as an ASN
barcode.
Defaults to "ASN"
`PAPERLESS_CONVERT_MEMORY_LIMIT=<num>`
: On smaller systems, or even in the case of Very Large Documents, the
consumer may explode, complaining about how it's "unable to extend
pixel cache". In such cases, try setting this to a reasonably low
value, like 32. The default is to use whatever is necessary to do
everything without writing to disk, and units are in megabytes.
For more information on how to use this value, you should search the
web for "MAGICK_MEMORY_LIMIT".
Defaults to 0, which disables the limit.
`PAPERLESS_CONVERT_TMPDIR=<path>`
: Similar to the memory limit, if you've got a small system and your
OS mounts /tmp as tmpfs, you should set this to a path that's on a
physical disk, like /home/your_user/tmp or something. ImageMagick
will use this as scratch space when crunching through very large
documents.
For more information on how to use this value, you should search the
web for "MAGICK_TMPDIR".
Default is none, which disables the temporary directory.
The default is blank, which means nothing will be executed.
`PAPERLESS_POST_CONSUME_SCRIPT=<filename>`
: After a document is consumed, Paperless can trigger an arbitrary
script if you like. This script will be passed a number of arguments
for you to work with. For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
script if you like. This script will be provided
data for it to work with via the environment.
For more information, take a look at [Post-consumption script](/advanced_usage#post-consume-script).
The default is blank, which means nothing will be executed.
@ -988,23 +925,109 @@ within your documents.
second, and year last order. Characters D, M, or Y can be shuffled
to meet the required order.
`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`
### Polling {#polling}
: By default, paperless ignores certain files and folders in the
consumption directory, such as system files created by the Mac OS
or hidden folders some tools use to store data.
`PAPERLESS_CONSUMER_POLLING=<num>`
This can be adjusted by configuring a custom json array with
patterns to exclude.
: If paperless won't find documents added to your consume folder, it
might not be able to automatically detect filesystem changes. In
that case, specify a polling interval in seconds here, which will
then cause paperless to periodically check your consumption
directory for changes. This will also disable listening for file
system changes with `inotify`.
For example, `.DS_STORE/*` will ignore any files found in a folder
named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
Defaults to 0, which disables polling and uses filesystem
notifications.
A pattern like `._*` will ignore anything starting with `._`, including:
`._foo.pdf` and `._bar/foo.pdf`
`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`
Defaults to
`[".DS_STORE/*", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*"]`.
: If consumer polling is enabled, sets the number of times paperless
will check for a file to remain unmodified.
Defaults to 5.
`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`
: If consumer polling is enabled, sets the delay in seconds between
each check (above) paperless will do while waiting for a file to
remain unmodified.
Defaults to 5.
### iNotify {#inotify}
`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`
: Sets the time in seconds the consumer will wait for additional
events from inotify before the consumer will consider a file ready
and begin consumption. Certain scanners or network setups may
generate multiple events for a single file, leading to multiple
consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
## Barcodes {#barcodes}
`PAPERLESS_CONSUMER_ENABLE_BARCODES=<bool>`
: Enables the scanning and page separation based on detected barcodes.
This allows for scanning and adding multiple documents per uploaded
file, which are separated by one or multiple barcode pages.
For ease of use, it is suggested to use a standardized separation
page, e.g. [here](https://www.alliancegroup.co.uk/patch-codes.htm).
If no barcodes are detected in the uploaded file, no page separation
will happen.
The original document will be removed and the separated pages will
be saved as pdf.
See additional information in the [advanced usage documentation](/advanced_usage#barcodes)
Defaults to false.
`PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=<bool>`
: Whether TIFF image files should be scanned for barcodes. This will
automatically convert any TIFF image(s) to pdfs for later
processing. This only has an effect, if
PAPERLESS_CONSUMER_ENABLE_BARCODES has been enabled.
Defaults to false.
`PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT`
: Defines the string to be detected as a separator barcode. If
paperless is used with the PATCH-T separator pages, users shouldn't
change this.
Defaults to "PATCHT"
`PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=<bool>`
: Enables the detection of barcodes in the scanned document and
setting the ASN (archive serial number) if a properly formatted
barcode is detected.
The barcode must consist of a (configurable) prefix and the ASN
to be set, for instance `ASN00123`.
This option is compatible with barcode page separation, since
pages will be split up before reading the ASN.
If no ASN barcodes are detected in the uploaded file, no ASN will
be set. If a barcode with an already existing ASN is detected, no ASN
will be set either and a warning will be logged.
Defaults to false.
`PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN`
: Defines the prefix that is used to identify a barcode as an ASN
barcode.
Defaults to "ASN"
## Binaries

View File

@ -119,7 +119,9 @@ first-time setup.
## Back end development
The back end is a [Django](https://www.djangoproject.com/) application. [PyCharm](https://www.jetbrains.com/de-de/pycharm/) as well as [Visual Studio Code](https://code.visualstudio.com) work well for development, but you can use whatever you want.
The back end is a [Django](https://www.djangoproject.com/) application.
[PyCharm](https://www.jetbrains.com/de-de/pycharm/) as well as [Visual Studio Code](https://code.visualstudio.com)
work well for development, but you can use whatever you want.
Configure the IDE to use the `src/`-folder as the base source folder.
Configure the following launch configurations in your IDE:
@ -138,7 +140,10 @@ $ python3 manage.py runserver & \
celery --app paperless worker -l DEBUG
```
You might need the front end to test your back end code. This assumes that you have AngularJS installed on your system. Go to the [Front end development](#front-end-development) section for further details. To build the front end once use this commmand:
You might need the front end to test your back end code.
This assumes that you have AngularJS installed on your system.
Go to the [Front end development](#front-end-development) section for further details.
To build the front end once use this command:
```bash
# src-ui/