mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-01-22 22:34:20 -06:00
Feature: Simplify and improve the consumer (#11753)
This commit is contained in:
@@ -501,7 +501,7 @@ The `datetime` filter formats a datetime string or datetime object using Python'
|
||||
See the [strftime format code documentation](https://docs.python.org/3.13/library/datetime.html#strftime-and-strptime-format-codes)
|
||||
for the possible codes and their meanings.
|
||||
|
||||
##### Date Localization
|
||||
##### Date Localization {#date-localization}
|
||||
|
||||
The `localize_date` filter formats a date or datetime object into a localized string using Babel internationalization.
|
||||
This takes into account the provided locale for translation. Since this must be used on a date or datetime object,
|
||||
@@ -851,8 +851,8 @@ followed by the even pages.
|
||||
|
||||
It's important that the scan files get consumed in the correct order, and one at a time.
|
||||
You therefore need to make sure that Paperless is running while you upload the files into
|
||||
the directory; and if you're using [polling](configuration.md#polling), make sure that
|
||||
`CONSUMER_POLLING` is set to a value lower than it takes for the second scan to appear,
|
||||
the directory; and if you're using polling, make sure that
|
||||
`CONSUMER_POLLING_INTERVAL` is set to a value lower than it takes for the second scan to appear,
|
||||
like 5-10 or even lower.
|
||||
|
||||
Another thing that might happen is that you start a double sided scan, but then forget
|
||||
|
||||
@@ -1175,21 +1175,45 @@ don't exist yet.
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`](#PAPERLESS_CONSUMER_IGNORE_PATTERNS) {#PAPERLESS_CONSUMER_IGNORE_PATTERNS}
|
||||
|
||||
: By default, paperless ignores certain files and folders in the
|
||||
consumption directory, such as system files created by the Mac OS
|
||||
or hidden folders some tools use to store data.
|
||||
: Additional regex patterns for files to ignore in the consumption directory. Patterns are matched against filenames only (not full paths)
|
||||
using Python's `re.match()`, which anchors at the start of the filename.
|
||||
|
||||
This can be adjusted by configuring a custom json array with
|
||||
patterns to exclude.
|
||||
See the [watchfiles documentation](https://watchfiles.helpmanual.io/api/filters/#watchfiles.BaseFilter.ignore_entity_patterns)
|
||||
|
||||
For example, `.DS_STORE/*` will ignore any files found in a folder
|
||||
named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
|
||||
This setting is for additional patterns beyond the built-in defaults. Common system files and directories are already ignored automatically.
|
||||
The patterns will be compiled via Python's standard `re` module.
|
||||
|
||||
A pattern like `._*` will ignore anything starting with `._`, including:
|
||||
`._foo.pdf` and `._bar/foo.pdf`
|
||||
Example custom patterns:
|
||||
|
||||
Defaults to
|
||||
`[".DS_Store", ".DS_STORE", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*", "Thumbs.db"]`.
|
||||
```json
|
||||
["^temp_", "\\.bak$", "^~"]
|
||||
```
|
||||
|
||||
This would ignore:
|
||||
|
||||
- Files starting with `temp_` (e.g., `temp_scan.pdf`)
|
||||
- Files ending with `.bak` (e.g., `document.pdf.bak`)
|
||||
- Files starting with `~` (e.g., `~$document.docx`)
|
||||
|
||||
Defaults to `[]` (empty list, uses only built-in defaults).
|
||||
|
||||
The default ignores are `[.DS_Store, .DS_STORE, ._*, desktop.ini, Thumbs.db]` and cannot be overridden.
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_IGNORE_DIRS=<json>`](#PAPERLESS_CONSUMER_IGNORE_DIRS) {#PAPERLESS_CONSUMER_IGNORE_DIRS}
|
||||
|
||||
: Additional directory names to ignore in the consumption directory. Directories matching these names (and all their contents) will be skipped.
|
||||
|
||||
This setting is for additional directories beyond the built-in defaults. Matching is done by directory name only, not full path.
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
["temp", "incoming", ".hidden"]
|
||||
```
|
||||
|
||||
Defaults to `[]` (empty list, uses only built-in defaults).
|
||||
|
||||
The default ignores are `[.stfolder, .stversions, .localized, @eaDir, .Spotlight-V100, .Trashes, __MACOSX]` and cannot be overridden.
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_BARCODE_SCANNER=<string>`](#PAPERLESS_CONSUMER_BARCODE_SCANNER) {#PAPERLESS_CONSUMER_BARCODE_SCANNER}
|
||||
|
||||
@@ -1288,48 +1312,24 @@ within your documents.
|
||||
|
||||
Defaults to false.
|
||||
|
||||
### Polling {#polling}
|
||||
#### [`PAPERLESS_CONSUMER_POLLING_INTERVAL=<num>`](#PAPERLESS_CONSUMER_POLLING_INTERVAL) {#PAPERLESS_CONSUMER_POLLING_INTERVAL}
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_POLLING=<num>`](#PAPERLESS_CONSUMER_POLLING) {#PAPERLESS_CONSUMER_POLLING}
|
||||
: Configures how the consumer detects new files in the consumption directory.
|
||||
|
||||
: If paperless won't find documents added to your consume folder, it
|
||||
might not be able to automatically detect filesystem changes. In
|
||||
that case, specify a polling interval in seconds here, which will
|
||||
then cause paperless to periodically check your consumption
|
||||
directory for changes. This will also disable listening for file
|
||||
system changes with `inotify`.
|
||||
When set to `0` (default), paperless uses native filesystem notifications for efficient, immediate detection of new files.
|
||||
|
||||
Defaults to 0, which disables polling and uses filesystem
|
||||
notifications.
|
||||
When set to a positive number, paperless polls the consumption directory at that interval in seconds. Use polling for network filesystems (NFS, SMB/CIFS) where native notifications may not work reliably.
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`](#PAPERLESS_CONSUMER_POLLING_RETRY_COUNT) {#PAPERLESS_CONSUMER_POLLING_RETRY_COUNT}
|
||||
Defaults to 0.
|
||||
|
||||
: If consumer polling is enabled, sets the maximum number of times
|
||||
paperless will check for a file to remain unmodified. If a file's
|
||||
modification time and size are identical for two consecutive checks, it
|
||||
will be consumed.
|
||||
#### [`PAPERLESS_CONSUMER_STABILITY_DELAY=<num>`](#PAPERLESS_CONSUMER_STABILITY_DELAY) {#PAPERLESS_CONSUMER_STABILITY_DELAY}
|
||||
|
||||
Defaults to 5.
|
||||
: Sets the time in seconds that a file must remain unchanged (same size and modification time) before paperless will begin consuming it.
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`](#PAPERLESS_CONSUMER_POLLING_DELAY) {#PAPERLESS_CONSUMER_POLLING_DELAY}
|
||||
Increase this value if you experience issues with files being consumed before they are fully written, particularly on slower network storage or
|
||||
with certain scanner quirks
|
||||
|
||||
: If consumer polling is enabled, sets the delay in seconds between
|
||||
each check (above) paperless will do while waiting for a file to
|
||||
remain unmodified.
|
||||
|
||||
Defaults to 5.
|
||||
|
||||
### iNotify {#inotify}
|
||||
|
||||
#### [`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`](#PAPERLESS_CONSUMER_INOTIFY_DELAY) {#PAPERLESS_CONSUMER_INOTIFY_DELAY}
|
||||
|
||||
: Sets the time in seconds the consumer will wait for additional
|
||||
events from inotify before the consumer will consider a file ready
|
||||
and begin consumption. Certain scanners or network setups may
|
||||
generate multiple events for a single file, leading to multiple
|
||||
consumers working on the same file. Configure this to prevent that.
|
||||
|
||||
Defaults to 0.5 seconds.
|
||||
Defaults to 5.0 seconds.
|
||||
|
||||
## Workflow webhooks
|
||||
|
||||
|
||||
19
docs/migration.md
Normal file
19
docs/migration.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# v3 Migration Guide
|
||||
|
||||
## Consumer Settings Changes
|
||||
|
||||
The v3 consumer command uses a [different library](https://watchfiles.helpmanual.io/) to unify
|
||||
the watching for new files in the consume directory. For the user, this removes several configuration options related to delays and retries
|
||||
and replaces with a single unified setting. It also adjusts how the consumer ignore filtering happens, replaced `fnmatch` with `regex` and
|
||||
separating the directory ignore from the file ignore.
|
||||
|
||||
### Summary
|
||||
|
||||
| Old Setting | New Setting | Notes |
|
||||
| ------------------------------ | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
|
||||
| `CONSUMER_POLLING` | [`CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL) | Renamed for clarity |
|
||||
| `CONSUMER_INOTIFY_DELAY` | [`CONSUMER_STABILITY_DELAY`](configuration.md#PAPERLESS_CONSUMER_STABILITY_DELAY) | Unified for all modes |
|
||||
| `CONSUMER_POLLING_DELAY` | _Removed_ | Use `CONSUMER_STABILITY_DELAY` |
|
||||
| `CONSUMER_POLLING_RETRY_COUNT` | _Removed_ | Automatic with stability tracking |
|
||||
| `CONSUMER_IGNORE_PATTERNS` | [`CONSUMER_IGNORE_PATTERNS`](configuration.md#PAPERLESS_CONSUMER_IGNORE_PATTERNS) | **Now regex, not fnmatch**; user patterns are added to (not replacing) default ones |
|
||||
| _New_ | [`CONSUMER_IGNORE_DIRS`](configuration.md#PAPERLESS_CONSUMER_IGNORE_DIRS) | Additional directories to ignore; user entries are added to (not replacing) defaults |
|
||||
@@ -124,8 +124,7 @@ account. The script essentially automatically performs the steps described in [D
|
||||
system notifications with `inotify`. When storing the consumption
|
||||
directory on such a file system, paperless will not pick up new
|
||||
files with the default configuration. You will need to use
|
||||
[`PAPERLESS_CONSUMER_POLLING`](configuration.md#PAPERLESS_CONSUMER_POLLING), which will disable inotify. See
|
||||
[here](configuration.md#polling).
|
||||
[`PAPERLESS_CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL), which will disable inotify.
|
||||
|
||||
5. Run `docker compose pull`. This will pull the image from the GitHub container registry
|
||||
by default but you can change the image to pull from Docker Hub by changing the `image`
|
||||
|
||||
@@ -46,9 +46,9 @@ run:
|
||||
If you notice that the consumer will only pickup files in the
|
||||
consumption directory at startup, but won't find any other files added
|
||||
later, you will need to enable filesystem polling with the configuration
|
||||
option [`PAPERLESS_CONSUMER_POLLING`](configuration.md#PAPERLESS_CONSUMER_POLLING).
|
||||
option [`PAPERLESS_CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL).
|
||||
|
||||
This will disable listening to filesystem changes with inotify and
|
||||
This will disable automatic listening for filesystem changes and
|
||||
paperless will manually check the consumption directory for changes
|
||||
instead.
|
||||
|
||||
@@ -234,47 +234,9 @@ FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.yhk3zb
|
||||
|
||||
This probably indicates paperless tried to consume the same file twice.
|
||||
This can happen for a number of reasons, depending on how documents are
|
||||
placed into the consume folder. If paperless is using inotify (the
|
||||
default) to check for documents, try adjusting the
|
||||
[inotify configuration](configuration.md#inotify). If polling is enabled, try adjusting the
|
||||
[polling configuration](configuration.md#polling).
|
||||
|
||||
## Consumer fails waiting for file to remain unmodified.
|
||||
|
||||
You might find messages like these in your log files:
|
||||
|
||||
```
|
||||
[ERROR] [paperless.management.consumer] Timeout while waiting on file /usr/src/paperless/src/../consume/SCN_0001.pdf to remain unmodified.
|
||||
```
|
||||
|
||||
This indicates paperless timed out while waiting for the file to be
|
||||
completely written to the consume folder. Adjusting
|
||||
[polling configuration](configuration.md#polling) values should resolve the issue.
|
||||
|
||||
!!! note
|
||||
|
||||
The user will need to manually move the file out of the consume folder
|
||||
and back in, for the initial failing file to be consumed.
|
||||
|
||||
## Consumer fails reporting "OS reports file as busy still".
|
||||
|
||||
You might find messages like these in your log files:
|
||||
|
||||
```
|
||||
[WARNING] [paperless.management.consumer] Not consuming file /usr/src/paperless/src/../consume/SCN_0001.pdf: OS reports file as busy still
|
||||
```
|
||||
|
||||
This indicates paperless was unable to open the file, as the OS reported
|
||||
the file as still being in use. To prevent a crash, paperless did not
|
||||
try to consume the file. If paperless is using inotify (the default) to
|
||||
check for documents, try adjusting the
|
||||
[inotify configuration](configuration.md#inotify). If polling is enabled, try adjusting the
|
||||
[polling configuration](configuration.md#polling).
|
||||
|
||||
!!! note
|
||||
|
||||
The user will need to manually move the file out of the consume folder
|
||||
and back in, for the initial failing file to be consumed.
|
||||
placed into the consume folder, such as how a scanner may modify a file multiple times as it scans.
|
||||
Try adjusting the
|
||||
[file stability delay](configuration.md#PAPERLESS_CONSUMER_STABILITY_DELAY) to a larger value.
|
||||
|
||||
## Log reports "Creating PaperlessTask failed".
|
||||
|
||||
|
||||
@@ -565,7 +565,7 @@ This allows for complex logic to be used to generate the title, including [logic
|
||||
and [filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#id11).
|
||||
The template is provided as a string.
|
||||
|
||||
Using Jinja2 Templates is also useful for [Date localization](advanced_usage.md#Date-Localization) in the title.
|
||||
Using Jinja2 Templates is also useful for [Date localization](advanced_usage.md#date-localization) in the title.
|
||||
|
||||
The available inputs differ depending on the type of workflow trigger.
|
||||
This is because at the time of consumption (when the text is to be set), no automatic tags etc. have been
|
||||
|
||||
Reference in New Issue
Block a user