Feature: Simplify and improve the consumer (#11753)

This commit is contained in:
Trenton H
2026-01-21 14:37:48 -08:00
committed by GitHub
parent e3c29fc626
commit 51b466a86b
13 changed files with 1625 additions and 808 deletions

View File

@@ -501,7 +501,7 @@ The `datetime` filter formats a datetime string or datetime object using Python'
See the [strftime format code documentation](https://docs.python.org/3.13/library/datetime.html#strftime-and-strptime-format-codes)
for the possible codes and their meanings.
##### Date Localization
##### Date Localization {#date-localization}
The `localize_date` filter formats a date or datetime object into a localized string using Babel internationalization.
This takes into account the provided locale for translation. Since this must be used on a date or datetime object,
@@ -851,8 +851,8 @@ followed by the even pages.
It's important that the scan files get consumed in the correct order, and one at a time.
You therefore need to make sure that Paperless is running while you upload the files into
the directory; and if you're using [polling](configuration.md#polling), make sure that
`CONSUMER_POLLING` is set to a value lower than it takes for the second scan to appear,
the directory; and if you're using polling, make sure that
`CONSUMER_POLLING_INTERVAL` is set to a value lower than it takes for the second scan to appear,
like 5-10 or even lower.
Another thing that might happen is that you start a double sided scan, but then forget

View File

@@ -1175,21 +1175,45 @@ don't exist yet.
#### [`PAPERLESS_CONSUMER_IGNORE_PATTERNS=<json>`](#PAPERLESS_CONSUMER_IGNORE_PATTERNS) {#PAPERLESS_CONSUMER_IGNORE_PATTERNS}
: By default, paperless ignores certain files and folders in the
consumption directory, such as system files created by the Mac OS
or hidden folders some tools use to store data.
: Additional regex patterns for files to ignore in the consumption directory. Patterns are matched against filenames only (not full paths)
using Python's `re.match()`, which anchors at the start of the filename.
This can be adjusted by configuring a custom json array with
patterns to exclude.
See the [watchfiles documentation](https://watchfiles.helpmanual.io/api/filters/#watchfiles.BaseFilter.ignore_entity_patterns)
For example, `.DS_STORE/*` will ignore any files found in a folder
named `.DS_STORE`, including `.DS_STORE/bar.pdf` and `foo/.DS_STORE/bar.pdf`
This setting is for additional patterns beyond the built-in defaults. Common system files and directories are already ignored automatically.
The patterns will be compiled via Python's standard `re` module.
A pattern like `._*` will ignore anything starting with `._`, including:
`._foo.pdf` and `._bar/foo.pdf`
Example custom patterns:
Defaults to
`[".DS_Store", ".DS_STORE", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*", "Thumbs.db"]`.
```json
["^temp_", "\\.bak$", "^~"]
```
This would ignore:
- Files starting with `temp_` (e.g., `temp_scan.pdf`)
- Files ending with `.bak` (e.g., `document.pdf.bak`)
- Files starting with `~` (e.g., `~$document.docx`)
Defaults to `[]` (empty list, uses only built-in defaults).
The default ignores are `[.DS_Store, .DS_STORE, ._*, desktop.ini, Thumbs.db]` and cannot be overridden.
#### [`PAPERLESS_CONSUMER_IGNORE_DIRS=<json>`](#PAPERLESS_CONSUMER_IGNORE_DIRS) {#PAPERLESS_CONSUMER_IGNORE_DIRS}
: Additional directory names to ignore in the consumption directory. Directories matching these names (and all their contents) will be skipped.
This setting is for additional directories beyond the built-in defaults. Matching is done by directory name only, not full path.
Example:
```json
["temp", "incoming", ".hidden"]
```
Defaults to `[]` (empty list, uses only built-in defaults).
The default ignores are `[.stfolder, .stversions, .localized, @eaDir, .Spotlight-V100, .Trashes, __MACOSX]` and cannot be overridden.
#### [`PAPERLESS_CONSUMER_BARCODE_SCANNER=<string>`](#PAPERLESS_CONSUMER_BARCODE_SCANNER) {#PAPERLESS_CONSUMER_BARCODE_SCANNER}
@@ -1288,48 +1312,24 @@ within your documents.
Defaults to false.
### Polling {#polling}
#### [`PAPERLESS_CONSUMER_POLLING_INTERVAL=<num>`](#PAPERLESS_CONSUMER_POLLING_INTERVAL) {#PAPERLESS_CONSUMER_POLLING_INTERVAL}
#### [`PAPERLESS_CONSUMER_POLLING=<num>`](#PAPERLESS_CONSUMER_POLLING) {#PAPERLESS_CONSUMER_POLLING}
: Configures how the consumer detects new files in the consumption directory.
: If paperless won't find documents added to your consume folder, it
might not be able to automatically detect filesystem changes. In
that case, specify a polling interval in seconds here, which will
then cause paperless to periodically check your consumption
directory for changes. This will also disable listening for file
system changes with `inotify`.
When set to `0` (default), paperless uses native filesystem notifications for efficient, immediate detection of new files.
Defaults to 0, which disables polling and uses filesystem
notifications.
When set to a positive number, paperless polls the consumption directory at that interval in seconds. Use polling for network filesystems (NFS, SMB/CIFS) where native notifications may not work reliably.
#### [`PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=<num>`](#PAPERLESS_CONSUMER_POLLING_RETRY_COUNT) {#PAPERLESS_CONSUMER_POLLING_RETRY_COUNT}
Defaults to 0.
: If consumer polling is enabled, sets the maximum number of times
paperless will check for a file to remain unmodified. If a file's
modification time and size are identical for two consecutive checks, it
will be consumed.
#### [`PAPERLESS_CONSUMER_STABILITY_DELAY=<num>`](#PAPERLESS_CONSUMER_STABILITY_DELAY) {#PAPERLESS_CONSUMER_STABILITY_DELAY}
Defaults to 5.
: Sets the time in seconds that a file must remain unchanged (same size and modification time) before paperless will begin consuming it.
#### [`PAPERLESS_CONSUMER_POLLING_DELAY=<num>`](#PAPERLESS_CONSUMER_POLLING_DELAY) {#PAPERLESS_CONSUMER_POLLING_DELAY}
Increase this value if you experience issues with files being consumed before they are fully written, particularly on slower network storage or
with certain scanner quirks
: If consumer polling is enabled, sets the delay in seconds between
each check (above) paperless will do while waiting for a file to
remain unmodified.
Defaults to 5.
### iNotify {#inotify}
#### [`PAPERLESS_CONSUMER_INOTIFY_DELAY=<num>`](#PAPERLESS_CONSUMER_INOTIFY_DELAY) {#PAPERLESS_CONSUMER_INOTIFY_DELAY}
: Sets the time in seconds the consumer will wait for additional
events from inotify before the consumer will consider a file ready
and begin consumption. Certain scanners or network setups may
generate multiple events for a single file, leading to multiple
consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
Defaults to 5.0 seconds.
## Workflow webhooks

19
docs/migration.md Normal file
View File

@@ -0,0 +1,19 @@
# v3 Migration Guide
## Consumer Settings Changes
The v3 consumer command uses a [different library](https://watchfiles.helpmanual.io/) to unify
the watching for new files in the consume directory. For the user, this removes several configuration options related to delays and retries
and replaces with a single unified setting. It also adjusts how the consumer ignore filtering happens, replaced `fnmatch` with `regex` and
separating the directory ignore from the file ignore.
### Summary
| Old Setting | New Setting | Notes |
| ------------------------------ | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| `CONSUMER_POLLING` | [`CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL) | Renamed for clarity |
| `CONSUMER_INOTIFY_DELAY` | [`CONSUMER_STABILITY_DELAY`](configuration.md#PAPERLESS_CONSUMER_STABILITY_DELAY) | Unified for all modes |
| `CONSUMER_POLLING_DELAY` | _Removed_ | Use `CONSUMER_STABILITY_DELAY` |
| `CONSUMER_POLLING_RETRY_COUNT` | _Removed_ | Automatic with stability tracking |
| `CONSUMER_IGNORE_PATTERNS` | [`CONSUMER_IGNORE_PATTERNS`](configuration.md#PAPERLESS_CONSUMER_IGNORE_PATTERNS) | **Now regex, not fnmatch**; user patterns are added to (not replacing) default ones |
| _New_ | [`CONSUMER_IGNORE_DIRS`](configuration.md#PAPERLESS_CONSUMER_IGNORE_DIRS) | Additional directories to ignore; user entries are added to (not replacing) defaults |

View File

@@ -124,8 +124,7 @@ account. The script essentially automatically performs the steps described in [D
system notifications with `inotify`. When storing the consumption
directory on such a file system, paperless will not pick up new
files with the default configuration. You will need to use
[`PAPERLESS_CONSUMER_POLLING`](configuration.md#PAPERLESS_CONSUMER_POLLING), which will disable inotify. See
[here](configuration.md#polling).
[`PAPERLESS_CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL), which will disable inotify.
5. Run `docker compose pull`. This will pull the image from the GitHub container registry
by default but you can change the image to pull from Docker Hub by changing the `image`

View File

@@ -46,9 +46,9 @@ run:
If you notice that the consumer will only pickup files in the
consumption directory at startup, but won't find any other files added
later, you will need to enable filesystem polling with the configuration
option [`PAPERLESS_CONSUMER_POLLING`](configuration.md#PAPERLESS_CONSUMER_POLLING).
option [`PAPERLESS_CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL).
This will disable listening to filesystem changes with inotify and
This will disable automatic listening for filesystem changes and
paperless will manually check the consumption directory for changes
instead.
@@ -234,47 +234,9 @@ FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.yhk3zb
This probably indicates paperless tried to consume the same file twice.
This can happen for a number of reasons, depending on how documents are
placed into the consume folder. If paperless is using inotify (the
default) to check for documents, try adjusting the
[inotify configuration](configuration.md#inotify). If polling is enabled, try adjusting the
[polling configuration](configuration.md#polling).
## Consumer fails waiting for file to remain unmodified.
You might find messages like these in your log files:
```
[ERROR] [paperless.management.consumer] Timeout while waiting on file /usr/src/paperless/src/../consume/SCN_0001.pdf to remain unmodified.
```
This indicates paperless timed out while waiting for the file to be
completely written to the consume folder. Adjusting
[polling configuration](configuration.md#polling) values should resolve the issue.
!!! note
The user will need to manually move the file out of the consume folder
and back in, for the initial failing file to be consumed.
## Consumer fails reporting "OS reports file as busy still".
You might find messages like these in your log files:
```
[WARNING] [paperless.management.consumer] Not consuming file /usr/src/paperless/src/../consume/SCN_0001.pdf: OS reports file as busy still
```
This indicates paperless was unable to open the file, as the OS reported
the file as still being in use. To prevent a crash, paperless did not
try to consume the file. If paperless is using inotify (the default) to
check for documents, try adjusting the
[inotify configuration](configuration.md#inotify). If polling is enabled, try adjusting the
[polling configuration](configuration.md#polling).
!!! note
The user will need to manually move the file out of the consume folder
and back in, for the initial failing file to be consumed.
placed into the consume folder, such as how a scanner may modify a file multiple times as it scans.
Try adjusting the
[file stability delay](configuration.md#PAPERLESS_CONSUMER_STABILITY_DELAY) to a larger value.
## Log reports "Creating PaperlessTask failed".

View File

@@ -565,7 +565,7 @@ This allows for complex logic to be used to generate the title, including [logic
and [filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#id11).
The template is provided as a string.
Using Jinja2 Templates is also useful for [Date localization](advanced_usage.md#Date-Localization) in the title.
Using Jinja2 Templates is also useful for [Date localization](advanced_usage.md#date-localization) in the title.
The available inputs differ depending on the type of workflow trigger.
This is because at the time of consumption (when the text is to be set), no automatic tags etc. have been