From aa26a44a12783c2d5221348a366013116c963e63 Mon Sep 17 00:00:00 2001 From: Trenton H <797416+stumpylog@users.noreply.github.com> Date: Sat, 28 Jun 2025 10:26:07 -0700 Subject: [PATCH] Updated Backend Ideas List (markdown) --- Backend-Ideas-List.md => v3-Ideas-List.md | 59 +++++++++++++++++++++-- 1 file changed, 55 insertions(+), 4 deletions(-) rename Backend-Ideas-List.md => v3-Ideas-List.md (51%) diff --git a/Backend-Ideas-List.md b/v3-Ideas-List.md similarity index 51% rename from Backend-Ideas-List.md rename to v3-Ideas-List.md index 3f19a9b..31f3aee 100644 --- a/Backend-Ideas-List.md +++ b/v3-Ideas-List.md @@ -6,11 +6,62 @@ - Provides no benefit - Does still linger in the code base here and there -## Migration to s6-overlay +### Settings Updates -- supervisord isn't meant to run as PID 1, S6 is -- s8 startup can be separated into independent units, with dependencies between them, which could slightly improve startup time -- Initial work done in https://github.com/paperless-ngx/paperless-ngx/tree/feature-s6-overlay +- Remove all but Django settings from the environment +- Separate OCR vs other settings +- Create multiple levels of OCR settings: + - A default system configuration, controlled by staff/superusers + - A user specific settings set + - The final settings used for OCR are then the combined set, with user, then default system settings +- Allow workflows/matching to set certain settings: + - Document filename matches regex, disable archive generation and disable de-skew +- When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again + +### Regex Everywhere + +- Remove usages of `fnmatch` in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it + +### Re-design parsing/consumption chain + +- Use chains/pipelines to actually break the consumption into multiple tasks +- Results from one task move on to the next +- An initial task takes the file, waits for it to be unmodified, then determines the next task to start. +- Or alternatively, the initial task builds a pipeline and starts that. +- Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas) + +### Actual Plugins + +- Design a system to allow plugins, while splitting apart the current code into plugins +- I can see the following being plugins: + - Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote, but local network API) + - Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf) + - Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?) + - Date parsing (handling non-latin dates, for example) + - Machine learning (provides an interface which returns the proposed tags, type, etc) +- Ideally, plugins should be registered when installed, declaring what mime types they support +- With the settings updates above, a workflow could also be used to set the parser based on matching certain values + +### Simpler consumer + +- Use something like [watchfiles](https://github.com/samuelcolvin/watchfiles) for a simpler loop with only itself as a dependency + +### Transition to Alpine container + +- Smaller image size +- Faster update cadence + +### Ditch celery for Huey + +- Celery is big and bulky, with support for memcached, sqs, etc, which we don't need +- Huey also has nice Django integrations +- Would need to use its signals to implement task tracking + +## Improved Tasks + +- Show scheduled tasks with next execution +- Simple task status +- Include more task types ## External Services