From aa26a44a12783c2d5221348a366013116c963e63 Mon Sep 17 00:00:00 2001
From: Trenton H <797416+stumpylog@users.noreply.github.com>
Date: Sat, 28 Jun 2025 10:26:07 -0700
Subject: [PATCH] Updated Backend Ideas List (markdown)

---
 Backend-Ideas-List.md => v3-Ideas-List.md | 59 +++++++++++++++++++++--
 1 file changed, 55 insertions(+), 4 deletions(-)
 rename Backend-Ideas-List.md => v3-Ideas-List.md (51%)

diff --git a/Backend-Ideas-List.md b/v3-Ideas-List.md
similarity index 51%
rename from Backend-Ideas-List.md
rename to v3-Ideas-List.md
index 3f19a9b..31f3aee 100644
--- a/Backend-Ideas-List.md
+++ b/v3-Ideas-List.md
@@ -6,11 +6,62 @@
 - Provides no benefit
 - Does still linger in the code base here and there
 
-## Migration to s6-overlay
+### Settings Updates
 
-- supervisord isn't meant to run as PID 1, S6 is
-- s8 startup can be separated into independent units, with dependencies between them, which could slightly improve startup time
-- Initial work done in https://github.com/paperless-ngx/paperless-ngx/tree/feature-s6-overlay
+- Remove all but Django settings from the environment
+- Separate OCR vs other settings
+- Create multiple levels of OCR settings:
+  - A default system configuration, controlled by staff/superusers
+  - A user specific settings set
+  - The final settings used for OCR are then the combined set, with user, then default system settings
+- Allow workflows/matching to set certain settings:
+  - Document filename matches regex, disable archive generation and disable de-skew
+- When a document starts consumption, settings go through the pipeline with it.  ie set once, not read (from DB) again
+
+### Regex Everywhere
+
+- Remove usages of `fnmatch` in favor of regex.  There was a PR that implemented some sort of multiple matching, where regex could have solved it
+
+### Re-design parsing/consumption chain
+
+- Use chains/pipelines to actually break the consumption into multiple tasks
+- Results from one task move on to the next
+- An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
+- Or alternatively, the initial task builds a pipeline and starts that.
+- Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)
+
+### Actual Plugins
+
+- Design a system to allow plugins, while splitting apart the current code into plugins
+- I can see the following being plugins:
+  - Parsers (obviously.  Includes things like AI/cloud OCR to get the content or even could talk to a remote, but local network API)
+  - Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
+  - Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
+  - Date parsing (handling non-latin dates, for example)
+  - Machine learning (provides an interface which returns the proposed tags, type, etc)
+- Ideally, plugins should be registered when installed, declaring what mime types they support
+- With the settings updates above, a workflow could also be used to set the parser based on matching certain values
+
+### Simpler consumer
+
+- Use something like [watchfiles](https://github.com/samuelcolvin/watchfiles) for a simpler loop with only itself as a dependency
+
+### Transition to Alpine container
+
+- Smaller image size
+- Faster update cadence
+
+### Ditch celery for Huey
+
+- Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
+- Huey also has nice Django integrations
+- Would need to use its signals to implement task tracking
+
+## Improved Tasks
+
+- Show scheduled tasks with next execution
+- Simple task status
+- Include more task types
 
 ## External Services