mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-06-30 16:04:40 -05:00
Updated Backend Ideas List (markdown)
parent
eabde26154
commit
aa26a44a12
@ -6,11 +6,62 @@
|
|||||||
- Provides no benefit
|
- Provides no benefit
|
||||||
- Does still linger in the code base here and there
|
- Does still linger in the code base here and there
|
||||||
|
|
||||||
## Migration to s6-overlay
|
### Settings Updates
|
||||||
|
|
||||||
- supervisord isn't meant to run as PID 1, S6 is
|
- Remove all but Django settings from the environment
|
||||||
- s8 startup can be separated into independent units, with dependencies between them, which could slightly improve startup time
|
- Separate OCR vs other settings
|
||||||
- Initial work done in https://github.com/paperless-ngx/paperless-ngx/tree/feature-s6-overlay
|
- Create multiple levels of OCR settings:
|
||||||
|
- A default system configuration, controlled by staff/superusers
|
||||||
|
- A user specific settings set
|
||||||
|
- The final settings used for OCR are then the combined set, with user, then default system settings
|
||||||
|
- Allow workflows/matching to set certain settings:
|
||||||
|
- Document filename matches regex, disable archive generation and disable de-skew
|
||||||
|
- When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again
|
||||||
|
|
||||||
|
### Regex Everywhere
|
||||||
|
|
||||||
|
- Remove usages of `fnmatch` in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it
|
||||||
|
|
||||||
|
### Re-design parsing/consumption chain
|
||||||
|
|
||||||
|
- Use chains/pipelines to actually break the consumption into multiple tasks
|
||||||
|
- Results from one task move on to the next
|
||||||
|
- An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
|
||||||
|
- Or alternatively, the initial task builds a pipeline and starts that.
|
||||||
|
- Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)
|
||||||
|
|
||||||
|
### Actual Plugins
|
||||||
|
|
||||||
|
- Design a system to allow plugins, while splitting apart the current code into plugins
|
||||||
|
- I can see the following being plugins:
|
||||||
|
- Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote, but local network API)
|
||||||
|
- Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
|
||||||
|
- Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
|
||||||
|
- Date parsing (handling non-latin dates, for example)
|
||||||
|
- Machine learning (provides an interface which returns the proposed tags, type, etc)
|
||||||
|
- Ideally, plugins should be registered when installed, declaring what mime types they support
|
||||||
|
- With the settings updates above, a workflow could also be used to set the parser based on matching certain values
|
||||||
|
|
||||||
|
### Simpler consumer
|
||||||
|
|
||||||
|
- Use something like [watchfiles](https://github.com/samuelcolvin/watchfiles) for a simpler loop with only itself as a dependency
|
||||||
|
|
||||||
|
### Transition to Alpine container
|
||||||
|
|
||||||
|
- Smaller image size
|
||||||
|
- Faster update cadence
|
||||||
|
|
||||||
|
### Ditch celery for Huey
|
||||||
|
|
||||||
|
- Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
|
||||||
|
- Huey also has nice Django integrations
|
||||||
|
- Would need to use its signals to implement task tracking
|
||||||
|
|
||||||
|
## Improved Tasks
|
||||||
|
|
||||||
|
- Show scheduled tasks with next execution
|
||||||
|
- Simple task status
|
||||||
|
- Include more task types
|
||||||
|
|
||||||
## External Services
|
## External Services
|
||||||
|
|
Loading…
x
Reference in New Issue
Block a user