mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-06-30 16:04:40 -05:00
Page:
v3 Ideas List
Pages
Affiliated Projects
Email OAuth App Setup
Home
Platform‐Specific Troubleshooting
Post Consume Script Examples
Pre Consume Script Examples
Related Projects
Scanner & Software Recommendations
Using Security Tools with Paperless ngx
Using a Reverse Proxy with Paperless ngx
Using and Generating ASN Barcodes
v3 Ideas List
Clone
1
v3 Ideas List
Trenton H edited this page 2025-06-28 10:26:07 -07:00
Table of Contents
- Breaking Changes
- Removing GPG / Encryption
- Settings Updates
- Regex Everywhere
- Re-design parsing/consumption chain
- Actual Plugins
- Simpler consumer
- Transition to Alpine container
- Ditch celery for Huey
- Improved Tasks
- External Services
- Separate OCR from Archive
- Break apart consumer
- Settings Manager
- Django Ninja
- Vector Embeddings
- New Sanity Checker
Breaking Changes
Removing GPG / Encryption
- Encrypting documents unsupported since 0.9, many years ago
- Provides no benefit
- Does still linger in the code base here and there
Settings Updates
- Remove all but Django settings from the environment
- Separate OCR vs other settings
- Create multiple levels of OCR settings:
- A default system configuration, controlled by staff/superusers
- A user specific settings set
- The final settings used for OCR are then the combined set, with user, then default system settings
- Allow workflows/matching to set certain settings:
- Document filename matches regex, disable archive generation and disable de-skew
- When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again
Regex Everywhere
- Remove usages of
fnmatch
in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it
Re-design parsing/consumption chain
- Use chains/pipelines to actually break the consumption into multiple tasks
- Results from one task move on to the next
- An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
- Or alternatively, the initial task builds a pipeline and starts that.
- Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)
Actual Plugins
- Design a system to allow plugins, while splitting apart the current code into plugins
- I can see the following being plugins:
- Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote, but local network API)
- Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
- Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
- Date parsing (handling non-latin dates, for example)
- Machine learning (provides an interface which returns the proposed tags, type, etc)
- Ideally, plugins should be registered when installed, declaring what mime types they support
- With the settings updates above, a workflow could also be used to set the parser based on matching certain values
Simpler consumer
- Use something like watchfiles for a simpler loop with only itself as a dependency
Transition to Alpine container
- Smaller image size
- Faster update cadence
Ditch celery for Huey
- Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
- Huey also has nice Django integrations
- Would need to use its signals to implement task tracking
Improved Tasks
- Show scheduled tasks with next execution
- Simple task status
- Include more task types
External Services
External OCR
- External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
- This would be streamed the document, eventually return the content and an optional archive file
- Is time consuming, so might need celery/huey/task queue there? And a database?
- fastapi could easily set this up, if there is no need for a database.
- Could use Redis/Valkey streams to manage state and show progress without a database
External Machine Learning
- Again, define an API that the service provides so it could be swapped out
- Provided the content, suggests the tags, correspondents, etc
- External allows it to be hosted on a larger resourced machine
- Needs a task queue for scheduled training?
Separate OCR from Archive
- The getting of a image or PDF document content should be separated from the generation of an archive file
- Just too many interactions between them, leading to odd combinations
Break apart consumer
- The consumer does so much stuff, break it apart into smaller, more discrete steps
- Make each step well defined with possible status/states to report over the websocket and/or notifications
- Make it a chain of tasks, passing a package through which accumulates data, etc, before being saved
Settings Manager
- Allow multiple levels of settings to be defined
- From matching, apply certain settings
- From the user (if known), apply their settings
- From the system wide settings
- From environment variable settings
- Then defaults
- settings at lower levels have less priority, so a matched setting is never changed
- Settings travel through the new consumer with the document
Django Ninja
- Really like the OpenAPI spec it generates
- async support for databases
- Strongly typed and validated with Pydantic
Blockers
- Would need to implement Token based authentication
- Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
- Could implement expiration too
- Async pagination isn't working quite yet
Vector Embeddings
- This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
- Would enable semantic search, document similarity search
- Maybe replacing Whoosh entirely?
New Sanity Checker
- Sanity checker messages are attached to a document
- Can be dismissed (but still viewed)
- Visible in the UI somehow
Feel free to contribute to the wiki pages - enhance and extend the content!
Also browse Discussions & connect in Matrix chat.