mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-03-31 13:35:08 -05:00
Page:
Backend Ideas List
Pages
Affiliated Projects
Backend Ideas List
Email OAuth App Setup
Home
Platform‐Specific Troubleshooting
Post Consume Script Examples
Pre Consume Script Examples
Related Projects
Scanner & Software Recommendations
Using Security Tools with Paperless ngx
Using a Reverse Proxy with Paperless ngx
Using and Generating ASN Barcodes
Clone
9
Backend Ideas List
Trenton H edited this page 2025-01-14 14:46:45 -08:00
Breaking Changes
Removing GPG / Encryption
- Encrypting documents unsupported since 0.9, many years ago
- Provides no benefit
- Does still linger in the code base here and there
Migration to s6-overlay
- supervisord isn't meant to run as PID 1, S6 is
- s8 startup can be separated into independent units, with dependencies between them, which could slightly improve startup time
- Initial work done in https://github.com/paperless-ngx/paperless-ngx/tree/feature-s6-overlay
External Services
External OCR
- External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
- This would be streamed the document, eventually return the content and an optional archive file
- Is time consuming, so might need celery/huey/task queue there? And a database?
- fastapi could easily set this up, if there is no need for a database.
- Could use Redis/Valkey streams to manage state and show progress without a database
External Machine Learning
- Again, define an API that the service provides so it could be swapped out
- Provided the content, suggests the tags, correspondents, etc
- External allows it to be hosted on a larger resourced machine
- Needs a task queue for scheduled training?
Separate OCR from Archive
- The getting of a image or PDF document content should be separated from the generation of an archive file
- Just too many interactions between them, leading to odd combinations
Break apart consumer
- The consumer does so much stuff, break it apart into smaller, more discrete steps
- Make each step well defined with possible status/states to report over the websocket and/or notifications
- Make it a chain of tasks, passing a package through which accumulates data, etc, before being saved
Settings Manager
- Allow multiple levels of settings to be defined
- From matching, apply certain settings
- From the user (if known), apply their settings
- From the system wide settings
- From environment variable settings
- Then defaults
- settings at lower levels have less priority, so a matched setting is never changed
- Settings travel through the new consumer with the document
Django Ninja
- Really like the OpenAPI spec it generates
- async support for databases
- Strongly typed and validated with Pydantic
Blockers
- Would need to implement Token based authentication
- Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
- Could implement expiration too
- Async pagination isn't working quite yet
Vector Embeddings
- This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
- Would enable semantic search, document similarity search
- Maybe replacing Whoosh entirely?
New Sanity Checker
- Sanity checker messages are attached to a document
- Can be dismissed (but still viewed)
- Visible in the UI somehow
Feel free to contribute to the wiki pages - enhance and extend the content!
Also browse Discussions & connect in Matrix chat.