paperless-ngx

paradizelost/paperless-ngx

Fork 0

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2025-06-30 16:04:40 -05:00

Table of Contents

Breaking Changes

Removing GPG / Encryption
Settings Updates
Regex Everywhere
Re-design parsing/consumption chain
Actual Plugins
Simpler consumer
Transition to Alpine container
Ditch celery for Huey

Improved Tasks
External Services

External OCR
External Machine Learning

Separate OCR from Archive
Break apart consumer
Settings Manager
Django Ninja

Blockers

Vector Embeddings
New Sanity Checker

Breaking Changes

Removing GPG / Encryption

Encrypting documents unsupported since 0.9, many years ago
Provides no benefit
Does still linger in the code base here and there

Settings Updates

Remove all but Django settings from the environment
Separate OCR vs other settings
Create multiple levels of OCR settings:
- A default system configuration, controlled by staff/superusers
- A user specific settings set
- The final settings used for OCR are then the combined set, with user, then default system settings
Allow workflows/matching to set certain settings:
- Document filename matches regex, disable archive generation and disable de-skew
When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again

Regex Everywhere

Remove usages of fnmatch in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it

Re-design parsing/consumption chain

Use chains/pipelines to actually break the consumption into multiple tasks
Results from one task move on to the next
An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
Or alternatively, the initial task builds a pipeline and starts that.
Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)

Actual Plugins

Design a system to allow plugins, while splitting apart the current code into plugins
I can see the following being plugins:
- Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote, but local network API)
- Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
- Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
- Date parsing (handling non-latin dates, for example)
- Machine learning (provides an interface which returns the proposed tags, type, etc)
Ideally, plugins should be registered when installed, declaring what mime types they support
With the settings updates above, a workflow could also be used to set the parser based on matching certain values

Simpler consumer

Use something like watchfiles for a simpler loop with only itself as a dependency

Transition to Alpine container

Smaller image size
Faster update cadence

Ditch celery for Huey

Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
Huey also has nice Django integrations
Would need to use its signals to implement task tracking

Improved Tasks

Show scheduled tasks with next execution
Simple task status
Include more task types

External Services

External OCR

External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
This would be streamed the document, eventually return the content and an optional archive file
Is time consuming, so might need celery/huey/task queue there? And a database?
fastapi could easily set this up, if there is no need for a database.
Could use Redis/Valkey streams to manage state and show progress without a database

External Machine Learning

Again, define an API that the service provides so it could be swapped out
Provided the content, suggests the tags, correspondents, etc
External allows it to be hosted on a larger resourced machine
Needs a task queue for scheduled training?

Separate OCR from Archive

The getting of a image or PDF document content should be separated from the generation of an archive file
Just too many interactions between them, leading to odd combinations

Break apart consumer

The consumer does so much stuff, break it apart into smaller, more discrete steps
Make each step well defined with possible status/states to report over the websocket and/or notifications
Make it a chain of tasks, passing a package through which accumulates data, etc, before being saved

Settings Manager

Allow multiple levels of settings to be defined
- From matching, apply certain settings
- From the user (if known), apply their settings
- From the system wide settings
- From environment variable settings
- Then defaults
settings at lower levels have less priority, so a matched setting is never changed
Settings travel through the new consumer with the document

Django Ninja

Really like the OpenAPI spec it generates
async support for databases
Strongly typed and validated with Pydantic

Blockers

Would need to implement Token based authentication
- Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
- Could implement expiration too
Async pagination isn't working quite yet

Vector Embeddings

This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
Would enable semantic search, document similarity search
- Maybe replacing Whoosh entirely?

New Sanity Checker

Sanity checker messages are attached to a document
Can be dismissed (but still viewed)
Visible in the UI somehow

Home
Lists
- Related Projects
- Hardware & Software Scanner
Setup Help
Examples
- Pre Consume Scripts
- Post Consume Scripts
Platform-Specific Troubleshooting
Ideas
- Backend

Feel free to contribute to the wiki pages - enhance and extend the content!

Also browse Discussions & connect in Matrix chat.