1
v3 Ideas List
Trenton H edited this page 2025-06-28 10:26:07 -07:00

Breaking Changes

Removing GPG / Encryption

  • Encrypting documents unsupported since 0.9, many years ago
  • Provides no benefit
  • Does still linger in the code base here and there

Settings Updates

  • Remove all but Django settings from the environment
  • Separate OCR vs other settings
  • Create multiple levels of OCR settings:
    • A default system configuration, controlled by staff/superusers
    • A user specific settings set
    • The final settings used for OCR are then the combined set, with user, then default system settings
  • Allow workflows/matching to set certain settings:
    • Document filename matches regex, disable archive generation and disable de-skew
  • When a document starts consumption, settings go through the pipeline with it. ie set once, not read (from DB) again

Regex Everywhere

  • Remove usages of fnmatch in favor of regex. There was a PR that implemented some sort of multiple matching, where regex could have solved it

Re-design parsing/consumption chain

  • Use chains/pipelines to actually break the consumption into multiple tasks
  • Results from one task move on to the next
  • An initial task takes the file, waits for it to be unmodified, then determines the next task to start.
  • Or alternatively, the initial task builds a pipeline and starts that.
  • Handles deciding if the file can be consumed, rather than when a new file is seen (see plugin ideas)

Actual Plugins

  • Design a system to allow plugins, while splitting apart the current code into plugins
  • I can see the following being plugins:
    • Parsers (obviously. Includes things like AI/cloud OCR to get the content or even could talk to a remote, but local network API)
    • Archive generation (example, use Gotenberg to convert a PDF to PDF/A instead of ocrmypdf)
    • Thumbnail generation (maybe you want to handle PDFs differently than JPEGs?)
    • Date parsing (handling non-latin dates, for example)
    • Machine learning (provides an interface which returns the proposed tags, type, etc)
  • Ideally, plugins should be registered when installed, declaring what mime types they support
  • With the settings updates above, a workflow could also be used to set the parser based on matching certain values

Simpler consumer

  • Use something like watchfiles for a simpler loop with only itself as a dependency

Transition to Alpine container

  • Smaller image size
  • Faster update cadence

Ditch celery for Huey

  • Celery is big and bulky, with support for memcached, sqs, etc, which we don't need
  • Huey also has nice Django integrations
  • Would need to use its signals to implement task tracking

Improved Tasks

  • Show scheduled tasks with next execution
  • Simple task status
  • Include more task types

External Services

External OCR

  • External OCR services, using an API, could provide more recent tesseract and ghostscript versions, potentially fixing issues faster than Debian updates (thinking Alpine based image)
  • This would be streamed the document, eventually return the content and an optional archive file
  • Is time consuming, so might need celery/huey/task queue there? And a database?
  • fastapi could easily set this up, if there is no need for a database.
  • Could use Redis/Valkey streams to manage state and show progress without a database

External Machine Learning

  • Again, define an API that the service provides so it could be swapped out
  • Provided the content, suggests the tags, correspondents, etc
  • External allows it to be hosted on a larger resourced machine
  • Needs a task queue for scheduled training?

Separate OCR from Archive

  • The getting of a image or PDF document content should be separated from the generation of an archive file
  • Just too many interactions between them, leading to odd combinations

Break apart consumer

  • The consumer does so much stuff, break it apart into smaller, more discrete steps
  • Make each step well defined with possible status/states to report over the websocket and/or notifications
  • Make it a chain of tasks, passing a package through which accumulates data, etc, before being saved

Settings Manager

  • Allow multiple levels of settings to be defined
    • From matching, apply certain settings
    • From the user (if known), apply their settings
    • From the system wide settings
    • From environment variable settings
    • Then defaults
  • settings at lower levels have less priority, so a matched setting is never changed
  • Settings travel through the new consumer with the document

Django Ninja

  • Really like the OpenAPI spec it generates
  • async support for databases
  • Strongly typed and validated with Pydantic

Blockers

  • Would need to implement Token based authentication
    • Could track, with some resolution, when a token was last used. Might be nice to display and allow removing old tokens which haven't been used
    • Could implement expiration too
  • Async pagination isn't working quite yet

Vector Embeddings

  • This would require either a new database or everyone to use the same database (ala Immich with pgvecto.rs)
  • Would enable semantic search, document similarity search
    • Maybe replacing Whoosh entirely?

New Sanity Checker

  • Sanity checker messages are attached to a document
  • Can be dismissed (but still viewed)
  • Visible in the UI somehow