Merge branch 'dev' into feature-remote-ocr-2

Fix consumer declaration file extensions
2025-08-12 00:19:48 +00:00 · 2025-08-11 10:48:36 -07:00 · 2025-08-10 05:32:52 -07:00 · 2025-08-06 16:04:25 -04:00 · 2025-08-01 23:55:49 -04:00 · 2025-07-09 11:05:37 -07:00
25 changed files with 962 additions and 744 deletions
--- a/docs/administration.md
+++ b/docs/administration.md
@@ -179,14 +179,10 @@ following:

 ### Database Upgrades

-Paperless-ngx is compatible with Django-supported versions of PostgreSQL and MariaDB and it is generally
+In general, paperless does not require a specific version of PostgreSQL or MariaDB and it is
 safe to update them to newer versions. However, you should always take a backup and follow
 the instructions from your database's documentation for how to upgrade between major versions.

-!!! note
-
-    As of Paperless-ngx v2.18, the minimum supported version of PostgreSQL is 13.
-
 For PostgreSQL, refer to [Upgrading a PostgreSQL Cluster](https://www.postgresql.org/docs/current/upgrading.html).

 For MariaDB, refer to [Upgrading MariaDB](https://mariadb.com/kb/en/upgrading/)
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1800,3 +1800,23 @@ password. All of these options come from their similarly-named [Django settings]
 #### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}

 : Defaults to false.
+
+## Remote OCR
+
+#### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
+
+: The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
+
+    Defaults to None, which disables remote OCR.
+
+#### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
+
+: The API key to use for the remote OCR engine.
+
+    Defaults to None.
+
+#### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
+
+: The endpoint to use for the remote OCR engine. This is required for Azure AI.
+
+    Defaults to None.
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,9 +25,10 @@ physical documents into a searchable online archive so you can keep, well, _less
 ## Features

 -   **Organize and index** your scanned documents with tags, correspondents, types, and more.
-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
+-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
 -   Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
-   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   _New!_ Supports remote OCR with Azure AI (opt-in).
 -   Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
 -   Uses machine-learning to automatically add tags, correspondents and document types to your documents.
 -   Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -850,6 +850,18 @@ how regularly you intend to scan documents and use paperless.
    performed the task associated with the document, move it to the
    inbox.

+## Remote OCR
+
+!!! important
+
+    This feature is disabled by default and will always remain strictly "opt-in".
+
+Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
+[Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
+This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
+Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
+the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
+
 ## Architecture

 Paperless-ngx consists of the following components:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -15,6 +15,7 @@ classifiers = [
 # This will allow testing to not install a webserver, mysql, etc

 dependencies = [
+  "azure-ai-documentintelligence>=1.0.2",
  "bleach~=6.2.0",
  "celery[redis]~=5.5.1",
  "channels~=4.2",
@@ -23,22 +24,22 @@ dependencies = [
  "dateparser~=1.2",
  # WARNING: django does not use semver.
  #          Only patch versions are guaranteed to not introduce breaking changes.
-  "django~=5.2.5",
+  "django~=5.1.7",
  "django-allauth[socialaccount,mfa]~=65.4.0",
-  "django-auditlog~=3.2.1",
+  "django-auditlog~=3.1.2",
  "django-cachalot~=2.8.0",
  "django-celery-results~=2.6.0",
  "django-compression-middleware~=0.5.0",
  "django-cors-headers~=4.7.0",
  "django-extensions~=4.1",
  "django-filter~=25.1",
-  "django-guardian~=3.0.3",
-  "django-multiselectfield~=1.0.1",
+  "django-guardian~=2.4.0",
+  "django-multiselectfield~=0.1.13",
  "django-soft-delete~=1.0.18",
-  "djangorestframework~=3.16",
-  "djangorestframework-guardian~=0.4.0",
+  "djangorestframework~=3.15",
+  "djangorestframework-guardian~=0.3.0",
  "drf-spectacular~=0.28",
-  "drf-spectacular-sidecar~=2025.8.1",
+  "drf-spectacular-sidecar~=2025.4.1",
  "drf-writable-nested~=0.7.1",
  "filelock~=3.18.0",
  "flower~=2.0.1",
@@ -103,7 +104,7 @@ testing = [
  "imagehash",
  "pytest~=8.4.1",
  "pytest-cov~=6.2.1",
-  "pytest-django~=4.11.1",
+  "pytest-django~=4.10.0",
  "pytest-env",
  "pytest-httpx",
  "pytest-mock",
@@ -233,6 +234,7 @@ testpaths = [
  "src/paperless_tesseract/tests/",
  "src/paperless_tika/tests",
  "src/paperless_text/tests/",
+  "src/paperless_remote/tests/",
 ]
 addopts = [
  "--pythonwarnings=all",
--- a/src/documents/management/commands/document_fuzzy_match.py
+++ b/src/documents/management/commands/document_fuzzy_match.py
@@ -125,14 +125,14 @@ class Command(MultiProcessMixin, ProgressBarMixin, BaseCommand):
                messages.append(
                    self.style.NOTICE(
                        f"Document {result.doc_one_pk} fuzzy match"
-                        f" to {result.doc_two_pk} (confidence {result.ratio:.3f})\n",
+                        f" to {result.doc_two_pk} (confidence {result.ratio:.3f})",
                    ),
                )
                maybe_delete_ids.append(result.doc_two_pk)

        if len(messages) == 0:
            messages.append(
-                self.style.SUCCESS("No matches found\n"),
+                self.style.SUCCESS("No matches found"),
            )
        self.stdout.writelines(
            messages,
--- a/src/documents/serialisers.py
+++ b/src/documents/serialisers.py
@@ -2089,24 +2089,6 @@ class WorkflowTriggerSerializer(serializers.ModelSerializer):

        return attrs

-    @staticmethod
-    def normalize_workflow_trigger_sources(trigger):
-        """
-        Convert sources to strings to handle django-multiselectfield v1.0 changes
-        """
-        if trigger and "sources" in trigger:
-            trigger["sources"] = [
-                str(s.value if hasattr(s, "value") else s) for s in trigger["sources"]
-            ]
-
-    def create(self, validated_data):
-        WorkflowTriggerSerializer.normalize_workflow_trigger_sources(validated_data)
-        return super().create(validated_data)
-
-    def update(self, instance, validated_data):
-        WorkflowTriggerSerializer.normalize_workflow_trigger_sources(validated_data)
-        return super().update(instance, validated_data)
-

 class WorkflowActionEmailSerializer(serializers.ModelSerializer):
    id = serializers.IntegerField(allow_null=True, required=False)
@@ -2271,8 +2253,6 @@ class WorkflowSerializer(serializers.ModelSerializer):
        if triggers is not None and triggers is not serializers.empty:
            for trigger in triggers:
                filter_has_tags = trigger.pop("filter_has_tags", None)
-                # Convert sources to strings to handle django-multiselectfield v1.0 changes
-                WorkflowTriggerSerializer.normalize_workflow_trigger_sources(trigger)
                trigger_instance, _ = WorkflowTrigger.objects.update_or_create(
                    id=trigger.get("id"),
                    defaults=trigger,
--- a/src/documents/tests/test_management_exporter.py
+++ b/src/documents/tests/test_management_exporter.py
@@ -123,7 +123,7 @@ class TestExportImport(

        self.trigger = WorkflowTrigger.objects.create(
            type=WorkflowTrigger.WorkflowTriggerType.CONSUMPTION,
-            sources=[str(WorkflowTrigger.DocumentSourceChoices.CONSUME_FOLDER.value)],
+            sources=[1],
            filter_filename="*",
        )
        self.action = WorkflowAction.objects.create(assign_title="new title")
--- a/src/documents/tests/test_management_fuzzy.py
+++ b/src/documents/tests/test_management_fuzzy.py
@@ -87,7 +87,7 @@ class TestFuzzyMatchCommand(TestCase):
            filename="other_test.pdf",
        )
        stdout, _ = self.call_command()
-        self.assertIn("No matches found", stdout)
+        self.assertEqual(stdout, "No matches found\n")

    def test_with_matches(self):
        """
@@ -116,7 +116,7 @@ class TestFuzzyMatchCommand(TestCase):
            filename="other_test.pdf",
        )
        stdout, _ = self.call_command("--processes", "1")
-        self.assertRegex(stdout, self.MSG_REGEX)
+        self.assertRegex(stdout, self.MSG_REGEX + "\n")

    def test_with_3_matches(self):
        """
@@ -152,10 +152,11 @@ class TestFuzzyMatchCommand(TestCase):
            filename="final_test.pdf",
        )
        stdout, _ = self.call_command()
-        lines = [x.strip() for x in stdout.splitlines() if x.strip()]
+        lines = [x.strip() for x in stdout.split("\n") if len(x.strip())]
        self.assertEqual(len(lines), 3)
-        for line in lines:
-            self.assertRegex(line, self.MSG_REGEX)
+        self.assertRegex(lines[0], self.MSG_REGEX)
+        self.assertRegex(lines[1], self.MSG_REGEX)
+        self.assertRegex(lines[2], self.MSG_REGEX)

    def test_document_deletion(self):
        """
@@ -196,12 +197,14 @@ class TestFuzzyMatchCommand(TestCase):

        stdout, _ = self.call_command("--delete")

-        self.assertIn(
+        lines = [x.strip() for x in stdout.split("\n") if len(x.strip())]
+        self.assertEqual(len(lines), 3)
+        self.assertEqual(
+            lines[0],
            "The command is configured to delete documents.  Use with caution",
-            stdout,
        )
-        self.assertRegex(stdout, self.MSG_REGEX)
-        self.assertIn("Deleting 1 documents based on ratio matches", stdout)
+        self.assertRegex(lines[1], self.MSG_REGEX)
+        self.assertEqual(lines[2], "Deleting 1 documents based on ratio matches")

        self.assertEqual(Document.objects.count(), 2)
        self.assertIsNotNone(Document.objects.get(pk=1))
--- a/src/documents/tests/test_migration_workflows.py
+++ b/src/documents/tests/test_migration_workflows.py
@@ -104,7 +104,7 @@ class TestReverseMigrateWorkflow(TestMigrations):

        trigger = WorkflowTrigger.objects.create(
            type=0,
-            sources=[str(DocumentSource.ConsumeFolder)],
+            sources=[DocumentSource.ConsumeFolder],
            filter_path="*/path/*",
            filter_filename="*file*",
        )
--- a/src/paperless/auth.py
+++ b/src/paperless/auth.py
@@ -54,7 +54,7 @@ class HttpRemoteUserMiddleware(PersistentRemoteUserMiddleware):

    header = settings.HTTP_REMOTE_USER_HEADER_NAME

-    def __call__(self, request: HttpRequest) -> None:
+    def process_request(self, request: HttpRequest) -> None:
        # If remote user auth is enabled only for the frontend, not the API,
        # then we need dont want to authenticate the user for API requests.
        if (
@@ -62,8 +62,8 @@ class HttpRemoteUserMiddleware(PersistentRemoteUserMiddleware):
            and "paperless.auth.PaperlessRemoteUserAuthentication"
            not in settings.REST_FRAMEWORK["DEFAULT_AUTHENTICATION_CLASSES"]
        ):
-            return self.get_response(request)
-        return super().__call__(request)
+            return
+        return super().process_request(request)


 class PaperlessRemoteUserAuthentication(authentication.RemoteUserAuthentication):
--- a/src/paperless/checks.py
+++ b/src/paperless/checks.py
@@ -214,3 +214,31 @@ def audit_log_check(app_configs, **kwargs):
        )

    return result
+
+
+@register()
+def check_postgres_version(app_configs, **kwargs):
+    """
+    Django 5.2 removed PostgreSQL 13 support and thus it will be removed in
+    a future Paperless-ngx version. This check can be removed eventually.
+    See https://docs.djangoproject.com/en/5.2/releases/5.2/#dropped-support-for-postgresql-13
+    """
+    db_conn = connections["default"]
+    result = []
+    if db_conn.vendor == "postgresql":
+        try:
+            with db_conn.cursor() as cursor:
+                cursor.execute("SHOW server_version;")
+                version = cursor.fetchone()[0]
+                if version.startswith("13"):
+                    return [
+                        Warning(
+                            "PostgreSQL 13 is deprecated and will not be supported in a future Paperless-ngx release.",
+                            hint="Upgrade to PostgreSQL 14 or newer.",
+                        ),
+                    ]
+        except Exception:  # pragma: no cover
+            # Don't block checks on version query failure
+            pass
+
+    return result
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -324,6 +324,7 @@ INSTALLED_APPS = [
    "paperless_tesseract.apps.PaperlessTesseractConfig",
    "paperless_text.apps.PaperlessTextConfig",
    "paperless_mail.apps.PaperlessMailConfig",
+    "paperless_remote.apps.PaperlessRemoteParserConfig",
    "django.contrib.admin",
    "rest_framework",
    "rest_framework.authtoken",
@@ -1443,3 +1444,10 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
    "PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
    "true",
 )
+
+###############################################################################
+# Remote Parser                                                               #
+###############################################################################
+REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
+REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
+REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")
--- a/src/paperless/tests/test_checks.py
+++ b/src/paperless/tests/test_checks.py
@@ -9,6 +9,7 @@ from documents.tests.utils import DirectoriesMixin
 from documents.tests.utils import FileSystemAssertsMixin
 from paperless.checks import audit_log_check
 from paperless.checks import binaries_check
+from paperless.checks import check_postgres_version
 from paperless.checks import debug_mode_check
 from paperless.checks import paths_check
 from paperless.checks import settings_values_check
@@ -262,3 +263,39 @@ class TestAuditLogChecks(TestCase):
                    ("auditlog table was found but audit log is disabled."),
                    msg.msg,
                )
+
+
+class TestPostgresVersionCheck(TestCase):
+    @mock.patch("paperless.checks.connections")
+    def test_postgres_13_warns(self, mock_connections):
+        mock_connection = mock.MagicMock()
+        mock_connection.vendor = "postgresql"
+        mock_cursor = mock.MagicMock()
+        mock_cursor.__enter__.return_value.fetchone.return_value = ["13.11"]
+        mock_connection.cursor.return_value = mock_cursor
+        mock_connections.__getitem__.return_value = mock_connection
+
+        warnings = check_postgres_version(None)
+        self.assertEqual(len(warnings), 1)
+        self.assertIn("PostgreSQL 13 is deprecated", warnings[0].msg)
+
+    @mock.patch("paperless.checks.connections")
+    def test_postgres_14_passes(self, mock_connections):
+        mock_connection = mock.MagicMock()
+        mock_connection.vendor = "postgresql"
+        mock_cursor = mock.MagicMock()
+        mock_cursor.__enter__.return_value.fetchone.return_value = ["14.10"]
+        mock_connection.cursor.return_value = mock_cursor
+        mock_connections.__getitem__.return_value = mock_connection
+
+        warnings = check_postgres_version(None)
+        self.assertEqual(warnings, [])
+
+    @mock.patch("paperless.checks.connections")
+    def test_non_postgres_skipped(self, mock_connections):
+        mock_connection = mock.MagicMock()
+        mock_connection.vendor = "sqlite"
+        mock_connections.__getitem__.return_value = mock_connection
+
+        warnings = check_postgres_version(None)
+        self.assertEqual(warnings, [])
--- a/src/paperless/tests/test_remote_user.py
+++ b/src/paperless/tests/test_remote_user.py
@@ -1,7 +1,6 @@
 import os
 from unittest import mock

-from django.conf import settings
 from django.contrib.auth.models import User
 from django.test import override_settings
 from rest_framework import status
@@ -92,7 +91,6 @@ class TestRemoteUser(DirectoriesMixin, APITestCase):

    @override_settings(
        REST_FRAMEWORK={
-            **settings.REST_FRAMEWORK,
            "DEFAULT_AUTHENTICATION_CLASSES": [
                "rest_framework.authentication.BasicAuthentication",
                "rest_framework.authentication.TokenAuthentication",
--- a/src/paperless_remote/init.py
+++ b/src/paperless_remote/init.py
@@ -0,0 +1,4 @@
+# this is here so that django finds the checks.
+from paperless_remote.checks import check_remote_parser_configured
+
+__all__ = ["check_remote_parser_configured"]
--- a/src/paperless_remote/apps.py
+++ b/src/paperless_remote/apps.py
@@ -0,0 +1,14 @@
+from django.apps import AppConfig
+
+from paperless_remote.signals import remote_consumer_declaration
+
+
+class PaperlessRemoteParserConfig(AppConfig):
+    name = "paperless_remote"
+
+    def ready(self):
+        from documents.signals import document_consumer_declaration
+
+        document_consumer_declaration.connect(remote_consumer_declaration)
+
+        AppConfig.ready(self)
--- a/src/paperless_remote/checks.py
+++ b/src/paperless_remote/checks.py
@@ -0,0 +1,15 @@
+from django.conf import settings
+from django.core.checks import Error
+from django.core.checks import register
+
+
+@register()
+def check_remote_parser_configured(app_configs, **kwargs):
+    if settings.REMOTE_OCR_ENGINE == "azureai" and not settings.REMOTE_OCR_ENDPOINT:
+        return [
+            Error(
+                "Azure AI remote parser requires endpoint to be configured.",
+            ),
+        ]
+
+    return []
--- a/src/paperless_remote/parsers.py
+++ b/src/paperless_remote/parsers.py
@@ -0,0 +1,113 @@
+from pathlib import Path
+
+from django.conf import settings
+
+from paperless_tesseract.parsers import RasterisedDocumentParser
+
+
+class RemoteEngineConfig:
+    def __init__(
+        self,
+        engine: str,
+        api_key: str | None = None,
+        endpoint: str | None = None,
+    ):
+        self.engine = engine
+        self.api_key = api_key
+        self.endpoint = endpoint
+
+    def engine_is_valid(self):
+        valid = self.engine in ["azureai"] and self.api_key is not None
+        if self.engine == "azureai":
+            valid = valid and self.endpoint is not None
+        return valid
+
+
+class RemoteDocumentParser(RasterisedDocumentParser):
+    """
+    This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
+    as this is the only service that provides a remote OCR API with text-embedded PDF output.
+    """
+
+    logging_name = "paperless.parsing.remote"
+
+    def get_settings(self) -> RemoteEngineConfig:
+        """
+        Returns the configuration for the remote OCR engine, loaded from Django settings.
+        """
+        return RemoteEngineConfig(
+            engine=settings.REMOTE_OCR_ENGINE,
+            api_key=settings.REMOTE_OCR_API_KEY,
+            endpoint=settings.REMOTE_OCR_ENDPOINT,
+        )
+
+    def supported_mime_types(self):
+        if self.settings.engine_is_valid():
+            return {
+                "application/pdf": ".pdf",
+                "image/png": ".png",
+                "image/jpeg": ".jpg",
+                "image/tiff": ".tiff",
+                "image/bmp": ".bmp",
+                "image/gif": ".gif",
+                "image/webp": ".webp",
+            }
+        else:
+            return {}
+
+    def azure_ai_vision_parse(
+        self,
+        file: Path,
+    ) -> str | None:
+        """
+        Uses Azure AI Vision to parse the document and return the text content.
+        It requests a searchable PDF output with embedded text.
+        The PDF is saved to the archive_path attribute.
+        Returns the text content extracted from the document.
+        If the parsing fails, it returns None.
+        """
+        from azure.ai.documentintelligence import DocumentIntelligenceClient
+        from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
+        from azure.ai.documentintelligence.models import AnalyzeOutputOption
+        from azure.ai.documentintelligence.models import DocumentContentFormat
+        from azure.core.credentials import AzureKeyCredential
+
+        client = DocumentIntelligenceClient(
+            endpoint=self.settings.endpoint,
+            credential=AzureKeyCredential(self.settings.api_key),
+        )
+
+        with file.open("rb") as f:
+            analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
+            poller = client.begin_analyze_document(
+                model_id="prebuilt-read",
+                body=analyze_request,
+                output_content_format=DocumentContentFormat.TEXT,
+                output=[AnalyzeOutputOption.PDF],  # request searchable PDF output
+                content_type="application/json",
+            )
+
+        poller.wait()
+        result_id = poller.details["operation_id"]
+        result = poller.result()
+
+        # Download the PDF with embedded text
+        self.archive_path = Path(self.tempdir) / "archive.pdf"
+        with self.archive_path.open("wb") as f:
+            for chunk in client.get_analyze_result_pdf(
+                model_id="prebuilt-read",
+                result_id=result_id,
+            ):
+                f.write(chunk)
+
+        return result.content
+
+    def parse(self, document_path: Path, mime_type, file_name=None):
+        if not self.settings.engine_is_valid():
+            self.log.warning(
+                "No valid remote parser engine is configured, content will be empty.",
+            )
+            self.text = ""
+            return
+        elif self.settings.engine == "azureai":
+            self.text = self.azure_ai_vision_parse(document_path)
--- a/src/paperless_remote/signals.py
+++ b/src/paperless_remote/signals.py
@@ -0,0 +1,18 @@
+def get_parser(*args, **kwargs):
+    from paperless_remote.parsers import RemoteDocumentParser
+
+    return RemoteDocumentParser(*args, **kwargs)
+
+
+def get_supported_mime_types():
+    from paperless_remote.parsers import RemoteDocumentParser
+
+    return RemoteDocumentParser(None).supported_mime_types()
+
+
+def remote_consumer_declaration(sender, **kwargs):
+    return {
+        "parser": get_parser,
+        "weight": 5,
+        "mime_types": get_supported_mime_types(),
+    }
--- a/src/paperless_remote/tests/init.py
+++ b/src/paperless_remote/tests/init.py
--- a/src/paperless_remote/tests/samples/simple-digital.pdf
+++ b/src/paperless_remote/tests/samples/simple-digital.pdf
--- a/src/paperless_remote/tests/test_checks.py
+++ b/src/paperless_remote/tests/test_checks.py
@@ -0,0 +1,29 @@
+from django.test import TestCase
+from django.test import override_settings
+
+from paperless_remote import check_remote_parser_configured
+
+
+class TestChecks(TestCase):
+    @override_settings(REMOTE_OCR_ENGINE=None)
+    def test_no_engine(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 0)
+
+    @override_settings(REMOTE_OCR_ENGINE="azureai")
+    @override_settings(REMOTE_OCR_API_KEY="somekey")
+    @override_settings(REMOTE_OCR_ENDPOINT=None)
+    def test_azure_no_endpoint(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 1)
+        self.assertTrue(
+            msgs[0].msg.startswith(
+                "Azure AI remote parser requires endpoint to be configured.",
+            ),
+        )
+
+    @override_settings(REMOTE_OCR_ENGINE="something")
+    @override_settings(REMOTE_OCR_API_KEY="somekey")
+    def test_valid_configuration(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 0)
--- a/src/paperless_remote/tests/test_parser.py
+++ b/src/paperless_remote/tests/test_parser.py
@@ -0,0 +1,101 @@
+import uuid
+from pathlib import Path
+from unittest import mock
+
+from django.test import TestCase
+from django.test import override_settings
+
+from documents.tests.utils import DirectoriesMixin
+from documents.tests.utils import FileSystemAssertsMixin
+from paperless_remote.parsers import RemoteDocumentParser
+from paperless_remote.signals import get_parser
+
+
+class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
+    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
+
+    def assertContainsStrings(self, content, strings):
+        # Asserts that all strings appear in content, in the given order.
+        indices = []
+        for s in strings:
+            if s in content:
+                indices.append(content.index(s))
+            else:
+                self.fail(f"'{s}' is not in '{content}'")
+        self.assertListEqual(indices, sorted(indices))
+
+    @mock.patch("paperless_tesseract.parsers.run_subprocess")
+    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
+    def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
+        # Arrange mock Azure client
+        mock_client = mock.Mock()
+        mock_client_cls.return_value = mock_client
+
+        # Simulate poller result and its `.details`
+        mock_poller = mock.Mock()
+        mock_poller.wait.return_value = None
+        mock_poller.details = {"operation_id": "fake-op-id"}
+        mock_client.begin_analyze_document.return_value = mock_poller
+        mock_poller.result.return_value.content = "This is a test document."
+
+        # Return dummy PDF bytes
+        mock_client.get_analyze_result_pdf.return_value = [
+            b"%PDF-",
+            b"1.7 ",
+            b"FAKEPDF",
+        ]
+
+        # Simulate pdftotext by writing dummy text to sidecar file
+        def fake_run(cmd, *args, **kwargs):
+            with Path(cmd[-1]).open("w", encoding="utf-8") as f:
+                f.write("This is a test document.")
+
+        mock_subprocess.side_effect = fake_run
+
+        with override_settings(
+            REMOTE_OCR_ENGINE="azureai",
+            REMOTE_OCR_API_KEY="somekey",
+            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+        ):
+            parser = get_parser(uuid.uuid4())
+            parser.parse(
+                self.SAMPLE_FILES / "simple-digital.pdf",
+                "application/pdf",
+            )
+
+            self.assertContainsStrings(
+                parser.text.strip(),
+                ["This is a test document."],
+            )
+
+    @override_settings(
+        REMOTE_OCR_ENGINE="azureai",
+        REMOTE_OCR_API_KEY="key",
+        REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+    )
+    def test_supported_mime_types_valid_config(self):
+        parser = RemoteDocumentParser(uuid.uuid4())
+        expected_types = {
+            "application/pdf": ".pdf",
+            "image/png": ".png",
+            "image/jpeg": ".jpg",
+            "image/tiff": ".tiff",
+            "image/bmp": ".bmp",
+            "image/gif": ".gif",
+            "image/webp": ".webp",
+        }
+        self.assertEqual(parser.supported_mime_types(), expected_types)
+
+    def test_supported_mime_types_invalid_config(self):
+        parser = get_parser(uuid.uuid4())
+        self.assertEqual(parser.supported_mime_types(), {})
+
+    @override_settings(
+        REMOTE_OCR_ENGINE=None,
+        REMOTE_OCR_API_KEY=None,
+        REMOTE_OCR_ENDPOINT=None,
+    )
+    def test_parse_with_invalid_config(self):
+        parser = get_parser(uuid.uuid4())
+        parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
+        self.assertEqual(parser.text, "")
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
shamoon	dd0ffaf312	Merge branch 'dev' into feature-remote-ocr-2	2025-08-11 10:48:36 -07:00
shamoon	264504affc	Fix consumer declaration file extensions	2025-08-10 05:32:52 -07:00
shamoon	4feedf2add	Merge branch 'dev' into feature-remote-ocr-2	2025-08-06 16:04:25 -04:00
shamoon	2f76cf9831	Merge branch 'dev' into feature-remote-ocr-2	2025-08-01 23:55:49 -04:00
shamoon	1002d37f6b	Update test_parser.py	2025-07-09 11:05:37 -07:00
shamoon	d260a94740	Update parsers.py	2025-07-09 11:02:57 -07:00
shamoon	88c69b83ea	Update index.md	2025-07-09 11:00:12 -07:00
shamoon	2557ee2014	Update docs to mention remote OCR with Azure AI	2025-07-09 09:53:30 -07:00
shamoon	3c75deed80	Add paperless_remote tests to testpaths	2025-07-08 14:19:45 -07:00
shamoon	d05343c927	Test fixes / coverage	2025-07-08 14:19:45 -07:00
shamoon	e7972b7eaf	Coverage	2025-07-08 14:19:45 -07:00
shamoon	75a091cc0d	Fix test	2025-07-08 14:19:44 -07:00
shamoon	dca74803fd	Use output_content_format poller.result to get clean content	2025-07-08 14:19:44 -07:00
shamoon	3cf3d868d0	Some docs	2025-07-08 14:19:43 -07:00
shamoon	bf4fc6604a	Test	2025-07-08 14:19:43 -07:00
shamoon	e8c1eb86fa	This actually works [ci skip]	2025-07-08 14:19:43 -07:00
shamoon	c3dad3cf69	Basic parse	2025-07-08 14:19:42 -07:00
shamoon	811bd66088	Ok, restart implementing this with just azure [ci skip]	2025-07-08 14:19:42 -07:00