Merge branch 'dev' into feature-remote-ocr-2

2025-12-31 13:58:04 -06:00 · 2025-12-30 13:08:19 -08:00 · 2025-12-11 12:58:14 -08:00 · 2025-12-07 20:37:56 -08:00 · 2025-11-22 13:18:50 -08:00 · 2025-11-19 23:49:11 -08:00
21 changed files with 444 additions and 115 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1804,3 +1804,23 @@ password. All of these options come from their similarly-named [Django settings]
 #### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}
 : Defaults to false.
 ## Remote OCR
 #### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
 : The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
    Defaults to None, which disables remote OCR.
 #### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
 : The API key to use for the remote OCR engine.
    Defaults to None.
 #### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
 : The endpoint to use for the remote OCR engine. This is required for Azure AI.
    Defaults to None.
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,9 +25,10 @@ physical documents into a searchable online archive so you can keep, well, _less
 ## Features
 -   **Organize and index** your scanned documents with tags, correspondents, types, and more.
-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
+-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
 -   Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
-   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
    -   _New!_ Supports remote OCR with Azure AI (opt-in).
 -   Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
 -   Uses machine-learning to automatically add tags, correspondents and document types to your documents.
 -   Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -901,6 +901,21 @@ how regularly you intend to scan documents and use paperless.
    performed the task associated with the document, move it to the
    inbox.
 ## Remote OCR
 !!! important
    This feature is disabled by default and will always remain strictly "opt-in".
 Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
 [Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
 This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
 Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
 the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
 Additionally, when using a commercial service with this feature, consider both potential costs as well as any associated file size
 or page limitations (e.g. with a free tier).
 ## Architecture
 Paperless-ngx consists of the following components:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -16,6 +16,7 @@ classifiers = [
 # This will allow testing to not install a webserver, mysql, etc
 dependencies = [
  "azure-ai-documentintelligence>=1.0.2",
  "babel>=2.17",
  "bleach~=6.3.0",
  "celery[redis]~=5.5.1",
@@ -253,6 +254,7 @@ testpaths = [
  "src/paperless_tesseract/tests/",
  "src/paperless_tika/tests",
  "src/paperless_text/tests/",
  "src/paperless_remote/tests/",
 ]
 addopts = [
  "--pythonwarnings=all",
--- a/src/documents/barcodes.py
+++ b/src/documents/barcodes.py
@@ -186,7 +186,11 @@ class BarcodePlugin(ConsumeTaskPlugin):
        # Update/overwrite an ASN if possible
        # After splitting, as otherwise each split document gets the same ASN
-        if self.settings.barcode_enable_asn and (located_asn := self.asn) is not None:
+        if (
            self.settings.barcode_enable_asn
            and not self.metadata.skip_asn
            and (located_asn := self.asn) is not None
        ):
            logger.info(f"Found ASN in barcode: {located_asn}")
            self.metadata.asn = located_asn
--- a/src/documents/bulk_edit.py
+++ b/src/documents/bulk_edit.py
@@ -7,6 +7,7 @@ from pathlib import Path
 from typing import TYPE_CHECKING
 from typing import Literal
 from celery import chain
 from celery import chord
 from celery import group
 from celery import shared_task
@@ -37,42 +38,6 @@ if TYPE_CHECKING:
 logger: logging.Logger = logging.getLogger("paperless.bulk_edit")
@shared_task(bind=True)
 def restore_archive_serial_numbers_task(
    self,
    backup: dict[int, int],
    *args,
    **kwargs,
 ) -> None:
    restore_archive_serial_numbers(backup)
 def release_archive_serial_numbers(doc_ids: list[int]) -> dict[int, int]:
    """
    Clears ASNs on documents that are about to be replaced so new documents
    can be assigned ASNs without uniqueness collisions. Returns a backup map
    of doc_id -> previous ASN for potential restoration.
    """
    qs = Document.objects.filter(
        id__in=doc_ids,
        archive_serial_number__isnull=False,
    ).only("pk", "archive_serial_number")
    backup = dict(qs.values_list("pk", "archive_serial_number"))
    qs.update(archive_serial_number=None)
    logger.info(f"Released archive serial numbers for documents {list(backup.keys())}")
    return backup
 def restore_archive_serial_numbers(backup: dict[int, int]) -> None:
    """
    Restores ASNs using the provided backup map, intended for
    rollback when replacement consumption fails.
    """
    for doc_id, asn in backup.items():
        Document.objects.filter(pk=doc_id).update(archive_serial_number=asn)
    logger.info(f"Restored archive serial numbers for documents {list(backup.keys())}")
 def set_correspondent(
    doc_ids: list[int],
    correspondent: Correspondent,
@@ -421,7 +386,6 @@ def merge(
    merged_pdf = pikepdf.new()
    version: str = merged_pdf.pdf_version
    handoff_asn: int | None = None
    # use doc_ids to preserve order
    for doc_id in doc_ids:
        doc = qs.get(id=doc_id)
@@ -437,8 +401,6 @@ def merge(
                version = max(version, pdf.pdf_version)
                merged_pdf.pages.extend(pdf.pages)
            affected_docs.append(doc.id)
            if handoff_asn is None and doc.archive_serial_number is not None:
                handoff_asn = doc.archive_serial_number
        except Exception as e:
            logger.exception(
                f"Error merging document {doc.id}, it will not be included in the merge: {e}",
@@ -464,8 +426,6 @@ def merge(
                DocumentMetadataOverrides.from_document(metadata_document)
            )
            overrides.title = metadata_document.title + " (merged)"
            if metadata_document.archive_serial_number is not None:
                handoff_asn = metadata_document.archive_serial_number
        else:
            overrides = DocumentMetadataOverrides()
    else:
@@ -473,9 +433,8 @@ def merge(
    if user is not None:
        overrides.owner_id = user.id
-
+    # Avoid copying or detecting ASN from merged PDFs to prevent collision
-    if delete_originals and handoff_asn is not None:
+    overrides.skip_asn = True
        overrides.asn = handoff_asn
    logger.info("Adding merged document to the task queue.")
@@ -488,20 +447,12 @@ def merge(
    )
    if delete_originals:
        backup = release_archive_serial_numbers(affected_docs)
        logger.info(
            "Queueing removal of original documents after consumption of merged document",
        )
-        try:
+        chain(consume_task, delete.si(affected_docs)).delay()
-            consume_task.apply_async(
+    else:
-                link=[delete.si(affected_docs)],
+        consume_task.delay()
                link_error=[restore_archive_serial_numbers_task.s(backup)],
            )
        except Exception:
            restore_archive_serial_numbers(backup)
            raise
        else:
            consume_task.delay()
    return "OK"
@@ -557,20 +508,10 @@ def split(
                )
            if delete_originals:
                backup = release_archive_serial_numbers([doc.id])
                logger.info(
                    "Queueing removal of original document after consumption of the split documents",
                )
-                try:
+                chord(header=consume_tasks, body=delete.si([doc.id])).delay()
                    chord(
                        header=consume_tasks,
                        body=delete.si([doc.id]),
                    ).apply_async(
                        link_error=[restore_archive_serial_numbers_task.s(backup)],
                    )
                except Exception:
                    restore_archive_serial_numbers(backup)
                    raise
            else:
                group(consume_tasks).delay()
@@ -673,8 +614,7 @@ def edit_pdf(
            )
            if user is not None:
                overrides.owner_id = user.id
-            if delete_original and len(pdf_docs) == 1:
+
                overrides.asn = doc.archive_serial_number
            for idx, pdf in enumerate(pdf_docs, start=1):
                filepath: Path = (
                    Path(tempfile.mkdtemp(dir=settings.SCRATCH_DIR))
@@ -693,17 +633,7 @@ def edit_pdf(
                )
            if delete_original:
-                backup = release_archive_serial_numbers([doc.id])
+                chord(header=consume_tasks, body=delete.si([doc.id])).delay()
                try:
                    chord(
                        header=consume_tasks,
                        body=delete.si([doc.id]),
                    ).apply_async(
                        link_error=[restore_archive_serial_numbers_task.s(backup)],
                    )
                except Exception:
                    restore_archive_serial_numbers(backup)
                    raise
            else:
                group(consume_tasks).delay()
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -696,7 +696,7 @@ class ConsumerPlugin(
                pk=self.metadata.storage_path_id,
            )
-        if self.metadata.asn is not None:
+        if self.metadata.asn is not None and not self.metadata.skip_asn:
            document.archive_serial_number = self.metadata.asn
        if self.metadata.owner_id:
@@ -812,8 +812,8 @@ class ConsumerPreflightPlugin(
        """
        Check that if override_asn is given, it is unique and within a valid range
        """
-        if self.metadata.asn is None:
+        if self.metadata.skip_asn or self.metadata.asn is None:
-            # if ASN is None
+            # if skip is set or ASN is None
            return
        # Validate the range is above zero and less than uint32_t max
        # otherwise, Whoosh can't handle it in the index
--- a/src/documents/data_models.py
+++ b/src/documents/data_models.py
@@ -30,6 +30,7 @@ class DocumentMetadataOverrides:
    change_users: list[int] | None = None
    change_groups: list[int] | None = None
    custom_fields: dict | None = None
    skip_asn: bool = False
    def update(self, other: "DocumentMetadataOverrides") -> "DocumentMetadataOverrides":
        """
@@ -49,6 +50,8 @@ class DocumentMetadataOverrides:
            self.storage_path_id = other.storage_path_id
        if other.owner_id is not None:
            self.owner_id = other.owner_id
        if other.skip_asn:
            self.skip_asn = True
        # merge
        if self.tag_ids is None:
--- a/src/documents/tests/test_bulk_edit.py
+++ b/src/documents/tests/test_bulk_edit.py
@@ -602,21 +602,23 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            expected_filename,
        )
        self.assertEqual(consume_file_args[1].title, None)
-        # No metadata_document_id, delete_originals False, so ASN should be None
+        self.assertTrue(consume_file_args[1].skip_asn)
        self.assertIsNone(consume_file_args[1].asn)
        # With metadata_document_id overrides
        result = bulk_edit.merge(doc_ids, metadata_document_id=metadata_document_id)
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(consume_file_args[1].title, "B (merged)")
        self.assertEqual(consume_file_args[1].created, self.doc2.created)
        self.assertTrue(consume_file_args[1].skip_asn)
        self.assertEqual(result, "OK")
    @mock.patch("documents.bulk_edit.delete.si")
    @mock.patch("documents.tasks.consume_file.s")
    @mock.patch("documents.bulk_edit.chain")
    def test_merge_and_delete_originals(
        self,
        mock_chain,
        mock_consume_file,
        mock_delete_documents,
    ):
@@ -630,12 +632,6 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            - Document deletion task should be called
        """
        doc_ids = [self.doc1.id, self.doc2.id, self.doc3.id]
        self.doc1.archive_serial_number = 101
        self.doc2.archive_serial_number = 102
        self.doc3.archive_serial_number = 103
        self.doc1.save()
        self.doc2.save()
        self.doc3.save()
        result = bulk_edit.merge(doc_ids, delete_originals=True)
        self.assertEqual(result, "OK")
@@ -646,8 +642,7 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        mock_consume_file.assert_called()
        mock_delete_documents.assert_called()
-        consume_sig = mock_consume_file.return_value
+        mock_chain.assert_called_once()
        consume_sig.apply_async.assert_called_once()
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(
@@ -655,7 +650,7 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            expected_filename,
        )
        self.assertEqual(consume_file_args[1].title, None)
-        self.assertEqual(consume_file_args[1].asn, 101)
+        self.assertTrue(consume_file_args[1].skip_asn)
        delete_documents_args, _ = mock_delete_documents.call_args
        self.assertEqual(
@@ -663,13 +658,6 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            doc_ids,
        )
        self.doc1.refresh_from_db()
        self.doc2.refresh_from_db()
        self.doc3.refresh_from_db()
        self.assertIsNone(self.doc1.archive_serial_number)
        self.assertIsNone(self.doc2.archive_serial_number)
        self.assertIsNone(self.doc3.archive_serial_number)
    @mock.patch("documents.tasks.consume_file.s")
    def test_merge_with_archive_fallback(self, mock_consume_file):
        """
@@ -738,7 +726,6 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        self.assertEqual(mock_consume_file.call_count, 2)
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(consume_file_args[1].title, "B (split 2)")
        self.assertIsNone(consume_file_args[1].asn)
        self.assertEqual(result, "OK")
@@ -763,8 +750,6 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        """
        doc_ids = [self.doc2.id]
        pages = [[1, 2], [3]]
        self.doc2.archive_serial_number = 200
        self.doc2.save()
        result = bulk_edit.split(doc_ids, pages, delete_originals=True)
        self.assertEqual(result, "OK")
@@ -782,9 +767,6 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            doc_ids,
        )
        self.doc2.refresh_from_db()
        self.assertIsNone(self.doc2.archive_serial_number)
    @mock.patch("documents.tasks.consume_file.delay")
    @mock.patch("pikepdf.Pdf.save")
    def test_split_with_errors(self, mock_save_pdf, mock_consume_file):
@@ -985,16 +967,10 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        mock_chord.return_value.delay.return_value = None
        doc_ids = [self.doc2.id]
        operations = [{"page": 1}, {"page": 2}]
        self.doc2.archive_serial_number = 250
        self.doc2.save()
        result = bulk_edit.edit_pdf(doc_ids, operations, delete_original=True)
        self.assertEqual(result, "OK")
        mock_chord.assert_called_once()
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(consume_file_args[1].asn, 250)
        self.doc2.refresh_from_db()
        self.assertIsNone(self.doc2.archive_serial_number)
    @mock.patch("documents.tasks.update_document_content_maybe_archive_file.delay")
    def test_edit_pdf_with_update_document(self, mock_update_document):
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -412,6 +412,14 @@ class TestConsumer(
        self.assertEqual(document.archive_serial_number, 123)
        self._assert_first_last_send_progress()
    def testMetadataOverridesSkipAsnPropagation(self):
        overrides = DocumentMetadataOverrides()
        incoming = DocumentMetadataOverrides(skip_asn=True)
        overrides.update(incoming)
        self.assertTrue(overrides.skip_asn)
    def testOverrideTitlePlaceholders(self):
        c = Correspondent.objects.create(name="Correspondent Name")
        dt = DocumentType.objects.create(name="DocType Name")
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -322,6 +322,7 @@ INSTALLED_APPS = [
    "paperless_tesseract.apps.PaperlessTesseractConfig",
    "paperless_text.apps.PaperlessTextConfig",
    "paperless_mail.apps.PaperlessMailConfig",
    "paperless_remote.apps.PaperlessRemoteParserConfig",
    "django.contrib.admin",
    "rest_framework",
    "rest_framework.authtoken",
@@ -1402,3 +1403,10 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
    "PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
    "true",
 )
 ###############################################################################
 # Remote Parser                                                               #
 ###############################################################################
 REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
 REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
 REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")
--- a/src/paperless_remote/init.py
+++ b/src/paperless_remote/init.py
@@ -0,0 +1,4 @@
 # this is here so that django finds the checks.
 from paperless_remote.checks import check_remote_parser_configured
 __all__ = ["check_remote_parser_configured"]
--- a/src/paperless_remote/apps.py
+++ b/src/paperless_remote/apps.py
@@ -0,0 +1,14 @@
 from django.apps import AppConfig
 from paperless_remote.signals import remote_consumer_declaration
 class PaperlessRemoteParserConfig(AppConfig):
    name = "paperless_remote"
    def ready(self):
        from documents.signals import document_consumer_declaration
        document_consumer_declaration.connect(remote_consumer_declaration)
        AppConfig.ready(self)
--- a/src/paperless_remote/checks.py
+++ b/src/paperless_remote/checks.py
@@ -0,0 +1,17 @@
 from django.conf import settings
 from django.core.checks import Error
 from django.core.checks import register
@register()
 def check_remote_parser_configured(app_configs, **kwargs):
    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
        settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
    ):
        return [
            Error(
                "Azure AI remote parser requires endpoint and API key to be configured.",
            ),
        ]
    return []
--- a/src/paperless_remote/parsers.py
+++ b/src/paperless_remote/parsers.py
@@ -0,0 +1,118 @@
 from pathlib import Path
 from django.conf import settings
 from paperless_tesseract.parsers import RasterisedDocumentParser
 class RemoteEngineConfig:
    def __init__(
        self,
        engine: str,
        api_key: str | None = None,
        endpoint: str | None = None,
    ):
        self.engine = engine
        self.api_key = api_key
        self.endpoint = endpoint
    def engine_is_valid(self):
        valid = self.engine in ["azureai"] and self.api_key is not None
        if self.engine == "azureai":
            valid = valid and self.endpoint is not None
        return valid
 class RemoteDocumentParser(RasterisedDocumentParser):
    """
    This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
    as this is the only service that provides a remote OCR API with text-embedded PDF output.
    """
    logging_name = "paperless.parsing.remote"
    def get_settings(self) -> RemoteEngineConfig:
        """
        Returns the configuration for the remote OCR engine, loaded from Django settings.
        """
        return RemoteEngineConfig(
            engine=settings.REMOTE_OCR_ENGINE,
            api_key=settings.REMOTE_OCR_API_KEY,
            endpoint=settings.REMOTE_OCR_ENDPOINT,
        )
    def supported_mime_types(self):
        if self.settings.engine_is_valid():
            return {
                "application/pdf": ".pdf",
                "image/png": ".png",
                "image/jpeg": ".jpg",
                "image/tiff": ".tiff",
                "image/bmp": ".bmp",
                "image/gif": ".gif",
                "image/webp": ".webp",
            }
        else:
            return {}
    def azure_ai_vision_parse(
        self,
        file: Path,
    ) -> str | None:
        """
        Uses Azure AI Vision to parse the document and return the text content.
        It requests a searchable PDF output with embedded text.
        The PDF is saved to the archive_path attribute.
        Returns the text content extracted from the document.
        If the parsing fails, it returns None.
        """
        from azure.ai.documentintelligence import DocumentIntelligenceClient
        from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
        from azure.ai.documentintelligence.models import AnalyzeOutputOption
        from azure.ai.documentintelligence.models import DocumentContentFormat
        from azure.core.credentials import AzureKeyCredential
        client = DocumentIntelligenceClient(
            endpoint=self.settings.endpoint,
            credential=AzureKeyCredential(self.settings.api_key),
        )
        try:
            with file.open("rb") as f:
                analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
                poller = client.begin_analyze_document(
                    model_id="prebuilt-read",
                    body=analyze_request,
                    output_content_format=DocumentContentFormat.TEXT,
                    output=[AnalyzeOutputOption.PDF],  # request searchable PDF output
                    content_type="application/json",
                )
            poller.wait()
            result_id = poller.details["operation_id"]
            result = poller.result()
            # Download the PDF with embedded text
            self.archive_path = self.tempdir / "archive.pdf"
            with self.archive_path.open("wb") as f:
                for chunk in client.get_analyze_result_pdf(
                    model_id="prebuilt-read",
                    result_id=result_id,
                ):
                    f.write(chunk)
            return result.content
        except Exception as e:
            self.log.error(f"Azure AI Vision parsing failed: {e}")
        finally:
            client.close()
        return None
    def parse(self, document_path: Path, mime_type, file_name=None):
        if not self.settings.engine_is_valid():
            self.log.warning(
                "No valid remote parser engine is configured, content will be empty.",
            )
            self.text = ""
        elif self.settings.engine == "azureai":
            self.text = self.azure_ai_vision_parse(document_path)
--- a/src/paperless_remote/signals.py
+++ b/src/paperless_remote/signals.py
@@ -0,0 +1,18 @@
 def get_parser(*args, **kwargs):
    from paperless_remote.parsers import RemoteDocumentParser
    return RemoteDocumentParser(*args, **kwargs)
 def get_supported_mime_types():
    from paperless_remote.parsers import RemoteDocumentParser
    return RemoteDocumentParser(None).supported_mime_types()
 def remote_consumer_declaration(sender, **kwargs):
    return {
        "parser": get_parser,
        "weight": 5,
        "mime_types": get_supported_mime_types(),
    }
--- a/src/paperless_remote/tests/init.py
+++ b/src/paperless_remote/tests/init.py
--- a/src/paperless_remote/tests/samples/simple-digital.pdf
+++ b/src/paperless_remote/tests/samples/simple-digital.pdf
--- a/src/paperless_remote/tests/test_checks.py
+++ b/src/paperless_remote/tests/test_checks.py
@@ -0,0 +1,24 @@
 from unittest import TestCase
 from django.test import override_settings
 from paperless_remote import check_remote_parser_configured
 class TestChecks(TestCase):
    @override_settings(REMOTE_OCR_ENGINE=None)
    def test_no_engine(self):
        msgs = check_remote_parser_configured(None)
        self.assertEqual(len(msgs), 0)
    @override_settings(REMOTE_OCR_ENGINE="azureai")
    @override_settings(REMOTE_OCR_API_KEY="somekey")
    @override_settings(REMOTE_OCR_ENDPOINT=None)
    def test_azure_no_endpoint(self):
        msgs = check_remote_parser_configured(None)
        self.assertEqual(len(msgs), 1)
        self.assertTrue(
            msgs[0].msg.startswith(
                "Azure AI remote parser requires endpoint and API key to be configured.",
            ),
        )
--- a/src/paperless_remote/tests/test_parser.py
+++ b/src/paperless_remote/tests/test_parser.py
@@ -0,0 +1,128 @@
 import uuid
 from pathlib import Path
 from unittest import mock
 from django.test import TestCase
 from django.test import override_settings
 from documents.tests.utils import DirectoriesMixin
 from documents.tests.utils import FileSystemAssertsMixin
 from paperless_remote.parsers import RemoteDocumentParser
 from paperless_remote.signals import get_parser
 class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
    def assertContainsStrings(self, content: str, strings: list[str]):
        # Asserts that all strings appear in content, in the given order.
        indices = []
        for s in strings:
            if s in content:
                indices.append(content.index(s))
            else:
                self.fail(f"'{s}' is not in '{content}'")
        self.assertListEqual(indices, sorted(indices))
    @mock.patch("paperless_tesseract.parsers.run_subprocess")
    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
    def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
        # Arrange mock Azure client
        mock_client = mock.Mock()
        mock_client_cls.return_value = mock_client
        # Simulate poller result and its `.details`
        mock_poller = mock.Mock()
        mock_poller.wait.return_value = None
        mock_poller.details = {"operation_id": "fake-op-id"}
        mock_client.begin_analyze_document.return_value = mock_poller
        mock_poller.result.return_value.content = "This is a test document."
        # Return dummy PDF bytes
        mock_client.get_analyze_result_pdf.return_value = [
            b"%PDF-",
            b"1.7 ",
            b"FAKEPDF",
        ]
        # Simulate pdftotext by writing dummy text to sidecar file
        def fake_run(cmd, *args, **kwargs):
            with Path(cmd[-1]).open("w", encoding="utf-8") as f:
                f.write("This is a test document.")
        mock_subprocess.side_effect = fake_run
        with override_settings(
            REMOTE_OCR_ENGINE="azureai",
            REMOTE_OCR_API_KEY="somekey",
            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
        ):
            parser = get_parser(uuid.uuid4())
            parser.parse(
                self.SAMPLE_FILES / "simple-digital.pdf",
                "application/pdf",
            )
            self.assertContainsStrings(
                parser.text.strip(),
                ["This is a test document."],
            )
    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
    def test_get_text_with_azure_error_logged_and_returns_none(self, mock_client_cls):
        mock_client = mock.Mock()
        mock_client.begin_analyze_document.side_effect = RuntimeError("fail")
        mock_client_cls.return_value = mock_client
        with override_settings(
            REMOTE_OCR_ENGINE="azureai",
            REMOTE_OCR_API_KEY="somekey",
            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
        ):
            parser = get_parser(uuid.uuid4())
            with mock.patch.object(parser.log, "error") as mock_log_error:
                parser.parse(
                    self.SAMPLE_FILES / "simple-digital.pdf",
                    "application/pdf",
                )
        self.assertIsNone(parser.text)
        mock_client.begin_analyze_document.assert_called_once()
        mock_client.close.assert_called_once()
        mock_log_error.assert_called_once()
        self.assertIn(
            "Azure AI Vision parsing failed",
            mock_log_error.call_args[0][0],
        )
    @override_settings(
        REMOTE_OCR_ENGINE="azureai",
        REMOTE_OCR_API_KEY="key",
        REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
    )
    def test_supported_mime_types_valid_config(self):
        parser = RemoteDocumentParser(uuid.uuid4())
        expected_types = {
            "application/pdf": ".pdf",
            "image/png": ".png",
            "image/jpeg": ".jpg",
            "image/tiff": ".tiff",
            "image/bmp": ".bmp",
            "image/gif": ".gif",
            "image/webp": ".webp",
        }
        self.assertEqual(parser.supported_mime_types(), expected_types)
    def test_supported_mime_types_invalid_config(self):
        parser = get_parser(uuid.uuid4())
        self.assertEqual(parser.supported_mime_types(), {})
    @override_settings(
        REMOTE_OCR_ENGINE=None,
        REMOTE_OCR_API_KEY=None,
        REMOTE_OCR_ENDPOINT=None,
    )
    def test_parse_with_invalid_config(self):
        parser = get_parser(uuid.uuid4())
        parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
        self.assertEqual(parser.text, "")
--- a/uv.lock
+++ b/uv.lock
@@ -95,6 +95,34 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/02/ff/1175b0b7371e46244032d43a56862d0af455823b5280a50c63d99cc50f18/automat-25.4.16-py3-none-any.whl", hash = "sha256:04e9bce696a8d5671ee698005af6e5a9fa15354140a87f4870744604dcdd3ba1", size = 42842, upload-time = "2025-04-16T20:12:14.447Z" },
 ]
 [[package]]
 name = "azure-ai-documentintelligence"
 version = "1.0.2"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
 ]
 [[package]]
 name = "azure-core"
 version = "1.33.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
 ]
 [[package]]
 name = "babel"
 version = "2.17.0"
@@ -1451,6 +1479,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
 ]
 [[package]]
 name = "isodate"
 version = "0.7.2"
 source = { registry = "https://pypi.org/simple" }
 sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
 ]
 [[package]]
 name = "jinja2"
 version = "3.1.6"
@@ -2118,6 +2155,7 @@ name = "paperless-ngx"
 version = "2.20.3"
 source = { virtual = "." }
 dependencies = [
    { name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2255,6 +2293,7 @@ typing = [
 [package.metadata]
 requires-dist = [
    { name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
    { name = "babel", specifier = ">=2.17" },
    { name = "bleach", specifier = "~=6.3.0" },
    { name = "celery", extras = ["redis"], specifier = "~=5.5.1" },
Author	SHA1	Message	Date
shamoon	14cfc96a5e	Merge branch 'dev' into feature-remote-ocr-2	2025-12-30 13:08:19 -08:00
shamoon	2cd96610f6	Merge branch 'dev' into feature-remote-ocr-2	2025-12-11 12:58:14 -08:00
shamoon	9eb81d5458	Merge branch 'dev' into feature-remote-ocr-2	2025-12-07 20:37:56 -08:00
shamoon	6a5ea49715	Merge branch 'dev' into feature-remote-ocr-2	2025-11-22 13:18:50 -08:00
shamoon	7d2fe630a5	Merge branch 'dev' into feature-remote-ocr-2	2025-11-19 23:49:11 -08:00
shamoon	c29dd5485b	Merge branch 'dev' into feature-remote-ocr-2	2025-11-18 12:08:38 -08:00
shamoon	cef100a955	Wrap in try/catch	2025-11-18 12:07:16 -08:00
shamoon	4f53d1b6ee	Merge branch 'dev' into feature-remote-ocr-2	2025-11-17 20:54:37 -08:00
shamoon	23cea77548	Merge branch 'dev' into feature-remote-ocr-2	2025-11-17 18:49:01 -08:00
shamoon	4900af93c6	Merge branch 'dev' into feature-remote-ocr-2	2025-11-15 13:49:39 -08:00
shamoon	ef834ae808	Merge branch 'dev' into feature-remote-ocr-2	2025-11-13 15:45:08 -08:00
shamoon	0537e87cb5	Merge branch 'dev' into feature-remote-ocr-2	2025-11-06 11:46:02 -08:00
shamoon	b4da5c3cd1	Merge branch 'dev' into feature-remote-ocr-2	2025-11-04 16:24:26 -08:00
shamoon	251b0fb3d6	Merge branch 'dev' into feature-remote-ocr-2	2025-11-04 08:24:02 -08:00
shamoon	32bdf11f7f	Merge branch 'dev' into feature-remote-ocr-2	2025-11-02 08:14:04 -08:00
shamoon	0627ca69f5	Merge branch 'dev' into feature-remote-ocr-2	2025-10-29 11:13:53 -07:00
shamoon	f5525bbdff	Merge branch 'dev' into feature-remote-ocr-2	2025-10-27 21:22:42 -07:00
shamoon	a21a2a41a8	Merge branch 'dev' into feature-remote-ocr-2	2025-10-26 07:41:51 -07:00
shamoon	cc73ed8b86	Merge branch 'dev' into feature-remote-ocr-2	2025-10-24 16:48:07 -07:00
shamoon	0c706b2316	Merge branch 'dev' into feature-remote-ocr-2	2025-10-23 16:38:35 -07:00
shamoon	85b7b6874d	Merge branch 'dev' into feature-remote-ocr-2	2025-10-22 21:53:07 -07:00
shamoon	56b26185fa	Merge branch 'dev' into feature-remote-ocr-2	2025-10-21 08:23:20 -07:00
shamoon	6537fade7b	Merge branch 'dev' into feature-remote-ocr-2	2025-10-15 16:04:02 -07:00
shamoon	9f8090816f	Merge branch 'dev' into feature-remote-ocr-2	2025-10-09 12:54:58 -07:00
shamoon	1de7c52478	Merge branch 'dev' into feature-remote-ocr-2	2025-10-01 19:24:38 -07:00
shamoon	9aaaa6f069	Merge branch 'dev' into feature-remote-ocr-2	2025-09-30 09:14:56 -07:00
shamoon	c3a20b7797	Merge branch 'dev' into feature-remote-ocr-2	2025-09-28 15:06:37 -07:00
shamoon	476556379b	Merge branch 'dev' into feature-remote-ocr-2	2025-09-24 13:46:49 -07:00
shamoon	e5cafff043	Merge branch 'dev' into feature-remote-ocr-2	2025-09-22 13:42:55 -07:00
shamoon	8e0d574e99	Merge branch 'dev' into feature-remote-ocr-2	2025-09-21 16:18:13 -07:00
shamoon	8a5820328e	Sonar suggestions	2025-09-17 19:18:47 -07:00
shamoon	809d62a2f4	Merge branch 'dev' into feature-remote-ocr-2	2025-09-17 16:51:23 -07:00
shamoon	0d87f94b9b	Merge branch 'dev' into feature-remote-ocr-2	2025-09-14 14:01:35 -07:00
shamoon	315b90f8e5	Add typing to assertContainsStrings test util	2025-09-11 13:56:14 -07:00
shamoon	47b2d2964b	Use regular testcase instead of django, config check test	2025-09-11 13:52:10 -07:00
shamoon	e05639ae4e	tempdir already a path	2025-09-11 13:49:30 -07:00
shamoon	f400a8cb2f	Close client	2025-09-11 13:49:06 -07:00
shamoon	26abcf5612	Also ensure API key is set	2025-09-11 13:48:06 -07:00
shamoon	afde52430d	Merge branch 'dev' into feature-remote-ocr-2	2025-09-11 13:25:53 -07:00
shamoon	716f2da652	Merge branch 'dev' into feature-remote-ocr-2	2025-09-08 11:36:49 -07:00
shamoon	c54073b7c2	Merge branch 'dev' into feature-remote-ocr-2	2025-09-04 09:16:59 -07:00
shamoon	247e6f39dc	Merge branch 'dev' into feature-remote-ocr-2	2025-09-01 20:10:40 -07:00
shamoon	1e6dfc4481	Merge branch 'dev' into feature-remote-ocr-2	2025-08-26 13:30:39 -07:00
shamoon	7cc0750066	Add note on costs and limitations for Azure OCR	2025-08-24 05:47:07 -07:00
shamoon	bd6585d3b4	Merge branch 'dev' into feature-remote-ocr-2	2025-08-22 08:54:26 -07:00
shamoon	717e828a1d	Merge branch 'dev' into feature-remote-ocr-2	2025-08-17 21:25:14 -07:00
shamoon	07381d48e6	Merge branch 'dev' into feature-remote-ocr-2	2025-08-17 07:49:58 -07:00
shamoon	dd0ffaf312	Merge branch 'dev' into feature-remote-ocr-2	2025-08-11 10:48:36 -07:00
shamoon	264504affc	Fix consumer declaration file extensions	2025-08-10 05:32:52 -07:00
shamoon	4feedf2add	Merge branch 'dev' into feature-remote-ocr-2	2025-08-06 16:04:25 -04:00
shamoon	2f76cf9831	Merge branch 'dev' into feature-remote-ocr-2	2025-08-01 23:55:49 -04:00
shamoon	1002d37f6b	Update test_parser.py	2025-07-09 11:05:37 -07:00
shamoon	d260a94740	Update parsers.py	2025-07-09 11:02:57 -07:00
shamoon	88c69b83ea	Update index.md	2025-07-09 11:00:12 -07:00
shamoon	2557ee2014	Update docs to mention remote OCR with Azure AI	2025-07-09 09:53:30 -07:00
shamoon	3c75deed80	Add paperless_remote tests to testpaths	2025-07-08 14:19:45 -07:00
shamoon	d05343c927	Test fixes / coverage	2025-07-08 14:19:45 -07:00
shamoon	e7972b7eaf	Coverage	2025-07-08 14:19:45 -07:00
shamoon	75a091cc0d	Fix test	2025-07-08 14:19:44 -07:00
shamoon	dca74803fd	Use output_content_format poller.result to get clean content	2025-07-08 14:19:44 -07:00
shamoon	3cf3d868d0	Some docs	2025-07-08 14:19:43 -07:00
shamoon	bf4fc6604a	Test	2025-07-08 14:19:43 -07:00
shamoon	e8c1eb86fa	This actually works [ci skip]	2025-07-08 14:19:43 -07:00
shamoon	c3dad3cf69	Basic parse	2025-07-08 14:19:42 -07:00
shamoon	811bd66088	Ok, restart implementing this with just azure [ci skip]	2025-07-08 14:19:42 -07:00