Remove skip_asn stuff

"Handoff" ASN when merging or editing PDFs
First, release ASNs before document replacement (and restore if needed)
2025-12-31 13:58:04 -06:00 · 2025-12-31 11:52:31 -08:00 · 2025-12-31 11:50:27 -08:00 · 2025-12-31 10:42:07 -08:00
21 changed files with 115 additions and 444 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1804,23 +1804,3 @@ password. All of these options come from their similarly-named [Django settings]
 #### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}
 : Defaults to false.
 ## Remote OCR
 #### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
 : The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
    Defaults to None, which disables remote OCR.
 #### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
 : The API key to use for the remote OCR engine.
    Defaults to None.
 #### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
 : The endpoint to use for the remote OCR engine. This is required for Azure AI.
    Defaults to None.
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,10 +25,9 @@ physical documents into a searchable online archive so you can keep, well, _less
 ## Features
 -   **Organize and index** your scanned documents with tags, correspondents, types, and more.
-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
+-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
 -   Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
-    -   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+-   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
    -   _New!_ Supports remote OCR with Azure AI (opt-in).
 -   Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
 -   Uses machine-learning to automatically add tags, correspondents and document types to your documents.
 -   Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -901,21 +901,6 @@ how regularly you intend to scan documents and use paperless.
    performed the task associated with the document, move it to the
    inbox.
 ## Remote OCR
 !!! important
    This feature is disabled by default and will always remain strictly "opt-in".
 Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
 [Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
 This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
 Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
 the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
 Additionally, when using a commercial service with this feature, consider both potential costs as well as any associated file size
 or page limitations (e.g. with a free tier).
 ## Architecture
 Paperless-ngx consists of the following components:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -16,7 +16,6 @@ classifiers = [
 # This will allow testing to not install a webserver, mysql, etc
 dependencies = [
  "azure-ai-documentintelligence>=1.0.2",
  "babel>=2.17",
  "bleach~=6.3.0",
  "celery[redis]~=5.5.1",
@@ -254,7 +253,6 @@ testpaths = [
  "src/paperless_tesseract/tests/",
  "src/paperless_tika/tests",
  "src/paperless_text/tests/",
  "src/paperless_remote/tests/",
 ]
 addopts = [
  "--pythonwarnings=all",
--- a/src/documents/barcodes.py
+++ b/src/documents/barcodes.py
@@ -186,11 +186,7 @@ class BarcodePlugin(ConsumeTaskPlugin):
        # Update/overwrite an ASN if possible
        # After splitting, as otherwise each split document gets the same ASN
-        if (
+        if self.settings.barcode_enable_asn and (located_asn := self.asn) is not None:
            self.settings.barcode_enable_asn
            and not self.metadata.skip_asn
            and (located_asn := self.asn) is not None
        ):
            logger.info(f"Found ASN in barcode: {located_asn}")
            self.metadata.asn = located_asn
--- a/src/documents/bulk_edit.py
+++ b/src/documents/bulk_edit.py
@@ -7,7 +7,6 @@ from pathlib import Path
 from typing import TYPE_CHECKING
 from typing import Literal
 from celery import chain
 from celery import chord
 from celery import group
 from celery import shared_task
@@ -38,6 +37,42 @@ if TYPE_CHECKING:
 logger: logging.Logger = logging.getLogger("paperless.bulk_edit")
@shared_task(bind=True)
 def restore_archive_serial_numbers_task(
    self,
    backup: dict[int, int],
    *args,
    **kwargs,
 ) -> None:
    restore_archive_serial_numbers(backup)
 def release_archive_serial_numbers(doc_ids: list[int]) -> dict[int, int]:
    """
    Clears ASNs on documents that are about to be replaced so new documents
    can be assigned ASNs without uniqueness collisions. Returns a backup map
    of doc_id -> previous ASN for potential restoration.
    """
    qs = Document.objects.filter(
        id__in=doc_ids,
        archive_serial_number__isnull=False,
    ).only("pk", "archive_serial_number")
    backup = dict(qs.values_list("pk", "archive_serial_number"))
    qs.update(archive_serial_number=None)
    logger.info(f"Released archive serial numbers for documents {list(backup.keys())}")
    return backup
 def restore_archive_serial_numbers(backup: dict[int, int]) -> None:
    """
    Restores ASNs using the provided backup map, intended for
    rollback when replacement consumption fails.
    """
    for doc_id, asn in backup.items():
        Document.objects.filter(pk=doc_id).update(archive_serial_number=asn)
    logger.info(f"Restored archive serial numbers for documents {list(backup.keys())}")
 def set_correspondent(
    doc_ids: list[int],
    correspondent: Correspondent,
@@ -386,6 +421,7 @@ def merge(
    merged_pdf = pikepdf.new()
    version: str = merged_pdf.pdf_version
    handoff_asn: int | None = None
    # use doc_ids to preserve order
    for doc_id in doc_ids:
        doc = qs.get(id=doc_id)
@@ -401,6 +437,8 @@ def merge(
                version = max(version, pdf.pdf_version)
                merged_pdf.pages.extend(pdf.pages)
            affected_docs.append(doc.id)
            if handoff_asn is None and doc.archive_serial_number is not None:
                handoff_asn = doc.archive_serial_number
        except Exception as e:
            logger.exception(
                f"Error merging document {doc.id}, it will not be included in the merge: {e}",
@@ -426,6 +464,8 @@ def merge(
                DocumentMetadataOverrides.from_document(metadata_document)
            )
            overrides.title = metadata_document.title + " (merged)"
            if metadata_document.archive_serial_number is not None:
                handoff_asn = metadata_document.archive_serial_number
        else:
            overrides = DocumentMetadataOverrides()
    else:
@@ -433,8 +473,9 @@ def merge(
    if user is not None:
        overrides.owner_id = user.id
-    # Avoid copying or detecting ASN from merged PDFs to prevent collision
+
-    overrides.skip_asn = True
+    if delete_originals and handoff_asn is not None:
        overrides.asn = handoff_asn
    logger.info("Adding merged document to the task queue.")
@@ -447,12 +488,20 @@ def merge(
    )
    if delete_originals:
        backup = release_archive_serial_numbers(affected_docs)
        logger.info(
            "Queueing removal of original documents after consumption of merged document",
        )
-        chain(consume_task, delete.si(affected_docs)).delay()
+        try:
-    else:
+            consume_task.apply_async(
-        consume_task.delay()
+                link=[delete.si(affected_docs)],
                link_error=[restore_archive_serial_numbers_task.s(backup)],
            )
        except Exception:
            restore_archive_serial_numbers(backup)
            raise
        else:
            consume_task.delay()
    return "OK"
@@ -508,10 +557,20 @@ def split(
                )
            if delete_originals:
                backup = release_archive_serial_numbers([doc.id])
                logger.info(
                    "Queueing removal of original document after consumption of the split documents",
                )
-                chord(header=consume_tasks, body=delete.si([doc.id])).delay()
+                try:
                    chord(
                        header=consume_tasks,
                        body=delete.si([doc.id]),
                    ).apply_async(
                        link_error=[restore_archive_serial_numbers_task.s(backup)],
                    )
                except Exception:
                    restore_archive_serial_numbers(backup)
                    raise
            else:
                group(consume_tasks).delay()
@@ -614,7 +673,8 @@ def edit_pdf(
            )
            if user is not None:
                overrides.owner_id = user.id
-
+            if delete_original and len(pdf_docs) == 1:
                overrides.asn = doc.archive_serial_number
            for idx, pdf in enumerate(pdf_docs, start=1):
                filepath: Path = (
                    Path(tempfile.mkdtemp(dir=settings.SCRATCH_DIR))
@@ -633,7 +693,17 @@ def edit_pdf(
                )
            if delete_original:
-                chord(header=consume_tasks, body=delete.si([doc.id])).delay()
+                backup = release_archive_serial_numbers([doc.id])
                try:
                    chord(
                        header=consume_tasks,
                        body=delete.si([doc.id]),
                    ).apply_async(
                        link_error=[restore_archive_serial_numbers_task.s(backup)],
                    )
                except Exception:
                    restore_archive_serial_numbers(backup)
                    raise
            else:
                group(consume_tasks).delay()
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -696,7 +696,7 @@ class ConsumerPlugin(
                pk=self.metadata.storage_path_id,
            )
-        if self.metadata.asn is not None and not self.metadata.skip_asn:
+        if self.metadata.asn is not None:
            document.archive_serial_number = self.metadata.asn
        if self.metadata.owner_id:
@@ -812,8 +812,8 @@ class ConsumerPreflightPlugin(
        """
        Check that if override_asn is given, it is unique and within a valid range
        """
-        if self.metadata.skip_asn or self.metadata.asn is None:
+        if self.metadata.asn is None:
-            # if skip is set or ASN is None
+            # if ASN is None
            return
        # Validate the range is above zero and less than uint32_t max
        # otherwise, Whoosh can't handle it in the index
--- a/src/documents/data_models.py
+++ b/src/documents/data_models.py
@@ -30,7 +30,6 @@ class DocumentMetadataOverrides:
    change_users: list[int] | None = None
    change_groups: list[int] | None = None
    custom_fields: dict | None = None
    skip_asn: bool = False
    def update(self, other: "DocumentMetadataOverrides") -> "DocumentMetadataOverrides":
        """
@@ -50,8 +49,6 @@ class DocumentMetadataOverrides:
            self.storage_path_id = other.storage_path_id
        if other.owner_id is not None:
            self.owner_id = other.owner_id
        if other.skip_asn:
            self.skip_asn = True
        # merge
        if self.tag_ids is None:
--- a/src/documents/tests/test_bulk_edit.py
+++ b/src/documents/tests/test_bulk_edit.py
@@ -602,23 +602,21 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            expected_filename,
        )
        self.assertEqual(consume_file_args[1].title, None)
-        self.assertTrue(consume_file_args[1].skip_asn)
+        # No metadata_document_id, delete_originals False, so ASN should be None
        self.assertIsNone(consume_file_args[1].asn)
        # With metadata_document_id overrides
        result = bulk_edit.merge(doc_ids, metadata_document_id=metadata_document_id)
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(consume_file_args[1].title, "B (merged)")
        self.assertEqual(consume_file_args[1].created, self.doc2.created)
        self.assertTrue(consume_file_args[1].skip_asn)
        self.assertEqual(result, "OK")
    @mock.patch("documents.bulk_edit.delete.si")
    @mock.patch("documents.tasks.consume_file.s")
    @mock.patch("documents.bulk_edit.chain")
    def test_merge_and_delete_originals(
        self,
        mock_chain,
        mock_consume_file,
        mock_delete_documents,
    ):
@@ -632,6 +630,12 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            - Document deletion task should be called
        """
        doc_ids = [self.doc1.id, self.doc2.id, self.doc3.id]
        self.doc1.archive_serial_number = 101
        self.doc2.archive_serial_number = 102
        self.doc3.archive_serial_number = 103
        self.doc1.save()
        self.doc2.save()
        self.doc3.save()
        result = bulk_edit.merge(doc_ids, delete_originals=True)
        self.assertEqual(result, "OK")
@@ -642,7 +646,8 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        mock_consume_file.assert_called()
        mock_delete_documents.assert_called()
-        mock_chain.assert_called_once()
+        consume_sig = mock_consume_file.return_value
        consume_sig.apply_async.assert_called_once()
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(
@@ -650,7 +655,7 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            expected_filename,
        )
        self.assertEqual(consume_file_args[1].title, None)
-        self.assertTrue(consume_file_args[1].skip_asn)
+        self.assertEqual(consume_file_args[1].asn, 101)
        delete_documents_args, _ = mock_delete_documents.call_args
        self.assertEqual(
@@ -658,6 +663,13 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            doc_ids,
        )
        self.doc1.refresh_from_db()
        self.doc2.refresh_from_db()
        self.doc3.refresh_from_db()
        self.assertIsNone(self.doc1.archive_serial_number)
        self.assertIsNone(self.doc2.archive_serial_number)
        self.assertIsNone(self.doc3.archive_serial_number)
    @mock.patch("documents.tasks.consume_file.s")
    def test_merge_with_archive_fallback(self, mock_consume_file):
        """
@@ -726,6 +738,7 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        self.assertEqual(mock_consume_file.call_count, 2)
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(consume_file_args[1].title, "B (split 2)")
        self.assertIsNone(consume_file_args[1].asn)
        self.assertEqual(result, "OK")
@@ -750,6 +763,8 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        """
        doc_ids = [self.doc2.id]
        pages = [[1, 2], [3]]
        self.doc2.archive_serial_number = 200
        self.doc2.save()
        result = bulk_edit.split(doc_ids, pages, delete_originals=True)
        self.assertEqual(result, "OK")
@@ -767,6 +782,9 @@ class TestPDFActions(DirectoriesMixin, TestCase):
            doc_ids,
        )
        self.doc2.refresh_from_db()
        self.assertIsNone(self.doc2.archive_serial_number)
    @mock.patch("documents.tasks.consume_file.delay")
    @mock.patch("pikepdf.Pdf.save")
    def test_split_with_errors(self, mock_save_pdf, mock_consume_file):
@@ -967,10 +985,16 @@ class TestPDFActions(DirectoriesMixin, TestCase):
        mock_chord.return_value.delay.return_value = None
        doc_ids = [self.doc2.id]
        operations = [{"page": 1}, {"page": 2}]
        self.doc2.archive_serial_number = 250
        self.doc2.save()
        result = bulk_edit.edit_pdf(doc_ids, operations, delete_original=True)
        self.assertEqual(result, "OK")
        mock_chord.assert_called_once()
        consume_file_args, _ = mock_consume_file.call_args
        self.assertEqual(consume_file_args[1].asn, 250)
        self.doc2.refresh_from_db()
        self.assertIsNone(self.doc2.archive_serial_number)
    @mock.patch("documents.tasks.update_document_content_maybe_archive_file.delay")
    def test_edit_pdf_with_update_document(self, mock_update_document):
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -412,14 +412,6 @@ class TestConsumer(
        self.assertEqual(document.archive_serial_number, 123)
        self._assert_first_last_send_progress()
    def testMetadataOverridesSkipAsnPropagation(self):
        overrides = DocumentMetadataOverrides()
        incoming = DocumentMetadataOverrides(skip_asn=True)
        overrides.update(incoming)
        self.assertTrue(overrides.skip_asn)
    def testOverrideTitlePlaceholders(self):
        c = Correspondent.objects.create(name="Correspondent Name")
        dt = DocumentType.objects.create(name="DocType Name")
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -322,7 +322,6 @@ INSTALLED_APPS = [
    "paperless_tesseract.apps.PaperlessTesseractConfig",
    "paperless_text.apps.PaperlessTextConfig",
    "paperless_mail.apps.PaperlessMailConfig",
    "paperless_remote.apps.PaperlessRemoteParserConfig",
    "django.contrib.admin",
    "rest_framework",
    "rest_framework.authtoken",
@@ -1403,10 +1402,3 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
    "PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
    "true",
 )
 ###############################################################################
 # Remote Parser                                                               #
 ###############################################################################
 REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
 REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
 REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")
--- a/src/paperless_remote/init.py
+++ b/src/paperless_remote/init.py
@@ -1,4 +0,0 @@
 # this is here so that django finds the checks.
 from paperless_remote.checks import check_remote_parser_configured
 __all__ = ["check_remote_parser_configured"]
--- a/src/paperless_remote/apps.py
+++ b/src/paperless_remote/apps.py
@@ -1,14 +0,0 @@
 from django.apps import AppConfig
 from paperless_remote.signals import remote_consumer_declaration
 class PaperlessRemoteParserConfig(AppConfig):
    name = "paperless_remote"
    def ready(self):
        from documents.signals import document_consumer_declaration
        document_consumer_declaration.connect(remote_consumer_declaration)
        AppConfig.ready(self)
--- a/src/paperless_remote/checks.py
+++ b/src/paperless_remote/checks.py
@@ -1,17 +0,0 @@
 from django.conf import settings
 from django.core.checks import Error
 from django.core.checks import register
@register()
 def check_remote_parser_configured(app_configs, **kwargs):
    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
        settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
    ):
        return [
            Error(
                "Azure AI remote parser requires endpoint and API key to be configured.",
            ),
        ]
    return []
--- a/src/paperless_remote/parsers.py
+++ b/src/paperless_remote/parsers.py
@@ -1,118 +0,0 @@
 from pathlib import Path
 from django.conf import settings
 from paperless_tesseract.parsers import RasterisedDocumentParser
 class RemoteEngineConfig:
    def __init__(
        self,
        engine: str,
        api_key: str | None = None,
        endpoint: str | None = None,
    ):
        self.engine = engine
        self.api_key = api_key
        self.endpoint = endpoint
    def engine_is_valid(self):
        valid = self.engine in ["azureai"] and self.api_key is not None
        if self.engine == "azureai":
            valid = valid and self.endpoint is not None
        return valid
 class RemoteDocumentParser(RasterisedDocumentParser):
    """
    This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
    as this is the only service that provides a remote OCR API with text-embedded PDF output.
    """
    logging_name = "paperless.parsing.remote"
    def get_settings(self) -> RemoteEngineConfig:
        """
        Returns the configuration for the remote OCR engine, loaded from Django settings.
        """
        return RemoteEngineConfig(
            engine=settings.REMOTE_OCR_ENGINE,
            api_key=settings.REMOTE_OCR_API_KEY,
            endpoint=settings.REMOTE_OCR_ENDPOINT,
        )
    def supported_mime_types(self):
        if self.settings.engine_is_valid():
            return {
                "application/pdf": ".pdf",
                "image/png": ".png",
                "image/jpeg": ".jpg",
                "image/tiff": ".tiff",
                "image/bmp": ".bmp",
                "image/gif": ".gif",
                "image/webp": ".webp",
            }
        else:
            return {}
    def azure_ai_vision_parse(
        self,
        file: Path,
    ) -> str | None:
        """
        Uses Azure AI Vision to parse the document and return the text content.
        It requests a searchable PDF output with embedded text.
        The PDF is saved to the archive_path attribute.
        Returns the text content extracted from the document.
        If the parsing fails, it returns None.
        """
        from azure.ai.documentintelligence import DocumentIntelligenceClient
        from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
        from azure.ai.documentintelligence.models import AnalyzeOutputOption
        from azure.ai.documentintelligence.models import DocumentContentFormat
        from azure.core.credentials import AzureKeyCredential
        client = DocumentIntelligenceClient(
            endpoint=self.settings.endpoint,
            credential=AzureKeyCredential(self.settings.api_key),
        )
        try:
            with file.open("rb") as f:
                analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
                poller = client.begin_analyze_document(
                    model_id="prebuilt-read",
                    body=analyze_request,
                    output_content_format=DocumentContentFormat.TEXT,
                    output=[AnalyzeOutputOption.PDF],  # request searchable PDF output
                    content_type="application/json",
                )
            poller.wait()
            result_id = poller.details["operation_id"]
            result = poller.result()
            # Download the PDF with embedded text
            self.archive_path = self.tempdir / "archive.pdf"
            with self.archive_path.open("wb") as f:
                for chunk in client.get_analyze_result_pdf(
                    model_id="prebuilt-read",
                    result_id=result_id,
                ):
                    f.write(chunk)
            return result.content
        except Exception as e:
            self.log.error(f"Azure AI Vision parsing failed: {e}")
        finally:
            client.close()
        return None
    def parse(self, document_path: Path, mime_type, file_name=None):
        if not self.settings.engine_is_valid():
            self.log.warning(
                "No valid remote parser engine is configured, content will be empty.",
            )
            self.text = ""
        elif self.settings.engine == "azureai":
            self.text = self.azure_ai_vision_parse(document_path)
--- a/src/paperless_remote/signals.py
+++ b/src/paperless_remote/signals.py
@@ -1,18 +0,0 @@
 def get_parser(*args, **kwargs):
    from paperless_remote.parsers import RemoteDocumentParser
    return RemoteDocumentParser(*args, **kwargs)
 def get_supported_mime_types():
    from paperless_remote.parsers import RemoteDocumentParser
    return RemoteDocumentParser(None).supported_mime_types()
 def remote_consumer_declaration(sender, **kwargs):
    return {
        "parser": get_parser,
        "weight": 5,
        "mime_types": get_supported_mime_types(),
    }
--- a/src/paperless_remote/tests/init.py
+++ b/src/paperless_remote/tests/init.py
--- a/src/paperless_remote/tests/samples/simple-digital.pdf
+++ b/src/paperless_remote/tests/samples/simple-digital.pdf
--- a/src/paperless_remote/tests/test_checks.py
+++ b/src/paperless_remote/tests/test_checks.py
@@ -1,24 +0,0 @@
 from unittest import TestCase
 from django.test import override_settings
 from paperless_remote import check_remote_parser_configured
 class TestChecks(TestCase):
    @override_settings(REMOTE_OCR_ENGINE=None)
    def test_no_engine(self):
        msgs = check_remote_parser_configured(None)
        self.assertEqual(len(msgs), 0)
    @override_settings(REMOTE_OCR_ENGINE="azureai")
    @override_settings(REMOTE_OCR_API_KEY="somekey")
    @override_settings(REMOTE_OCR_ENDPOINT=None)
    def test_azure_no_endpoint(self):
        msgs = check_remote_parser_configured(None)
        self.assertEqual(len(msgs), 1)
        self.assertTrue(
            msgs[0].msg.startswith(
                "Azure AI remote parser requires endpoint and API key to be configured.",
            ),
        )
--- a/src/paperless_remote/tests/test_parser.py
+++ b/src/paperless_remote/tests/test_parser.py
@@ -1,128 +0,0 @@
 import uuid
 from pathlib import Path
 from unittest import mock
 from django.test import TestCase
 from django.test import override_settings
 from documents.tests.utils import DirectoriesMixin
 from documents.tests.utils import FileSystemAssertsMixin
 from paperless_remote.parsers import RemoteDocumentParser
 from paperless_remote.signals import get_parser
 class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
    def assertContainsStrings(self, content: str, strings: list[str]):
        # Asserts that all strings appear in content, in the given order.
        indices = []
        for s in strings:
            if s in content:
                indices.append(content.index(s))
            else:
                self.fail(f"'{s}' is not in '{content}'")
        self.assertListEqual(indices, sorted(indices))
    @mock.patch("paperless_tesseract.parsers.run_subprocess")
    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
    def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
        # Arrange mock Azure client
        mock_client = mock.Mock()
        mock_client_cls.return_value = mock_client
        # Simulate poller result and its `.details`
        mock_poller = mock.Mock()
        mock_poller.wait.return_value = None
        mock_poller.details = {"operation_id": "fake-op-id"}
        mock_client.begin_analyze_document.return_value = mock_poller
        mock_poller.result.return_value.content = "This is a test document."
        # Return dummy PDF bytes
        mock_client.get_analyze_result_pdf.return_value = [
            b"%PDF-",
            b"1.7 ",
            b"FAKEPDF",
        ]
        # Simulate pdftotext by writing dummy text to sidecar file
        def fake_run(cmd, *args, **kwargs):
            with Path(cmd[-1]).open("w", encoding="utf-8") as f:
                f.write("This is a test document.")
        mock_subprocess.side_effect = fake_run
        with override_settings(
            REMOTE_OCR_ENGINE="azureai",
            REMOTE_OCR_API_KEY="somekey",
            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
        ):
            parser = get_parser(uuid.uuid4())
            parser.parse(
                self.SAMPLE_FILES / "simple-digital.pdf",
                "application/pdf",
            )
            self.assertContainsStrings(
                parser.text.strip(),
                ["This is a test document."],
            )
    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
    def test_get_text_with_azure_error_logged_and_returns_none(self, mock_client_cls):
        mock_client = mock.Mock()
        mock_client.begin_analyze_document.side_effect = RuntimeError("fail")
        mock_client_cls.return_value = mock_client
        with override_settings(
            REMOTE_OCR_ENGINE="azureai",
            REMOTE_OCR_API_KEY="somekey",
            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
        ):
            parser = get_parser(uuid.uuid4())
            with mock.patch.object(parser.log, "error") as mock_log_error:
                parser.parse(
                    self.SAMPLE_FILES / "simple-digital.pdf",
                    "application/pdf",
                )
        self.assertIsNone(parser.text)
        mock_client.begin_analyze_document.assert_called_once()
        mock_client.close.assert_called_once()
        mock_log_error.assert_called_once()
        self.assertIn(
            "Azure AI Vision parsing failed",
            mock_log_error.call_args[0][0],
        )
    @override_settings(
        REMOTE_OCR_ENGINE="azureai",
        REMOTE_OCR_API_KEY="key",
        REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
    )
    def test_supported_mime_types_valid_config(self):
        parser = RemoteDocumentParser(uuid.uuid4())
        expected_types = {
            "application/pdf": ".pdf",
            "image/png": ".png",
            "image/jpeg": ".jpg",
            "image/tiff": ".tiff",
            "image/bmp": ".bmp",
            "image/gif": ".gif",
            "image/webp": ".webp",
        }
        self.assertEqual(parser.supported_mime_types(), expected_types)
    def test_supported_mime_types_invalid_config(self):
        parser = get_parser(uuid.uuid4())
        self.assertEqual(parser.supported_mime_types(), {})
    @override_settings(
        REMOTE_OCR_ENGINE=None,
        REMOTE_OCR_API_KEY=None,
        REMOTE_OCR_ENDPOINT=None,
    )
    def test_parse_with_invalid_config(self):
        parser = get_parser(uuid.uuid4())
        parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
        self.assertEqual(parser.text, "")
--- a/uv.lock
+++ b/uv.lock
@@ -95,34 +95,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/02/ff/1175b0b7371e46244032d43a56862d0af455823b5280a50c63d99cc50f18/automat-25.4.16-py3-none-any.whl", hash = "sha256:04e9bce696a8d5671ee698005af6e5a9fa15354140a87f4870744604dcdd3ba1", size = 42842, upload-time = "2025-04-16T20:12:14.447Z" },
 ]
 [[package]]
 name = "azure-ai-documentintelligence"
 version = "1.0.2"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
 ]
 [[package]]
 name = "azure-core"
 version = "1.33.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
 ]
 sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
 ]
 [[package]]
 name = "babel"
 version = "2.17.0"
@@ -1479,15 +1451,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
 ]
 [[package]]
 name = "isodate"
 version = "0.7.2"
 source = { registry = "https://pypi.org/simple" }
 sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
 wheels = [
    { url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
 ]
 [[package]]
 name = "jinja2"
 version = "3.1.6"
@@ -2155,7 +2118,6 @@ name = "paperless-ngx"
 version = "2.20.3"
 source = { virtual = "." }
 dependencies = [
    { name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2293,7 +2255,6 @@ typing = [
 [package.metadata]
 requires-dist = [
    { name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
    { name = "babel", specifier = ">=2.17" },
    { name = "bleach", specifier = "~=6.3.0" },
    { name = "celery", extras = ["redis"], specifier = "~=5.5.1" },
Author	SHA1	Message	Date
shamoon	016bccdcdf	Remove skip_asn stuff	2025-12-31 11:52:31 -08:00
shamoon	92deebddd4	"Handoff" ASN when merging or editing PDFs	2025-12-31 11:50:27 -08:00
shamoon	c7efcee3d6	First, release ASNs before document replacement (and restore if needed)	2025-12-31 10:42:07 -08:00