Merge branch 'dev' into feature-remote-ocr-2

2025-12-16 01:31:09 -06:00 · 2025-12-11 12:58:14 -08:00 · 2025-12-07 20:37:56 -08:00 · 2025-11-22 13:18:50 -08:00 · 2025-11-19 23:49:11 -08:00 · 2025-11-18 12:08:38 -08:00
25 changed files with 520 additions and 405 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1054,22 +1054,12 @@ should be a valid crontab(5) expression describing when to run.

 #### [`PAPERLESS_SANITY_TASK_CRON=<cron expression>`](#PAPERLESS_SANITY_TASK_CRON) {#PAPERLESS_SANITY_TASK_CRON}

-: Configures the scheduled sanity checker frequency. The value should be a
-valid crontab(5) expression describing when to run.
+: Configures the scheduled sanity checker frequency.

 : If set to the string "disable", the sanity checker will not run automatically.

    Defaults to `30 0 * * sun` or Sunday at 30 minutes past midnight.

-#### [`PAPERLESS_WORKFLOW_SCHEDULED_TASK_CRON=<cron expression>`](#PAPERLESS_WORKFLOW_SCHEDULED_TASK_CRON) {#PAPERLESS_WORKFLOW_SCHEDULED_TASK_CRON}
-
-: Configures the scheduled workflow check frequency. The value should be a
-valid crontab(5) expression describing when to run.
-
-: If set to the string "disable", scheduled workflows will not run.
-
-    Defaults to `5 */1 * * *` or every hour at 5 minutes past the hour.
-
 #### [`PAPERLESS_ENABLE_COMPRESSION=<bool>`](#PAPERLESS_ENABLE_COMPRESSION) {#PAPERLESS_ENABLE_COMPRESSION}

 : Enables compression of the responses from the webserver.
@@ -1281,6 +1271,30 @@ within your documents.

    Defaults to false.

+## Workflow webhooks
+
+#### [`PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES) {#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES}
+
+: A comma-separated list of allowed schemes for webhooks. This setting
+controls which URL schemes are permitted for webhook URLs.
+
+    Defaults to `http,https`.
+
+#### [`PAPERLESS_WEBHOOKS_ALLOWED_PORTS=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_PORTS) {#PAPERLESS_WEBHOOKS_ALLOWED_PORTS}
+
+: A comma-separated list of allowed ports for webhooks. This setting
+controls which ports are permitted for webhook URLs. For example, if you
+set this to `80,443`, webhooks will only be sent to URLs that use these
+ports.
+
+    Defaults to empty list, which allows all ports.
+
+#### [`PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS=<bool>`](#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS) {#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS}
+
+: If set to false, webhooks cannot be sent to internal URLs (e.g., localhost).
+
+    Defaults to true, which allows internal requests.
+
 ### Polling {#polling}

 #### [`PAPERLESS_CONSUMER_POLLING=<num>`](#PAPERLESS_CONSUMER_POLLING) {#PAPERLESS_CONSUMER_POLLING}
@@ -1324,30 +1338,6 @@ consumers working on the same file. Configure this to prevent that.

    Defaults to 0.5 seconds.

-## Workflow webhooks
-
-#### [`PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES) {#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES}
-
-: A comma-separated list of allowed schemes for webhooks. This setting
-controls which URL schemes are permitted for webhook URLs.
-
-    Defaults to `http,https`.
-
-#### [`PAPERLESS_WEBHOOKS_ALLOWED_PORTS=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_PORTS) {#PAPERLESS_WEBHOOKS_ALLOWED_PORTS}
-
-: A comma-separated list of allowed ports for webhooks. This setting
-controls which ports are permitted for webhook URLs. For example, if you
-set this to `80,443`, webhooks will only be sent to URLs that use these
-ports.
-
-    Defaults to empty list, which allows all ports.
-
-#### [`PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS=<bool>`](#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS) {#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS}
-
-: If set to false, webhooks cannot be sent to internal URLs (e.g., localhost).
-
-    Defaults to true, which allows internal requests.
-
 ## Incoming Mail {#incoming_mail}

 ### Email OAuth {#email_oauth}
@@ -1804,3 +1794,23 @@ password. All of these options come from their similarly-named [Django settings]
 #### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}

 : Defaults to false.
+
+## Remote OCR
+
+#### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
+
+: The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
+
+    Defaults to None, which disables remote OCR.
+
+#### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
+
+: The API key to use for the remote OCR engine.
+
+    Defaults to None.
+
+#### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
+
+: The endpoint to use for the remote OCR engine. This is required for Azure AI.
+
+    Defaults to None.
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,9 +25,10 @@ physical documents into a searchable online archive so you can keep, well, _less
 ## Features

 -   **Organize and index** your scanned documents with tags, correspondents, types, and more.
-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
+-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
 -   Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
-   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   _New!_ Supports remote OCR with Azure AI (opt-in).
 -   Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
 -   Uses machine-learning to automatically add tags, correspondents and document types to your documents.
 -   Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -443,10 +443,6 @@ flowchart TD
    'Updated'
    trigger(s)"}

-    scheduled{"Documents
-    matching
-    trigger(s)"}
-
    A[New Document] --> consumption
    consumption --> |Yes| C[Workflow Actions Run]
    consumption --> |No| D
@@ -459,11 +455,6 @@ flowchart TD
    updated --> |Yes| J[Workflow Actions Run]
    updated --> |No| K
    J --> K[Document Saved]
-    L[Scheduled Task Check<br/>hourly at :05] --> M[Get All Scheduled Triggers]
-    M --> scheduled
-    scheduled --> |Yes| N[Workflow Actions Run]
-    scheduled --> |No| O[Document Saved]
-    N --> O
 ```

 #### Filters {#workflow-trigger-filters}
@@ -901,6 +892,21 @@ how regularly you intend to scan documents and use paperless.
    performed the task associated with the document, move it to the
    inbox.

+## Remote OCR
+
+!!! important
+
+    This feature is disabled by default and will always remain strictly "opt-in".
+
+Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
+[Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
+This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
+Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
+the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
+
+Additionally, when using a commercial service with this feature, consider both potential costs as well as any associated file size
+or page limitations (e.g. with a free tier).
+
 ## Architecture

 Paperless-ngx consists of the following components:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -16,6 +16,7 @@ classifiers = [
 # This will allow testing to not install a webserver, mysql, etc

 dependencies = [
+  "azure-ai-documentintelligence>=1.0.2",
  "babel>=2.17",
  "bleach~=6.3.0",
  "celery[redis]~=5.5.1",
@@ -63,7 +64,6 @@ dependencies = [
  "pyzbar~=0.1.9",
  "rapidfuzz~=3.14.0",
  "redis[hiredis]~=5.2.1",
-  "regex>=2025.9.18",
  "scikit-learn~=1.7.0",
  "setproctitle~=1.3.4",
  "tika-client~=0.10.0",
@@ -253,6 +253,7 @@ testpaths = [
  "src/paperless_tesseract/tests/",
  "src/paperless_tika/tests",
  "src/paperless_text/tests/",
+  "src/paperless_remote/tests/",
 ]
 addopts = [
  "--pythonwarnings=all",
--- a/src/documents/matching.py
+++ b/src/documents/matching.py
@@ -20,7 +20,6 @@ from documents.models import Tag
 from documents.models import Workflow
 from documents.models import WorkflowTrigger
 from documents.permissions import get_objects_for_user_owner_aware
-from documents.regex import safe_regex_search

 if TYPE_CHECKING:
    from django.db.models import QuerySet
@@ -153,7 +152,7 @@ def match_storage_paths(document: Document, classifier: DocumentClassifier, user


 def matches(matching_model: MatchingModel, document: Document):
-    search_flags = 0
+    search_kwargs = {}

    document_content = document.content

@@ -162,18 +161,14 @@ def matches(matching_model: MatchingModel, document: Document):
        return False

    if matching_model.is_insensitive:
-        search_flags = re.IGNORECASE
+        search_kwargs = {"flags": re.IGNORECASE}

    if matching_model.matching_algorithm == MatchingModel.MATCH_NONE:
        return False

    elif matching_model.matching_algorithm == MatchingModel.MATCH_ALL:
        for word in _split_match(matching_model):
-            search_result = re.search(
-                rf"\b{word}\b",
-                document_content,
-                flags=search_flags,
-            )
+            search_result = re.search(rf"\b{word}\b", document_content, **search_kwargs)
            if not search_result:
                return False
        log_reason(
@@ -185,7 +180,7 @@ def matches(matching_model: MatchingModel, document: Document):

    elif matching_model.matching_algorithm == MatchingModel.MATCH_ANY:
        for word in _split_match(matching_model):
-            if re.search(rf"\b{word}\b", document_content, flags=search_flags):
+            if re.search(rf"\b{word}\b", document_content, **search_kwargs):
                log_reason(matching_model, document, f"it contains this word: {word}")
                return True
        return False
@@ -195,7 +190,7 @@ def matches(matching_model: MatchingModel, document: Document):
            re.search(
                rf"\b{re.escape(matching_model.match)}\b",
                document_content,
-                flags=search_flags,
+                **search_kwargs,
            ),
        )
        if result:
@@ -207,11 +202,16 @@ def matches(matching_model: MatchingModel, document: Document):
        return result

    elif matching_model.matching_algorithm == MatchingModel.MATCH_REGEX:
-        match = safe_regex_search(
-            matching_model.match,
-            document_content,
-            flags=search_flags,
-        )
+        try:
+            match = re.search(
+                re.compile(matching_model.match, **search_kwargs),
+                document_content,
+            )
+        except re.error:
+            logger.error(
+                f"Error while processing regular expression {matching_model.match}",
+            )
+            return False
        if match:
            log_reason(
                matching_model,
--- a/src/documents/regex.py
+++ b/src/documents/regex.py
@@ -1,50 +0,0 @@
-from __future__ import annotations
-
-import logging
-import textwrap
-
-import regex
-from django.conf import settings
-
-logger = logging.getLogger("paperless.regex")
-
-REGEX_TIMEOUT_SECONDS: float = getattr(settings, "MATCH_REGEX_TIMEOUT_SECONDS", 0.1)
-
-
-def validate_regex_pattern(pattern: str) -> None:
-    """
-    Validate user provided regex for basic compile errors.
-    Raises ValueError on validation failure.
-    """
-
-    try:
-        regex.compile(pattern)
-    except regex.error as exc:
-        raise ValueError(exc.msg) from exc
-
-
-def safe_regex_search(pattern: str, text: str, *, flags: int = 0):
-    """
-    Run a regex search with a timeout. Returns a match object or None.
-    Validation errors and timeouts are logged and treated as no match.
-    """
-
-    try:
-        validate_regex_pattern(pattern)
-        compiled = regex.compile(pattern, flags=flags)
-    except (regex.error, ValueError) as exc:
-        logger.error(
-            "Error while processing regular expression %s: %s",
-            textwrap.shorten(pattern, width=80, placeholder="…"),
-            exc,
-        )
-        return None
-
-    try:
-        return compiled.search(text, timeout=REGEX_TIMEOUT_SECONDS)
-    except TimeoutError:
-        logger.warning(
-            "Regular expression matching timed out for pattern %s",
-            textwrap.shorten(pattern, width=80, placeholder="…"),
-        )
-        return None
--- a/src/documents/serialisers.py
+++ b/src/documents/serialisers.py
@@ -21,7 +21,6 @@ from django.core.validators import MaxLengthValidator
 from django.core.validators import RegexValidator
 from django.core.validators import integer_validator
 from django.db.models import Count
-from django.db.models.functions import Lower
 from django.utils.crypto import get_random_string
 from django.utils.dateparse import parse_datetime
 from django.utils.text import slugify
@@ -39,7 +38,6 @@ from guardian.utils import get_user_obj_perms_model
 from rest_framework import fields
 from rest_framework import serializers
 from rest_framework.fields import SerializerMethodField
-from rest_framework.filters import OrderingFilter

 if settings.AUDIT_LOG_ENABLED:
    from auditlog.context import set_actor
@@ -71,7 +69,6 @@ from documents.parsers import is_mime_type_supported
 from documents.permissions import get_document_count_filter_for_user
 from documents.permissions import get_groups_with_only_permission
 from documents.permissions import set_permissions_for_object
-from documents.regex import validate_regex_pattern
 from documents.templating.filepath import validate_filepath_template_and_render
 from documents.templating.utils import convert_format_str_to_template_format
 from documents.validators import uri_validator
@@ -142,10 +139,10 @@ class MatchingModelSerializer(serializers.ModelSerializer):
            and self.initial_data["matching_algorithm"] == MatchingModel.MATCH_REGEX
        ):
            try:
-                validate_regex_pattern(match)
-            except ValueError as e:
+                re.compile(match)
+            except re.error as e:
                raise serializers.ValidationError(
-                    _("Invalid regular expression: %(error)s") % {"error": str(e)},
+                    _("Invalid regular expression: %(error)s") % {"error": str(e.msg)},
                )
        return match

@@ -578,29 +575,15 @@ class TagSerializer(MatchingModelSerializer, OwnedObjectSerializer):
    )
    def get_children(self, obj):
        filter_q = self.context.get("document_count_filter")
-        request = self.context.get("request")
        if filter_q is None:
+            request = self.context.get("request")
            user = getattr(request, "user", None) if request else None
            filter_q = get_document_count_filter_for_user(user)
            self.context["document_count_filter"] = filter_q
-
-        children_queryset = (
+        serializer = TagSerializer(
            obj.get_children_queryset()
            .select_related("owner")
-            .annotate(document_count=Count("documents", filter=filter_q))
-        )
-
-        view = self.context.get("view")
-        ordering = (
-            OrderingFilter().get_ordering(request, children_queryset, view)
-            if request and view
-            else None
-        )
-        ordering = ordering or (Lower("name"),)
-        children_queryset = children_queryset.order_by(*ordering)
-
-        serializer = TagSerializer(
-            children_queryset,
+            .annotate(document_count=Count("documents", filter=filter_q)),
            many=True,
            user=self.user,
            full_perms=self.full_perms,
--- a/src/documents/signals/handlers.py
+++ b/src/documents/signals/handlers.py
@@ -28,7 +28,6 @@ from documents import matching
 from documents.caching import clear_document_caches
 from documents.file_handling import create_source_path_directory
 from documents.file_handling import delete_empty_directories
-from documents.file_handling import generate_filename
 from documents.file_handling import generate_unique_filename
 from documents.models import CustomField
 from documents.models import CustomFieldInstance
@@ -43,7 +42,6 @@ from documents.models import WorkflowAction
 from documents.models import WorkflowRun
 from documents.models import WorkflowTrigger
 from documents.permissions import get_objects_for_user_owner_aware
-from documents.templating.utils import convert_format_str_to_template_format
 from documents.workflows.actions import build_workflow_action_context
 from documents.workflows.actions import execute_email_action
 from documents.workflows.actions import execute_webhook_action
@@ -391,19 +389,6 @@ class CannotMoveFilesException(Exception):
    pass


-def _filename_template_uses_custom_fields(doc: Document) -> bool:
-    template = None
-    if doc.storage_path is not None:
-        template = doc.storage_path.path
-    elif settings.FILENAME_FORMAT is not None:
-        template = convert_format_str_to_template_format(settings.FILENAME_FORMAT)
-
-    if not template:
-        return False
-
-    return "custom_fields" in template
-
-
 # should be disabled in /src/documents/management/commands/document_importer.py handle
@receiver(models.signals.post_save, sender=CustomFieldInstance, weak=False)
@receiver(models.signals.m2m_changed, sender=Document.tags.through, weak=False)
@@ -414,8 +399,6 @@ def update_filename_and_move_files(
    **kwargs,
 ):
    if isinstance(instance, CustomFieldInstance):
-        if not _filename_template_uses_custom_fields(instance.document):
-            return
        instance = instance.document

    def validate_move(instance, old_path: Path, new_path: Path):
@@ -453,47 +436,21 @@ def update_filename_and_move_files(
            old_filename = instance.filename
            old_source_path = instance.source_path

-            candidate_filename = generate_filename(instance)
-            candidate_source_path = (
-                settings.ORIGINALS_DIR / candidate_filename
-            ).resolve()
-            if candidate_filename == Path(old_filename):
-                new_filename = Path(old_filename)
-            elif (
-                candidate_source_path.exists()
-                and candidate_source_path != old_source_path
-            ):
-                # Only fall back to unique search when there is an actual conflict
-                new_filename = generate_unique_filename(instance)
-            else:
-                new_filename = candidate_filename
-
            # Need to convert to string to be able to save it to the db
-            instance.filename = str(new_filename)
+            instance.filename = str(generate_unique_filename(instance))
            move_original = old_filename != instance.filename

            old_archive_filename = instance.archive_filename
            old_archive_path = instance.archive_path

            if instance.has_archive_version:
-                archive_candidate = generate_filename(instance, archive_filename=True)
-                archive_candidate_path = (
-                    settings.ARCHIVE_DIR / archive_candidate
-                ).resolve()
-                if archive_candidate == Path(old_archive_filename):
-                    new_archive_filename = Path(old_archive_filename)
-                elif (
-                    archive_candidate_path.exists()
-                    and archive_candidate_path != old_archive_path
-                ):
-                    new_archive_filename = generate_unique_filename(
+                # Need to convert to string to be able to save it to the db
+                instance.archive_filename = str(
+                    generate_unique_filename(
                        instance,
                        archive_filename=True,
-                    )
-                else:
-                    new_archive_filename = archive_candidate
-
-                instance.archive_filename = str(new_archive_filename)
+                    ),
+                )

                move_archive = old_archive_filename != instance.archive_filename
            else:
--- a/src/documents/tests/test_file_handling.py
+++ b/src/documents/tests/test_file_handling.py
@@ -16,7 +16,6 @@ from django.utils import timezone
 from documents.file_handling import create_source_path_directory
 from documents.file_handling import delete_empty_directories
 from documents.file_handling import generate_filename
-from documents.file_handling import generate_unique_filename
 from documents.models import Correspondent
 from documents.models import CustomField
 from documents.models import CustomFieldInstance
@@ -1633,73 +1632,6 @@ class TestFilenameGeneration(DirectoriesMixin, TestCase):
            )


-class TestCustomFieldFilenameUpdates(
-    DirectoriesMixin,
-    FileSystemAssertsMixin,
-    TestCase,
-):
-    def setUp(self):
-        self.cf = CustomField.objects.create(
-            name="flavor",
-            data_type=CustomField.FieldDataType.STRING,
-        )
-        self.doc = Document.objects.create(
-            title="document",
-            mime_type="application/pdf",
-            checksum="abc123",
-        )
-        self.cfi = CustomFieldInstance.objects.create(
-            field=self.cf,
-            document=self.doc,
-            value_text="initial",
-        )
-        return super().setUp()
-
-    @override_settings(FILENAME_FORMAT=None)
-    def test_custom_field_not_in_template_skips_filename_work(self):
-        storage_path = StoragePath.objects.create(path="{{created}}/{{ title }}")
-        self.doc.storage_path = storage_path
-        self.doc.save()
-        initial_filename = generate_filename(self.doc)
-        Document.objects.filter(pk=self.doc.pk).update(filename=str(initial_filename))
-        self.doc.refresh_from_db()
-        Path(self.doc.source_path).parent.mkdir(parents=True, exist_ok=True)
-        Path(self.doc.source_path).touch()
-
-        with mock.patch("documents.signals.handlers.generate_unique_filename") as m:
-            m.side_effect = generate_unique_filename
-            self.cfi.value_text = "updated"
-            self.cfi.save()
-
-        self.doc.refresh_from_db()
-        self.assertEqual(Path(self.doc.filename), initial_filename)
-        self.assertEqual(m.call_count, 0)
-
-    @override_settings(FILENAME_FORMAT=None)
-    def test_custom_field_in_template_triggers_filename_update(self):
-        storage_path = StoragePath.objects.create(
-            path="{{ custom_fields|get_cf_value('flavor') }}/{{ title }}",
-        )
-        self.doc.storage_path = storage_path
-        self.doc.save()
-        initial_filename = generate_filename(self.doc)
-        Document.objects.filter(pk=self.doc.pk).update(filename=str(initial_filename))
-        self.doc.refresh_from_db()
-        Path(self.doc.source_path).parent.mkdir(parents=True, exist_ok=True)
-        Path(self.doc.source_path).touch()
-
-        with mock.patch("documents.signals.handlers.generate_unique_filename") as m:
-            m.side_effect = generate_unique_filename
-            self.cfi.value_text = "updated"
-            self.cfi.save()
-
-        self.doc.refresh_from_db()
-        expected_filename = Path("updated/document.pdf")
-        self.assertEqual(Path(self.doc.filename), expected_filename)
-        self.assertTrue(Path(self.doc.source_path).is_file())
-        self.assertLessEqual(m.call_count, 1)
-
-
 class TestPathDateLocalization:
    """
    Groups all tests related to the `localize_date` function.
--- a/src/documents/tests/test_matchables.py
+++ b/src/documents/tests/test_matchables.py
@@ -206,22 +206,6 @@ class TestMatching(_TestMatchingBase):
    def test_tach_invalid_regex(self):
        self._test_matching("[", "MATCH_REGEX", [], ["Don't match this"])

-    def test_match_regex_timeout_returns_false(self):
-        tag = Tag.objects.create(
-            name="slow",
-            match=r"(a+)+$",
-            matching_algorithm=Tag.MATCH_REGEX,
-        )
-        document = Document(content=("a" * 5000) + "X")
-
-        with self.assertLogs("paperless.regex", level="WARNING") as cm:
-            self.assertFalse(matching.matches(tag, document))
-
-        self.assertTrue(
-            any("timed out" in message for message in cm.output),
-            f"Expected timeout log, got {cm.output}",
-        )
-
    def test_match_fuzzy(self):
        self._test_matching(
            "Springfield, Miss.",
--- a/src/documents/tests/test_workflows.py
+++ b/src/documents/tests/test_workflows.py
@@ -17,7 +17,6 @@ from django.utils import timezone
 from guardian.shortcuts import assign_perm
 from guardian.shortcuts import get_groups_with_perms
 from guardian.shortcuts import get_users_with_perms
-from httpx import ConnectError
 from httpx import HTTPError
 from httpx import HTTPStatusError
 from pytest_httpx import HTTPXMock
@@ -3429,7 +3428,7 @@ class TestWorkflows(
            expected_str = "Error occurred parsing webhook headers"
            self.assertIn(expected_str, cm.output[1])

-    @mock.patch("httpx.Client.post")
+    @mock.patch("httpx.post")
    def test_workflow_webhook_send_webhook_task(self, mock_post):
        mock_post.return_value = mock.Mock(
            status_code=200,
@@ -3450,6 +3449,8 @@ class TestWorkflows(
                content="Test message",
                headers={},
                files=None,
+                follow_redirects=False,
+                timeout=5,
            )

            expected_str = "Webhook sent to http://paperless-ngx.com"
@@ -3467,9 +3468,11 @@ class TestWorkflows(
                data={"message": "Test message"},
                headers={},
                files=None,
+                follow_redirects=False,
+                timeout=5,
            )

-    @mock.patch("httpx.Client.post")
+    @mock.patch("httpx.post")
    def test_workflow_webhook_send_webhook_retry(self, mock_http):
        mock_http.return_value.raise_for_status = mock.Mock(
            side_effect=HTTPStatusError(
@@ -3665,7 +3668,7 @@ class TestWebhookSecurity:
            - ValueError is raised
        """
        resolve_to("127.0.0.1")
-        with pytest.raises(ConnectError):
+        with pytest.raises(ValueError):
            send_webhook(
                "http://paperless-ngx.com",
                data="",
@@ -3695,8 +3698,7 @@ class TestWebhookSecurity:
        )

        req = httpx_mock.get_request()
-        assert req.url.host == "52.207.186.75"
-        assert req.headers["host"] == "paperless-ngx.com"
+        assert req.url.host == "paperless-ngx.com"

    def test_follow_redirects_disabled(self, httpx_mock: HTTPXMock, resolve_to):
        """
--- a/src/documents/workflows/webhooks.py
+++ b/src/documents/workflows/webhooks.py
@@ -10,98 +10,26 @@ from django.conf import settings
 logger = logging.getLogger("paperless.workflows.webhooks")


-class WebhookTransport(httpx.HTTPTransport):
-    """
-    Transport that resolves/validates hostnames and rewrites to a vetted IP
-    while keeping Host/SNI as the original hostname.
-    """
-
-    def __init__(
-        self,
-        hostname: str,
-        *args,
-        allow_internal: bool = False,
-        **kwargs,
-    ) -> None:
-        super().__init__(*args, **kwargs)
-        self.hostname = hostname
-        self.allow_internal = allow_internal
-
-    def handle_request(self, request: httpx.Request) -> httpx.Response:
-        hostname = request.url.host
-
-        if not hostname:
-            raise httpx.ConnectError("No hostname in request URL")
-
-        try:
-            addr_info = socket.getaddrinfo(hostname, None)
-        except socket.gaierror as e:
-            raise httpx.ConnectError(f"Could not resolve hostname: {hostname}") from e
-
-        ips = [info[4][0] for info in addr_info if info and info[4]]
-        if not ips:
-            raise httpx.ConnectError(f"Could not resolve hostname: {hostname}")
-
-        if not self.allow_internal:
-            for ip_str in ips:
-                if not WebhookTransport.is_public_ip(ip_str):
-                    raise httpx.ConnectError(
-                        f"Connection blocked: {hostname} resolves to a non-public address",
-                    )
-
-        ip_str = ips[0]
-        formatted_ip = self._format_ip_for_url(ip_str)
-
-        new_headers = httpx.Headers(request.headers)
-        if "host" in new_headers:
-            del new_headers["host"]
-        new_headers["Host"] = hostname
-        new_url = request.url.copy_with(host=formatted_ip)
-
-        request = httpx.Request(
-            method=request.method,
-            url=new_url,
-            headers=new_headers,
-            content=request.content,
-            extensions=request.extensions,
+def _is_public_ip(ip: str) -> bool:
+    try:
+        obj = ipaddress.ip_address(ip)
+        return not (
+            obj.is_private
+            or obj.is_loopback
+            or obj.is_link_local
+            or obj.is_multicast
+            or obj.is_unspecified
        )
-        request.extensions["sni_hostname"] = hostname
+    except ValueError:  # pragma: no cover
+        return False

-        return super().handle_request(request)

-    def _format_ip_for_url(self, ip: str) -> str:
-        """
-        Format IP address for use in URL (wrap IPv6 in brackets)
-        """
-        try:
-            ip_obj = ipaddress.ip_address(ip)
-            if ip_obj.version == 6:
-                return f"[{ip}]"
-            return ip
-        except ValueError:
-            return ip
-
-    @staticmethod
-    def is_public_ip(ip: str | int) -> bool:
-        try:
-            obj = ipaddress.ip_address(ip)
-            return not (
-                obj.is_private
-                or obj.is_loopback
-                or obj.is_link_local
-                or obj.is_multicast
-                or obj.is_unspecified
-            )
-        except ValueError:  # pragma: no cover
-            return False
-
-    @staticmethod
-    def resolve_first_ip(host: str) -> str | None:
-        try:
-            info = socket.getaddrinfo(host, None)
-            return info[0][4][0] if info else None
-        except Exception:  # pragma: no cover
-            return None
+def _resolve_first_ip(host: str) -> str | None:
+    try:
+        info = socket.getaddrinfo(host, None)
+        return info[0][4][0] if info else None
+    except Exception:  # pragma: no cover
+        return None


@shared_task(
@@ -131,10 +59,12 @@ def send_webhook(
        logger.warning("Webhook blocked: port not permitted")
        raise ValueError("Destination port not permitted.")

-    transport = WebhookTransport(
-        hostname=p.hostname,
-        allow_internal=settings.WEBHOOKS_ALLOW_INTERNAL_REQUESTS,
-    )
+    ip = _resolve_first_ip(p.hostname)
+    if not ip or (
+        not _is_public_ip(ip) and not settings.WEBHOOKS_ALLOW_INTERNAL_REQUESTS
+    ):
+        logger.warning("Webhook blocked: destination not allowed")
+        raise ValueError("Destination host is not allowed.")

    try:
        post_args = {
@@ -143,6 +73,8 @@ def send_webhook(
                k: v for k, v in (headers or {}).items() if k.lower() != "host"
            },
            "files": files or None,
+            "timeout": 5.0,
+            "follow_redirects": False,
        }
        if as_json:
            post_args["json"] = data
@@ -151,21 +83,14 @@ def send_webhook(
        else:
            post_args["content"] = data

-        with httpx.Client(
-            transport=transport,
-            timeout=5.0,
-            follow_redirects=False,
-        ) as client:
-            client.post(
-                **post_args,
-            ).raise_for_status()
-            logger.info(
-                f"Webhook sent to {url}",
-            )
+        httpx.post(
+            **post_args,
+        ).raise_for_status()
+        logger.info(
+            f"Webhook sent to {url}",
+        )
    except Exception as e:
        logger.error(
            f"Failed attempt sending webhook to {url}: {e}",
        )
        raise e
-    finally:
-        transport.close()
--- a/src/locale/en_US/LC_MESSAGES/django.po
+++ b/src/locale/en_US/LC_MESSAGES/django.po
@@ -2,7 +2,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: paperless-ngx\n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2025-12-12 17:29+0000\n"
+"POT-Creation-Date: 2025-12-10 16:39+0000\n"
 "PO-Revision-Date: 2022-02-17 04:17\n"
 "Last-Translator: \n"
 "Language-Team: English\n"
@@ -1219,40 +1219,40 @@ msgstr ""
 msgid "workflow runs"
 msgstr ""

-#: documents/serialisers.py:148
+#: documents/serialisers.py:145
 #, python-format
 msgid "Invalid regular expression: %(error)s"
 msgstr ""

-#: documents/serialisers.py:639
+#: documents/serialisers.py:622
 msgid "Invalid color."
 msgstr ""

-#: documents/serialisers.py:1825
+#: documents/serialisers.py:1808
 #, python-format
 msgid "File type %(type)s not supported"
 msgstr ""

-#: documents/serialisers.py:1869
+#: documents/serialisers.py:1852
 #, python-format
 msgid "Custom field id must be an integer: %(id)s"
 msgstr ""

-#: documents/serialisers.py:1876
+#: documents/serialisers.py:1859
 #, python-format
 msgid "Custom field with id %(id)s does not exist"
 msgstr ""

-#: documents/serialisers.py:1893 documents/serialisers.py:1903
+#: documents/serialisers.py:1876 documents/serialisers.py:1886
 msgid ""
 "Custom fields must be a list of integers or an object mapping ids to values."
 msgstr ""

-#: documents/serialisers.py:1898
+#: documents/serialisers.py:1881
 msgid "Some custom fields don't exist or were specified twice."
 msgstr ""

-#: documents/serialisers.py:2013
+#: documents/serialisers.py:1996
 msgid "Invalid variable detected."
 msgstr ""

--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -322,6 +322,7 @@ INSTALLED_APPS = [
    "paperless_tesseract.apps.PaperlessTesseractConfig",
    "paperless_text.apps.PaperlessTextConfig",
    "paperless_mail.apps.PaperlessMailConfig",
+    "paperless_remote.apps.PaperlessRemoteParserConfig",
    "django.contrib.admin",
    "rest_framework",
    "rest_framework.authtoken",
@@ -1401,3 +1402,10 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
    "PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
    "true",
 )
+
+###############################################################################
+# Remote Parser                                                               #
+###############################################################################
+REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
+REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
+REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")
--- a/src/paperless/validators.py
+++ b/src/paperless/validators.py
@@ -14,14 +14,13 @@ ALLOWED_SVG_TAGS: set[str] = {
    "text",
    "tspan",
    "defs",
-    "lineargradient",
-    "radialgradient",
+    "linearGradient",
+    "radialGradient",
    "stop",
-    "clippath",
+    "clipPath",
    "use",
    "title",
    "desc",
-    "style",
 }

 ALLOWED_SVG_ATTRIBUTES: set[str] = {
@@ -30,7 +29,6 @@ ALLOWED_SVG_ATTRIBUTES: set[str] = {
    "style",
    "d",
    "fill",
-    "fill-opacity",
    "fill-rule",
    "stroke",
    "stroke-width",
@@ -54,14 +52,14 @@ ALLOWED_SVG_ATTRIBUTES: set[str] = {
    "y1",
    "x2",
    "y2",
-    "gradienttransform",
-    "gradientunits",
+    "gradientTransform",
+    "gradientUnits",
    "offset",
    "stop-color",
    "stop-opacity",
    "clip-path",
-    "viewbox",
-    "preserveaspectratio",
+    "viewBox",
+    "preserveAspectRatio",
    "href",
    "xlink:href",
    "font-family",
@@ -70,8 +68,6 @@ ALLOWED_SVG_ATTRIBUTES: set[str] = {
    "text-anchor",
    "xmlns",
    "xmlns:xlink",
-    "version",
-    "type",
 }


--- a/src/paperless_remote/init.py
+++ b/src/paperless_remote/init.py
@@ -0,0 +1,4 @@
+# this is here so that django finds the checks.
+from paperless_remote.checks import check_remote_parser_configured
+
+__all__ = ["check_remote_parser_configured"]
--- a/src/paperless_remote/apps.py
+++ b/src/paperless_remote/apps.py
@@ -0,0 +1,14 @@
+from django.apps import AppConfig
+
+from paperless_remote.signals import remote_consumer_declaration
+
+
+class PaperlessRemoteParserConfig(AppConfig):
+    name = "paperless_remote"
+
+    def ready(self):
+        from documents.signals import document_consumer_declaration
+
+        document_consumer_declaration.connect(remote_consumer_declaration)
+
+        AppConfig.ready(self)
--- a/src/paperless_remote/checks.py
+++ b/src/paperless_remote/checks.py
@@ -0,0 +1,17 @@
+from django.conf import settings
+from django.core.checks import Error
+from django.core.checks import register
+
+
+@register()
+def check_remote_parser_configured(app_configs, **kwargs):
+    if settings.REMOTE_OCR_ENGINE == "azureai" and not (
+        settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
+    ):
+        return [
+            Error(
+                "Azure AI remote parser requires endpoint and API key to be configured.",
+            ),
+        ]
+
+    return []
--- a/src/paperless_remote/parsers.py
+++ b/src/paperless_remote/parsers.py
@@ -0,0 +1,118 @@
+from pathlib import Path
+
+from django.conf import settings
+
+from paperless_tesseract.parsers import RasterisedDocumentParser
+
+
+class RemoteEngineConfig:
+    def __init__(
+        self,
+        engine: str,
+        api_key: str | None = None,
+        endpoint: str | None = None,
+    ):
+        self.engine = engine
+        self.api_key = api_key
+        self.endpoint = endpoint
+
+    def engine_is_valid(self):
+        valid = self.engine in ["azureai"] and self.api_key is not None
+        if self.engine == "azureai":
+            valid = valid and self.endpoint is not None
+        return valid
+
+
+class RemoteDocumentParser(RasterisedDocumentParser):
+    """
+    This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
+    as this is the only service that provides a remote OCR API with text-embedded PDF output.
+    """
+
+    logging_name = "paperless.parsing.remote"
+
+    def get_settings(self) -> RemoteEngineConfig:
+        """
+        Returns the configuration for the remote OCR engine, loaded from Django settings.
+        """
+        return RemoteEngineConfig(
+            engine=settings.REMOTE_OCR_ENGINE,
+            api_key=settings.REMOTE_OCR_API_KEY,
+            endpoint=settings.REMOTE_OCR_ENDPOINT,
+        )
+
+    def supported_mime_types(self):
+        if self.settings.engine_is_valid():
+            return {
+                "application/pdf": ".pdf",
+                "image/png": ".png",
+                "image/jpeg": ".jpg",
+                "image/tiff": ".tiff",
+                "image/bmp": ".bmp",
+                "image/gif": ".gif",
+                "image/webp": ".webp",
+            }
+        else:
+            return {}
+
+    def azure_ai_vision_parse(
+        self,
+        file: Path,
+    ) -> str | None:
+        """
+        Uses Azure AI Vision to parse the document and return the text content.
+        It requests a searchable PDF output with embedded text.
+        The PDF is saved to the archive_path attribute.
+        Returns the text content extracted from the document.
+        If the parsing fails, it returns None.
+        """
+        from azure.ai.documentintelligence import DocumentIntelligenceClient
+        from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
+        from azure.ai.documentintelligence.models import AnalyzeOutputOption
+        from azure.ai.documentintelligence.models import DocumentContentFormat
+        from azure.core.credentials import AzureKeyCredential
+
+        client = DocumentIntelligenceClient(
+            endpoint=self.settings.endpoint,
+            credential=AzureKeyCredential(self.settings.api_key),
+        )
+
+        try:
+            with file.open("rb") as f:
+                analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
+                poller = client.begin_analyze_document(
+                    model_id="prebuilt-read",
+                    body=analyze_request,
+                    output_content_format=DocumentContentFormat.TEXT,
+                    output=[AnalyzeOutputOption.PDF],  # request searchable PDF output
+                    content_type="application/json",
+                )
+
+            poller.wait()
+            result_id = poller.details["operation_id"]
+            result = poller.result()
+
+            # Download the PDF with embedded text
+            self.archive_path = self.tempdir / "archive.pdf"
+            with self.archive_path.open("wb") as f:
+                for chunk in client.get_analyze_result_pdf(
+                    model_id="prebuilt-read",
+                    result_id=result_id,
+                ):
+                    f.write(chunk)
+            return result.content
+        except Exception as e:
+            self.log.error(f"Azure AI Vision parsing failed: {e}")
+        finally:
+            client.close()
+
+        return None
+
+    def parse(self, document_path: Path, mime_type, file_name=None):
+        if not self.settings.engine_is_valid():
+            self.log.warning(
+                "No valid remote parser engine is configured, content will be empty.",
+            )
+            self.text = ""
+        elif self.settings.engine == "azureai":
+            self.text = self.azure_ai_vision_parse(document_path)
--- a/src/paperless_remote/signals.py
+++ b/src/paperless_remote/signals.py
@@ -0,0 +1,18 @@
+def get_parser(*args, **kwargs):
+    from paperless_remote.parsers import RemoteDocumentParser
+
+    return RemoteDocumentParser(*args, **kwargs)
+
+
+def get_supported_mime_types():
+    from paperless_remote.parsers import RemoteDocumentParser
+
+    return RemoteDocumentParser(None).supported_mime_types()
+
+
+def remote_consumer_declaration(sender, **kwargs):
+    return {
+        "parser": get_parser,
+        "weight": 5,
+        "mime_types": get_supported_mime_types(),
+    }
--- a/src/paperless_remote/tests/init.py
+++ b/src/paperless_remote/tests/init.py
--- a/src/paperless_remote/tests/samples/simple-digital.pdf
+++ b/src/paperless_remote/tests/samples/simple-digital.pdf
--- a/src/paperless_remote/tests/test_checks.py
+++ b/src/paperless_remote/tests/test_checks.py
@@ -0,0 +1,24 @@
+from unittest import TestCase
+
+from django.test import override_settings
+
+from paperless_remote import check_remote_parser_configured
+
+
+class TestChecks(TestCase):
+    @override_settings(REMOTE_OCR_ENGINE=None)
+    def test_no_engine(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 0)
+
+    @override_settings(REMOTE_OCR_ENGINE="azureai")
+    @override_settings(REMOTE_OCR_API_KEY="somekey")
+    @override_settings(REMOTE_OCR_ENDPOINT=None)
+    def test_azure_no_endpoint(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 1)
+        self.assertTrue(
+            msgs[0].msg.startswith(
+                "Azure AI remote parser requires endpoint and API key to be configured.",
+            ),
+        )
--- a/src/paperless_remote/tests/test_parser.py
+++ b/src/paperless_remote/tests/test_parser.py
@@ -0,0 +1,128 @@
+import uuid
+from pathlib import Path
+from unittest import mock
+
+from django.test import TestCase
+from django.test import override_settings
+
+from documents.tests.utils import DirectoriesMixin
+from documents.tests.utils import FileSystemAssertsMixin
+from paperless_remote.parsers import RemoteDocumentParser
+from paperless_remote.signals import get_parser
+
+
+class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
+    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
+
+    def assertContainsStrings(self, content: str, strings: list[str]):
+        # Asserts that all strings appear in content, in the given order.
+        indices = []
+        for s in strings:
+            if s in content:
+                indices.append(content.index(s))
+            else:
+                self.fail(f"'{s}' is not in '{content}'")
+        self.assertListEqual(indices, sorted(indices))
+
+    @mock.patch("paperless_tesseract.parsers.run_subprocess")
+    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
+    def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
+        # Arrange mock Azure client
+        mock_client = mock.Mock()
+        mock_client_cls.return_value = mock_client
+
+        # Simulate poller result and its `.details`
+        mock_poller = mock.Mock()
+        mock_poller.wait.return_value = None
+        mock_poller.details = {"operation_id": "fake-op-id"}
+        mock_client.begin_analyze_document.return_value = mock_poller
+        mock_poller.result.return_value.content = "This is a test document."
+
+        # Return dummy PDF bytes
+        mock_client.get_analyze_result_pdf.return_value = [
+            b"%PDF-",
+            b"1.7 ",
+            b"FAKEPDF",
+        ]
+
+        # Simulate pdftotext by writing dummy text to sidecar file
+        def fake_run(cmd, *args, **kwargs):
+            with Path(cmd[-1]).open("w", encoding="utf-8") as f:
+                f.write("This is a test document.")
+
+        mock_subprocess.side_effect = fake_run
+
+        with override_settings(
+            REMOTE_OCR_ENGINE="azureai",
+            REMOTE_OCR_API_KEY="somekey",
+            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+        ):
+            parser = get_parser(uuid.uuid4())
+            parser.parse(
+                self.SAMPLE_FILES / "simple-digital.pdf",
+                "application/pdf",
+            )
+
+            self.assertContainsStrings(
+                parser.text.strip(),
+                ["This is a test document."],
+            )
+
+    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
+    def test_get_text_with_azure_error_logged_and_returns_none(self, mock_client_cls):
+        mock_client = mock.Mock()
+        mock_client.begin_analyze_document.side_effect = RuntimeError("fail")
+        mock_client_cls.return_value = mock_client
+
+        with override_settings(
+            REMOTE_OCR_ENGINE="azureai",
+            REMOTE_OCR_API_KEY="somekey",
+            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+        ):
+            parser = get_parser(uuid.uuid4())
+            with mock.patch.object(parser.log, "error") as mock_log_error:
+                parser.parse(
+                    self.SAMPLE_FILES / "simple-digital.pdf",
+                    "application/pdf",
+                )
+
+        self.assertIsNone(parser.text)
+        mock_client.begin_analyze_document.assert_called_once()
+        mock_client.close.assert_called_once()
+        mock_log_error.assert_called_once()
+        self.assertIn(
+            "Azure AI Vision parsing failed",
+            mock_log_error.call_args[0][0],
+        )
+
+    @override_settings(
+        REMOTE_OCR_ENGINE="azureai",
+        REMOTE_OCR_API_KEY="key",
+        REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+    )
+    def test_supported_mime_types_valid_config(self):
+        parser = RemoteDocumentParser(uuid.uuid4())
+        expected_types = {
+            "application/pdf": ".pdf",
+            "image/png": ".png",
+            "image/jpeg": ".jpg",
+            "image/tiff": ".tiff",
+            "image/bmp": ".bmp",
+            "image/gif": ".gif",
+            "image/webp": ".webp",
+        }
+        self.assertEqual(parser.supported_mime_types(), expected_types)
+
+    def test_supported_mime_types_invalid_config(self):
+        parser = get_parser(uuid.uuid4())
+        self.assertEqual(parser.supported_mime_types(), {})
+
+    @override_settings(
+        REMOTE_OCR_ENGINE=None,
+        REMOTE_OCR_API_KEY=None,
+        REMOTE_OCR_ENDPOINT=None,
+    )
+    def test_parse_with_invalid_config(self):
+        parser = get_parser(uuid.uuid4())
+        parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
+        self.assertEqual(parser.text, "")
--- a/uv.lock
+++ b/uv.lock
@@ -95,6 +95,34 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/02/ff/1175b0b7371e46244032d43a56862d0af455823b5280a50c63d99cc50f18/automat-25.4.16-py3-none-any.whl", hash = "sha256:04e9bce696a8d5671ee698005af6e5a9fa15354140a87f4870744604dcdd3ba1", size = 42842, upload-time = "2025-04-16T20:12:14.447Z" },
 ]

+[[package]]
+name = "azure-ai-documentintelligence"
+version = "1.0.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
+]
+
+[[package]]
+name = "azure-core"
+version = "1.33.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
+]
+
 [[package]]
 name = "babel"
 version = "2.17.0"
@@ -1451,6 +1479,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
 ]

+[[package]]
+name = "isodate"
+version = "0.7.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
+]
+
 [[package]]
 name = "jinja2"
 version = "3.1.6"
@@ -2118,6 +2155,7 @@ name = "paperless-ngx"
 version = "2.20.1"
 source = { virtual = "." }
 dependencies = [
+    { name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2163,7 +2201,6 @@ dependencies = [
    { name = "pyzbar", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "rapidfuzz", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "redis", extra = ["hiredis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
-    { name = "regex", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "scikit-learn", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "setproctitle", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "tika-client", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2255,6 +2292,7 @@ typing = [

 [package.metadata]
 requires-dist = [
+    { name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
    { name = "babel", specifier = ">=2.17" },
    { name = "bleach", specifier = "~=6.3.0" },
    { name = "celery", extras = ["redis"], specifier = "~=5.5.1" },
@@ -2307,7 +2345,6 @@ requires-dist = [
    { name = "pyzbar", specifier = "~=0.1.9" },
    { name = "rapidfuzz", specifier = "~=3.14.0" },
    { name = "redis", extras = ["hiredis"], specifier = "~=5.2.1" },
-    { name = "regex", specifier = ">=2025.9.18" },
    { name = "scikit-learn", specifier = "~=1.7.0" },
    { name = "setproctitle", specifier = "~=1.3.4" },
    { name = "tika-client", specifier = "~=0.10.0" },
@@ -4064,11 +4101,11 @@ wheels = [

 [[package]]
 name = "types-python-dateutil"
-version = "2.9.0.20251115"
+version = "2.9.0.20250822"
 source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/6a/36/06d01fb52c0d57e9ad0c237654990920fa41195e4b3d640830dabf9eeb2f/types_python_dateutil-2.9.0.20251115.tar.gz", hash = "sha256:8a47f2c3920f52a994056b8786309b43143faa5a64d4cbb2722d6addabdf1a58", size = 16363, upload-time = "2025-11-15T03:00:13.717Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/0c/0a/775f8551665992204c756be326f3575abba58c4a3a52eef9909ef4536428/types_python_dateutil-2.9.0.20250822.tar.gz", hash = "sha256:84c92c34bd8e68b117bff742bc00b692a1e8531262d4507b33afcc9f7716cd53", size = 16084, upload-time = "2025-08-22T03:02:00.613Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/43/0b/56961d3ba517ed0df9b3a27bfda6514f3d01b28d499d1bce9068cfe4edd1/types_python_dateutil-2.9.0.20251115-py3-none-any.whl", hash = "sha256:9cf9c1c582019753b8639a081deefd7e044b9fa36bd8217f565c6c4e36ee0624", size = 18251, upload-time = "2025-11-15T03:00:12.317Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/d9/a29dfa84363e88b053bf85a8b7f212a04f0d7343a4d24933baa45c06e08b/types_python_dateutil-2.9.0.20250822-py3-none-any.whl", hash = "sha256:849d52b737e10a6dc6621d2bd7940ec7c65fcb69e6aa2882acf4e56b2b508ddc", size = 17892, upload-time = "2025-08-22T03:01:59.436Z" },
 ]

 [[package]]
Author	SHA1	Message	Date
shamoon	2cd96610f6	Merge branch 'dev' into feature-remote-ocr-2	2025-12-11 12:58:14 -08:00
shamoon	9eb81d5458	Merge branch 'dev' into feature-remote-ocr-2	2025-12-07 20:37:56 -08:00
shamoon	6a5ea49715	Merge branch 'dev' into feature-remote-ocr-2	2025-11-22 13:18:50 -08:00
shamoon	7d2fe630a5	Merge branch 'dev' into feature-remote-ocr-2	2025-11-19 23:49:11 -08:00
shamoon	c29dd5485b	Merge branch 'dev' into feature-remote-ocr-2	2025-11-18 12:08:38 -08:00
shamoon	cef100a955	Wrap in try/catch	2025-11-18 12:07:16 -08:00
shamoon	4f53d1b6ee	Merge branch 'dev' into feature-remote-ocr-2	2025-11-17 20:54:37 -08:00
shamoon	23cea77548	Merge branch 'dev' into feature-remote-ocr-2	2025-11-17 18:49:01 -08:00
shamoon	4900af93c6	Merge branch 'dev' into feature-remote-ocr-2	2025-11-15 13:49:39 -08:00
shamoon	ef834ae808	Merge branch 'dev' into feature-remote-ocr-2	2025-11-13 15:45:08 -08:00
shamoon	0537e87cb5	Merge branch 'dev' into feature-remote-ocr-2	2025-11-06 11:46:02 -08:00
shamoon	b4da5c3cd1	Merge branch 'dev' into feature-remote-ocr-2	2025-11-04 16:24:26 -08:00
shamoon	251b0fb3d6	Merge branch 'dev' into feature-remote-ocr-2	2025-11-04 08:24:02 -08:00
shamoon	32bdf11f7f	Merge branch 'dev' into feature-remote-ocr-2	2025-11-02 08:14:04 -08:00
shamoon	0627ca69f5	Merge branch 'dev' into feature-remote-ocr-2	2025-10-29 11:13:53 -07:00
shamoon	f5525bbdff	Merge branch 'dev' into feature-remote-ocr-2	2025-10-27 21:22:42 -07:00
shamoon	a21a2a41a8	Merge branch 'dev' into feature-remote-ocr-2	2025-10-26 07:41:51 -07:00
shamoon	cc73ed8b86	Merge branch 'dev' into feature-remote-ocr-2	2025-10-24 16:48:07 -07:00
shamoon	0c706b2316	Merge branch 'dev' into feature-remote-ocr-2	2025-10-23 16:38:35 -07:00
shamoon	85b7b6874d	Merge branch 'dev' into feature-remote-ocr-2	2025-10-22 21:53:07 -07:00
shamoon	56b26185fa	Merge branch 'dev' into feature-remote-ocr-2	2025-10-21 08:23:20 -07:00
shamoon	6537fade7b	Merge branch 'dev' into feature-remote-ocr-2	2025-10-15 16:04:02 -07:00
shamoon	9f8090816f	Merge branch 'dev' into feature-remote-ocr-2	2025-10-09 12:54:58 -07:00
shamoon	1de7c52478	Merge branch 'dev' into feature-remote-ocr-2	2025-10-01 19:24:38 -07:00
shamoon	9aaaa6f069	Merge branch 'dev' into feature-remote-ocr-2	2025-09-30 09:14:56 -07:00
shamoon	c3a20b7797	Merge branch 'dev' into feature-remote-ocr-2	2025-09-28 15:06:37 -07:00
shamoon	476556379b	Merge branch 'dev' into feature-remote-ocr-2	2025-09-24 13:46:49 -07:00
shamoon	e5cafff043	Merge branch 'dev' into feature-remote-ocr-2	2025-09-22 13:42:55 -07:00
shamoon	8e0d574e99	Merge branch 'dev' into feature-remote-ocr-2	2025-09-21 16:18:13 -07:00
shamoon	8a5820328e	Sonar suggestions	2025-09-17 19:18:47 -07:00
shamoon	809d62a2f4	Merge branch 'dev' into feature-remote-ocr-2	2025-09-17 16:51:23 -07:00
shamoon	0d87f94b9b	Merge branch 'dev' into feature-remote-ocr-2	2025-09-14 14:01:35 -07:00
shamoon	315b90f8e5	Add typing to assertContainsStrings test util	2025-09-11 13:56:14 -07:00
shamoon	47b2d2964b	Use regular testcase instead of django, config check test	2025-09-11 13:52:10 -07:00
shamoon	e05639ae4e	tempdir already a path	2025-09-11 13:49:30 -07:00
shamoon	f400a8cb2f	Close client	2025-09-11 13:49:06 -07:00
shamoon	26abcf5612	Also ensure API key is set	2025-09-11 13:48:06 -07:00
shamoon	afde52430d	Merge branch 'dev' into feature-remote-ocr-2	2025-09-11 13:25:53 -07:00
shamoon	716f2da652	Merge branch 'dev' into feature-remote-ocr-2	2025-09-08 11:36:49 -07:00
shamoon	c54073b7c2	Merge branch 'dev' into feature-remote-ocr-2	2025-09-04 09:16:59 -07:00
shamoon	247e6f39dc	Merge branch 'dev' into feature-remote-ocr-2	2025-09-01 20:10:40 -07:00
shamoon	1e6dfc4481	Merge branch 'dev' into feature-remote-ocr-2	2025-08-26 13:30:39 -07:00
shamoon	7cc0750066	Add note on costs and limitations for Azure OCR	2025-08-24 05:47:07 -07:00
shamoon	bd6585d3b4	Merge branch 'dev' into feature-remote-ocr-2	2025-08-22 08:54:26 -07:00
shamoon	717e828a1d	Merge branch 'dev' into feature-remote-ocr-2	2025-08-17 21:25:14 -07:00
shamoon	07381d48e6	Merge branch 'dev' into feature-remote-ocr-2	2025-08-17 07:49:58 -07:00
shamoon	dd0ffaf312	Merge branch 'dev' into feature-remote-ocr-2	2025-08-11 10:48:36 -07:00
shamoon	264504affc	Fix consumer declaration file extensions	2025-08-10 05:32:52 -07:00
shamoon	4feedf2add	Merge branch 'dev' into feature-remote-ocr-2	2025-08-06 16:04:25 -04:00
shamoon	2f76cf9831	Merge branch 'dev' into feature-remote-ocr-2	2025-08-01 23:55:49 -04:00
shamoon	1002d37f6b	Update test_parser.py	2025-07-09 11:05:37 -07:00
shamoon	d260a94740	Update parsers.py	2025-07-09 11:02:57 -07:00
shamoon	88c69b83ea	Update index.md	2025-07-09 11:00:12 -07:00
shamoon	2557ee2014	Update docs to mention remote OCR with Azure AI	2025-07-09 09:53:30 -07:00
shamoon	3c75deed80	Add paperless_remote tests to testpaths	2025-07-08 14:19:45 -07:00
shamoon	d05343c927	Test fixes / coverage	2025-07-08 14:19:45 -07:00
shamoon	e7972b7eaf	Coverage	2025-07-08 14:19:45 -07:00
shamoon	75a091cc0d	Fix test	2025-07-08 14:19:44 -07:00
shamoon	dca74803fd	Use output_content_format poller.result to get clean content	2025-07-08 14:19:44 -07:00
shamoon	3cf3d868d0	Some docs	2025-07-08 14:19:43 -07:00
shamoon	bf4fc6604a	Test	2025-07-08 14:19:43 -07:00
shamoon	e8c1eb86fa	This actually works [ci skip]	2025-07-08 14:19:43 -07:00
shamoon	c3dad3cf69	Basic parse	2025-07-08 14:19:42 -07:00
shamoon	811bd66088	Ok, restart implementing this with just azure [ci skip]	2025-07-08 14:19:42 -07:00