Compare commits

..

11 Commits

Author SHA1 Message Date
dependabot[bot]
8d42ad727e Chore(deps): Bump types-python-dateutil
Bumps [types-python-dateutil](https://github.com/typeshed-internal/stub_uploader) from 2.9.0.20250822 to 2.9.0.20251115.
- [Commits](https://github.com/typeshed-internal/stub_uploader/commits)

---
updated-dependencies:
- dependency-name: types-python-dateutil
  dependency-version: 2.9.0.20251115
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-12-12 17:33:05 +00:00
GitHub Actions
4d7aa8e1a2 Auto translate strings 2025-12-12 17:30:36 +00:00
shamoon
9bdbfd362f Merge commit from fork
* Add safe regex matching with timeouts and validation

* Remove redundant length check

* Remove timeouterror workaround
2025-12-12 09:28:47 -08:00
shamoon
9ba1d93e15 Merge commit from fork
* Uses a custom transport to resolve the slim chance of a DNS rebinding affecting the webhook

* Fix WebhookTransport hostname resolution and validation

* Fix test failures

* Lint

* Keep all internal logic inside WebhookTransport

* Fix test failure

* Update handlers.py

* Update handlers.py

---------

Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
2025-12-12 09:28:17 -08:00
shamoon
a9c73e2846 Update validators.py 2025-12-12 09:27:19 -08:00
GitHub Actions
332136df8b Auto translate strings 2025-12-12 16:44:49 +00:00
shamoon
3a1d33225e Fixhancement: pass ordering to tag children (#11556) 2025-12-12 16:43:16 +00:00
Jan Kleine
e770ff572e Documentation: Document missing workflows env variable and complete diagram (#11554) 2025-12-12 16:12:23 +00:00
Trenton H
402f2ead59 Fixes the workflow configuration being nested under the consumption documentation (#11588) 2025-12-12 07:51:45 -08:00
shamoon
3b4d958b97 Performance: avoid unnecessary filename operations on bulk custom field updates (#11558) 2025-12-12 07:50:51 -08:00
shamoon
3f81b432ec Fix: normalize SVG tag and attribute names, add version (#11586) 2025-12-11 19:17:55 -08:00
25 changed files with 405 additions and 520 deletions

View File

@@ -1054,12 +1054,22 @@ should be a valid crontab(5) expression describing when to run.
#### [`PAPERLESS_SANITY_TASK_CRON=<cron expression>`](#PAPERLESS_SANITY_TASK_CRON) {#PAPERLESS_SANITY_TASK_CRON}
: Configures the scheduled sanity checker frequency.
: Configures the scheduled sanity checker frequency. The value should be a
valid crontab(5) expression describing when to run.
: If set to the string "disable", the sanity checker will not run automatically.
Defaults to `30 0 * * sun` or Sunday at 30 minutes past midnight.
#### [`PAPERLESS_WORKFLOW_SCHEDULED_TASK_CRON=<cron expression>`](#PAPERLESS_WORKFLOW_SCHEDULED_TASK_CRON) {#PAPERLESS_WORKFLOW_SCHEDULED_TASK_CRON}
: Configures the scheduled workflow check frequency. The value should be a
valid crontab(5) expression describing when to run.
: If set to the string "disable", scheduled workflows will not run.
Defaults to `5 */1 * * *` or every hour at 5 minutes past the hour.
#### [`PAPERLESS_ENABLE_COMPRESSION=<bool>`](#PAPERLESS_ENABLE_COMPRESSION) {#PAPERLESS_ENABLE_COMPRESSION}
: Enables compression of the responses from the webserver.
@@ -1271,30 +1281,6 @@ within your documents.
Defaults to false.
## Workflow webhooks
#### [`PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES) {#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES}
: A comma-separated list of allowed schemes for webhooks. This setting
controls which URL schemes are permitted for webhook URLs.
Defaults to `http,https`.
#### [`PAPERLESS_WEBHOOKS_ALLOWED_PORTS=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_PORTS) {#PAPERLESS_WEBHOOKS_ALLOWED_PORTS}
: A comma-separated list of allowed ports for webhooks. This setting
controls which ports are permitted for webhook URLs. For example, if you
set this to `80,443`, webhooks will only be sent to URLs that use these
ports.
Defaults to empty list, which allows all ports.
#### [`PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS=<bool>`](#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS) {#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS}
: If set to false, webhooks cannot be sent to internal URLs (e.g., localhost).
Defaults to true, which allows internal requests.
### Polling {#polling}
#### [`PAPERLESS_CONSUMER_POLLING=<num>`](#PAPERLESS_CONSUMER_POLLING) {#PAPERLESS_CONSUMER_POLLING}
@@ -1338,6 +1324,30 @@ consumers working on the same file. Configure this to prevent that.
Defaults to 0.5 seconds.
## Workflow webhooks
#### [`PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES) {#PAPERLESS_WEBHOOKS_ALLOWED_SCHEMES}
: A comma-separated list of allowed schemes for webhooks. This setting
controls which URL schemes are permitted for webhook URLs.
Defaults to `http,https`.
#### [`PAPERLESS_WEBHOOKS_ALLOWED_PORTS=<str>`](#PAPERLESS_WEBHOOKS_ALLOWED_PORTS) {#PAPERLESS_WEBHOOKS_ALLOWED_PORTS}
: A comma-separated list of allowed ports for webhooks. This setting
controls which ports are permitted for webhook URLs. For example, if you
set this to `80,443`, webhooks will only be sent to URLs that use these
ports.
Defaults to empty list, which allows all ports.
#### [`PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS=<bool>`](#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS) {#PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS}
: If set to false, webhooks cannot be sent to internal URLs (e.g., localhost).
Defaults to true, which allows internal requests.
## Incoming Mail {#incoming_mail}
### Email OAuth {#email_oauth}
@@ -1794,23 +1804,3 @@ password. All of these options come from their similarly-named [Django settings]
#### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}
: Defaults to false.
## Remote OCR
#### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
: The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
Defaults to None, which disables remote OCR.
#### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
: The API key to use for the remote OCR engine.
Defaults to None.
#### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
: The endpoint to use for the remote OCR engine. This is required for Azure AI.
Defaults to None.

View File

@@ -25,10 +25,9 @@ physical documents into a searchable online archive so you can keep, well, _less
## Features
- **Organize and index** your scanned documents with tags, correspondents, types, and more.
- _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
- _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
- Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
- Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- _New!_ Supports remote OCR with Azure AI (opt-in).
- Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
- Uses machine-learning to automatically add tags, correspondents and document types to your documents.
- Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.

View File

@@ -443,6 +443,10 @@ flowchart TD
'Updated'
trigger(s)"}
scheduled{"Documents
matching
trigger(s)"}
A[New Document] --> consumption
consumption --> |Yes| C[Workflow Actions Run]
consumption --> |No| D
@@ -455,6 +459,11 @@ flowchart TD
updated --> |Yes| J[Workflow Actions Run]
updated --> |No| K
J --> K[Document Saved]
L[Scheduled Task Check<br/>hourly at :05] --> M[Get All Scheduled Triggers]
M --> scheduled
scheduled --> |Yes| N[Workflow Actions Run]
scheduled --> |No| O[Document Saved]
N --> O
```
#### Filters {#workflow-trigger-filters}
@@ -892,21 +901,6 @@ how regularly you intend to scan documents and use paperless.
performed the task associated with the document, move it to the
inbox.
## Remote OCR
!!! important
This feature is disabled by default and will always remain strictly "opt-in".
Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
[Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
Additionally, when using a commercial service with this feature, consider both potential costs as well as any associated file size
or page limitations (e.g. with a free tier).
## Architecture
Paperless-ngx consists of the following components:

View File

@@ -16,7 +16,6 @@ classifiers = [
# This will allow testing to not install a webserver, mysql, etc
dependencies = [
"azure-ai-documentintelligence>=1.0.2",
"babel>=2.17",
"bleach~=6.3.0",
"celery[redis]~=5.5.1",
@@ -64,6 +63,7 @@ dependencies = [
"pyzbar~=0.1.9",
"rapidfuzz~=3.14.0",
"redis[hiredis]~=5.2.1",
"regex>=2025.9.18",
"scikit-learn~=1.7.0",
"setproctitle~=1.3.4",
"tika-client~=0.10.0",
@@ -253,7 +253,6 @@ testpaths = [
"src/paperless_tesseract/tests/",
"src/paperless_tika/tests",
"src/paperless_text/tests/",
"src/paperless_remote/tests/",
]
addopts = [
"--pythonwarnings=all",

View File

@@ -20,6 +20,7 @@ from documents.models import Tag
from documents.models import Workflow
from documents.models import WorkflowTrigger
from documents.permissions import get_objects_for_user_owner_aware
from documents.regex import safe_regex_search
if TYPE_CHECKING:
from django.db.models import QuerySet
@@ -152,7 +153,7 @@ def match_storage_paths(document: Document, classifier: DocumentClassifier, user
def matches(matching_model: MatchingModel, document: Document):
search_kwargs = {}
search_flags = 0
document_content = document.content
@@ -161,14 +162,18 @@ def matches(matching_model: MatchingModel, document: Document):
return False
if matching_model.is_insensitive:
search_kwargs = {"flags": re.IGNORECASE}
search_flags = re.IGNORECASE
if matching_model.matching_algorithm == MatchingModel.MATCH_NONE:
return False
elif matching_model.matching_algorithm == MatchingModel.MATCH_ALL:
for word in _split_match(matching_model):
search_result = re.search(rf"\b{word}\b", document_content, **search_kwargs)
search_result = re.search(
rf"\b{word}\b",
document_content,
flags=search_flags,
)
if not search_result:
return False
log_reason(
@@ -180,7 +185,7 @@ def matches(matching_model: MatchingModel, document: Document):
elif matching_model.matching_algorithm == MatchingModel.MATCH_ANY:
for word in _split_match(matching_model):
if re.search(rf"\b{word}\b", document_content, **search_kwargs):
if re.search(rf"\b{word}\b", document_content, flags=search_flags):
log_reason(matching_model, document, f"it contains this word: {word}")
return True
return False
@@ -190,7 +195,7 @@ def matches(matching_model: MatchingModel, document: Document):
re.search(
rf"\b{re.escape(matching_model.match)}\b",
document_content,
**search_kwargs,
flags=search_flags,
),
)
if result:
@@ -202,16 +207,11 @@ def matches(matching_model: MatchingModel, document: Document):
return result
elif matching_model.matching_algorithm == MatchingModel.MATCH_REGEX:
try:
match = re.search(
re.compile(matching_model.match, **search_kwargs),
document_content,
)
except re.error:
logger.error(
f"Error while processing regular expression {matching_model.match}",
)
return False
match = safe_regex_search(
matching_model.match,
document_content,
flags=search_flags,
)
if match:
log_reason(
matching_model,

50
src/documents/regex.py Normal file
View File

@@ -0,0 +1,50 @@
from __future__ import annotations
import logging
import textwrap
import regex
from django.conf import settings
logger = logging.getLogger("paperless.regex")
REGEX_TIMEOUT_SECONDS: float = getattr(settings, "MATCH_REGEX_TIMEOUT_SECONDS", 0.1)
def validate_regex_pattern(pattern: str) -> None:
"""
Validate user provided regex for basic compile errors.
Raises ValueError on validation failure.
"""
try:
regex.compile(pattern)
except regex.error as exc:
raise ValueError(exc.msg) from exc
def safe_regex_search(pattern: str, text: str, *, flags: int = 0):
"""
Run a regex search with a timeout. Returns a match object or None.
Validation errors and timeouts are logged and treated as no match.
"""
try:
validate_regex_pattern(pattern)
compiled = regex.compile(pattern, flags=flags)
except (regex.error, ValueError) as exc:
logger.error(
"Error while processing regular expression %s: %s",
textwrap.shorten(pattern, width=80, placeholder=""),
exc,
)
return None
try:
return compiled.search(text, timeout=REGEX_TIMEOUT_SECONDS)
except TimeoutError:
logger.warning(
"Regular expression matching timed out for pattern %s",
textwrap.shorten(pattern, width=80, placeholder=""),
)
return None

View File

@@ -21,6 +21,7 @@ from django.core.validators import MaxLengthValidator
from django.core.validators import RegexValidator
from django.core.validators import integer_validator
from django.db.models import Count
from django.db.models.functions import Lower
from django.utils.crypto import get_random_string
from django.utils.dateparse import parse_datetime
from django.utils.text import slugify
@@ -38,6 +39,7 @@ from guardian.utils import get_user_obj_perms_model
from rest_framework import fields
from rest_framework import serializers
from rest_framework.fields import SerializerMethodField
from rest_framework.filters import OrderingFilter
if settings.AUDIT_LOG_ENABLED:
from auditlog.context import set_actor
@@ -69,6 +71,7 @@ from documents.parsers import is_mime_type_supported
from documents.permissions import get_document_count_filter_for_user
from documents.permissions import get_groups_with_only_permission
from documents.permissions import set_permissions_for_object
from documents.regex import validate_regex_pattern
from documents.templating.filepath import validate_filepath_template_and_render
from documents.templating.utils import convert_format_str_to_template_format
from documents.validators import uri_validator
@@ -139,10 +142,10 @@ class MatchingModelSerializer(serializers.ModelSerializer):
and self.initial_data["matching_algorithm"] == MatchingModel.MATCH_REGEX
):
try:
re.compile(match)
except re.error as e:
validate_regex_pattern(match)
except ValueError as e:
raise serializers.ValidationError(
_("Invalid regular expression: %(error)s") % {"error": str(e.msg)},
_("Invalid regular expression: %(error)s") % {"error": str(e)},
)
return match
@@ -575,15 +578,29 @@ class TagSerializer(MatchingModelSerializer, OwnedObjectSerializer):
)
def get_children(self, obj):
filter_q = self.context.get("document_count_filter")
request = self.context.get("request")
if filter_q is None:
request = self.context.get("request")
user = getattr(request, "user", None) if request else None
filter_q = get_document_count_filter_for_user(user)
self.context["document_count_filter"] = filter_q
serializer = TagSerializer(
children_queryset = (
obj.get_children_queryset()
.select_related("owner")
.annotate(document_count=Count("documents", filter=filter_q)),
.annotate(document_count=Count("documents", filter=filter_q))
)
view = self.context.get("view")
ordering = (
OrderingFilter().get_ordering(request, children_queryset, view)
if request and view
else None
)
ordering = ordering or (Lower("name"),)
children_queryset = children_queryset.order_by(*ordering)
serializer = TagSerializer(
children_queryset,
many=True,
user=self.user,
full_perms=self.full_perms,

View File

@@ -28,6 +28,7 @@ from documents import matching
from documents.caching import clear_document_caches
from documents.file_handling import create_source_path_directory
from documents.file_handling import delete_empty_directories
from documents.file_handling import generate_filename
from documents.file_handling import generate_unique_filename
from documents.models import CustomField
from documents.models import CustomFieldInstance
@@ -42,6 +43,7 @@ from documents.models import WorkflowAction
from documents.models import WorkflowRun
from documents.models import WorkflowTrigger
from documents.permissions import get_objects_for_user_owner_aware
from documents.templating.utils import convert_format_str_to_template_format
from documents.workflows.actions import build_workflow_action_context
from documents.workflows.actions import execute_email_action
from documents.workflows.actions import execute_webhook_action
@@ -389,6 +391,19 @@ class CannotMoveFilesException(Exception):
pass
def _filename_template_uses_custom_fields(doc: Document) -> bool:
template = None
if doc.storage_path is not None:
template = doc.storage_path.path
elif settings.FILENAME_FORMAT is not None:
template = convert_format_str_to_template_format(settings.FILENAME_FORMAT)
if not template:
return False
return "custom_fields" in template
# should be disabled in /src/documents/management/commands/document_importer.py handle
@receiver(models.signals.post_save, sender=CustomFieldInstance, weak=False)
@receiver(models.signals.m2m_changed, sender=Document.tags.through, weak=False)
@@ -399,6 +414,8 @@ def update_filename_and_move_files(
**kwargs,
):
if isinstance(instance, CustomFieldInstance):
if not _filename_template_uses_custom_fields(instance.document):
return
instance = instance.document
def validate_move(instance, old_path: Path, new_path: Path):
@@ -436,21 +453,47 @@ def update_filename_and_move_files(
old_filename = instance.filename
old_source_path = instance.source_path
candidate_filename = generate_filename(instance)
candidate_source_path = (
settings.ORIGINALS_DIR / candidate_filename
).resolve()
if candidate_filename == Path(old_filename):
new_filename = Path(old_filename)
elif (
candidate_source_path.exists()
and candidate_source_path != old_source_path
):
# Only fall back to unique search when there is an actual conflict
new_filename = generate_unique_filename(instance)
else:
new_filename = candidate_filename
# Need to convert to string to be able to save it to the db
instance.filename = str(generate_unique_filename(instance))
instance.filename = str(new_filename)
move_original = old_filename != instance.filename
old_archive_filename = instance.archive_filename
old_archive_path = instance.archive_path
if instance.has_archive_version:
# Need to convert to string to be able to save it to the db
instance.archive_filename = str(
generate_unique_filename(
archive_candidate = generate_filename(instance, archive_filename=True)
archive_candidate_path = (
settings.ARCHIVE_DIR / archive_candidate
).resolve()
if archive_candidate == Path(old_archive_filename):
new_archive_filename = Path(old_archive_filename)
elif (
archive_candidate_path.exists()
and archive_candidate_path != old_archive_path
):
new_archive_filename = generate_unique_filename(
instance,
archive_filename=True,
),
)
)
else:
new_archive_filename = archive_candidate
instance.archive_filename = str(new_archive_filename)
move_archive = old_archive_filename != instance.archive_filename
else:

View File

@@ -16,6 +16,7 @@ from django.utils import timezone
from documents.file_handling import create_source_path_directory
from documents.file_handling import delete_empty_directories
from documents.file_handling import generate_filename
from documents.file_handling import generate_unique_filename
from documents.models import Correspondent
from documents.models import CustomField
from documents.models import CustomFieldInstance
@@ -1632,6 +1633,73 @@ class TestFilenameGeneration(DirectoriesMixin, TestCase):
)
class TestCustomFieldFilenameUpdates(
DirectoriesMixin,
FileSystemAssertsMixin,
TestCase,
):
def setUp(self):
self.cf = CustomField.objects.create(
name="flavor",
data_type=CustomField.FieldDataType.STRING,
)
self.doc = Document.objects.create(
title="document",
mime_type="application/pdf",
checksum="abc123",
)
self.cfi = CustomFieldInstance.objects.create(
field=self.cf,
document=self.doc,
value_text="initial",
)
return super().setUp()
@override_settings(FILENAME_FORMAT=None)
def test_custom_field_not_in_template_skips_filename_work(self):
storage_path = StoragePath.objects.create(path="{{created}}/{{ title }}")
self.doc.storage_path = storage_path
self.doc.save()
initial_filename = generate_filename(self.doc)
Document.objects.filter(pk=self.doc.pk).update(filename=str(initial_filename))
self.doc.refresh_from_db()
Path(self.doc.source_path).parent.mkdir(parents=True, exist_ok=True)
Path(self.doc.source_path).touch()
with mock.patch("documents.signals.handlers.generate_unique_filename") as m:
m.side_effect = generate_unique_filename
self.cfi.value_text = "updated"
self.cfi.save()
self.doc.refresh_from_db()
self.assertEqual(Path(self.doc.filename), initial_filename)
self.assertEqual(m.call_count, 0)
@override_settings(FILENAME_FORMAT=None)
def test_custom_field_in_template_triggers_filename_update(self):
storage_path = StoragePath.objects.create(
path="{{ custom_fields|get_cf_value('flavor') }}/{{ title }}",
)
self.doc.storage_path = storage_path
self.doc.save()
initial_filename = generate_filename(self.doc)
Document.objects.filter(pk=self.doc.pk).update(filename=str(initial_filename))
self.doc.refresh_from_db()
Path(self.doc.source_path).parent.mkdir(parents=True, exist_ok=True)
Path(self.doc.source_path).touch()
with mock.patch("documents.signals.handlers.generate_unique_filename") as m:
m.side_effect = generate_unique_filename
self.cfi.value_text = "updated"
self.cfi.save()
self.doc.refresh_from_db()
expected_filename = Path("updated/document.pdf")
self.assertEqual(Path(self.doc.filename), expected_filename)
self.assertTrue(Path(self.doc.source_path).is_file())
self.assertLessEqual(m.call_count, 1)
class TestPathDateLocalization:
"""
Groups all tests related to the `localize_date` function.

View File

@@ -206,6 +206,22 @@ class TestMatching(_TestMatchingBase):
def test_tach_invalid_regex(self):
self._test_matching("[", "MATCH_REGEX", [], ["Don't match this"])
def test_match_regex_timeout_returns_false(self):
tag = Tag.objects.create(
name="slow",
match=r"(a+)+$",
matching_algorithm=Tag.MATCH_REGEX,
)
document = Document(content=("a" * 5000) + "X")
with self.assertLogs("paperless.regex", level="WARNING") as cm:
self.assertFalse(matching.matches(tag, document))
self.assertTrue(
any("timed out" in message for message in cm.output),
f"Expected timeout log, got {cm.output}",
)
def test_match_fuzzy(self):
self._test_matching(
"Springfield, Miss.",

View File

@@ -17,6 +17,7 @@ from django.utils import timezone
from guardian.shortcuts import assign_perm
from guardian.shortcuts import get_groups_with_perms
from guardian.shortcuts import get_users_with_perms
from httpx import ConnectError
from httpx import HTTPError
from httpx import HTTPStatusError
from pytest_httpx import HTTPXMock
@@ -3428,7 +3429,7 @@ class TestWorkflows(
expected_str = "Error occurred parsing webhook headers"
self.assertIn(expected_str, cm.output[1])
@mock.patch("httpx.post")
@mock.patch("httpx.Client.post")
def test_workflow_webhook_send_webhook_task(self, mock_post):
mock_post.return_value = mock.Mock(
status_code=200,
@@ -3449,8 +3450,6 @@ class TestWorkflows(
content="Test message",
headers={},
files=None,
follow_redirects=False,
timeout=5,
)
expected_str = "Webhook sent to http://paperless-ngx.com"
@@ -3468,11 +3467,9 @@ class TestWorkflows(
data={"message": "Test message"},
headers={},
files=None,
follow_redirects=False,
timeout=5,
)
@mock.patch("httpx.post")
@mock.patch("httpx.Client.post")
def test_workflow_webhook_send_webhook_retry(self, mock_http):
mock_http.return_value.raise_for_status = mock.Mock(
side_effect=HTTPStatusError(
@@ -3668,7 +3665,7 @@ class TestWebhookSecurity:
- ValueError is raised
"""
resolve_to("127.0.0.1")
with pytest.raises(ValueError):
with pytest.raises(ConnectError):
send_webhook(
"http://paperless-ngx.com",
data="",
@@ -3698,7 +3695,8 @@ class TestWebhookSecurity:
)
req = httpx_mock.get_request()
assert req.url.host == "paperless-ngx.com"
assert req.url.host == "52.207.186.75"
assert req.headers["host"] == "paperless-ngx.com"
def test_follow_redirects_disabled(self, httpx_mock: HTTPXMock, resolve_to):
"""

View File

@@ -10,26 +10,98 @@ from django.conf import settings
logger = logging.getLogger("paperless.workflows.webhooks")
def _is_public_ip(ip: str) -> bool:
try:
obj = ipaddress.ip_address(ip)
return not (
obj.is_private
or obj.is_loopback
or obj.is_link_local
or obj.is_multicast
or obj.is_unspecified
class WebhookTransport(httpx.HTTPTransport):
"""
Transport that resolves/validates hostnames and rewrites to a vetted IP
while keeping Host/SNI as the original hostname.
"""
def __init__(
self,
hostname: str,
*args,
allow_internal: bool = False,
**kwargs,
) -> None:
super().__init__(*args, **kwargs)
self.hostname = hostname
self.allow_internal = allow_internal
def handle_request(self, request: httpx.Request) -> httpx.Response:
hostname = request.url.host
if not hostname:
raise httpx.ConnectError("No hostname in request URL")
try:
addr_info = socket.getaddrinfo(hostname, None)
except socket.gaierror as e:
raise httpx.ConnectError(f"Could not resolve hostname: {hostname}") from e
ips = [info[4][0] for info in addr_info if info and info[4]]
if not ips:
raise httpx.ConnectError(f"Could not resolve hostname: {hostname}")
if not self.allow_internal:
for ip_str in ips:
if not WebhookTransport.is_public_ip(ip_str):
raise httpx.ConnectError(
f"Connection blocked: {hostname} resolves to a non-public address",
)
ip_str = ips[0]
formatted_ip = self._format_ip_for_url(ip_str)
new_headers = httpx.Headers(request.headers)
if "host" in new_headers:
del new_headers["host"]
new_headers["Host"] = hostname
new_url = request.url.copy_with(host=formatted_ip)
request = httpx.Request(
method=request.method,
url=new_url,
headers=new_headers,
content=request.content,
extensions=request.extensions,
)
except ValueError: # pragma: no cover
return False
request.extensions["sni_hostname"] = hostname
return super().handle_request(request)
def _resolve_first_ip(host: str) -> str | None:
try:
info = socket.getaddrinfo(host, None)
return info[0][4][0] if info else None
except Exception: # pragma: no cover
return None
def _format_ip_for_url(self, ip: str) -> str:
"""
Format IP address for use in URL (wrap IPv6 in brackets)
"""
try:
ip_obj = ipaddress.ip_address(ip)
if ip_obj.version == 6:
return f"[{ip}]"
return ip
except ValueError:
return ip
@staticmethod
def is_public_ip(ip: str | int) -> bool:
try:
obj = ipaddress.ip_address(ip)
return not (
obj.is_private
or obj.is_loopback
or obj.is_link_local
or obj.is_multicast
or obj.is_unspecified
)
except ValueError: # pragma: no cover
return False
@staticmethod
def resolve_first_ip(host: str) -> str | None:
try:
info = socket.getaddrinfo(host, None)
return info[0][4][0] if info else None
except Exception: # pragma: no cover
return None
@shared_task(
@@ -59,12 +131,10 @@ def send_webhook(
logger.warning("Webhook blocked: port not permitted")
raise ValueError("Destination port not permitted.")
ip = _resolve_first_ip(p.hostname)
if not ip or (
not _is_public_ip(ip) and not settings.WEBHOOKS_ALLOW_INTERNAL_REQUESTS
):
logger.warning("Webhook blocked: destination not allowed")
raise ValueError("Destination host is not allowed.")
transport = WebhookTransport(
hostname=p.hostname,
allow_internal=settings.WEBHOOKS_ALLOW_INTERNAL_REQUESTS,
)
try:
post_args = {
@@ -73,8 +143,6 @@ def send_webhook(
k: v for k, v in (headers or {}).items() if k.lower() != "host"
},
"files": files or None,
"timeout": 5.0,
"follow_redirects": False,
}
if as_json:
post_args["json"] = data
@@ -83,14 +151,21 @@ def send_webhook(
else:
post_args["content"] = data
httpx.post(
**post_args,
).raise_for_status()
logger.info(
f"Webhook sent to {url}",
)
with httpx.Client(
transport=transport,
timeout=5.0,
follow_redirects=False,
) as client:
client.post(
**post_args,
).raise_for_status()
logger.info(
f"Webhook sent to {url}",
)
except Exception as e:
logger.error(
f"Failed attempt sending webhook to {url}: {e}",
)
raise e
finally:
transport.close()

View File

@@ -2,7 +2,7 @@ msgid ""
msgstr ""
"Project-Id-Version: paperless-ngx\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2025-12-10 16:39+0000\n"
"POT-Creation-Date: 2025-12-12 17:29+0000\n"
"PO-Revision-Date: 2022-02-17 04:17\n"
"Last-Translator: \n"
"Language-Team: English\n"
@@ -1219,40 +1219,40 @@ msgstr ""
msgid "workflow runs"
msgstr ""
#: documents/serialisers.py:145
#: documents/serialisers.py:148
#, python-format
msgid "Invalid regular expression: %(error)s"
msgstr ""
#: documents/serialisers.py:622
#: documents/serialisers.py:639
msgid "Invalid color."
msgstr ""
#: documents/serialisers.py:1808
#: documents/serialisers.py:1825
#, python-format
msgid "File type %(type)s not supported"
msgstr ""
#: documents/serialisers.py:1852
#: documents/serialisers.py:1869
#, python-format
msgid "Custom field id must be an integer: %(id)s"
msgstr ""
#: documents/serialisers.py:1859
#: documents/serialisers.py:1876
#, python-format
msgid "Custom field with id %(id)s does not exist"
msgstr ""
#: documents/serialisers.py:1876 documents/serialisers.py:1886
#: documents/serialisers.py:1893 documents/serialisers.py:1903
msgid ""
"Custom fields must be a list of integers or an object mapping ids to values."
msgstr ""
#: documents/serialisers.py:1881
#: documents/serialisers.py:1898
msgid "Some custom fields don't exist or were specified twice."
msgstr ""
#: documents/serialisers.py:1996
#: documents/serialisers.py:2013
msgid "Invalid variable detected."
msgstr ""

View File

@@ -322,7 +322,6 @@ INSTALLED_APPS = [
"paperless_tesseract.apps.PaperlessTesseractConfig",
"paperless_text.apps.PaperlessTextConfig",
"paperless_mail.apps.PaperlessMailConfig",
"paperless_remote.apps.PaperlessRemoteParserConfig",
"django.contrib.admin",
"rest_framework",
"rest_framework.authtoken",
@@ -1402,10 +1401,3 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
"PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
"true",
)
###############################################################################
# Remote Parser #
###############################################################################
REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")

View File

@@ -14,13 +14,14 @@ ALLOWED_SVG_TAGS: set[str] = {
"text",
"tspan",
"defs",
"linearGradient",
"radialGradient",
"lineargradient",
"radialgradient",
"stop",
"clipPath",
"clippath",
"use",
"title",
"desc",
"style",
}
ALLOWED_SVG_ATTRIBUTES: set[str] = {
@@ -29,6 +30,7 @@ ALLOWED_SVG_ATTRIBUTES: set[str] = {
"style",
"d",
"fill",
"fill-opacity",
"fill-rule",
"stroke",
"stroke-width",
@@ -52,14 +54,14 @@ ALLOWED_SVG_ATTRIBUTES: set[str] = {
"y1",
"x2",
"y2",
"gradientTransform",
"gradientUnits",
"gradienttransform",
"gradientunits",
"offset",
"stop-color",
"stop-opacity",
"clip-path",
"viewBox",
"preserveAspectRatio",
"viewbox",
"preserveaspectratio",
"href",
"xlink:href",
"font-family",
@@ -68,6 +70,8 @@ ALLOWED_SVG_ATTRIBUTES: set[str] = {
"text-anchor",
"xmlns",
"xmlns:xlink",
"version",
"type",
}

View File

@@ -1,4 +0,0 @@
# this is here so that django finds the checks.
from paperless_remote.checks import check_remote_parser_configured
__all__ = ["check_remote_parser_configured"]

View File

@@ -1,14 +0,0 @@
from django.apps import AppConfig
from paperless_remote.signals import remote_consumer_declaration
class PaperlessRemoteParserConfig(AppConfig):
name = "paperless_remote"
def ready(self):
from documents.signals import document_consumer_declaration
document_consumer_declaration.connect(remote_consumer_declaration)
AppConfig.ready(self)

View File

@@ -1,17 +0,0 @@
from django.conf import settings
from django.core.checks import Error
from django.core.checks import register
@register()
def check_remote_parser_configured(app_configs, **kwargs):
if settings.REMOTE_OCR_ENGINE == "azureai" and not (
settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
):
return [
Error(
"Azure AI remote parser requires endpoint and API key to be configured.",
),
]
return []

View File

@@ -1,118 +0,0 @@
from pathlib import Path
from django.conf import settings
from paperless_tesseract.parsers import RasterisedDocumentParser
class RemoteEngineConfig:
def __init__(
self,
engine: str,
api_key: str | None = None,
endpoint: str | None = None,
):
self.engine = engine
self.api_key = api_key
self.endpoint = endpoint
def engine_is_valid(self):
valid = self.engine in ["azureai"] and self.api_key is not None
if self.engine == "azureai":
valid = valid and self.endpoint is not None
return valid
class RemoteDocumentParser(RasterisedDocumentParser):
"""
This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
as this is the only service that provides a remote OCR API with text-embedded PDF output.
"""
logging_name = "paperless.parsing.remote"
def get_settings(self) -> RemoteEngineConfig:
"""
Returns the configuration for the remote OCR engine, loaded from Django settings.
"""
return RemoteEngineConfig(
engine=settings.REMOTE_OCR_ENGINE,
api_key=settings.REMOTE_OCR_API_KEY,
endpoint=settings.REMOTE_OCR_ENDPOINT,
)
def supported_mime_types(self):
if self.settings.engine_is_valid():
return {
"application/pdf": ".pdf",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/tiff": ".tiff",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/webp": ".webp",
}
else:
return {}
def azure_ai_vision_parse(
self,
file: Path,
) -> str | None:
"""
Uses Azure AI Vision to parse the document and return the text content.
It requests a searchable PDF output with embedded text.
The PDF is saved to the archive_path attribute.
Returns the text content extracted from the document.
If the parsing fails, it returns None.
"""
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.ai.documentintelligence.models import AnalyzeOutputOption
from azure.ai.documentintelligence.models import DocumentContentFormat
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint=self.settings.endpoint,
credential=AzureKeyCredential(self.settings.api_key),
)
try:
with file.open("rb") as f:
analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
poller = client.begin_analyze_document(
model_id="prebuilt-read",
body=analyze_request,
output_content_format=DocumentContentFormat.TEXT,
output=[AnalyzeOutputOption.PDF], # request searchable PDF output
content_type="application/json",
)
poller.wait()
result_id = poller.details["operation_id"]
result = poller.result()
# Download the PDF with embedded text
self.archive_path = self.tempdir / "archive.pdf"
with self.archive_path.open("wb") as f:
for chunk in client.get_analyze_result_pdf(
model_id="prebuilt-read",
result_id=result_id,
):
f.write(chunk)
return result.content
except Exception as e:
self.log.error(f"Azure AI Vision parsing failed: {e}")
finally:
client.close()
return None
def parse(self, document_path: Path, mime_type, file_name=None):
if not self.settings.engine_is_valid():
self.log.warning(
"No valid remote parser engine is configured, content will be empty.",
)
self.text = ""
elif self.settings.engine == "azureai":
self.text = self.azure_ai_vision_parse(document_path)

View File

@@ -1,18 +0,0 @@
def get_parser(*args, **kwargs):
from paperless_remote.parsers import RemoteDocumentParser
return RemoteDocumentParser(*args, **kwargs)
def get_supported_mime_types():
from paperless_remote.parsers import RemoteDocumentParser
return RemoteDocumentParser(None).supported_mime_types()
def remote_consumer_declaration(sender, **kwargs):
return {
"parser": get_parser,
"weight": 5,
"mime_types": get_supported_mime_types(),
}

View File

@@ -1,24 +0,0 @@
from unittest import TestCase
from django.test import override_settings
from paperless_remote import check_remote_parser_configured
class TestChecks(TestCase):
@override_settings(REMOTE_OCR_ENGINE=None)
def test_no_engine(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 0)
@override_settings(REMOTE_OCR_ENGINE="azureai")
@override_settings(REMOTE_OCR_API_KEY="somekey")
@override_settings(REMOTE_OCR_ENDPOINT=None)
def test_azure_no_endpoint(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 1)
self.assertTrue(
msgs[0].msg.startswith(
"Azure AI remote parser requires endpoint and API key to be configured.",
),
)

View File

@@ -1,128 +0,0 @@
import uuid
from pathlib import Path
from unittest import mock
from django.test import TestCase
from django.test import override_settings
from documents.tests.utils import DirectoriesMixin
from documents.tests.utils import FileSystemAssertsMixin
from paperless_remote.parsers import RemoteDocumentParser
from paperless_remote.signals import get_parser
class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
def assertContainsStrings(self, content: str, strings: list[str]):
# Asserts that all strings appear in content, in the given order.
indices = []
for s in strings:
if s in content:
indices.append(content.index(s))
else:
self.fail(f"'{s}' is not in '{content}'")
self.assertListEqual(indices, sorted(indices))
@mock.patch("paperless_tesseract.parsers.run_subprocess")
@mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
# Arrange mock Azure client
mock_client = mock.Mock()
mock_client_cls.return_value = mock_client
# Simulate poller result and its `.details`
mock_poller = mock.Mock()
mock_poller.wait.return_value = None
mock_poller.details = {"operation_id": "fake-op-id"}
mock_client.begin_analyze_document.return_value = mock_poller
mock_poller.result.return_value.content = "This is a test document."
# Return dummy PDF bytes
mock_client.get_analyze_result_pdf.return_value = [
b"%PDF-",
b"1.7 ",
b"FAKEPDF",
]
# Simulate pdftotext by writing dummy text to sidecar file
def fake_run(cmd, *args, **kwargs):
with Path(cmd[-1]).open("w", encoding="utf-8") as f:
f.write("This is a test document.")
mock_subprocess.side_effect = fake_run
with override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="somekey",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
):
parser = get_parser(uuid.uuid4())
parser.parse(
self.SAMPLE_FILES / "simple-digital.pdf",
"application/pdf",
)
self.assertContainsStrings(
parser.text.strip(),
["This is a test document."],
)
@mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
def test_get_text_with_azure_error_logged_and_returns_none(self, mock_client_cls):
mock_client = mock.Mock()
mock_client.begin_analyze_document.side_effect = RuntimeError("fail")
mock_client_cls.return_value = mock_client
with override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="somekey",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
):
parser = get_parser(uuid.uuid4())
with mock.patch.object(parser.log, "error") as mock_log_error:
parser.parse(
self.SAMPLE_FILES / "simple-digital.pdf",
"application/pdf",
)
self.assertIsNone(parser.text)
mock_client.begin_analyze_document.assert_called_once()
mock_client.close.assert_called_once()
mock_log_error.assert_called_once()
self.assertIn(
"Azure AI Vision parsing failed",
mock_log_error.call_args[0][0],
)
@override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="key",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
)
def test_supported_mime_types_valid_config(self):
parser = RemoteDocumentParser(uuid.uuid4())
expected_types = {
"application/pdf": ".pdf",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/tiff": ".tiff",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/webp": ".webp",
}
self.assertEqual(parser.supported_mime_types(), expected_types)
def test_supported_mime_types_invalid_config(self):
parser = get_parser(uuid.uuid4())
self.assertEqual(parser.supported_mime_types(), {})
@override_settings(
REMOTE_OCR_ENGINE=None,
REMOTE_OCR_API_KEY=None,
REMOTE_OCR_ENDPOINT=None,
)
def test_parse_with_invalid_config(self):
parser = get_parser(uuid.uuid4())
parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
self.assertEqual(parser.text, "")

47
uv.lock generated
View File

@@ -95,34 +95,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/02/ff/1175b0b7371e46244032d43a56862d0af455823b5280a50c63d99cc50f18/automat-25.4.16-py3-none-any.whl", hash = "sha256:04e9bce696a8d5671ee698005af6e5a9fa15354140a87f4870744604dcdd3ba1", size = 42842, upload-time = "2025-04-16T20:12:14.447Z" },
]
[[package]]
name = "azure-ai-documentintelligence"
version = "1.0.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
]
[[package]]
name = "azure-core"
version = "1.33.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
]
[[package]]
name = "babel"
version = "2.17.0"
@@ -1479,15 +1451,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
]
[[package]]
name = "isodate"
version = "0.7.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
]
[[package]]
name = "jinja2"
version = "3.1.6"
@@ -2155,7 +2118,6 @@ name = "paperless-ngx"
version = "2.20.1"
source = { virtual = "." }
dependencies = [
{ name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2201,6 +2163,7 @@ dependencies = [
{ name = "pyzbar", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "rapidfuzz", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "redis", extra = ["hiredis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "regex", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "scikit-learn", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "setproctitle", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "tika-client", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2292,7 +2255,6 @@ typing = [
[package.metadata]
requires-dist = [
{ name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
{ name = "babel", specifier = ">=2.17" },
{ name = "bleach", specifier = "~=6.3.0" },
{ name = "celery", extras = ["redis"], specifier = "~=5.5.1" },
@@ -2345,6 +2307,7 @@ requires-dist = [
{ name = "pyzbar", specifier = "~=0.1.9" },
{ name = "rapidfuzz", specifier = "~=3.14.0" },
{ name = "redis", extras = ["hiredis"], specifier = "~=5.2.1" },
{ name = "regex", specifier = ">=2025.9.18" },
{ name = "scikit-learn", specifier = "~=1.7.0" },
{ name = "setproctitle", specifier = "~=1.3.4" },
{ name = "tika-client", specifier = "~=0.10.0" },
@@ -4101,11 +4064,11 @@ wheels = [
[[package]]
name = "types-python-dateutil"
version = "2.9.0.20250822"
version = "2.9.0.20251115"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/0c/0a/775f8551665992204c756be326f3575abba58c4a3a52eef9909ef4536428/types_python_dateutil-2.9.0.20250822.tar.gz", hash = "sha256:84c92c34bd8e68b117bff742bc00b692a1e8531262d4507b33afcc9f7716cd53", size = 16084, upload-time = "2025-08-22T03:02:00.613Z" }
sdist = { url = "https://files.pythonhosted.org/packages/6a/36/06d01fb52c0d57e9ad0c237654990920fa41195e4b3d640830dabf9eeb2f/types_python_dateutil-2.9.0.20251115.tar.gz", hash = "sha256:8a47f2c3920f52a994056b8786309b43143faa5a64d4cbb2722d6addabdf1a58", size = 16363, upload-time = "2025-11-15T03:00:13.717Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ab/d9/a29dfa84363e88b053bf85a8b7f212a04f0d7343a4d24933baa45c06e08b/types_python_dateutil-2.9.0.20250822-py3-none-any.whl", hash = "sha256:849d52b737e10a6dc6621d2bd7940ec7c65fcb69e6aa2882acf4e56b2b508ddc", size = 17892, upload-time = "2025-08-22T03:01:59.436Z" },
{ url = "https://files.pythonhosted.org/packages/43/0b/56961d3ba517ed0df9b3a27bfda6514f3d01b28d499d1bce9068cfe4edd1/types_python_dateutil-2.9.0.20251115-py3-none-any.whl", hash = "sha256:9cf9c1c582019753b8639a081deefd7e044b9fa36bd8217f565c6c4e36ee0624", size = 18251, upload-time = "2025-11-15T03:00:12.317Z" },
]
[[package]]