Compare commits

..

28 Commits

Author SHA1 Message Date
shamoon
bd6585d3b4 Merge branch 'dev' into feature-remote-ocr-2 2025-08-22 08:54:26 -07:00
shamoon
bfd468103b Revert "Update ci.yml"
This reverts commit be0c1fd1ed.
2025-08-22 08:46:01 -07:00
shamoon
be0c1fd1ed Update ci.yml 2025-08-22 08:45:33 -07:00
shamoon
82370963da Fix: ignore incomplete tasks for system status 'last run' (#10641) 2025-08-21 21:44:41 +00:00
shamoon
0fdfa42a83 Tweak: improve dateparser auto-detection messages (#10640) 2025-08-21 21:14:25 +00:00
shamoon
0f0ba92e15 Fix: increase legibility of date filter clear button in light mode (#10649) 2025-08-21 07:25:21 -07:00
Guntbert Reiter
5f0281e427 Documentation: fix typo in troubleshooting docs (#10643) 2025-08-20 13:25:42 -07:00
shamoon
a0c7785881 Dont require_changes for codecov comment 2025-08-20 11:18:38 -07:00
shamoon
717e828a1d Merge branch 'dev' into feature-remote-ocr-2 2025-08-17 21:25:14 -07:00
shamoon
07381d48e6 Merge branch 'dev' into feature-remote-ocr-2 2025-08-17 07:49:58 -07:00
shamoon
dd0ffaf312 Merge branch 'dev' into feature-remote-ocr-2 2025-08-11 10:48:36 -07:00
shamoon
264504affc Fix consumer declaration file extensions 2025-08-10 05:32:52 -07:00
shamoon
4feedf2add Merge branch 'dev' into feature-remote-ocr-2 2025-08-06 16:04:25 -04:00
shamoon
2f76cf9831 Merge branch 'dev' into feature-remote-ocr-2 2025-08-01 23:55:49 -04:00
shamoon
1002d37f6b Update test_parser.py 2025-07-09 11:05:37 -07:00
shamoon
d260a94740 Update parsers.py 2025-07-09 11:02:57 -07:00
shamoon
88c69b83ea Update index.md 2025-07-09 11:00:12 -07:00
shamoon
2557ee2014 Update docs to mention remote OCR with Azure AI 2025-07-09 09:53:30 -07:00
shamoon
3c75deed80 Add paperless_remote tests to testpaths 2025-07-08 14:19:45 -07:00
shamoon
d05343c927 Test fixes / coverage 2025-07-08 14:19:45 -07:00
shamoon
e7972b7eaf Coverage 2025-07-08 14:19:45 -07:00
shamoon
75a091cc0d Fix test 2025-07-08 14:19:44 -07:00
shamoon
dca74803fd Use output_content_format poller.result to get clean content 2025-07-08 14:19:44 -07:00
shamoon
3cf3d868d0 Some docs 2025-07-08 14:19:43 -07:00
shamoon
bf4fc6604a Test 2025-07-08 14:19:43 -07:00
shamoon
e8c1eb86fa This actually works
[ci skip]
2025-07-08 14:19:43 -07:00
shamoon
c3dad3cf69 Basic parse 2025-07-08 14:19:42 -07:00
shamoon
811bd66088 Ok, restart implementing this with just azure
[ci skip]
2025-07-08 14:19:42 -07:00
35 changed files with 494 additions and 472 deletions

View File

@@ -10,10 +10,8 @@ component_management:
paths:
- src-ui/**
# https://docs.codecov.com/docs/pull-request-comments
# codecov will only comment if coverage changes
comment:
layout: "header, diff, components, flags, files"
require_changes: true
# https://docs.codecov.com/docs/javascript-bundle-analysis
require_bundle_changes: true
bundle_change_threshold: "50Kb"

View File

@@ -1800,3 +1800,23 @@ password. All of these options come from their similarly-named [Django settings]
#### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}
: Defaults to false.
## Remote OCR
#### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
: The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
Defaults to None, which disables remote OCR.
#### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
: The API key to use for the remote OCR engine.
Defaults to None.
#### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
: The endpoint to use for the remote OCR engine. This is required for Azure AI.
Defaults to None.

View File

@@ -25,9 +25,10 @@ physical documents into a searchable online archive so you can keep, well, _less
## Features
- **Organize and index** your scanned documents with tags, correspondents, types, and more.
- _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
- _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
- Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
- Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- _New!_ Supports remote OCR with Azure AI (opt-in).
- Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
- Uses machine-learning to automatically add tags, correspondents and document types to your documents.
- Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.

View File

@@ -33,7 +33,7 @@ warns that
`OCR for XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled`,
then you might need to install the [Tesseract language
files](https://packages.ubuntu.com/search?keywords=tesseract-ocr)
marching your document's languages.
matching your document's languages.
As an example, if you are running Paperless-ngx from any Ubuntu or
Debian box, and your documents are written in Spanish you may need to

View File

@@ -850,6 +850,18 @@ how regularly you intend to scan documents and use paperless.
performed the task associated with the document, move it to the
inbox.
## Remote OCR
!!! important
This feature is disabled by default and will always remain strictly "opt-in".
Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
[Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
## Architecture
Paperless-ngx consists of the following components:

View File

@@ -15,6 +15,7 @@ classifiers = [
# This will allow testing to not install a webserver, mysql, etc
dependencies = [
"azure-ai-documentintelligence>=1.0.2",
"babel>=2.17",
"bleach~=6.2.0",
"celery[redis]~=5.5.1",
@@ -239,6 +240,7 @@ testpaths = [
"src/paperless_tesseract/tests/",
"src/paperless_tika/tests",
"src/paperless_text/tests/",
"src/paperless_remote/tests/",
]
addopts = [
"--pythonwarnings=all",

View File

@@ -11,7 +11,7 @@
<div class="selected-icon">
@if (createdRelativeDate) {
<a class="text-light focus-variants" href="javascript:void(0)" (click)="clearCreatedRelativeDate()">
<i-bs width="1em" height="1em" name="check" class="variant-unfocused"></i-bs>
<i-bs width="1em" height="1em" name="check" class="variant-unfocused text-dark"></i-bs>
<i-bs width="1em" height="1em" name="x" class="variant-focused text-primary"></i-bs>
</a>
}

View File

@@ -12,8 +12,6 @@
<pngx-input-color i18n-title title="Color" formControlName="color" [error]="error?.color"></pngx-input-color>
<pngx-input-select i18n-title title="Parent" formControlName="parent" [items]="tags" [allowNull]="true" [error]="error?.parent"></pngx-input-select>
<pngx-input-check i18n-title title="Inbox tag" formControlName="is_inbox_tag" i18n-hint hint="Inbox tags are automatically assigned to all consumed documents."></pngx-input-check>
<pngx-input-select i18n-title title="Matching algorithm" [items]="getMatchingAlgorithms()" formControlName="matching_algorithm"></pngx-input-select>
@if (patternRequired) {

View File

@@ -35,16 +35,11 @@ import { TextComponent } from '../../input/text/text.component'
],
})
export class TagEditDialogComponent extends EditDialogComponent<Tag> {
tags: Tag[]
constructor() {
super()
this.service = inject(TagService)
this.userService = inject(UserService)
this.settingsService = inject(SettingsService)
this.service.listAll().subscribe((result) => {
this.tags = result.results
})
}
getCreateTitle() {
@@ -60,7 +55,6 @@ export class TagEditDialogComponent extends EditDialogComponent<Tag> {
name: new FormControl(''),
color: new FormControl(randomColor()),
is_inbox_tag: new FormControl(false),
parent: new FormControl(null),
matching_algorithm: new FormControl(DEFAULT_MATCHING_ALGORITHM),
match: new FormControl(''),
is_insensitive: new FormControl(true),

View File

@@ -7,14 +7,13 @@
<div class="input-group flex-nowrap">
<ng-select #tagSelect name="tags" [items]="tags" bindLabel="name" bindValue="id" [(ngModel)]="value"
[disabled]="disabled"
[multiple]="multiple"
[multiple]="true"
[closeOnSelect]="false"
[clearSearchOnAdd]="true"
[hideSelected]="tags.length > 0"
[addTag]="allowCreate ? createTagRef : false"
addTagText="Add tag"
i18n-addTagText
(add)="onAdd($event)"
(change)="onChange(value)">
<ng-template ng-label-tmp let-item="item">

View File

@@ -100,9 +100,6 @@ export class TagsComponent implements OnInit, ControlValueAccessor {
@Input()
horizontal: boolean = false
@Input()
multiple: boolean = true
@Output()
filterDocuments = new EventEmitter<Tag[]>()
@@ -127,40 +124,13 @@ export class TagsComponent implements OnInit, ControlValueAccessor {
let index = this.value.indexOf(tagID)
if (index > -1) {
const tag = this.getTag(tagID)
// remove tag
let oldValue = this.value
oldValue.splice(index, 1)
// remove children
oldValue = this.removeChildren(oldValue, tag)
this.value = [...oldValue]
this.onChange(this.value)
}
}
private removeChildren(tagIDs: number[], tag: Tag) {
if (tag.children?.length) {
const childIDs = tag.children.map((child) => child.id)
tagIDs = tagIDs.filter((id) => !childIDs.includes(id))
for (const child of tag.children) {
tagIDs = this.removeChildren(tagIDs, child)
}
}
return tagIDs
}
public onAdd(tag: Tag) {
if (tag.parent) {
// add all parents recursively
const parent = this.getTag(tag.parent)
this.value = [...this.value, parent.id]
this.onAdd(parent)
}
}
createTag(name: string = null, add: boolean = false) {
var modal = this.modalService.open(TagEditDialogComponent, {
backdrop: 'static',

View File

@@ -54,7 +54,61 @@
</tr>
}
@for (object of data; track object) {
<ng-container [ngTemplateOutlet]="objectRow" [ngTemplateOutletContext]="{ object: object, depth: 0 }"></ng-container>
<tr (click)="toggleSelected(object); $event.stopPropagation();" class="data-row fade" [class.show]="show">
<td>
<div class="form-check m-0 ms-2 me-n2">
<input type="checkbox" class="form-check-input" id="{{typeName}}{{object.id}}" [checked]="selectedObjects.has(object.id)" (click)="toggleSelected(object); $event.stopPropagation();">
<label class="form-check-label" for="{{typeName}}{{object.id}}"></label>
</div>
</td>
<td scope="row"><button class="btn btn-link ms-0 ps-0 text-start" (click)="userCanEdit(object) ? openEditDialog(object) : null; $event.stopPropagation()">{{ object.name }}</button> </td>
<td scope="row" class="d-none d-sm-table-cell">{{ getMatching(object) }}</td>
<td scope="row">{{ object.document_count }}</td>
@for (column of extraColumns; track column) {
<td scope="row" [ngClass]="{ 'd-none d-sm-table-cell' : column.hideOnMobile }">
@if (column.rendersHtml) {
<div [innerHtml]="column.valueFn.call(null, object) | safeHtml"></div>
} @else if (column.monospace) {
<span class="font-monospace">{{ column.valueFn.call(null, object) }}</span>
} @else {
{{ column.valueFn.call(null, object) }}
}
</td>
}
<td scope="row">
<div class="btn-toolbar gap-2">
<div class="btn-group d-block d-sm-none">
<div ngbDropdown container="body" class="d-inline-block">
<button type="button" class="btn btn-link" id="actionsMenuMobile" (click)="$event.stopPropagation()" ngbDropdownToggle>
<i-bs name="three-dots-vertical"></i-bs>
</button>
<div ngbDropdownMenu aria-labelledby="actionsMenuMobile">
<button (click)="openEditDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" ngbDropdownItem i18n>Edit</button>
<button class="text-danger" (click)="openDeleteDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" ngbDropdownItem i18n>Delete</button>
@if (object.document_count > 0) {
<button (click)="filterDocuments(object)" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }" ngbDropdownItem i18n>Filter Documents ({{ object.document_count }})</button>
}
</div>
</div>
</div>
<div class="btn-group d-none d-sm-inline-block">
<button class="btn btn-sm btn-outline-secondary" (click)="openEditDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" [disabled]="!userCanEdit(object)">
<i-bs width="1em" height="1em" name="pencil"></i-bs>&nbsp;<ng-container i18n>Edit</ng-container>
</button>
<button class="btn btn-sm btn-outline-danger" (click)="openDeleteDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" [disabled]="!userCanDelete(object)">
<i-bs width="1em" height="1em" name="trash"></i-bs>&nbsp;<ng-container i18n>Delete</ng-container>
</button>
</div>
@if (object.document_count > 0) {
<div class="btn-group d-none d-sm-inline-block">
<button class="btn btn-sm btn-outline-secondary" (click)="filterDocuments(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }">
<i-bs width="1em" height="1em" name="filter"></i-bs>&nbsp;<ng-container i18n>Documents</ng-container><span class="badge bg-light text-secondary ms-2">{{ object.document_count }}</span>
</button>
</div>
}
</div>
</td>
</tr>
}
</tbody>
</table>
@@ -75,72 +129,3 @@
}
</div>
}
<ng-template #objectRow let-object="object" let-depth="depth">
<tr (click)="toggleSelected(object); $event.stopPropagation();" class="data-row fade" [class.show]="show">
<td>
<div class="form-check m-0 ms-2 me-n2">
<input type="checkbox" class="form-check-input" id="{{typeName}}{{object.id}}" [checked]="selectedObjects.has(object.id)" (click)="toggleSelected(object); $event.stopPropagation();">
<label class="form-check-label" for="{{typeName}}{{object.id}}"></label>
</div>
</td>
<td scope="row" class="name-cell" style="--depth: {{depth}}">
@if (depth > 0) {
<div class="indicator"></div>
}
<button class="btn btn-link ms-0 ps-0 text-start" (click)="userCanEdit(object) ? openEditDialog(object) : null; $event.stopPropagation()">{{ object.name }}</button>
</td>
<td scope="row" class="d-none d-sm-table-cell">{{ getMatching(object) }}</td>
<td scope="row">{{ getDocumentCount(object) }}</td>
@for (column of extraColumns; track column) {
<td scope="row" [ngClass]="{ 'd-none d-sm-table-cell' : column.hideOnMobile }">
@if (column.rendersHtml) {
<div [innerHtml]="column.valueFn.call(null, object) | safeHtml"></div>
} @else if (column.monospace) {
<span class="font-monospace">{{ column.valueFn.call(null, object) }}</span>
} @else {
{{ column.valueFn.call(null, object) }}
}
</td>
}
<td scope="row">
<div class="btn-toolbar gap-2">
<div class="btn-group d-block d-sm-none">
<div ngbDropdown container="body" class="d-inline-block">
<button type="button" class="btn btn-link" id="actionsMenuMobile" (click)="$event.stopPropagation()" ngbDropdownToggle>
<i-bs name="three-dots-vertical"></i-bs>
</button>
<div ngbDropdownMenu aria-labelledby="actionsMenuMobile">
<button (click)="openEditDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" ngbDropdownItem i18n>Edit</button>
<button class="text-danger" (click)="openDeleteDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" ngbDropdownItem i18n>Delete</button>
@if (getDocumentCount(object) > 0) {
<button (click)="filterDocuments(object)" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }" ngbDropdownItem i18n>Filter Documents ({{ getDocumentCount(object) }})</button>
}
</div>
</div>
</div>
<div class="btn-group d-none d-sm-inline-block">
<button class="btn btn-sm btn-outline-secondary" (click)="openEditDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" [disabled]="!userCanEdit(object)">
<i-bs width="1em" height="1em" name="pencil"></i-bs>&nbsp;<ng-container i18n>Edit</ng-container>
</button>
<button class="btn btn-sm btn-outline-danger" (click)="openDeleteDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" [disabled]="!userCanDelete(object)">
<i-bs width="1em" height="1em" name="trash"></i-bs>&nbsp;<ng-container i18n>Delete</ng-container>
</button>
</div>
@if (getDocumentCount(object) > 0) {
<div class="btn-group d-none d-sm-inline-block">
<button class="btn btn-sm btn-outline-secondary" (click)="filterDocuments(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }">
<i-bs width="1em" height="1em" name="filter"></i-bs>&nbsp;<ng-container i18n>Documents</ng-container><span class="badge bg-light text-secondary ms-2">{{ getDocumentCount(object) }}</span>
</button>
</div>
}
</div>
</td>
</tr>
@if (object.children && object.children.length > 0) {
@for (child of object.children; track child) {
<ng-container [ngTemplateOutlet]="objectRow" [ngTemplateOutletContext]="{ object: child, depth: depth + 1 }"></ng-container>
}
}
</ng-template>

View File

@@ -10,17 +10,3 @@ tbody tr:last-child td {
.form-check {
min-height: 0;
}
td.name-cell {
padding-left: calc(calc(var(--depth) - 1) * 1.1rem);
.indicator {
display: inline-block;
width: .8rem;
height: .8rem;
border-left: 1px solid var(--bs-secondary);
border-bottom: 1px solid var(--bs-secondary);
margin-right: .25rem;
margin-left: .5rem;
}
}

View File

@@ -79,7 +79,6 @@ export abstract class ManagementListComponent<T extends MatchingModel>
@ViewChildren(SortableDirective) headers: QueryList<SortableDirective>
public data: T[] = []
private unfilteredData: T[] = []
public page = 1
@@ -133,18 +132,6 @@ export abstract class ManagementListComponent<T extends MatchingModel>
this.reloadData()
}
protected filterData(data: T[]): T[] {
return data
}
getDocumentCount(object: MatchingModel): number {
return (
object.document_count ??
this.unfilteredData.find((d) => d.id == object.id)?.document_count ??
0
)
}
reloadData(extraParams: { [key: string]: any } = null) {
this.loading = true
this.clearSelection()
@@ -161,8 +148,7 @@ export abstract class ManagementListComponent<T extends MatchingModel>
.pipe(
takeUntil(this.unsubscribeNotifier),
tap((c) => {
this.unfilteredData = c.results
this.data = this.filterData(c.results)
this.data = c.results
this.collectionSize = c.count
}),
delay(100)

View File

@@ -1,4 +1,4 @@
import { NgClass, NgTemplateOutlet, TitleCasePipe } from '@angular/common'
import { NgClass, TitleCasePipe } from '@angular/common'
import { Component, inject } from '@angular/core'
import { FormsModule, ReactiveFormsModule } from '@angular/forms'
import {
@@ -30,7 +30,6 @@ import { ManagementListComponent } from '../management-list/management-list.comp
FormsModule,
ReactiveFormsModule,
NgClass,
NgTemplateOutlet,
NgbDropdownModule,
NgbPaginationModule,
NgxBootstrapIconsModule,
@@ -60,8 +59,4 @@ export class TagListComponent extends ManagementListComponent<Tag> {
getDeleteMessage(object: Tag) {
return $localize`Do you really want to delete the tag "${object.name}"?`
}
filterData(data: Tag[]) {
return data.filter((tag) => !tag.parent)
}
}

View File

@@ -6,8 +6,4 @@ export interface Tag extends MatchingModel {
text_color?: string
is_inbox_tag?: boolean
parent?: number // Tag ID
children?: Tag[] // read-only
}

View File

@@ -25,7 +25,6 @@ from documents.models import CustomFieldInstance
from documents.models import Document
from documents.models import DocumentType
from documents.models import StoragePath
from documents.models import Tag
from documents.permissions import set_permissions_for_object
from documents.plugins.helpers import DocumentsStatusManager
from documents.tasks import bulk_update_documents
@@ -97,46 +96,31 @@ def set_document_type(doc_ids: list[int], document_type: DocumentType) -> Litera
def add_tag(doc_ids: list[int], tag: int) -> Literal["OK"]:
tag_obj = Tag.objects.get(pk=tag)
tags_to_add = [tag_obj, *tag_obj.get_all_ancestors()]
qs = Document.objects.filter(Q(id__in=doc_ids) & ~Q(tags__id=tag)).only("pk")
affected_docs = list(qs.values_list("pk", flat=True))
DocumentTagRelationship = Document.tags.through
to_create = []
affected_docs: set[int] = set()
for t in tags_to_add:
qs = Document.objects.filter(Q(id__in=doc_ids) & ~Q(tags__id=t.id)).only("pk")
doc_ids_missing_tag = list(qs.values_list("pk", flat=True))
affected_docs.update(doc_ids_missing_tag)
to_create.extend(
DocumentTagRelationship(document_id=doc, tag_id=t.id)
for doc in doc_ids_missing_tag
)
DocumentTagRelationship.objects.bulk_create(
[DocumentTagRelationship(document_id=doc, tag_id=tag) for doc in affected_docs],
)
if to_create:
DocumentTagRelationship.objects.bulk_create(to_create)
if affected_docs:
bulk_update_documents.delay(document_ids=list(affected_docs))
bulk_update_documents.delay(document_ids=affected_docs)
return "OK"
def remove_tag(doc_ids: list[int], tag: int) -> Literal["OK"]:
tag_obj = Tag.objects.get(pk=tag)
tags_to_remove = [tag_obj, *tag_obj.get_all_descendants()]
tag_ids = [t.id for t in tags_to_remove]
qs = Document.objects.filter(Q(id__in=doc_ids) & Q(tags__id=tag)).only("pk")
affected_docs = list(qs.values_list("pk", flat=True))
DocumentTagRelationship = Document.tags.through
qs = DocumentTagRelationship.objects.filter(
document_id__in=doc_ids,
tag_id__in=tag_ids,
)
affected_docs = list(qs.values_list("document_id", flat=True).distinct())
qs.delete()
if affected_docs:
bulk_update_documents.delay(document_ids=affected_docs)
DocumentTagRelationship.objects.filter(
Q(document_id__in=affected_docs) & Q(tag_id=tag),
).delete()
bulk_update_documents.delay(document_ids=affected_docs)
return "OK"
@@ -148,35 +132,23 @@ def modify_tags(
) -> Literal["OK"]:
qs = Document.objects.filter(id__in=doc_ids).only("pk")
affected_docs = list(qs.values_list("pk", flat=True))
DocumentTagRelationship = Document.tags.through
# add with all ancestors
expanded_add_tags: set[int] = set()
for tag_id in add_tags:
t = Tag.objects.get(pk=tag_id)
expanded_add_tags.update([t.id for t in [t, *t.get_all_ancestors()]])
DocumentTagRelationship.objects.filter(
document_id__in=affected_docs,
tag_id__in=remove_tags,
).delete()
# remove with all descendants
expanded_remove_tags: set[int] = set()
for tag_id in remove_tags:
t = Tag.objects.get(pk=tag_id)
expanded_remove_tags.update([t.id for t in [t, *t.get_all_descendants()]])
DocumentTagRelationship.objects.bulk_create(
[
DocumentTagRelationship(document_id=doc, tag_id=tag)
for (doc, tag) in itertools.product(affected_docs, add_tags)
],
ignore_conflicts=True,
)
if expanded_remove_tags:
DocumentTagRelationship.objects.filter(
document_id__in=affected_docs,
tag_id__in=expanded_remove_tags,
).delete()
to_create = [
DocumentTagRelationship(document_id=doc, tag_id=tag)
for (doc, tag) in itertools.product(affected_docs, expanded_add_tags)
]
if to_create:
DocumentTagRelationship.objects.bulk_create(to_create, ignore_conflicts=True)
if affected_docs:
bulk_update_documents.delay(document_ids=affected_docs)
bulk_update_documents.delay(document_ids=affected_docs)
return "OK"

View File

@@ -689,7 +689,7 @@ class ConsumerPlugin(
if self.metadata.tag_ids:
for tag_id in self.metadata.tag_ids:
document.add_nested_tags([Tag.objects.get(pk=tag_id)])
document.tags.add(Tag.objects.get(pk=tag_id))
if self.metadata.storage_path_id:
document.storage_path = StoragePath.objects.get(

View File

@@ -1,26 +0,0 @@
# Generated by Django 5.1.5 on 2025-02-10 06:02
import django.db.models.deletion
from django.db import migrations
from django.db import models
class Migration(migrations.Migration):
dependencies = [
("documents", "1068_alter_document_created"),
]
operations = [
migrations.AddField(
model_name="tag",
name="parent",
field=models.ForeignKey(
blank=True,
null=True,
on_delete=django.db.models.deletion.CASCADE,
related_name="children",
to="documents.tag",
verbose_name="parent",
),
),
]

View File

@@ -7,7 +7,6 @@ from celery import states
from django.conf import settings
from django.contrib.auth.models import Group
from django.contrib.auth.models import User
from django.core.exceptions import ValidationError
from django.core.validators import MaxValueValidator
from django.core.validators import MinValueValidator
from django.db import models
@@ -109,38 +108,10 @@ class Tag(MatchingModel):
),
)
parent = models.ForeignKey(
"self",
blank=True,
null=True,
on_delete=models.CASCADE,
related_name="children",
verbose_name=_("parent"),
)
class Meta(MatchingModel.Meta):
verbose_name = _("tag")
verbose_name_plural = _("tags")
def get_all_descendants(self):
descendants = []
for child in self.children.all():
descendants.append(child)
descendants.extend(child.get_all_descendants())
return descendants
def get_all_ancestors(self):
ancestors = []
if self.parent:
ancestors.append(self.parent)
ancestors.extend(self.parent.get_all_ancestors())
return ancestors
def clean(self):
if self.parent == self:
raise ValidationError("Cannot set itself as parent.")
return super().clean()
class DocumentType(MatchingModel):
class Meta(MatchingModel.Meta):
@@ -405,12 +376,6 @@ class Document(SoftDeleteModel, ModelWithOwner):
def created_date(self):
return self.created
def add_nested_tags(self, tags):
for tag in tags:
self.tags.add(tag)
if tag.parent:
self.add_nested_tags([tag.parent])
class SavedView(ModelWithOwner):
class DisplayMode(models.TextChoices):

View File

@@ -540,18 +540,6 @@ class TagSerializer(MatchingModelSerializer, OwnedObjectSerializer):
text_color = serializers.SerializerMethodField()
children = SerializerMethodField()
@extend_schema_field(
field=serializers.ListSerializer(
child=serializers.PrimaryKeyRelatedField(
queryset=Tag.objects.all(),
),
),
)
def get_children(self, obj):
return TagSerializer(obj.children.all(), many=True).data
class Meta:
model = Tag
fields = (
@@ -569,8 +557,6 @@ class TagSerializer(MatchingModelSerializer, OwnedObjectSerializer):
"permissions",
"user_can_change",
"set_permissions",
"parent",
"children",
)
def validate_color(self, color):
@@ -1042,23 +1028,6 @@ class DocumentSerializer(
custom_field_instance.field,
doc_id,
)
if "tags" in validated_data:
# add all parent tags
all_ancestor_tags = set(validated_data["tags"])
for tag in validated_data["tags"]:
all_ancestor_tags.update(tag.get_all_ancestors())
validated_data["tags"] = list(all_ancestor_tags)
# remove any children for parents that are being removed
tag_parents_being_removed = [
tag
for tag in instance.tags.all()
if tag not in validated_data["tags"] and tag.children.count() > 0
]
validated_data["tags"] = [
tag
for tag in validated_data["tags"]
if tag not in tag_parents_being_removed
]
if validated_data.get("remove_inbox_tags"):
tag_ids_being_added = (
[

View File

@@ -260,7 +260,7 @@ def set_tags(
extra={"group": logging_group},
)
document.add_nested_tags(relevant_tags)
document.tags.add(*relevant_tags)
def set_storage_path(
@@ -767,17 +767,14 @@ def run_workflows(
def assignment_action():
if action.assign_tags.exists():
tag_ids_to_add: set[int] = set()
for tag in action.assign_tags.all():
tag_ids_to_add.add(tag.pk)
tag_ids_to_add.update(t.pk for t in tag.get_all_ancestors())
if not use_overrides:
doc_tag_ids[:] = list(set(doc_tag_ids) | tag_ids_to_add)
doc_tag_ids.extend(action.assign_tags.values_list("pk", flat=True))
else:
if overrides.tag_ids is None:
overrides.tag_ids = []
overrides.tag_ids = list(set(overrides.tag_ids) | tag_ids_to_add)
overrides.tag_ids.extend(
action.assign_tags.values_list("pk", flat=True),
)
if action.assign_correspondent:
if not use_overrides:
@@ -920,17 +917,14 @@ def run_workflows(
else:
overrides.tag_ids = None
else:
tag_ids_to_remove: set[int] = set()
for tag in action.remove_tags.all():
tag_ids_to_remove.add(tag.pk)
tag_ids_to_remove.update(t.pk for t in tag.get_all_descendants())
if not use_overrides:
doc_tag_ids[:] = [t for t in doc_tag_ids if t not in tag_ids_to_remove]
for tag in action.remove_tags.filter(
pk__in=document.tags.values_list("pk", flat=True),
):
doc_tag_ids.remove(tag.pk)
elif overrides.tag_ids:
overrides.tag_ids = [
t for t in overrides.tag_ids if t not in tag_ids_to_remove
]
for tag in action.remove_tags.filter(pk__in=overrides.tag_ids):
overrides.tag_ids.remove(tag.pk)
if not use_overrides and (
action.remove_all_correspondents

View File

@@ -1,112 +0,0 @@
from unittest import mock
from django.contrib.auth.models import User
from rest_framework.test import APITestCase
from documents import bulk_edit
from documents.models import Document
from documents.models import Tag
from documents.models import Workflow
from documents.models import WorkflowAction
from documents.models import WorkflowTrigger
from documents.signals.handlers import run_workflows
class TestTagHierarchy(APITestCase):
def setUp(self):
self.user = User.objects.create_superuser(username="admin")
self.client.force_authenticate(user=self.user)
self.parent = Tag.objects.create(name="Parent")
self.child = Tag.objects.create(name="Child", parent=self.parent)
patcher = mock.patch("documents.bulk_edit.bulk_update_documents.delay")
self.async_task = patcher.start()
self.addCleanup(patcher.stop)
self.document = Document.objects.create(
title="doc",
content="",
checksum="1",
mime_type="application/pdf",
)
def test_api_add_child_adds_parent(self):
self.client.patch(
f"/api/documents/{self.document.pk}/",
{"tags": [self.child.pk]},
format="json",
)
self.document.refresh_from_db()
tags = set(self.document.tags.values_list("pk", flat=True))
assert tags == {self.parent.pk, self.child.pk}
def test_api_remove_parent_removes_child(self):
self.document.add_nested_tags([self.child])
self.client.patch(
f"/api/documents/{self.document.pk}/",
{"tags": []},
format="json",
)
self.document.refresh_from_db()
assert self.document.tags.count() == 0
def test_bulk_edit_respects_hierarchy(self):
bulk_edit.add_tag([self.document.pk], self.child.pk)
self.document.refresh_from_db()
tags = set(self.document.tags.values_list("pk", flat=True))
assert tags == {self.parent.pk, self.child.pk}
bulk_edit.remove_tag([self.document.pk], self.parent.pk)
self.document.refresh_from_db()
assert self.document.tags.count() == 0
bulk_edit.modify_tags([self.document.pk], [self.child.pk], [])
self.document.refresh_from_db()
tags = set(self.document.tags.values_list("pk", flat=True))
assert tags == {self.parent.pk, self.child.pk}
bulk_edit.modify_tags([self.document.pk], [], [self.parent.pk])
self.document.refresh_from_db()
assert self.document.tags.count() == 0
def test_workflow_actions(self):
workflow = Workflow.objects.create(name="wf", order=0)
trigger = WorkflowTrigger.objects.create(
type=WorkflowTrigger.WorkflowTriggerType.DOCUMENT_ADDED,
)
assign_action = WorkflowAction.objects.create()
assign_action.assign_tags.add(self.child)
workflow.triggers.add(trigger)
workflow.actions.add(assign_action)
run_workflows(trigger.type, self.document)
self.document.refresh_from_db()
tags = set(self.document.tags.values_list("pk", flat=True))
assert tags == {self.parent.pk, self.child.pk}
# removal
removal_action = WorkflowAction.objects.create(
type=WorkflowAction.WorkflowActionType.REMOVAL,
)
removal_action.remove_tags.add(self.parent)
workflow.actions.clear()
workflow.actions.add(removal_action)
run_workflows(trigger.type, self.document)
self.document.refresh_from_db()
assert self.document.tags.count() == 0
def test_tag_view_parent_update_adds_parent_to_docs(self):
orphan = Tag.objects.create(name="Orphan")
self.document.tags.add(orphan)
self.client.patch(
f"/api/tags/{orphan.pk}/",
{"parent": self.parent.pk},
format="json",
)
self.document.refresh_from_db()
tags = set(self.document.tags.values_list("pk", flat=True))
assert tags == {self.parent.pk, orphan.pk}

View File

@@ -341,39 +341,6 @@ class TagViewSet(ModelViewSet, PermissionsAwareDocumentCountMixin):
filterset_class = TagFilterSet
ordering_fields = ("color", "name", "matching_algorithm", "match", "document_count")
def perform_update(self, serializer):
old_parent = self.get_object().parent
tag = serializer.save()
new_parent = tag.parent
if old_parent != new_parent:
self._update_document_parent_tags(tag, old_parent, new_parent)
def _update_document_parent_tags(self, tag, old_parent, new_parent):
DocumentTagRelationship = Document.tags.through
doc_ids = list(Document.objects.filter(tags=tag).values_list("pk", flat=True))
affected = set()
if new_parent:
parents_to_add = [new_parent, *new_parent.get_all_ancestors()]
to_create = []
for parent in parents_to_add:
missing = Document.objects.filter(id__in=doc_ids).exclude(tags=parent)
to_create.extend(
DocumentTagRelationship(document_id=doc_id, tag_id=parent.id)
for doc_id in missing.values_list("pk", flat=True)
)
affected.update(missing.values_list("pk", flat=True))
if to_create:
DocumentTagRelationship.objects.bulk_create(
to_create,
ignore_conflicts=True,
)
if affected:
from documents.tasks import bulk_update_documents
bulk_update_documents.delay(document_ids=list(affected))
@extend_schema_view(**generate_object_with_permissions_schema(DocumentTypeSerializer))
class DocumentTypeViewSet(ModelViewSet, PermissionsAwareDocumentCountMixin):
@@ -2869,6 +2836,11 @@ class SystemStatusView(PassUserMixin):
last_trained_task = (
PaperlessTask.objects.filter(
task_name=PaperlessTask.TaskName.TRAIN_CLASSIFIER,
status__in=[
states.SUCCESS,
states.FAILURE,
states.REVOKED,
], # ignore running tasks
)
.order_by("-date_done")
.first()
@@ -2878,7 +2850,7 @@ class SystemStatusView(PassUserMixin):
if last_trained_task is None:
classifier_status = "WARNING"
classifier_error = "No classifier training tasks found"
elif last_trained_task and last_trained_task.status == states.FAILURE:
elif last_trained_task and last_trained_task.status != states.SUCCESS:
classifier_status = "ERROR"
classifier_error = last_trained_task.result
classifier_last_trained = (
@@ -2888,6 +2860,11 @@ class SystemStatusView(PassUserMixin):
last_sanity_check = (
PaperlessTask.objects.filter(
task_name=PaperlessTask.TaskName.CHECK_SANITY,
status__in=[
states.SUCCESS,
states.FAILURE,
states.REVOKED,
], # ignore running tasks
)
.order_by("-date_done")
.first()
@@ -2897,7 +2874,7 @@ class SystemStatusView(PassUserMixin):
if last_sanity_check is None:
sanity_check_status = "WARNING"
sanity_check_error = "No sanity check tasks found"
elif last_sanity_check and last_sanity_check.status == states.FAILURE:
elif last_sanity_check and last_sanity_check.status != states.SUCCESS:
sanity_check_status = "ERROR"
sanity_check_error = last_sanity_check.result
sanity_check_last_run = (

View File

@@ -324,6 +324,7 @@ INSTALLED_APPS = [
"paperless_tesseract.apps.PaperlessTesseractConfig",
"paperless_text.apps.PaperlessTextConfig",
"paperless_mail.apps.PaperlessMailConfig",
"paperless_remote.apps.PaperlessRemoteParserConfig",
"django.contrib.admin",
"rest_framework",
"rest_framework.authtoken",
@@ -1205,8 +1206,8 @@ def _ocr_to_dateparser_languages(ocr_languages: str) -> list[str]:
language_part = ocr_to_dateparser.get(ocr_lang_part)
if language_part is None:
logger.warning(
f'Skipping unknown OCR language "{ocr_language}" — no dateparser equivalent.',
logger.debug(
f'Unable to map OCR language "{ocr_lang_part}" to dateparser locale. ',
)
continue
@@ -1219,7 +1220,7 @@ def _ocr_to_dateparser_languages(ocr_languages: str) -> list[str]:
try:
loader.get_locale_map(locales=[dateparser_language])
except Exception:
logger.warning(
logger.info(
f"Language variant '{dateparser_language}' not supported by dateparser; falling back to base language '{language_part}'. You can manually set PAPERLESS_DATE_PARSER_LANGUAGES if needed.",
)
dateparser_language = language_part
@@ -1229,12 +1230,12 @@ def _ocr_to_dateparser_languages(ocr_languages: str) -> list[str]:
result.append(dateparser_language)
except Exception as e:
logger.warning(
f"Could not configure dateparser languages. Set PAPERLESS_DATE_PARSER_LANGUAGES parameter to avoid this. Detail: {e}",
f"Error auto-configuring dateparser languages. Set PAPERLESS_DATE_PARSER_LANGUAGES parameter to avoid this. Detail: {e}",
)
return []
if not result:
logger.warning(
"Could not configure any dateparser languages from OCR_LANGUAGE fallback to autodetection.",
logger.info(
"Unable to automatically determine dateparser languages from OCR_LANGUAGE, falling back to multi-language support.",
)
return result
@@ -1443,3 +1444,10 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
"PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
"true",
)
###############################################################################
# Remote Parser #
###############################################################################
REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")

View File

@@ -0,0 +1,4 @@
# this is here so that django finds the checks.
from paperless_remote.checks import check_remote_parser_configured
__all__ = ["check_remote_parser_configured"]

View File

@@ -0,0 +1,14 @@
from django.apps import AppConfig
from paperless_remote.signals import remote_consumer_declaration
class PaperlessRemoteParserConfig(AppConfig):
name = "paperless_remote"
def ready(self):
from documents.signals import document_consumer_declaration
document_consumer_declaration.connect(remote_consumer_declaration)
AppConfig.ready(self)

View File

@@ -0,0 +1,15 @@
from django.conf import settings
from django.core.checks import Error
from django.core.checks import register
@register()
def check_remote_parser_configured(app_configs, **kwargs):
if settings.REMOTE_OCR_ENGINE == "azureai" and not settings.REMOTE_OCR_ENDPOINT:
return [
Error(
"Azure AI remote parser requires endpoint to be configured.",
),
]
return []

View File

@@ -0,0 +1,113 @@
from pathlib import Path
from django.conf import settings
from paperless_tesseract.parsers import RasterisedDocumentParser
class RemoteEngineConfig:
def __init__(
self,
engine: str,
api_key: str | None = None,
endpoint: str | None = None,
):
self.engine = engine
self.api_key = api_key
self.endpoint = endpoint
def engine_is_valid(self):
valid = self.engine in ["azureai"] and self.api_key is not None
if self.engine == "azureai":
valid = valid and self.endpoint is not None
return valid
class RemoteDocumentParser(RasterisedDocumentParser):
"""
This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
as this is the only service that provides a remote OCR API with text-embedded PDF output.
"""
logging_name = "paperless.parsing.remote"
def get_settings(self) -> RemoteEngineConfig:
"""
Returns the configuration for the remote OCR engine, loaded from Django settings.
"""
return RemoteEngineConfig(
engine=settings.REMOTE_OCR_ENGINE,
api_key=settings.REMOTE_OCR_API_KEY,
endpoint=settings.REMOTE_OCR_ENDPOINT,
)
def supported_mime_types(self):
if self.settings.engine_is_valid():
return {
"application/pdf": ".pdf",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/tiff": ".tiff",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/webp": ".webp",
}
else:
return {}
def azure_ai_vision_parse(
self,
file: Path,
) -> str | None:
"""
Uses Azure AI Vision to parse the document and return the text content.
It requests a searchable PDF output with embedded text.
The PDF is saved to the archive_path attribute.
Returns the text content extracted from the document.
If the parsing fails, it returns None.
"""
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.ai.documentintelligence.models import AnalyzeOutputOption
from azure.ai.documentintelligence.models import DocumentContentFormat
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint=self.settings.endpoint,
credential=AzureKeyCredential(self.settings.api_key),
)
with file.open("rb") as f:
analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
poller = client.begin_analyze_document(
model_id="prebuilt-read",
body=analyze_request,
output_content_format=DocumentContentFormat.TEXT,
output=[AnalyzeOutputOption.PDF], # request searchable PDF output
content_type="application/json",
)
poller.wait()
result_id = poller.details["operation_id"]
result = poller.result()
# Download the PDF with embedded text
self.archive_path = Path(self.tempdir) / "archive.pdf"
with self.archive_path.open("wb") as f:
for chunk in client.get_analyze_result_pdf(
model_id="prebuilt-read",
result_id=result_id,
):
f.write(chunk)
return result.content
def parse(self, document_path: Path, mime_type, file_name=None):
if not self.settings.engine_is_valid():
self.log.warning(
"No valid remote parser engine is configured, content will be empty.",
)
self.text = ""
return
elif self.settings.engine == "azureai":
self.text = self.azure_ai_vision_parse(document_path)

View File

@@ -0,0 +1,18 @@
def get_parser(*args, **kwargs):
from paperless_remote.parsers import RemoteDocumentParser
return RemoteDocumentParser(*args, **kwargs)
def get_supported_mime_types():
from paperless_remote.parsers import RemoteDocumentParser
return RemoteDocumentParser(None).supported_mime_types()
def remote_consumer_declaration(sender, **kwargs):
return {
"parser": get_parser,
"weight": 5,
"mime_types": get_supported_mime_types(),
}

View File

Binary file not shown.

View File

@@ -0,0 +1,29 @@
from django.test import TestCase
from django.test import override_settings
from paperless_remote import check_remote_parser_configured
class TestChecks(TestCase):
@override_settings(REMOTE_OCR_ENGINE=None)
def test_no_engine(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 0)
@override_settings(REMOTE_OCR_ENGINE="azureai")
@override_settings(REMOTE_OCR_API_KEY="somekey")
@override_settings(REMOTE_OCR_ENDPOINT=None)
def test_azure_no_endpoint(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 1)
self.assertTrue(
msgs[0].msg.startswith(
"Azure AI remote parser requires endpoint to be configured.",
),
)
@override_settings(REMOTE_OCR_ENGINE="something")
@override_settings(REMOTE_OCR_API_KEY="somekey")
def test_valid_configuration(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 0)

View File

@@ -0,0 +1,101 @@
import uuid
from pathlib import Path
from unittest import mock
from django.test import TestCase
from django.test import override_settings
from documents.tests.utils import DirectoriesMixin
from documents.tests.utils import FileSystemAssertsMixin
from paperless_remote.parsers import RemoteDocumentParser
from paperless_remote.signals import get_parser
class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
def assertContainsStrings(self, content, strings):
# Asserts that all strings appear in content, in the given order.
indices = []
for s in strings:
if s in content:
indices.append(content.index(s))
else:
self.fail(f"'{s}' is not in '{content}'")
self.assertListEqual(indices, sorted(indices))
@mock.patch("paperless_tesseract.parsers.run_subprocess")
@mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
# Arrange mock Azure client
mock_client = mock.Mock()
mock_client_cls.return_value = mock_client
# Simulate poller result and its `.details`
mock_poller = mock.Mock()
mock_poller.wait.return_value = None
mock_poller.details = {"operation_id": "fake-op-id"}
mock_client.begin_analyze_document.return_value = mock_poller
mock_poller.result.return_value.content = "This is a test document."
# Return dummy PDF bytes
mock_client.get_analyze_result_pdf.return_value = [
b"%PDF-",
b"1.7 ",
b"FAKEPDF",
]
# Simulate pdftotext by writing dummy text to sidecar file
def fake_run(cmd, *args, **kwargs):
with Path(cmd[-1]).open("w", encoding="utf-8") as f:
f.write("This is a test document.")
mock_subprocess.side_effect = fake_run
with override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="somekey",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
):
parser = get_parser(uuid.uuid4())
parser.parse(
self.SAMPLE_FILES / "simple-digital.pdf",
"application/pdf",
)
self.assertContainsStrings(
parser.text.strip(),
["This is a test document."],
)
@override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="key",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
)
def test_supported_mime_types_valid_config(self):
parser = RemoteDocumentParser(uuid.uuid4())
expected_types = {
"application/pdf": ".pdf",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/tiff": ".tiff",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/webp": ".webp",
}
self.assertEqual(parser.supported_mime_types(), expected_types)
def test_supported_mime_types_invalid_config(self):
parser = get_parser(uuid.uuid4())
self.assertEqual(parser.supported_mime_types(), {})
@override_settings(
REMOTE_OCR_ENGINE=None,
REMOTE_OCR_API_KEY=None,
REMOTE_OCR_ENDPOINT=None,
)
def test_parse_with_invalid_config(self):
parser = get_parser(uuid.uuid4())
parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
self.assertEqual(parser.text, "")

39
uv.lock generated
View File

@@ -95,6 +95,34 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/af/cc/55a32a2c98022d88812b5986d2a92c4ff3ee087e83b712ebc703bba452bf/Automat-24.8.1-py3-none-any.whl", hash = "sha256:bf029a7bc3da1e2c24da2343e7598affaa9f10bf0ab63ff808566ce90551e02a", size = 42585, upload-time = "2024-08-19T17:31:56.729Z" },
]
[[package]]
name = "azure-ai-documentintelligence"
version = "1.0.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
]
[[package]]
name = "azure-core"
version = "1.33.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
]
[[package]]
name = "babel"
version = "2.17.0"
@@ -1402,6 +1430,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
]
[[package]]
name = "isodate"
version = "0.7.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
]
[[package]]
name = "jinja2"
version = "3.1.6"
@@ -2010,6 +2047,7 @@ name = "paperless-ngx"
version = "2.18.1"
source = { virtual = "." }
dependencies = [
{ name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2144,6 +2182,7 @@ typing = [
[package.metadata]
requires-dist = [
{ name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
{ name = "babel", specifier = ">=2.17" },
{ name = "bleach", specifier = "~=6.2.0" },
{ name = "celery", extras = ["redis"], specifier = "~=5.5.1" },