Compare commits

..

26 Commits

Author SHA1 Message Date
shamoon
1e6dfc4481 Merge branch 'dev' into feature-remote-ocr-2 2025-08-26 13:30:39 -07:00
shamoon
0088333360 Chore: refactor document details component (#10662) 2025-08-26 13:29:38 -07:00
GitHub Actions
ed1d488d6e Auto translate strings 2025-08-26 20:29:22 +00:00
shamoon
b25b15ba32 Fixhancement: more saved view count refreshes (#10694) 2025-08-26 13:27:49 -07:00
shamoon
7cc0750066 Add note on costs and limitations for Azure OCR 2025-08-24 05:47:07 -07:00
shamoon
bd6585d3b4 Merge branch 'dev' into feature-remote-ocr-2 2025-08-22 08:54:26 -07:00
shamoon
717e828a1d Merge branch 'dev' into feature-remote-ocr-2 2025-08-17 21:25:14 -07:00
shamoon
07381d48e6 Merge branch 'dev' into feature-remote-ocr-2 2025-08-17 07:49:58 -07:00
shamoon
dd0ffaf312 Merge branch 'dev' into feature-remote-ocr-2 2025-08-11 10:48:36 -07:00
shamoon
264504affc Fix consumer declaration file extensions 2025-08-10 05:32:52 -07:00
shamoon
4feedf2add Merge branch 'dev' into feature-remote-ocr-2 2025-08-06 16:04:25 -04:00
shamoon
2f76cf9831 Merge branch 'dev' into feature-remote-ocr-2 2025-08-01 23:55:49 -04:00
shamoon
1002d37f6b Update test_parser.py 2025-07-09 11:05:37 -07:00
shamoon
d260a94740 Update parsers.py 2025-07-09 11:02:57 -07:00
shamoon
88c69b83ea Update index.md 2025-07-09 11:00:12 -07:00
shamoon
2557ee2014 Update docs to mention remote OCR with Azure AI 2025-07-09 09:53:30 -07:00
shamoon
3c75deed80 Add paperless_remote tests to testpaths 2025-07-08 14:19:45 -07:00
shamoon
d05343c927 Test fixes / coverage 2025-07-08 14:19:45 -07:00
shamoon
e7972b7eaf Coverage 2025-07-08 14:19:45 -07:00
shamoon
75a091cc0d Fix test 2025-07-08 14:19:44 -07:00
shamoon
dca74803fd Use output_content_format poller.result to get clean content 2025-07-08 14:19:44 -07:00
shamoon
3cf3d868d0 Some docs 2025-07-08 14:19:43 -07:00
shamoon
bf4fc6604a Test 2025-07-08 14:19:43 -07:00
shamoon
e8c1eb86fa This actually works
[ci skip]
2025-07-08 14:19:43 -07:00
shamoon
c3dad3cf69 Basic parse 2025-07-08 14:19:42 -07:00
shamoon
811bd66088 Ok, restart implementing this with just azure
[ci skip]
2025-07-08 14:19:42 -07:00
32 changed files with 689 additions and 308 deletions

View File

@@ -1800,3 +1800,23 @@ password. All of these options come from their similarly-named [Django settings]
#### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL} #### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}
: Defaults to false. : Defaults to false.
## Remote OCR
#### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
: The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
Defaults to None, which disables remote OCR.
#### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
: The API key to use for the remote OCR engine.
Defaults to None.
#### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
: The endpoint to use for the remote OCR engine. This is required for Azure AI.
Defaults to None.

View File

@@ -25,9 +25,10 @@ physical documents into a searchable online archive so you can keep, well, _less
## Features ## Features
- **Organize and index** your scanned documents with tags, correspondents, types, and more. - **Organize and index** your scanned documents with tags, correspondents, types, and more.
- _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way. - _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
- Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images. - Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
- Utilizes the open-source Tesseract engine to recognize more than 100 languages. - Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- _New!_ Supports remote OCR with Azure AI (opt-in).
- Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals. - Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
- Uses machine-learning to automatically add tags, correspondents and document types to your documents. - Uses machine-learning to automatically add tags, correspondents and document types to your documents.
- Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more. - Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.

View File

@@ -850,6 +850,21 @@ how regularly you intend to scan documents and use paperless.
performed the task associated with the document, move it to the performed the task associated with the document, move it to the
inbox. inbox.
## Remote OCR
!!! important
This feature is disabled by default and will always remain strictly "opt-in".
Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
[Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
Additionally, when using a commercial service with this feature, consider both potential costs as well as any associated file size
or page limitations (e.g. with a free tier).
## Architecture ## Architecture
Paperless-ngx consists of the following components: Paperless-ngx consists of the following components:

View File

@@ -15,6 +15,7 @@ classifiers = [
# This will allow testing to not install a webserver, mysql, etc # This will allow testing to not install a webserver, mysql, etc
dependencies = [ dependencies = [
"azure-ai-documentintelligence>=1.0.2",
"babel>=2.17", "babel>=2.17",
"bleach~=6.2.0", "bleach~=6.2.0",
"celery[redis]~=5.5.1", "celery[redis]~=5.5.1",
@@ -239,6 +240,7 @@ testpaths = [
"src/paperless_tesseract/tests/", "src/paperless_tesseract/tests/",
"src/paperless_tika/tests", "src/paperless_tika/tests",
"src/paperless_text/tests/", "src/paperless_text/tests/",
"src/paperless_remote/tests/",
] ]
addopts = [ addopts = [
"--pythonwarnings=all", "--pythonwarnings=all",

View File

@@ -7671,7 +7671,7 @@
</context-group> </context-group>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">311</context> <context context-type="linenumber">313</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="1494518490116523821" datatype="html"> <trans-unit id="1494518490116523821" datatype="html">
@@ -7682,7 +7682,7 @@
</context-group> </context-group>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">304</context> <context context-type="linenumber">306</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="8461842260159597706" datatype="html"> <trans-unit id="8461842260159597706" datatype="html">
@@ -7925,49 +7925,49 @@
<source>Reset filters / selection</source> <source>Reset filters / selection</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">292</context> <context context-type="linenumber">294</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="4135055128446167640" datatype="html"> <trans-unit id="4135055128446167640" datatype="html">
<source>Open first [selected] document</source> <source>Open first [selected] document</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">320</context> <context context-type="linenumber">322</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="3629960544875360046" datatype="html"> <trans-unit id="3629960544875360046" datatype="html">
<source>Previous page</source> <source>Previous page</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">336</context> <context context-type="linenumber">338</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="3337301694210287595" datatype="html"> <trans-unit id="3337301694210287595" datatype="html">
<source>Next page</source> <source>Next page</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">348</context> <context context-type="linenumber">350</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="2155249406916744630" datatype="html"> <trans-unit id="2155249406916744630" datatype="html">
<source>View &quot;<x id="PH" equiv-text="this.list.activeSavedViewTitle"/>&quot; saved successfully.</source> <source>View &quot;<x id="PH" equiv-text="this.list.activeSavedViewTitle"/>&quot; saved successfully.</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">381</context> <context context-type="linenumber">383</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="4646273665293421938" datatype="html"> <trans-unit id="4646273665293421938" datatype="html">
<source>Failed to save view &quot;<x id="PH" equiv-text="this.list.activeSavedViewTitle"/>&quot;.</source> <source>Failed to save view &quot;<x id="PH" equiv-text="this.list.activeSavedViewTitle"/>&quot;.</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">387</context> <context context-type="linenumber">389</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="6837554170707123455" datatype="html"> <trans-unit id="6837554170707123455" datatype="html">
<source>View &quot;<x id="PH" equiv-text="savedView.name"/>&quot; created successfully.</source> <source>View &quot;<x id="PH" equiv-text="savedView.name"/>&quot; created successfully.</source>
<context-group purpose="location"> <context-group purpose="location">
<context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context> <context context-type="sourcefile">src/app/components/document-list/document-list.component.ts</context>
<context context-type="linenumber">431</context> <context context-type="linenumber">435</context>
</context-group> </context-group>
</trans-unit> </trans-unit>
<trans-unit id="739880801667335279" datatype="html"> <trans-unit id="739880801667335279" datatype="html">

View File

@@ -106,6 +106,7 @@ describe('DashboardComponent', () => {
}), }),
dashboardViews: saved_views.filter((v) => v.show_on_dashboard), dashboardViews: saved_views.filter((v) => v.show_on_dashboard),
allViews: saved_views, allViews: saved_views,
setDocumentCount: jest.fn(),
}, },
}, },
provideHttpClient(withInterceptorsFromDi()), provideHttpClient(withInterceptorsFromDi()),

View File

@@ -52,6 +52,7 @@ import {
} from 'src/app/services/permissions.service' } from 'src/app/services/permissions.service'
import { CustomFieldsService } from 'src/app/services/rest/custom-fields.service' import { CustomFieldsService } from 'src/app/services/rest/custom-fields.service'
import { DocumentService } from 'src/app/services/rest/document.service' import { DocumentService } from 'src/app/services/rest/document.service'
import { SavedViewService } from 'src/app/services/rest/saved-view.service'
import { SettingsService } from 'src/app/services/settings.service' import { SettingsService } from 'src/app/services/settings.service'
import { WebsocketStatusService } from 'src/app/services/websocket-status.service' import { WebsocketStatusService } from 'src/app/services/websocket-status.service'
import { WidgetFrameComponent } from '../widget-frame/widget-frame.component' import { WidgetFrameComponent } from '../widget-frame/widget-frame.component'
@@ -94,6 +95,7 @@ export class SavedViewWidgetComponent
permissionsService = inject(PermissionsService) permissionsService = inject(PermissionsService)
private settingsService = inject(SettingsService) private settingsService = inject(SettingsService)
private customFieldService = inject(CustomFieldsService) private customFieldService = inject(CustomFieldsService)
private savedViewService = inject(SavedViewService)
public DisplayMode = DisplayMode public DisplayMode = DisplayMode
public DisplayField = DisplayField public DisplayField = DisplayField
@@ -181,6 +183,7 @@ export class SavedViewWidgetComponent
this.show = true this.show = true
this.documents = result.results this.documents = result.results
this.count = result.count this.count = result.count
this.savedViewService.setDocumentCount(this.savedView, result.count)
}), }),
delay(500) delay(500)
) )

View File

@@ -1,9 +1,4 @@
<pngx-page-header [(title)]="title"> <pngx-page-header [(title)]="title">
@if (document?.in_process) {
<span class="badge bg-danger text-dark ms-2 d-flex align-items-center">
<div class="spinner-border spinner-border-sm me-1" role="status"></div><span i18n>Processing...</span>
</span>
}
@if (archiveContentRenderType === ContentRenderType.PDF && !useNativePdfViewer) { @if (archiveContentRenderType === ContentRenderType.PDF && !useNativePdfViewer) {
@if (previewNumPages) { @if (previewNumPages) {
<div class="input-group input-group-sm d-none d-md-flex"> <div class="input-group input-group-sm d-none d-md-flex">
@@ -55,7 +50,7 @@
<div class="d-none d-sm-inline">&nbsp;<ng-container i18n>Actions</ng-container></div> <div class="d-none d-sm-inline">&nbsp;<ng-container i18n>Actions</ng-container></div>
</button> </button>
<div ngbDropdownMenu aria-labelledby="actionsDropdown" class="shadow"> <div ngbDropdownMenu aria-labelledby="actionsDropdown" class="shadow">
<button ngbDropdownItem (click)="reprocess()" [disabled]="!userCanEdit || !userIsOwner || document?.in_process"> <button ngbDropdownItem (click)="reprocess()" [disabled]="!userCanEdit || !userIsOwner">
<i-bs width="1em" height="1em" name="arrow-counterclockwise"></i-bs>&nbsp;<span i18n>Reprocess</span> <i-bs width="1em" height="1em" name="arrow-counterclockwise"></i-bs>&nbsp;<span i18n>Reprocess</span>
</button> </button>
@@ -63,7 +58,7 @@
<i-bs width="1em" height="1em" name="diagram-3"></i-bs>&nbsp;<span i18n>More like this</span> <i-bs width="1em" height="1em" name="diagram-3"></i-bs>&nbsp;<span i18n>More like this</span>
</button> </button>
<button ngbDropdownItem (click)="editPdf()" [disabled]="!userIsOwner || !userCanEdit || originalContentRenderType !== ContentRenderType.PDF || document?.in_process"> <button ngbDropdownItem (click)="editPdf()" [disabled]="!userIsOwner || !userCanEdit || originalContentRenderType !== ContentRenderType.PDF">
<i-bs name="pencil"></i-bs>&nbsp;<ng-container i18n>PDF Editor</ng-container> <i-bs name="pencil"></i-bs>&nbsp;<ng-container i18n>PDF Editor</ng-container>
</button> </button>
</div> </div>
@@ -95,6 +90,7 @@
} }
</div> </div>
</div> </div>
</pngx-page-header> </pngx-page-header>
<div class="row"> <div class="row">

View File

@@ -21,8 +21,9 @@ import { dirtyCheck, DirtyComponent } from '@ngneat/dirty-check-forms'
import { PDFDocumentProxy, PdfViewerModule } from 'ng2-pdf-viewer' import { PDFDocumentProxy, PdfViewerModule } from 'ng2-pdf-viewer'
import { NgxBootstrapIconsModule } from 'ngx-bootstrap-icons' import { NgxBootstrapIconsModule } from 'ngx-bootstrap-icons'
import { DeviceDetectorService } from 'ngx-device-detector' import { DeviceDetectorService } from 'ngx-device-detector'
import { BehaviorSubject, Observable, Subject } from 'rxjs' import { BehaviorSubject, Observable, of, Subject } from 'rxjs'
import { import {
catchError,
debounceTime, debounceTime,
distinctUntilChanged, distinctUntilChanged,
filter, filter,
@@ -327,19 +328,163 @@ export class DocumentDetailComponent
} }
} }
private mapDocToForm(doc: Document): any {
return {
...doc,
permissions_form: { owner: doc.owner, set_permissions: doc.permissions },
}
}
private mapFormToDoc(value: any): any {
const docValues = { ...value }
docValues['owner'] = value['permissions_form']?.owner
docValues['set_permissions'] = value['permissions_form']?.set_permissions
delete docValues['permissions_form']
return docValues
}
private prepareForm(doc: Document): void {
this.documentForm.reset(this.mapDocToForm(doc), { emitEvent: false })
if (!this.userCanEditDoc(doc)) {
this.documentForm.disable({ emitEvent: false })
} else {
this.documentForm.enable({ emitEvent: false })
}
if (doc.__changedFields) {
doc.__changedFields.forEach((field) => {
if (field === 'owner' || field === 'set_permissions') {
this.documentForm.get('permissions_form')?.markAsDirty()
} else {
this.documentForm.get(field)?.markAsDirty()
}
})
}
}
private setupDirtyTracking(
currentDocument: Document,
originalDocument: Document
): void {
this.store = new BehaviorSubject({
title: originalDocument.title,
content: originalDocument.content,
created: originalDocument.created,
correspondent: originalDocument.correspondent,
document_type: originalDocument.document_type,
storage_path: originalDocument.storage_path,
archive_serial_number: originalDocument.archive_serial_number,
tags: [...originalDocument.tags],
permissions_form: {
owner: originalDocument.owner,
set_permissions: originalDocument.permissions,
},
custom_fields: [...originalDocument.custom_fields],
})
this.isDirty$ = dirtyCheck(this.documentForm, this.store.asObservable())
this.isDirty$
.pipe(
takeUntil(this.unsubscribeNotifier),
takeUntil(this.docChangeNotifier)
)
.subscribe((dirty) =>
this.openDocumentService.setDirty(
currentDocument,
dirty,
this.getChangedFields()
)
)
}
private loadDocument(documentId: number): void {
this.previewUrl = this.documentsService.getPreviewUrl(documentId)
this.http
.get(this.previewUrl, { responseType: 'text' })
.pipe(
first(),
takeUntil(this.unsubscribeNotifier),
takeUntil(this.docChangeNotifier)
)
.subscribe({
next: (res) => (this.previewText = res.toString()),
error: (err) =>
(this.previewText = $localize`An error occurred loading content: ${
err.message ?? err.toString()
}`),
})
this.thumbUrl = this.documentsService.getThumbUrl(documentId)
this.documentsService
.get(documentId)
.pipe(
catchError(() => {
// 404 is handled in the subscribe below
return of(null)
}),
first(),
takeUntil(this.unsubscribeNotifier),
takeUntil(this.docChangeNotifier)
)
.subscribe({
next: (doc) => {
if (!doc) {
this.router.navigate(['404'], { replaceUrl: true })
return
}
this.documentId = doc.id
this.suggestions = null
const openDocument = this.openDocumentService.getOpenDocument(
this.documentId
)
const useDoc = openDocument || doc
if (openDocument) {
if (
new Date(doc.modified) > new Date(openDocument.modified) &&
!this.modalService.hasOpenModals()
) {
const modal = this.modalService.open(ConfirmDialogComponent)
modal.componentInstance.title = $localize`Document changes detected`
modal.componentInstance.messageBold = $localize`The version of this document in your browser session appears older than the existing version.`
modal.componentInstance.message = $localize`Saving the document here may overwrite other changes that were made. To restore the existing version, discard your changes or close the document.`
modal.componentInstance.cancelBtnClass = 'visually-hidden'
modal.componentInstance.btnCaption = $localize`Ok`
modal.componentInstance.confirmClicked.subscribe(() =>
modal.close()
)
}
} else {
this.openDocumentService
.openDocument(doc)
.pipe(
first(),
takeUntil(this.unsubscribeNotifier),
takeUntil(this.docChangeNotifier)
)
.subscribe()
}
this.updateComponent(useDoc)
this.titleSubject
.pipe(
debounceTime(1000),
distinctUntilChanged(),
takeUntil(this.docChangeNotifier),
takeUntil(this.unsubscribeNotifier)
)
.subscribe((titleValue) => {
if (titleValue !== this.titleInput.value) return
this.title = titleValue
this.documentForm.patchValue({ title: titleValue })
})
this.setupDirtyTracking(useDoc, doc)
},
})
}
ngOnInit(): void { ngOnInit(): void {
this.setZoom(this.settings.get(SETTINGS_KEYS.PDF_VIEWER_ZOOM_SETTING)) this.setZoom(this.settings.get(SETTINGS_KEYS.PDF_VIEWER_ZOOM_SETTING))
this.documentForm.valueChanges this.documentForm.valueChanges
.pipe(takeUntil(this.unsubscribeNotifier)) .pipe(takeUntil(this.unsubscribeNotifier))
.subscribe(() => { .subscribe((values) => {
this.error = null this.error = null
const docValues = Object.assign({}, this.documentForm.value) Object.assign(this.document, this.mapFormToDoc(values))
docValues['owner'] =
this.documentForm.get('permissions_form').value['owner']
docValues['set_permissions'] =
this.documentForm.get('permissions_form').value['set_permissions']
delete docValues['permissions_form']
Object.assign(this.document, docValues)
}) })
if ( if (
@@ -391,171 +536,36 @@ export class DocumentDetailComponent
this.route.paramMap this.route.paramMap
.pipe( .pipe(
filter((paramMap) => { filter(
// only init when changing docs & section is set (paramMap) =>
return (
+paramMap.get('id') !== this.documentId && +paramMap.get('id') !== this.documentId &&
paramMap.get('section')?.length > 0 paramMap.get('section')?.length > 0
) ),
}), takeUntil(this.unsubscribeNotifier)
takeUntil(this.unsubscribeNotifier),
switchMap((paramMap) => {
const documentId = +paramMap.get('id')
this.docChangeNotifier.next(documentId)
// Dont wait to get the preview
this.previewUrl = this.documentsService.getPreviewUrl(documentId)
this.http.get(this.previewUrl, { responseType: 'text' }).subscribe({
next: (res) => {
this.previewText = res.toString()
},
error: (err) => {
this.previewText = $localize`An error occurred loading content: ${
err.message ?? err.toString()
}`
},
})
this.thumbUrl = this.documentsService.getThumbUrl(documentId)
return this.documentsService.get(documentId)
})
) )
.pipe( .subscribe((paramMap) => {
switchMap((doc) => { const documentId = +paramMap.get('id')
this.documentId = doc.id this.docChangeNotifier.next(documentId)
this.suggestions = null this.loadDocument(documentId)
const openDocument = this.openDocumentService.getOpenDocument(
this.documentId
)
if (openDocument) {
if (
new Date(doc.modified) > new Date(openDocument.modified) &&
!this.modalService.hasOpenModals()
) {
let modal = this.modalService.open(ConfirmDialogComponent)
modal.componentInstance.title = $localize`Document changes detected`
modal.componentInstance.messageBold = $localize`The version of this document in your browser session appears older than the existing version.`
modal.componentInstance.message = $localize`Saving the document here may overwrite other changes that were made. To restore the existing version, discard your changes or close the document.`
modal.componentInstance.cancelBtnClass = 'visually-hidden'
modal.componentInstance.btnCaption = $localize`Ok`
modal.componentInstance.confirmClicked.subscribe(() =>
modal.close()
)
}
// Prevent mutating stale form values into the next document: only sync if it still matches the active document.
if (
this.documentForm.dirty &&
(this.document?.id === openDocument.id || !this.document)
) {
Object.assign(openDocument, this.documentForm.value)
openDocument['owner'] =
this.documentForm.get('permissions_form').value['owner']
openDocument['permissions'] =
this.documentForm.get('permissions_form').value[
'set_permissions'
]
delete openDocument['permissions_form']
}
if (openDocument.__changedFields) {
openDocument.__changedFields.forEach((field) => {
if (field === 'owner' || field === 'set_permissions') {
this.documentForm.get('permissions_form').markAsDirty()
} else {
this.documentForm.get(field)?.markAsDirty()
}
})
}
this.updateComponent(openDocument)
} else {
this.openDocumentService.openDocument(doc)
this.updateComponent(doc)
}
this.titleSubject
.pipe(
debounceTime(1000),
distinctUntilChanged(),
takeUntil(this.docChangeNotifier),
takeUntil(this.unsubscribeNotifier)
)
.subscribe({
next: (titleValue) => {
// In the rare case when the field changed just after debounced event was fired.
// We dont want to overwrite what's actually in the text field, so just return
if (titleValue !== this.titleInput.value) return
this.title = titleValue
this.documentForm.patchValue({ title: titleValue })
},
complete: () => {
// doc changed so we manually check dirty in case title was changed
if (
this.store.getValue().title !==
this.documentForm.get('title').value
) {
this.openDocumentService.setDirty(
doc,
true,
this.getChangedFields()
)
}
},
})
// Initialize dirtyCheck
this.store = new BehaviorSubject({
title: doc.title,
content: doc.content,
created: doc.created,
correspondent: doc.correspondent,
document_type: doc.document_type,
storage_path: doc.storage_path,
archive_serial_number: doc.archive_serial_number,
tags: [...doc.tags],
permissions_form: {
owner: doc.owner,
set_permissions: doc.permissions,
},
custom_fields: [...doc.custom_fields],
})
this.isDirty$ = dirtyCheck(
this.documentForm,
this.store.asObservable()
)
return this.isDirty$.pipe(
takeUntil(this.unsubscribeNotifier),
map((dirty) => ({ doc, dirty }))
)
})
)
.subscribe({
next: ({ doc, dirty }) => {
this.openDocumentService.setDirty(doc, dirty, this.getChangedFields())
},
error: (error) => {
this.router.navigate(['404'], {
replaceUrl: true,
})
},
}) })
this.route.paramMap.subscribe((paramMap) => { this.route.paramMap
const section = paramMap.get('section') .pipe(takeUntil(this.unsubscribeNotifier))
if (section) { .subscribe((paramMap) => {
const navIDKey: string = Object.keys(DocumentDetailNavIDs).find( const section = paramMap.get('section')
(navID) => navID.toLowerCase() == section if (section) {
) const navIDKey: string = Object.keys(DocumentDetailNavIDs).find(
if (navIDKey) { (navID) => navID.toLowerCase() == section
this.activeNavID = DocumentDetailNavIDs[navIDKey] )
if (navIDKey) {
this.activeNavID = DocumentDetailNavIDs[navIDKey]
}
} else if (paramMap.get('id')) {
this.router.navigate(['documents', +paramMap.get('id'), 'details'], {
replaceUrl: true,
})
} }
} else if (paramMap.get('id')) { })
this.router.navigate(['documents', +paramMap.get('id'), 'details'], {
replaceUrl: true,
})
}
})
this.hotKeyService this.hotKeyService
.addShortcut({ .addShortcut({
@@ -682,19 +692,7 @@ export class DocumentDetailComponent
}) })
} }
this.title = this.documentTitlePipe.transform(doc.title) this.title = this.documentTitlePipe.transform(doc.title)
const docFormValues = Object.assign({}, doc) this.prepareForm(doc)
docFormValues['permissions_form'] = {
owner: doc.owner,
set_permissions: doc.permissions,
}
this.documentForm.patchValue(docFormValues, { emitEvent: false })
if (!this.userCanEdit) this.documentForm.disable()
setTimeout(() => {
// check again after a tick in case form was dirty
if (!this.userCanEdit) this.documentForm.disable()
else this.documentForm.enable()
}, 10)
} }
get customFieldFormFields(): FormArray { get customFieldFormFields(): FormArray {
@@ -797,7 +795,11 @@ export class DocumentDetailComponent
discard() { discard() {
this.documentsService this.documentsService
.get(this.documentId) .get(this.documentId)
.pipe(first()) .pipe(
first(),
takeUntil(this.unsubscribeNotifier),
takeUntil(this.docChangeNotifier)
)
.subscribe({ .subscribe({
next: (doc) => { next: (doc) => {
Object.assign(this.document, doc) Object.assign(this.document, doc)
@@ -900,9 +902,10 @@ export class DocumentDetailComponent
.patch(this.getChangedFields()) .patch(this.getChangedFields())
.pipe( .pipe(
switchMap((updateResult) => { switchMap((updateResult) => {
return this.documentListViewService return this.documentListViewService.getNext(this.documentId).pipe(
.getNext(this.documentId) map((nextDocId) => ({ nextDocId, updateResult })),
.pipe(map((nextDocId) => ({ nextDocId, updateResult }))) takeUntil(this.unsubscribeNotifier)
)
}) })
) )
.pipe( .pipe(
@@ -912,7 +915,10 @@ export class DocumentDetailComponent
return this.openDocumentService return this.openDocumentService
.closeDocument(this.document) .closeDocument(this.document)
.pipe( .pipe(
map((closeResult) => ({ updateResult, nextDocId, closeResult })) map(
(closeResult) => ({ updateResult, nextDocId, closeResult }),
takeUntil(this.unsubscribeNotifier)
)
) )
} }
}) })
@@ -1238,16 +1244,19 @@ export class DocumentDetailComponent
) { ) {
doc.owner = this.store.value.permissions_form.owner doc.owner = this.store.value.permissions_form.owner
} }
return !this.document || this.userCanEditDoc(doc)
}
private userCanEditDoc(doc: Document): boolean {
return ( return (
!this.document || this.permissionsService.currentUserCan(
(this.permissionsService.currentUserCan(
PermissionAction.Change, PermissionAction.Change,
PermissionType.Document PermissionType.Document
) && ) &&
this.permissionsService.currentUserHasObjectPermissions( this.permissionsService.currentUserHasObjectPermissions(
PermissionAction.Change, PermissionAction.Change,
doc doc
)) )
) )
} }
@@ -1429,43 +1438,50 @@ export class DocumentDetailComponent
} }
private tryRenderTiff() { private tryRenderTiff() {
this.http.get(this.previewUrl, { responseType: 'arraybuffer' }).subscribe({ this.http
next: (res) => { .get(this.previewUrl, { responseType: 'arraybuffer' })
/* istanbul ignore next */ .pipe(
try { first(),
// See UTIF.js > _imgLoaded takeUntil(this.unsubscribeNotifier),
const tiffIfds: any[] = UTIF.decode(res) takeUntil(this.docChangeNotifier)
var vsns = tiffIfds, )
ma = 0, .subscribe({
page = vsns[0] next: (res) => {
if (tiffIfds[0].subIFD) vsns = vsns.concat(tiffIfds[0].subIFD) /* istanbul ignore next */
for (var i = 0; i < vsns.length; i++) { try {
var img = vsns[i] // See UTIF.js > _imgLoaded
if (img['t258'] == null || img['t258'].length < 3) continue const tiffIfds: any[] = UTIF.decode(res)
var ar = img['t256'] * img['t257'] var vsns = tiffIfds,
if (ar > ma) { ma = 0,
ma = ar page = vsns[0]
page = img if (tiffIfds[0].subIFD) vsns = vsns.concat(tiffIfds[0].subIFD)
for (var i = 0; i < vsns.length; i++) {
var img = vsns[i]
if (img['t258'] == null || img['t258'].length < 3) continue
var ar = img['t256'] * img['t257']
if (ar > ma) {
ma = ar
page = img
}
} }
UTIF.decodeImage(res, page, tiffIfds)
const rgba = UTIF.toRGBA8(page)
const { width: w, height: h } = page
var cnv = document.createElement('canvas')
cnv.width = w
cnv.height = h
var ctx = cnv.getContext('2d'),
imgd = ctx.createImageData(w, h)
for (var i = 0; i < rgba.length; i++) imgd.data[i] = rgba[i]
ctx.putImageData(imgd, 0, 0)
this.tiffURL = cnv.toDataURL()
} catch (err) {
this.tiffError = $localize`An error occurred loading tiff: ${err.toString()}`
} }
UTIF.decodeImage(res, page, tiffIfds) },
const rgba = UTIF.toRGBA8(page) error: (err) => {
const { width: w, height: h } = page
var cnv = document.createElement('canvas')
cnv.width = w
cnv.height = h
var ctx = cnv.getContext('2d'),
imgd = ctx.createImageData(w, h)
for (var i = 0; i < rgba.length; i++) imgd.data[i] = rgba[i]
ctx.putImageData(imgd, 0, 0)
this.tiffURL = cnv.toDataURL()
} catch (err) {
this.tiffError = $localize`An error occurred loading tiff: ${err.toString()}` this.tiffError = $localize`An error occurred loading tiff: ${err.toString()}`
} },
}, })
error: (err) => {
this.tiffError = $localize`An error occurred loading tiff: ${err.toString()}`
},
})
} }
} }

View File

@@ -15,13 +15,8 @@
} }
</div> </div>
<div class="col col-md-10"> <div class="col col-md-10">
<div class="card-body"> <div class="card-body">
@if (document?.in_process) {
<span class="badge bg-secondary text-light mb-2">
<div class="spinner-border spinner-border-sm me-1" role="status"></div><span i18n>Processing...</span>
</span>
}
<div class="d-flex justify-content-between align-items-center"> <div class="d-flex justify-content-between align-items-center">
<h5 class="card-title w-100"> <h5 class="card-title w-100">
@if (document) { @if (document) {

View File

@@ -37,11 +37,6 @@
} }
<div class="card-body bg-light p-2"> <div class="card-body bg-light p-2">
@if (document?.in_process) {
<span class="badge bg-secondary text-light mb-2">
<div class="spinner-border spinner-border-sm me-1" role="status"></div><span i18n>Processing...</span>
</span>
}
<p class="card-text"> <p class="card-text">
@if (document) { @if (document) {
@if (displayFields.includes(DisplayField.CORRESPONDENT) && document.correspondent) { @if (displayFields.includes(DisplayField.CORRESPONDENT) && document.correspondent) {

View File

@@ -301,11 +301,6 @@
} }
@if (activeDisplayFields.includes(DisplayField.TITLE) || activeDisplayFields.includes(DisplayField.TAGS)) { @if (activeDisplayFields.includes(DisplayField.TITLE) || activeDisplayFields.includes(DisplayField.TAGS)) {
<td width="30%"> <td width="30%">
@if (d.in_process) {
<span class="badge bg-secondary text-light me-1">
<div class="spinner-border spinner-border-sm me-1" role="status"></div><span i18n>Processing...</span>
</span>
}
@if (activeDisplayFields.includes(DisplayField.TITLE)) { @if (activeDisplayFields.includes(DisplayField.TITLE)) {
<div class="d-inline-block" (mouseleave)="popupPreview.close()"> <div class="d-inline-block" (mouseleave)="popupPreview.close()">
<a routerLink="/documents/{{d.id}}" title="Edit document" i18n-title style="overflow-wrap: anywhere;">{{d.title | documentTitle}}</a> <a routerLink="/documents/{{d.id}}" title="Edit document" i18n-title style="overflow-wrap: anywhere;">{{d.title | documentTitle}}</a>

View File

@@ -199,6 +199,14 @@ describe('DocumentListComponent', () => {
} }
const queryParams = { id: view.id.toString() } const queryParams = { id: view.id.toString() }
const getSavedViewSpy = jest.spyOn(savedViewService, 'getCached') const getSavedViewSpy = jest.spyOn(savedViewService, 'getCached')
const setCountSpy = jest.spyOn(savedViewService, 'setDocumentCount')
jest.spyOn(documentService, 'listFiltered').mockReturnValue(
of({
results: docs,
count: 3,
all: docs.map((d) => d.id),
})
)
getSavedViewSpy.mockReturnValue(of(view)) getSavedViewSpy.mockReturnValue(of(view))
const activateSavedViewSpy = jest.spyOn( const activateSavedViewSpy = jest.spyOn(
documentListService, documentListService,
@@ -215,6 +223,7 @@ describe('DocumentListComponent', () => {
view, view,
convertToParamMap(queryParams) convertToParamMap(queryParams)
) )
expect(setCountSpy).toHaveBeenCalledWith(view, 3)
}) })
it('should 404 on load saved view from URL if no view', () => { it('should 404 on load saved view from URL if no view', () => {
@@ -248,6 +257,34 @@ describe('DocumentListComponent', () => {
expect(getSavedViewSpy).toHaveBeenCalledWith(view.id) expect(getSavedViewSpy).toHaveBeenCalledWith(view.id)
}) })
it('should update saved view document count on load saved view from query params', () => {
jest.spyOn(savedViewService, 'getCached').mockReturnValue(
of({
id: 10,
sort_field: 'added',
sort_reverse: true,
filter_rules: [],
})
)
jest.spyOn(documentService, 'listFiltered').mockReturnValue(
of({
results: docs,
count: 3,
all: docs.map((d) => d.id),
})
)
const setCountSpy = jest.spyOn(savedViewService, 'setDocumentCount')
jest.spyOn(documentService, 'listFiltered').mockReturnValue(
of({
results: docs,
count: 3,
all: docs.map((d) => d.id),
})
)
component.loadViewConfig(10)
expect(setCountSpy).toHaveBeenCalledWith(expect.any(Object), 3)
})
it('should support 3 different display modes', () => { it('should support 3 different display modes', () => {
jest.spyOn(documentListService, 'documents', 'get').mockReturnValue(docs) jest.spyOn(documentListService, 'documents', 'get').mockReturnValue(docs)
fixture.detectChanges() fixture.detectChanges()

View File

@@ -264,7 +264,9 @@ export class DocumentListComponent
view, view,
convertToParamMap(this.route.snapshot.queryParams) convertToParamMap(this.route.snapshot.queryParams)
) )
this.list.reload() this.list.reload(() => {
this.savedViewService.setDocumentCount(view, this.list.collectionSize)
})
this.updateDisplayCustomFields() this.updateDisplayCustomFields()
this.unmodifiedFilterRules = view.filter_rules this.unmodifiedFilterRules = view.filter_rules
}) })
@@ -399,7 +401,9 @@ export class DocumentListComponent
.subscribe((view) => { .subscribe((view) => {
this.unmodifiedSavedView = view this.unmodifiedSavedView = view
this.list.activateSavedView(view) this.list.activateSavedView(view)
this.list.reload() this.list.reload(() => {
this.savedViewService.setDocumentCount(view, this.list.collectionSize)
})
}) })
} }

View File

@@ -159,8 +159,6 @@ export interface Document extends ObjectWithPermissions {
page_count?: number page_count?: number
in_process?: boolean
// Frontend only // Frontend only
__changedFields?: string[] __changedFields?: string[]
} }

View File

@@ -140,11 +140,15 @@ export class SavedViewService extends AbstractPaperlessService<SavedView> {
) )
.pipe(takeUntil(this.unsubscribeNotifier)) .pipe(takeUntil(this.unsubscribeNotifier))
.subscribe((results: Results<Document>) => { .subscribe((results: Results<Document>) => {
this.savedViewDocumentCounts.set(view.id, results.count) this.setDocumentCount(view, results.count)
}) })
}) })
} }
public setDocumentCount(view: SavedView, count: number) {
this.savedViewDocumentCounts.set(view.id, count)
}
public getDocumentCount(view: SavedView): number { public getDocumentCount(view: SavedView): number {
return this.savedViewDocumentCounts.get(view.id) return this.savedViewDocumentCounts.get(view.id)
} }

View File

@@ -283,7 +283,6 @@ def rotate(doc_ids: list[int], degrees: int) -> Literal["OK"]:
f"Attempting to rotate {len(doc_ids)} documents by {degrees} degrees.", f"Attempting to rotate {len(doc_ids)} documents by {degrees} degrees.",
) )
qs = Document.objects.filter(id__in=doc_ids) qs = Document.objects.filter(id__in=doc_ids)
Document.objects.filter(pk__in=doc_ids).update(in_process=True)
affected_docs: list[int] = [] affected_docs: list[int] = []
import pikepdf import pikepdf
@@ -310,9 +309,7 @@ def rotate(doc_ids: list[int], degrees: int) -> Literal["OK"]:
f"Rotated document {doc.id} by {degrees} degrees", f"Rotated document {doc.id} by {degrees} degrees",
) )
affected_docs.append(doc.id) affected_docs.append(doc.id)
Document.objects.filter(pk__in=doc_ids).update(in_process=False)
except Exception as e: except Exception as e:
Document.objects.filter(pk__in=doc_ids).update(in_process=False)
logger.exception(f"Error rotating document {doc.id}: {e}") logger.exception(f"Error rotating document {doc.id}: {e}")
if len(affected_docs) > 0: if len(affected_docs) > 0:
@@ -477,7 +474,6 @@ def delete_pages(doc_ids: list[int], pages: list[int]) -> Literal["OK"]:
f"Attempting to delete pages {pages} from {len(doc_ids)} documents", f"Attempting to delete pages {pages} from {len(doc_ids)} documents",
) )
doc = Document.objects.get(id=doc_ids[0]) doc = Document.objects.get(id=doc_ids[0])
Document.objects.filter(pk=doc.id).update(in_process=True)
pages = sorted(pages) # sort pages to avoid index issues pages = sorted(pages) # sort pages to avoid index issues
import pikepdf import pikepdf
@@ -496,7 +492,6 @@ def delete_pages(doc_ids: list[int], pages: list[int]) -> Literal["OK"]:
update_document_content_maybe_archive_file.delay(document_id=doc.id) update_document_content_maybe_archive_file.delay(document_id=doc.id)
logger.info(f"Deleted pages {pages} from document {doc.id}") logger.info(f"Deleted pages {pages} from document {doc.id}")
except Exception as e: except Exception as e:
Document.objects.filter(pk=doc.id).update(in_process=False)
logger.exception(f"Error deleting pages from document {doc.id}: {e}") logger.exception(f"Error deleting pages from document {doc.id}: {e}")
return "OK" return "OK"
@@ -523,7 +518,6 @@ def edit_pdf(
f"Editing PDF of document {doc_ids[0]} with {len(operations)} operations", f"Editing PDF of document {doc_ids[0]} with {len(operations)} operations",
) )
doc = Document.objects.get(id=doc_ids[0]) doc = Document.objects.get(id=doc_ids[0])
Document.objects.filter(pk=doc.id).update(in_process=True)
import pikepdf import pikepdf
pdf_docs: list[pikepdf.Pdf] = [] pdf_docs: list[pikepdf.Pdf] = []
@@ -593,7 +587,6 @@ def edit_pdf(
except Exception as e: except Exception as e:
logger.exception(f"Error editing document {doc.id}: {e}") logger.exception(f"Error editing document {doc.id}: {e}")
Document.objects.filter(pk=doc.id).update(in_process=False)
raise ValueError( raise ValueError(
f"An error occurred while editing the document: {e}", f"An error occurred while editing the document: {e}",
) from e ) from e

View File

@@ -1,23 +0,0 @@
# Generated by Django 5.2.5 on 2025-08-26 07:54
from django.db import migrations
from django.db import models
class Migration(migrations.Migration):
dependencies = [
("documents", "1068_alter_document_created"),
]
operations = [
migrations.AddField(
model_name="document",
name="in_process",
field=models.BooleanField(
db_index=True,
default=False,
help_text="Whether the document is currently being processed.",
verbose_name="in process",
),
),
]

View File

@@ -289,13 +289,6 @@ class Document(SoftDeleteModel, ModelWithOwner):
), ),
) )
in_process = models.BooleanField(
_("in process"),
default=False,
db_index=True,
help_text=_("Whether the document is currently being processed."),
)
class Meta: class Meta:
ordering = ("-created",) ordering = ("-created",)
verbose_name = _("document") verbose_name = _("document")

View File

@@ -935,8 +935,6 @@ class DocumentSerializer(
required=False, required=False,
) )
in_process = serializers.BooleanField(read_only=True)
def get_page_count(self, obj) -> int | None: def get_page_count(self, obj) -> int | None:
return obj.page_count return obj.page_count
@@ -1105,7 +1103,6 @@ class DocumentSerializer(
"remove_inbox_tags", "remove_inbox_tags",
"page_count", "page_count",
"mime_type", "mime_type",
"in_process",
) )
list_serializer_class = OwnedObjectListSerializer list_serializer_class = OwnedObjectListSerializer

View File

@@ -250,7 +250,6 @@ def update_document_content_maybe_archive_file(document_id):
it exists. it exists.
""" """
document = Document.objects.get(id=document_id) document = Document.objects.get(id=document_id)
Document.objects.filter(pk=document_id).update(in_process=True)
mime_type = document.mime_type mime_type = document.mime_type
@@ -350,7 +349,6 @@ def update_document_content_maybe_archive_file(document_id):
) )
finally: finally:
parser.cleanup() parser.cleanup()
Document.objects.filter(pk=document_id).update(in_process=False)
@shared_task @shared_task

View File

@@ -324,6 +324,7 @@ INSTALLED_APPS = [
"paperless_tesseract.apps.PaperlessTesseractConfig", "paperless_tesseract.apps.PaperlessTesseractConfig",
"paperless_text.apps.PaperlessTextConfig", "paperless_text.apps.PaperlessTextConfig",
"paperless_mail.apps.PaperlessMailConfig", "paperless_mail.apps.PaperlessMailConfig",
"paperless_remote.apps.PaperlessRemoteParserConfig",
"django.contrib.admin", "django.contrib.admin",
"rest_framework", "rest_framework",
"rest_framework.authtoken", "rest_framework.authtoken",
@@ -1443,3 +1444,10 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
"PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS", "PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
"true", "true",
) )
###############################################################################
# Remote Parser #
###############################################################################
REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")

View File

@@ -0,0 +1,4 @@
# this is here so that django finds the checks.
from paperless_remote.checks import check_remote_parser_configured
__all__ = ["check_remote_parser_configured"]

View File

@@ -0,0 +1,14 @@
from django.apps import AppConfig
from paperless_remote.signals import remote_consumer_declaration
class PaperlessRemoteParserConfig(AppConfig):
name = "paperless_remote"
def ready(self):
from documents.signals import document_consumer_declaration
document_consumer_declaration.connect(remote_consumer_declaration)
AppConfig.ready(self)

View File

@@ -0,0 +1,15 @@
from django.conf import settings
from django.core.checks import Error
from django.core.checks import register
@register()
def check_remote_parser_configured(app_configs, **kwargs):
if settings.REMOTE_OCR_ENGINE == "azureai" and not settings.REMOTE_OCR_ENDPOINT:
return [
Error(
"Azure AI remote parser requires endpoint to be configured.",
),
]
return []

View File

@@ -0,0 +1,113 @@
from pathlib import Path
from django.conf import settings
from paperless_tesseract.parsers import RasterisedDocumentParser
class RemoteEngineConfig:
def __init__(
self,
engine: str,
api_key: str | None = None,
endpoint: str | None = None,
):
self.engine = engine
self.api_key = api_key
self.endpoint = endpoint
def engine_is_valid(self):
valid = self.engine in ["azureai"] and self.api_key is not None
if self.engine == "azureai":
valid = valid and self.endpoint is not None
return valid
class RemoteDocumentParser(RasterisedDocumentParser):
"""
This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
as this is the only service that provides a remote OCR API with text-embedded PDF output.
"""
logging_name = "paperless.parsing.remote"
def get_settings(self) -> RemoteEngineConfig:
"""
Returns the configuration for the remote OCR engine, loaded from Django settings.
"""
return RemoteEngineConfig(
engine=settings.REMOTE_OCR_ENGINE,
api_key=settings.REMOTE_OCR_API_KEY,
endpoint=settings.REMOTE_OCR_ENDPOINT,
)
def supported_mime_types(self):
if self.settings.engine_is_valid():
return {
"application/pdf": ".pdf",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/tiff": ".tiff",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/webp": ".webp",
}
else:
return {}
def azure_ai_vision_parse(
self,
file: Path,
) -> str | None:
"""
Uses Azure AI Vision to parse the document and return the text content.
It requests a searchable PDF output with embedded text.
The PDF is saved to the archive_path attribute.
Returns the text content extracted from the document.
If the parsing fails, it returns None.
"""
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
from azure.ai.documentintelligence.models import AnalyzeOutputOption
from azure.ai.documentintelligence.models import DocumentContentFormat
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint=self.settings.endpoint,
credential=AzureKeyCredential(self.settings.api_key),
)
with file.open("rb") as f:
analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
poller = client.begin_analyze_document(
model_id="prebuilt-read",
body=analyze_request,
output_content_format=DocumentContentFormat.TEXT,
output=[AnalyzeOutputOption.PDF], # request searchable PDF output
content_type="application/json",
)
poller.wait()
result_id = poller.details["operation_id"]
result = poller.result()
# Download the PDF with embedded text
self.archive_path = Path(self.tempdir) / "archive.pdf"
with self.archive_path.open("wb") as f:
for chunk in client.get_analyze_result_pdf(
model_id="prebuilt-read",
result_id=result_id,
):
f.write(chunk)
return result.content
def parse(self, document_path: Path, mime_type, file_name=None):
if not self.settings.engine_is_valid():
self.log.warning(
"No valid remote parser engine is configured, content will be empty.",
)
self.text = ""
return
elif self.settings.engine == "azureai":
self.text = self.azure_ai_vision_parse(document_path)

View File

@@ -0,0 +1,18 @@
def get_parser(*args, **kwargs):
from paperless_remote.parsers import RemoteDocumentParser
return RemoteDocumentParser(*args, **kwargs)
def get_supported_mime_types():
from paperless_remote.parsers import RemoteDocumentParser
return RemoteDocumentParser(None).supported_mime_types()
def remote_consumer_declaration(sender, **kwargs):
return {
"parser": get_parser,
"weight": 5,
"mime_types": get_supported_mime_types(),
}

View File

Binary file not shown.

View File

@@ -0,0 +1,29 @@
from django.test import TestCase
from django.test import override_settings
from paperless_remote import check_remote_parser_configured
class TestChecks(TestCase):
@override_settings(REMOTE_OCR_ENGINE=None)
def test_no_engine(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 0)
@override_settings(REMOTE_OCR_ENGINE="azureai")
@override_settings(REMOTE_OCR_API_KEY="somekey")
@override_settings(REMOTE_OCR_ENDPOINT=None)
def test_azure_no_endpoint(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 1)
self.assertTrue(
msgs[0].msg.startswith(
"Azure AI remote parser requires endpoint to be configured.",
),
)
@override_settings(REMOTE_OCR_ENGINE="something")
@override_settings(REMOTE_OCR_API_KEY="somekey")
def test_valid_configuration(self):
msgs = check_remote_parser_configured(None)
self.assertEqual(len(msgs), 0)

View File

@@ -0,0 +1,101 @@
import uuid
from pathlib import Path
from unittest import mock
from django.test import TestCase
from django.test import override_settings
from documents.tests.utils import DirectoriesMixin
from documents.tests.utils import FileSystemAssertsMixin
from paperless_remote.parsers import RemoteDocumentParser
from paperless_remote.signals import get_parser
class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
def assertContainsStrings(self, content, strings):
# Asserts that all strings appear in content, in the given order.
indices = []
for s in strings:
if s in content:
indices.append(content.index(s))
else:
self.fail(f"'{s}' is not in '{content}'")
self.assertListEqual(indices, sorted(indices))
@mock.patch("paperless_tesseract.parsers.run_subprocess")
@mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
# Arrange mock Azure client
mock_client = mock.Mock()
mock_client_cls.return_value = mock_client
# Simulate poller result and its `.details`
mock_poller = mock.Mock()
mock_poller.wait.return_value = None
mock_poller.details = {"operation_id": "fake-op-id"}
mock_client.begin_analyze_document.return_value = mock_poller
mock_poller.result.return_value.content = "This is a test document."
# Return dummy PDF bytes
mock_client.get_analyze_result_pdf.return_value = [
b"%PDF-",
b"1.7 ",
b"FAKEPDF",
]
# Simulate pdftotext by writing dummy text to sidecar file
def fake_run(cmd, *args, **kwargs):
with Path(cmd[-1]).open("w", encoding="utf-8") as f:
f.write("This is a test document.")
mock_subprocess.side_effect = fake_run
with override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="somekey",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
):
parser = get_parser(uuid.uuid4())
parser.parse(
self.SAMPLE_FILES / "simple-digital.pdf",
"application/pdf",
)
self.assertContainsStrings(
parser.text.strip(),
["This is a test document."],
)
@override_settings(
REMOTE_OCR_ENGINE="azureai",
REMOTE_OCR_API_KEY="key",
REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
)
def test_supported_mime_types_valid_config(self):
parser = RemoteDocumentParser(uuid.uuid4())
expected_types = {
"application/pdf": ".pdf",
"image/png": ".png",
"image/jpeg": ".jpg",
"image/tiff": ".tiff",
"image/bmp": ".bmp",
"image/gif": ".gif",
"image/webp": ".webp",
}
self.assertEqual(parser.supported_mime_types(), expected_types)
def test_supported_mime_types_invalid_config(self):
parser = get_parser(uuid.uuid4())
self.assertEqual(parser.supported_mime_types(), {})
@override_settings(
REMOTE_OCR_ENGINE=None,
REMOTE_OCR_API_KEY=None,
REMOTE_OCR_ENDPOINT=None,
)
def test_parse_with_invalid_config(self):
parser = get_parser(uuid.uuid4())
parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
self.assertEqual(parser.text, "")

39
uv.lock generated
View File

@@ -95,6 +95,34 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/af/cc/55a32a2c98022d88812b5986d2a92c4ff3ee087e83b712ebc703bba452bf/Automat-24.8.1-py3-none-any.whl", hash = "sha256:bf029a7bc3da1e2c24da2343e7598affaa9f10bf0ab63ff808566ce90551e02a", size = 42585, upload-time = "2024-08-19T17:31:56.729Z" }, { url = "https://files.pythonhosted.org/packages/af/cc/55a32a2c98022d88812b5986d2a92c4ff3ee087e83b712ebc703bba452bf/Automat-24.8.1-py3-none-any.whl", hash = "sha256:bf029a7bc3da1e2c24da2343e7598affaa9f10bf0ab63ff808566ce90551e02a", size = 42585, upload-time = "2024-08-19T17:31:56.729Z" },
] ]
[[package]]
name = "azure-ai-documentintelligence"
version = "1.0.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
]
[[package]]
name = "azure-core"
version = "1.33.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
]
[[package]] [[package]]
name = "babel" name = "babel"
version = "2.17.0" version = "2.17.0"
@@ -1402,6 +1430,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" }, { url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
] ]
[[package]]
name = "isodate"
version = "0.7.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
]
[[package]] [[package]]
name = "jinja2" name = "jinja2"
version = "3.1.6" version = "3.1.6"
@@ -2010,6 +2047,7 @@ name = "paperless-ngx"
version = "2.18.2" version = "2.18.2"
source = { virtual = "." } source = { virtual = "." }
dependencies = [ dependencies = [
{ name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" }, { name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" }, { name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" }, { name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2144,6 +2182,7 @@ typing = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
{ name = "babel", specifier = ">=2.17" }, { name = "babel", specifier = ">=2.17" },
{ name = "bleach", specifier = "~=6.2.0" }, { name = "bleach", specifier = "~=6.2.0" },
{ name = "celery", extras = ["redis"], specifier = "~=5.5.1" }, { name = "celery", extras = ["redis"], specifier = "~=5.5.1" },