Merge branch 'dev' into feature-remote-ocr-2

Revert "Update ci.yml"
This reverts commit be0c1fd1ed.
2025-09-06 21:13:43 -05:00 · 2025-08-22 08:54:26 -07:00 · 2025-08-22 08:46:01 -07:00 · 2025-08-22 08:45:33 -07:00 · 2025-08-21 21:44:41 +00:00 · 2025-08-21 21:14:25 +00:00
35 changed files with 494 additions and 472 deletions
--- a/.codecov.yml
+++ b/.codecov.yml
@@ -10,10 +10,8 @@ component_management:
      paths:
        - src-ui/**
 # https://docs.codecov.com/docs/pull-request-comments
-# codecov will only comment if coverage changes
 comment:
  layout: "header, diff, components, flags, files"
-  require_changes: true
  # https://docs.codecov.com/docs/javascript-bundle-analysis
  require_bundle_changes: true
  bundle_change_threshold: "50Kb"
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1800,3 +1800,23 @@ password. All of these options come from their similarly-named [Django settings]
 #### [`PAPERLESS_EMAIL_USE_SSL=<bool>`](#PAPERLESS_EMAIL_USE_SSL) {#PAPERLESS_EMAIL_USE_SSL}

 : Defaults to false.
+
+## Remote OCR
+
+#### [`PAPERLESS_REMOTE_OCR_ENGINE=<str>`](#PAPERLESS_REMOTE_OCR_ENGINE) {#PAPERLESS_REMOTE_OCR_ENGINE}
+
+: The remote OCR engine to use. Currently only Azure AI is supported as "azureai".
+
+    Defaults to None, which disables remote OCR.
+
+#### [`PAPERLESS_REMOTE_OCR_API_KEY=<str>`](#PAPERLESS_REMOTE_OCR_API_KEY) {#PAPERLESS_REMOTE_OCR_API_KEY}
+
+: The API key to use for the remote OCR engine.
+
+    Defaults to None.
+
+#### [`PAPERLESS_REMOTE_OCR_ENDPOINT=<str>`](#PAPERLESS_REMOTE_OCR_ENDPOINT) {#PAPERLESS_REMOTE_OCR_ENDPOINT}
+
+: The endpoint to use for the remote OCR engine. This is required for Azure AI.
+
+    Defaults to None.
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,9 +25,10 @@ physical documents into a searchable online archive so you can keep, well, _less
 ## Features

 -   **Organize and index** your scanned documents with tags, correspondents, types, and more.
-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
+-   _Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
 -   Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
-   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   Utilizes the open-source Tesseract engine to recognize more than 100 languages.
+    -   _New!_ Supports remote OCR with Azure AI (opt-in).
 -   Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
 -   Uses machine-learning to automatically add tags, correspondents and document types to your documents.
 -   Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -33,7 +33,7 @@ warns that
 `OCR for XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled`,
 then you might need to install the [Tesseract language
 files](https://packages.ubuntu.com/search?keywords=tesseract-ocr)
-marching your document's languages.
+matching your document's languages.

 As an example, if you are running Paperless-ngx from any Ubuntu or
 Debian box, and your documents are written in Spanish you may need to
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -850,6 +850,18 @@ how regularly you intend to scan documents and use paperless.
    performed the task associated with the document, move it to the
    inbox.

+## Remote OCR
+
+!!! important
+
+    This feature is disabled by default and will always remain strictly "opt-in".
+
+Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
+[Microsoft's Azure "Document Intelligence" service](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).
+This is of course a paid service (with a free tier) which requires an Azure account and subscription. Azure AI is not affiliated with
+Paperless-ngx in any way. When enabled, Paperless-ngx will automatically send appropriate documents to Azure for OCR processing, bypassing
+the local OCR engine. See the [configuration](configuration.md#PAPERLESS_REMOTE_OCR_ENGINE) options for more details.
+
 ## Architecture

 Paperless-ngx consists of the following components:
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -15,6 +15,7 @@ classifiers = [
 # This will allow testing to not install a webserver, mysql, etc

 dependencies = [
+  "azure-ai-documentintelligence>=1.0.2",
  "babel>=2.17",
  "bleach~=6.2.0",
  "celery[redis]~=5.5.1",
@@ -239,6 +240,7 @@ testpaths = [
  "src/paperless_tesseract/tests/",
  "src/paperless_tika/tests",
  "src/paperless_text/tests/",
+  "src/paperless_remote/tests/",
 ]
 addopts = [
  "--pythonwarnings=all",
--- a/src-ui/src/app/components/common/dates-dropdown/dates-dropdown.component.html
+++ b/src-ui/src/app/components/common/dates-dropdown/dates-dropdown.component.html
@@ -11,7 +11,7 @@
        <div class="selected-icon">
          @if (createdRelativeDate) {
            <a class="text-light focus-variants" href="javascript:void(0)" (click)="clearCreatedRelativeDate()">
-              <i-bs width="1em" height="1em" name="check" class="variant-unfocused"></i-bs>
+              <i-bs width="1em" height="1em" name="check" class="variant-unfocused text-dark"></i-bs>
              <i-bs width="1em" height="1em" name="x" class="variant-focused text-primary"></i-bs>
            </a>
          }
--- a/src-ui/src/app/components/common/edit-dialog/tag-edit-dialog/tag-edit-dialog.component.html
+++ b/src-ui/src/app/components/common/edit-dialog/tag-edit-dialog/tag-edit-dialog.component.html
@@ -12,8 +12,6 @@

    <pngx-input-color i18n-title title="Color" formControlName="color" [error]="error?.color"></pngx-input-color>

-    <pngx-input-select i18n-title title="Parent" formControlName="parent" [items]="tags" [allowNull]="true" [error]="error?.parent"></pngx-input-select>
-
    <pngx-input-check i18n-title title="Inbox tag" formControlName="is_inbox_tag" i18n-hint hint="Inbox tags are automatically assigned to all consumed documents."></pngx-input-check>
    <pngx-input-select i18n-title title="Matching algorithm" [items]="getMatchingAlgorithms()" formControlName="matching_algorithm"></pngx-input-select>
    @if (patternRequired) {
--- a/src-ui/src/app/components/common/edit-dialog/tag-edit-dialog/tag-edit-dialog.component.ts
+++ b/src-ui/src/app/components/common/edit-dialog/tag-edit-dialog/tag-edit-dialog.component.ts
@@ -35,16 +35,11 @@ import { TextComponent } from '../../input/text/text.component'
  ],
 })
 export class TagEditDialogComponent extends EditDialogComponent<Tag> {
-  tags: Tag[]
-
  constructor() {
    super()
    this.service = inject(TagService)
    this.userService = inject(UserService)
    this.settingsService = inject(SettingsService)
-    this.service.listAll().subscribe((result) => {
-      this.tags = result.results
-    })
  }

  getCreateTitle() {
@@ -60,7 +55,6 @@ export class TagEditDialogComponent extends EditDialogComponent<Tag> {
      name: new FormControl(''),
      color: new FormControl(randomColor()),
      is_inbox_tag: new FormControl(false),
-      parent: new FormControl(null),
      matching_algorithm: new FormControl(DEFAULT_MATCHING_ALGORITHM),
      match: new FormControl(''),
      is_insensitive: new FormControl(true),
--- a/src-ui/src/app/components/common/input/tags/tags.component.html
+++ b/src-ui/src/app/components/common/input/tags/tags.component.html
@@ -7,14 +7,13 @@
      <div class="input-group flex-nowrap">
        <ng-select #tagSelect name="tags" [items]="tags" bindLabel="name" bindValue="id" [(ngModel)]="value"
          [disabled]="disabled"
-          [multiple]="multiple"
+          [multiple]="true"
          [closeOnSelect]="false"
          [clearSearchOnAdd]="true"
          [hideSelected]="tags.length > 0"
          [addTag]="allowCreate ? createTagRef : false"
          addTagText="Add tag"
          i18n-addTagText
-          (add)="onAdd($event)"
          (change)="onChange(value)">

          <ng-template ng-label-tmp let-item="item">
--- a/src-ui/src/app/components/common/input/tags/tags.component.ts
+++ b/src-ui/src/app/components/common/input/tags/tags.component.ts
@@ -100,9 +100,6 @@ export class TagsComponent implements OnInit, ControlValueAccessor {
  @Input()
  horizontal: boolean = false

-  @Input()
-  multiple: boolean = true
-
  @Output()
  filterDocuments = new EventEmitter<Tag[]>()

@@ -127,40 +124,13 @@ export class TagsComponent implements OnInit, ControlValueAccessor {

    let index = this.value.indexOf(tagID)
    if (index > -1) {
-      const tag = this.getTag(tagID)
-
-      // remove tag
      let oldValue = this.value
      oldValue.splice(index, 1)
-
-      // remove children
-      oldValue = this.removeChildren(oldValue, tag)
-
      this.value = [...oldValue]
      this.onChange(this.value)
    }
  }

-  private removeChildren(tagIDs: number[], tag: Tag) {
-    if (tag.children?.length) {
-      const childIDs = tag.children.map((child) => child.id)
-      tagIDs = tagIDs.filter((id) => !childIDs.includes(id))
-      for (const child of tag.children) {
-        tagIDs = this.removeChildren(tagIDs, child)
-      }
-    }
-    return tagIDs
-  }
-
-  public onAdd(tag: Tag) {
-    if (tag.parent) {
-      // add all parents recursively
-      const parent = this.getTag(tag.parent)
-      this.value = [...this.value, parent.id]
-      this.onAdd(parent)
-    }
-  }
-
  createTag(name: string = null, add: boolean = false) {
    var modal = this.modalService.open(TagEditDialogComponent, {
      backdrop: 'static',
--- a/src-ui/src/app/components/manage/management-list/management-list.component.html
+++ b/src-ui/src/app/components/manage/management-list/management-list.component.html
@@ -54,7 +54,61 @@
        </tr>
      }
      @for (object of data; track object) {
-        <ng-container [ngTemplateOutlet]="objectRow" [ngTemplateOutletContext]="{ object: object, depth: 0 }"></ng-container>
+        <tr (click)="toggleSelected(object); $event.stopPropagation();" class="data-row fade" [class.show]="show">
+          <td>
+            <div class="form-check m-0 ms-2 me-n2">
+              <input type="checkbox" class="form-check-input" id="{{typeName}}{{object.id}}" [checked]="selectedObjects.has(object.id)" (click)="toggleSelected(object); $event.stopPropagation();">
+              <label class="form-check-label" for="{{typeName}}{{object.id}}"></label>
+            </div>
+          </td>
+          <td scope="row"><button class="btn btn-link ms-0 ps-0 text-start" (click)="userCanEdit(object) ? openEditDialog(object) : null; $event.stopPropagation()">{{ object.name }}</button> </td>
+          <td scope="row" class="d-none d-sm-table-cell">{{ getMatching(object) }}</td>
+          <td scope="row">{{ object.document_count }}</td>
+          @for (column of extraColumns; track column) {
+            <td scope="row" [ngClass]="{ 'd-none d-sm-table-cell' : column.hideOnMobile }">
+              @if (column.rendersHtml) {
+                <div [innerHtml]="column.valueFn.call(null, object) | safeHtml"></div>
+              } @else if (column.monospace) {
+                <span class="font-monospace">{{ column.valueFn.call(null, object) }}</span>
+              } @else {
+                {{ column.valueFn.call(null, object) }}
+              }
+            </td>
+          }
+          <td scope="row">
+            <div class="btn-toolbar gap-2">
+              <div class="btn-group d-block d-sm-none">
+                <div ngbDropdown container="body" class="d-inline-block">
+                  <button type="button" class="btn btn-link" id="actionsMenuMobile" (click)="$event.stopPropagation()" ngbDropdownToggle>
+                    <i-bs name="three-dots-vertical"></i-bs>
+                  </button>
+                  <div ngbDropdownMenu aria-labelledby="actionsMenuMobile">
+                    <button (click)="openEditDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" ngbDropdownItem i18n>Edit</button>
+                    <button class="text-danger" (click)="openDeleteDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" ngbDropdownItem i18n>Delete</button>
+                    @if (object.document_count > 0) {
+                      <button (click)="filterDocuments(object)" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }" ngbDropdownItem i18n>Filter Documents ({{ object.document_count }})</button>
+                    }
+                  </div>
+                </div>
+              </div>
+              <div class="btn-group d-none d-sm-inline-block">
+                <button class="btn btn-sm btn-outline-secondary" (click)="openEditDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" [disabled]="!userCanEdit(object)">
+                  <i-bs width="1em" height="1em" name="pencil"></i-bs>&nbsp;<ng-container i18n>Edit</ng-container>
+                </button>
+                <button class="btn btn-sm btn-outline-danger" (click)="openDeleteDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" [disabled]="!userCanDelete(object)">
+                  <i-bs width="1em" height="1em" name="trash"></i-bs>&nbsp;<ng-container i18n>Delete</ng-container>
+                </button>
+              </div>
+              @if (object.document_count > 0) {
+                <div class="btn-group d-none d-sm-inline-block">
+                  <button class="btn btn-sm btn-outline-secondary" (click)="filterDocuments(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }">
+                    <i-bs width="1em" height="1em" name="filter"></i-bs>&nbsp;<ng-container i18n>Documents</ng-container><span class="badge bg-light text-secondary ms-2">{{ object.document_count }}</span>
+                  </button>
+                </div>
+              }
+            </div>
+          </td>
+        </tr>
      }
    </tbody>
  </table>
@@ -75,72 +129,3 @@
    }
  </div>
 }
-
-<ng-template #objectRow let-object="object" let-depth="depth">
-  <tr (click)="toggleSelected(object); $event.stopPropagation();" class="data-row fade" [class.show]="show">
-    <td>
-      <div class="form-check m-0 ms-2 me-n2">
-        <input type="checkbox" class="form-check-input" id="{{typeName}}{{object.id}}" [checked]="selectedObjects.has(object.id)" (click)="toggleSelected(object); $event.stopPropagation();">
-        <label class="form-check-label" for="{{typeName}}{{object.id}}"></label>
-      </div>
-    </td>
-    <td scope="row" class="name-cell" style="--depth: {{depth}}">
-      @if (depth > 0) {
-        <div class="indicator"></div>
-      }
-      <button class="btn btn-link ms-0 ps-0 text-start" (click)="userCanEdit(object) ? openEditDialog(object) : null; $event.stopPropagation()">{{ object.name }}</button>
-    </td>
-    <td scope="row" class="d-none d-sm-table-cell">{{ getMatching(object) }}</td>
-    <td scope="row">{{ getDocumentCount(object) }}</td>
-    @for (column of extraColumns; track column) {
-      <td scope="row" [ngClass]="{ 'd-none d-sm-table-cell' : column.hideOnMobile }">
-        @if (column.rendersHtml) {
-          <div [innerHtml]="column.valueFn.call(null, object) | safeHtml"></div>
-        } @else if (column.monospace) {
-          <span class="font-monospace">{{ column.valueFn.call(null, object) }}</span>
-        } @else {
-          {{ column.valueFn.call(null, object) }}
-        }
-      </td>
-    }
-    <td scope="row">
-      <div class="btn-toolbar gap-2">
-        <div class="btn-group d-block d-sm-none">
-          <div ngbDropdown container="body" class="d-inline-block">
-            <button type="button" class="btn btn-link" id="actionsMenuMobile" (click)="$event.stopPropagation()" ngbDropdownToggle>
-              <i-bs name="three-dots-vertical"></i-bs>
-            </button>
-            <div ngbDropdownMenu aria-labelledby="actionsMenuMobile">
-              <button (click)="openEditDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" ngbDropdownItem i18n>Edit</button>
-              <button class="text-danger" (click)="openDeleteDialog(object)" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" ngbDropdownItem i18n>Delete</button>
-              @if (getDocumentCount(object) > 0) {
-                <button (click)="filterDocuments(object)" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }" ngbDropdownItem i18n>Filter Documents ({{ getDocumentCount(object) }})</button>
-              }
-            </div>
-          </div>
-        </div>
-        <div class="btn-group d-none d-sm-inline-block">
-          <button class="btn btn-sm btn-outline-secondary" (click)="openEditDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Change, type: permissionType }" [disabled]="!userCanEdit(object)">
-            <i-bs width="1em" height="1em" name="pencil"></i-bs>&nbsp;<ng-container i18n>Edit</ng-container>
-          </button>
-          <button class="btn btn-sm btn-outline-danger" (click)="openDeleteDialog(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.Delete, type: permissionType }" [disabled]="!userCanDelete(object)">
-            <i-bs width="1em" height="1em" name="trash"></i-bs>&nbsp;<ng-container i18n>Delete</ng-container>
-          </button>
-        </div>
-        @if (getDocumentCount(object) > 0) {
-          <div class="btn-group d-none d-sm-inline-block">
-            <button class="btn btn-sm btn-outline-secondary" (click)="filterDocuments(object); $event.stopPropagation();" *pngxIfPermissions="{ action: PermissionAction.View, type: PermissionType.Document }">
-              <i-bs width="1em" height="1em" name="filter"></i-bs>&nbsp;<ng-container i18n>Documents</ng-container><span class="badge bg-light text-secondary ms-2">{{ getDocumentCount(object) }}</span>
-            </button>
-          </div>
-        }
-      </div>
-    </td>
-  </tr>
-
-  @if (object.children && object.children.length > 0) {
-    @for (child of object.children; track child) {
-      <ng-container [ngTemplateOutlet]="objectRow" [ngTemplateOutletContext]="{ object: child, depth: depth + 1 }"></ng-container>
-    }
-  }
-</ng-template>
--- a/src-ui/src/app/components/manage/management-list/management-list.component.scss
+++ b/src-ui/src/app/components/manage/management-list/management-list.component.scss
@@ -10,17 +10,3 @@ tbody tr:last-child td {
 .form-check {
    min-height: 0;
 }
-
-td.name-cell {
-    padding-left: calc(calc(var(--depth) - 1) * 1.1rem);
-
-    .indicator {
-        display: inline-block;
-        width: .8rem;
-        height: .8rem;
-        border-left: 1px solid var(--bs-secondary);
-        border-bottom: 1px solid var(--bs-secondary);
-        margin-right: .25rem;
-        margin-left: .5rem;
-    }
-}
--- a/src-ui/src/app/components/manage/management-list/management-list.component.ts
+++ b/src-ui/src/app/components/manage/management-list/management-list.component.ts
@@ -79,7 +79,6 @@ export abstract class ManagementListComponent<T extends MatchingModel>
  @ViewChildren(SortableDirective) headers: QueryList<SortableDirective>

  public data: T[] = []
-  private unfilteredData: T[] = []

  public page = 1

@@ -133,18 +132,6 @@ export abstract class ManagementListComponent<T extends MatchingModel>
    this.reloadData()
  }

-  protected filterData(data: T[]): T[] {
-    return data
-  }
-
-  getDocumentCount(object: MatchingModel): number {
-    return (
-      object.document_count ??
-      this.unfilteredData.find((d) => d.id == object.id)?.document_count ??
-      0
-    )
-  }
-
  reloadData(extraParams: { [key: string]: any } = null) {
    this.loading = true
    this.clearSelection()
@@ -161,8 +148,7 @@ export abstract class ManagementListComponent<T extends MatchingModel>
      .pipe(
        takeUntil(this.unsubscribeNotifier),
        tap((c) => {
-          this.unfilteredData = c.results
-          this.data = this.filterData(c.results)
+          this.data = c.results
          this.collectionSize = c.count
        }),
        delay(100)
--- a/src-ui/src/app/components/manage/tag-list/tag-list.component.ts
+++ b/src-ui/src/app/components/manage/tag-list/tag-list.component.ts
@@ -1,4 +1,4 @@
-import { NgClass, NgTemplateOutlet, TitleCasePipe } from '@angular/common'
+import { NgClass, TitleCasePipe } from '@angular/common'
 import { Component, inject } from '@angular/core'
 import { FormsModule, ReactiveFormsModule } from '@angular/forms'
 import {
@@ -30,7 +30,6 @@ import { ManagementListComponent } from '../management-list/management-list.comp
    FormsModule,
    ReactiveFormsModule,
    NgClass,
-    NgTemplateOutlet,
    NgbDropdownModule,
    NgbPaginationModule,
    NgxBootstrapIconsModule,
@@ -60,8 +59,4 @@ export class TagListComponent extends ManagementListComponent<Tag> {
  getDeleteMessage(object: Tag) {
    return $localize`Do you really want to delete the tag "${object.name}"?`
  }
-
-  filterData(data: Tag[]) {
-    return data.filter((tag) => !tag.parent)
-  }
 }
--- a/src-ui/src/app/data/tag.ts
+++ b/src-ui/src/app/data/tag.ts
@@ -6,8 +6,4 @@ export interface Tag extends MatchingModel {
  text_color?: string

  is_inbox_tag?: boolean
-
-  parent?: number // Tag ID
-
-  children?: Tag[] // read-only
 }
--- a/src/documents/bulk_edit.py
+++ b/src/documents/bulk_edit.py
@@ -25,7 +25,6 @@ from documents.models import CustomFieldInstance
 from documents.models import Document
 from documents.models import DocumentType
 from documents.models import StoragePath
-from documents.models import Tag
 from documents.permissions import set_permissions_for_object
 from documents.plugins.helpers import DocumentsStatusManager
 from documents.tasks import bulk_update_documents
@@ -97,46 +96,31 @@ def set_document_type(doc_ids: list[int], document_type: DocumentType) -> Litera


 def add_tag(doc_ids: list[int], tag: int) -> Literal["OK"]:
-    tag_obj = Tag.objects.get(pk=tag)
-    tags_to_add = [tag_obj, *tag_obj.get_all_ancestors()]
+    qs = Document.objects.filter(Q(id__in=doc_ids) & ~Q(tags__id=tag)).only("pk")
+    affected_docs = list(qs.values_list("pk", flat=True))

    DocumentTagRelationship = Document.tags.through
-    to_create = []
-    affected_docs: set[int] = set()

-    for t in tags_to_add:
-        qs = Document.objects.filter(Q(id__in=doc_ids) & ~Q(tags__id=t.id)).only("pk")
-        doc_ids_missing_tag = list(qs.values_list("pk", flat=True))
-        affected_docs.update(doc_ids_missing_tag)
-        to_create.extend(
-            DocumentTagRelationship(document_id=doc, tag_id=t.id)
-            for doc in doc_ids_missing_tag
-        )
+    DocumentTagRelationship.objects.bulk_create(
+        [DocumentTagRelationship(document_id=doc, tag_id=tag) for doc in affected_docs],
+    )

-    if to_create:
-        DocumentTagRelationship.objects.bulk_create(to_create)
-
-    if affected_docs:
-        bulk_update_documents.delay(document_ids=list(affected_docs))
+    bulk_update_documents.delay(document_ids=affected_docs)

    return "OK"


 def remove_tag(doc_ids: list[int], tag: int) -> Literal["OK"]:
-    tag_obj = Tag.objects.get(pk=tag)
-    tags_to_remove = [tag_obj, *tag_obj.get_all_descendants()]
-    tag_ids = [t.id for t in tags_to_remove]
+    qs = Document.objects.filter(Q(id__in=doc_ids) & Q(tags__id=tag)).only("pk")
+    affected_docs = list(qs.values_list("pk", flat=True))

    DocumentTagRelationship = Document.tags.through
-    qs = DocumentTagRelationship.objects.filter(
-        document_id__in=doc_ids,
-        tag_id__in=tag_ids,
-    )
-    affected_docs = list(qs.values_list("document_id", flat=True).distinct())
-    qs.delete()

-    if affected_docs:
-        bulk_update_documents.delay(document_ids=affected_docs)
+    DocumentTagRelationship.objects.filter(
+        Q(document_id__in=affected_docs) & Q(tag_id=tag),
+    ).delete()
+
+    bulk_update_documents.delay(document_ids=affected_docs)

    return "OK"

@@ -148,35 +132,23 @@ def modify_tags(
 ) -> Literal["OK"]:
    qs = Document.objects.filter(id__in=doc_ids).only("pk")
    affected_docs = list(qs.values_list("pk", flat=True))
+
    DocumentTagRelationship = Document.tags.through

-    # add with all ancestors
-    expanded_add_tags: set[int] = set()
-    for tag_id in add_tags:
-        t = Tag.objects.get(pk=tag_id)
-        expanded_add_tags.update([t.id for t in [t, *t.get_all_ancestors()]])
+    DocumentTagRelationship.objects.filter(
+        document_id__in=affected_docs,
+        tag_id__in=remove_tags,
+    ).delete()

-    # remove with all descendants
-    expanded_remove_tags: set[int] = set()
-    for tag_id in remove_tags:
-        t = Tag.objects.get(pk=tag_id)
-        expanded_remove_tags.update([t.id for t in [t, *t.get_all_descendants()]])
+    DocumentTagRelationship.objects.bulk_create(
+        [
+            DocumentTagRelationship(document_id=doc, tag_id=tag)
+            for (doc, tag) in itertools.product(affected_docs, add_tags)
+        ],
+        ignore_conflicts=True,
+    )

-    if expanded_remove_tags:
-        DocumentTagRelationship.objects.filter(
-            document_id__in=affected_docs,
-            tag_id__in=expanded_remove_tags,
-        ).delete()
-
-    to_create = [
-        DocumentTagRelationship(document_id=doc, tag_id=tag)
-        for (doc, tag) in itertools.product(affected_docs, expanded_add_tags)
-    ]
-    if to_create:
-        DocumentTagRelationship.objects.bulk_create(to_create, ignore_conflicts=True)
-
-    if affected_docs:
-        bulk_update_documents.delay(document_ids=affected_docs)
+    bulk_update_documents.delay(document_ids=affected_docs)

    return "OK"

--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -689,7 +689,7 @@ class ConsumerPlugin(

        if self.metadata.tag_ids:
            for tag_id in self.metadata.tag_ids:
-                document.add_nested_tags([Tag.objects.get(pk=tag_id)])
+                document.tags.add(Tag.objects.get(pk=tag_id))

        if self.metadata.storage_path_id:
            document.storage_path = StoragePath.objects.get(
--- a/src/documents/migrations/1069_tag_parent.py
+++ b/src/documents/migrations/1069_tag_parent.py
@@ -1,26 +0,0 @@
-# Generated by Django 5.1.5 on 2025-02-10 06:02
-
-import django.db.models.deletion
-from django.db import migrations
-from django.db import models
-
-
-class Migration(migrations.Migration):
-    dependencies = [
-        ("documents", "1068_alter_document_created"),
-    ]
-
-    operations = [
-        migrations.AddField(
-            model_name="tag",
-            name="parent",
-            field=models.ForeignKey(
-                blank=True,
-                null=True,
-                on_delete=django.db.models.deletion.CASCADE,
-                related_name="children",
-                to="documents.tag",
-                verbose_name="parent",
-            ),
-        ),
-    ]
--- a/src/documents/models.py
+++ b/src/documents/models.py
@@ -7,7 +7,6 @@ from celery import states
 from django.conf import settings
 from django.contrib.auth.models import Group
 from django.contrib.auth.models import User
-from django.core.exceptions import ValidationError
 from django.core.validators import MaxValueValidator
 from django.core.validators import MinValueValidator
 from django.db import models
@@ -109,38 +108,10 @@ class Tag(MatchingModel):
        ),
    )

-    parent = models.ForeignKey(
-        "self",
-        blank=True,
-        null=True,
-        on_delete=models.CASCADE,
-        related_name="children",
-        verbose_name=_("parent"),
-    )
-
    class Meta(MatchingModel.Meta):
        verbose_name = _("tag")
        verbose_name_plural = _("tags")

-    def get_all_descendants(self):
-        descendants = []
-        for child in self.children.all():
-            descendants.append(child)
-            descendants.extend(child.get_all_descendants())
-        return descendants
-
-    def get_all_ancestors(self):
-        ancestors = []
-        if self.parent:
-            ancestors.append(self.parent)
-            ancestors.extend(self.parent.get_all_ancestors())
-        return ancestors
-
-    def clean(self):
-        if self.parent == self:
-            raise ValidationError("Cannot set itself as parent.")
-        return super().clean()
-

 class DocumentType(MatchingModel):
    class Meta(MatchingModel.Meta):
@@ -405,12 +376,6 @@ class Document(SoftDeleteModel, ModelWithOwner):
    def created_date(self):
        return self.created

-    def add_nested_tags(self, tags):
-        for tag in tags:
-            self.tags.add(tag)
-            if tag.parent:
-                self.add_nested_tags([tag.parent])
-

 class SavedView(ModelWithOwner):
    class DisplayMode(models.TextChoices):
--- a/src/documents/serialisers.py
+++ b/src/documents/serialisers.py
@@ -540,18 +540,6 @@ class TagSerializer(MatchingModelSerializer, OwnedObjectSerializer):

    text_color = serializers.SerializerMethodField()

-    children = SerializerMethodField()
-
-    @extend_schema_field(
-        field=serializers.ListSerializer(
-            child=serializers.PrimaryKeyRelatedField(
-                queryset=Tag.objects.all(),
-            ),
-        ),
-    )
-    def get_children(self, obj):
-        return TagSerializer(obj.children.all(), many=True).data
-
    class Meta:
        model = Tag
        fields = (
@@ -569,8 +557,6 @@ class TagSerializer(MatchingModelSerializer, OwnedObjectSerializer):
            "permissions",
            "user_can_change",
            "set_permissions",
-            "parent",
-            "children",
        )

    def validate_color(self, color):
@@ -1042,23 +1028,6 @@ class DocumentSerializer(
                            custom_field_instance.field,
                            doc_id,
                        )
-        if "tags" in validated_data:
-            # add all parent tags
-            all_ancestor_tags = set(validated_data["tags"])
-            for tag in validated_data["tags"]:
-                all_ancestor_tags.update(tag.get_all_ancestors())
-            validated_data["tags"] = list(all_ancestor_tags)
-            # remove any children for parents that are being removed
-            tag_parents_being_removed = [
-                tag
-                for tag in instance.tags.all()
-                if tag not in validated_data["tags"] and tag.children.count() > 0
-            ]
-            validated_data["tags"] = [
-                tag
-                for tag in validated_data["tags"]
-                if tag not in tag_parents_being_removed
-            ]
        if validated_data.get("remove_inbox_tags"):
            tag_ids_being_added = (
                [
--- a/src/documents/signals/handlers.py
+++ b/src/documents/signals/handlers.py
@@ -260,7 +260,7 @@ def set_tags(
            extra={"group": logging_group},
        )

-        document.add_nested_tags(relevant_tags)
+        document.tags.add(*relevant_tags)


 def set_storage_path(
@@ -767,17 +767,14 @@ def run_workflows(

    def assignment_action():
        if action.assign_tags.exists():
-            tag_ids_to_add: set[int] = set()
-            for tag in action.assign_tags.all():
-                tag_ids_to_add.add(tag.pk)
-                tag_ids_to_add.update(t.pk for t in tag.get_all_ancestors())
-
            if not use_overrides:
-                doc_tag_ids[:] = list(set(doc_tag_ids) | tag_ids_to_add)
+                doc_tag_ids.extend(action.assign_tags.values_list("pk", flat=True))
            else:
                if overrides.tag_ids is None:
                    overrides.tag_ids = []
-                overrides.tag_ids = list(set(overrides.tag_ids) | tag_ids_to_add)
+                overrides.tag_ids.extend(
+                    action.assign_tags.values_list("pk", flat=True),
+                )

        if action.assign_correspondent:
            if not use_overrides:
@@ -920,17 +917,14 @@ def run_workflows(
            else:
                overrides.tag_ids = None
        else:
-            tag_ids_to_remove: set[int] = set()
-            for tag in action.remove_tags.all():
-                tag_ids_to_remove.add(tag.pk)
-                tag_ids_to_remove.update(t.pk for t in tag.get_all_descendants())
-
            if not use_overrides:
-                doc_tag_ids[:] = [t for t in doc_tag_ids if t not in tag_ids_to_remove]
+                for tag in action.remove_tags.filter(
+                    pk__in=document.tags.values_list("pk", flat=True),
+                ):
+                    doc_tag_ids.remove(tag.pk)
            elif overrides.tag_ids:
-                overrides.tag_ids = [
-                    t for t in overrides.tag_ids if t not in tag_ids_to_remove
-                ]
+                for tag in action.remove_tags.filter(pk__in=overrides.tag_ids):
+                    overrides.tag_ids.remove(tag.pk)

        if not use_overrides and (
            action.remove_all_correspondents
--- a/src/documents/tests/test_tag_hierarchy.py
+++ b/src/documents/tests/test_tag_hierarchy.py
@@ -1,112 +0,0 @@
-from unittest import mock
-
-from django.contrib.auth.models import User
-from rest_framework.test import APITestCase
-
-from documents import bulk_edit
-from documents.models import Document
-from documents.models import Tag
-from documents.models import Workflow
-from documents.models import WorkflowAction
-from documents.models import WorkflowTrigger
-from documents.signals.handlers import run_workflows
-
-
-class TestTagHierarchy(APITestCase):
-    def setUp(self):
-        self.user = User.objects.create_superuser(username="admin")
-        self.client.force_authenticate(user=self.user)
-
-        self.parent = Tag.objects.create(name="Parent")
-        self.child = Tag.objects.create(name="Child", parent=self.parent)
-
-        patcher = mock.patch("documents.bulk_edit.bulk_update_documents.delay")
-        self.async_task = patcher.start()
-        self.addCleanup(patcher.stop)
-
-        self.document = Document.objects.create(
-            title="doc",
-            content="",
-            checksum="1",
-            mime_type="application/pdf",
-        )
-
-    def test_api_add_child_adds_parent(self):
-        self.client.patch(
-            f"/api/documents/{self.document.pk}/",
-            {"tags": [self.child.pk]},
-            format="json",
-        )
-        self.document.refresh_from_db()
-        tags = set(self.document.tags.values_list("pk", flat=True))
-        assert tags == {self.parent.pk, self.child.pk}
-
-    def test_api_remove_parent_removes_child(self):
-        self.document.add_nested_tags([self.child])
-        self.client.patch(
-            f"/api/documents/{self.document.pk}/",
-            {"tags": []},
-            format="json",
-        )
-        self.document.refresh_from_db()
-        assert self.document.tags.count() == 0
-
-    def test_bulk_edit_respects_hierarchy(self):
-        bulk_edit.add_tag([self.document.pk], self.child.pk)
-        self.document.refresh_from_db()
-        tags = set(self.document.tags.values_list("pk", flat=True))
-        assert tags == {self.parent.pk, self.child.pk}
-
-        bulk_edit.remove_tag([self.document.pk], self.parent.pk)
-        self.document.refresh_from_db()
-        assert self.document.tags.count() == 0
-
-        bulk_edit.modify_tags([self.document.pk], [self.child.pk], [])
-        self.document.refresh_from_db()
-        tags = set(self.document.tags.values_list("pk", flat=True))
-        assert tags == {self.parent.pk, self.child.pk}
-
-        bulk_edit.modify_tags([self.document.pk], [], [self.parent.pk])
-        self.document.refresh_from_db()
-        assert self.document.tags.count() == 0
-
-    def test_workflow_actions(self):
-        workflow = Workflow.objects.create(name="wf", order=0)
-        trigger = WorkflowTrigger.objects.create(
-            type=WorkflowTrigger.WorkflowTriggerType.DOCUMENT_ADDED,
-        )
-        assign_action = WorkflowAction.objects.create()
-        assign_action.assign_tags.add(self.child)
-        workflow.triggers.add(trigger)
-        workflow.actions.add(assign_action)
-
-        run_workflows(trigger.type, self.document)
-        self.document.refresh_from_db()
-        tags = set(self.document.tags.values_list("pk", flat=True))
-        assert tags == {self.parent.pk, self.child.pk}
-
-        # removal
-        removal_action = WorkflowAction.objects.create(
-            type=WorkflowAction.WorkflowActionType.REMOVAL,
-        )
-        removal_action.remove_tags.add(self.parent)
-        workflow.actions.clear()
-        workflow.actions.add(removal_action)
-
-        run_workflows(trigger.type, self.document)
-        self.document.refresh_from_db()
-        assert self.document.tags.count() == 0
-
-    def test_tag_view_parent_update_adds_parent_to_docs(self):
-        orphan = Tag.objects.create(name="Orphan")
-        self.document.tags.add(orphan)
-
-        self.client.patch(
-            f"/api/tags/{orphan.pk}/",
-            {"parent": self.parent.pk},
-            format="json",
-        )
-
-        self.document.refresh_from_db()
-        tags = set(self.document.tags.values_list("pk", flat=True))
-        assert tags == {self.parent.pk, orphan.pk}
--- a/src/documents/views.py
+++ b/src/documents/views.py
@@ -341,39 +341,6 @@ class TagViewSet(ModelViewSet, PermissionsAwareDocumentCountMixin):
    filterset_class = TagFilterSet
    ordering_fields = ("color", "name", "matching_algorithm", "match", "document_count")

-    def perform_update(self, serializer):
-        old_parent = self.get_object().parent
-        tag = serializer.save()
-        new_parent = tag.parent
-        if old_parent != new_parent:
-            self._update_document_parent_tags(tag, old_parent, new_parent)
-
-    def _update_document_parent_tags(self, tag, old_parent, new_parent):
-        DocumentTagRelationship = Document.tags.through
-        doc_ids = list(Document.objects.filter(tags=tag).values_list("pk", flat=True))
-        affected = set()
-
-        if new_parent:
-            parents_to_add = [new_parent, *new_parent.get_all_ancestors()]
-            to_create = []
-            for parent in parents_to_add:
-                missing = Document.objects.filter(id__in=doc_ids).exclude(tags=parent)
-                to_create.extend(
-                    DocumentTagRelationship(document_id=doc_id, tag_id=parent.id)
-                    for doc_id in missing.values_list("pk", flat=True)
-                )
-                affected.update(missing.values_list("pk", flat=True))
-            if to_create:
-                DocumentTagRelationship.objects.bulk_create(
-                    to_create,
-                    ignore_conflicts=True,
-                )
-
-        if affected:
-            from documents.tasks import bulk_update_documents
-
-            bulk_update_documents.delay(document_ids=list(affected))
-

@extend_schema_view(**generate_object_with_permissions_schema(DocumentTypeSerializer))
 class DocumentTypeViewSet(ModelViewSet, PermissionsAwareDocumentCountMixin):
@@ -2869,6 +2836,11 @@ class SystemStatusView(PassUserMixin):
        last_trained_task = (
            PaperlessTask.objects.filter(
                task_name=PaperlessTask.TaskName.TRAIN_CLASSIFIER,
+                status__in=[
+                    states.SUCCESS,
+                    states.FAILURE,
+                    states.REVOKED,
+                ],  # ignore running tasks
            )
            .order_by("-date_done")
            .first()
@@ -2878,7 +2850,7 @@ class SystemStatusView(PassUserMixin):
        if last_trained_task is None:
            classifier_status = "WARNING"
            classifier_error = "No classifier training tasks found"
-        elif last_trained_task and last_trained_task.status == states.FAILURE:
+        elif last_trained_task and last_trained_task.status != states.SUCCESS:
            classifier_status = "ERROR"
            classifier_error = last_trained_task.result
        classifier_last_trained = (
@@ -2888,6 +2860,11 @@ class SystemStatusView(PassUserMixin):
        last_sanity_check = (
            PaperlessTask.objects.filter(
                task_name=PaperlessTask.TaskName.CHECK_SANITY,
+                status__in=[
+                    states.SUCCESS,
+                    states.FAILURE,
+                    states.REVOKED,
+                ],  # ignore running tasks
            )
            .order_by("-date_done")
            .first()
@@ -2897,7 +2874,7 @@ class SystemStatusView(PassUserMixin):
        if last_sanity_check is None:
            sanity_check_status = "WARNING"
            sanity_check_error = "No sanity check tasks found"
-        elif last_sanity_check and last_sanity_check.status == states.FAILURE:
+        elif last_sanity_check and last_sanity_check.status != states.SUCCESS:
            sanity_check_status = "ERROR"
            sanity_check_error = last_sanity_check.result
        sanity_check_last_run = (
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -324,6 +324,7 @@ INSTALLED_APPS = [
    "paperless_tesseract.apps.PaperlessTesseractConfig",
    "paperless_text.apps.PaperlessTextConfig",
    "paperless_mail.apps.PaperlessMailConfig",
+    "paperless_remote.apps.PaperlessRemoteParserConfig",
    "django.contrib.admin",
    "rest_framework",
    "rest_framework.authtoken",
@@ -1205,8 +1206,8 @@ def _ocr_to_dateparser_languages(ocr_languages: str) -> list[str]:

            language_part = ocr_to_dateparser.get(ocr_lang_part)
            if language_part is None:
-                logger.warning(
-                    f'Skipping unknown OCR language "{ocr_language}" — no dateparser equivalent.',
+                logger.debug(
+                    f'Unable to map OCR language "{ocr_lang_part}" to dateparser locale. ',
                )
                continue

@@ -1219,7 +1220,7 @@ def _ocr_to_dateparser_languages(ocr_languages: str) -> list[str]:
                try:
                    loader.get_locale_map(locales=[dateparser_language])
                except Exception:
-                    logger.warning(
+                    logger.info(
                        f"Language variant '{dateparser_language}' not supported by dateparser; falling back to base language '{language_part}'. You can manually set PAPERLESS_DATE_PARSER_LANGUAGES if needed.",
                    )
                    dateparser_language = language_part
@@ -1229,12 +1230,12 @@ def _ocr_to_dateparser_languages(ocr_languages: str) -> list[str]:
                result.append(dateparser_language)
    except Exception as e:
        logger.warning(
-            f"Could not configure dateparser languages. Set PAPERLESS_DATE_PARSER_LANGUAGES parameter to avoid this. Detail: {e}",
+            f"Error auto-configuring dateparser languages. Set PAPERLESS_DATE_PARSER_LANGUAGES parameter to avoid this. Detail: {e}",
        )
        return []
    if not result:
-        logger.warning(
-            "Could not configure any dateparser languages from OCR_LANGUAGE — fallback to autodetection.",
+        logger.info(
+            "Unable to automatically determine dateparser languages from OCR_LANGUAGE, falling back to multi-language support.",
        )
    return result

@@ -1443,3 +1444,10 @@ WEBHOOKS_ALLOW_INTERNAL_REQUESTS = __get_boolean(
    "PAPERLESS_WEBHOOKS_ALLOW_INTERNAL_REQUESTS",
    "true",
 )
+
+###############################################################################
+# Remote Parser                                                               #
+###############################################################################
+REMOTE_OCR_ENGINE = os.getenv("PAPERLESS_REMOTE_OCR_ENGINE")
+REMOTE_OCR_API_KEY = os.getenv("PAPERLESS_REMOTE_OCR_API_KEY")
+REMOTE_OCR_ENDPOINT = os.getenv("PAPERLESS_REMOTE_OCR_ENDPOINT")
--- a/src/paperless_remote/init.py
+++ b/src/paperless_remote/init.py
@@ -0,0 +1,4 @@
+# this is here so that django finds the checks.
+from paperless_remote.checks import check_remote_parser_configured
+
+__all__ = ["check_remote_parser_configured"]
--- a/src/paperless_remote/apps.py
+++ b/src/paperless_remote/apps.py
@@ -0,0 +1,14 @@
+from django.apps import AppConfig
+
+from paperless_remote.signals import remote_consumer_declaration
+
+
+class PaperlessRemoteParserConfig(AppConfig):
+    name = "paperless_remote"
+
+    def ready(self):
+        from documents.signals import document_consumer_declaration
+
+        document_consumer_declaration.connect(remote_consumer_declaration)
+
+        AppConfig.ready(self)
--- a/src/paperless_remote/checks.py
+++ b/src/paperless_remote/checks.py
@@ -0,0 +1,15 @@
+from django.conf import settings
+from django.core.checks import Error
+from django.core.checks import register
+
+
+@register()
+def check_remote_parser_configured(app_configs, **kwargs):
+    if settings.REMOTE_OCR_ENGINE == "azureai" and not settings.REMOTE_OCR_ENDPOINT:
+        return [
+            Error(
+                "Azure AI remote parser requires endpoint to be configured.",
+            ),
+        ]
+
+    return []
--- a/src/paperless_remote/parsers.py
+++ b/src/paperless_remote/parsers.py
@@ -0,0 +1,113 @@
+from pathlib import Path
+
+from django.conf import settings
+
+from paperless_tesseract.parsers import RasterisedDocumentParser
+
+
+class RemoteEngineConfig:
+    def __init__(
+        self,
+        engine: str,
+        api_key: str | None = None,
+        endpoint: str | None = None,
+    ):
+        self.engine = engine
+        self.api_key = api_key
+        self.endpoint = endpoint
+
+    def engine_is_valid(self):
+        valid = self.engine in ["azureai"] and self.api_key is not None
+        if self.engine == "azureai":
+            valid = valid and self.endpoint is not None
+        return valid
+
+
+class RemoteDocumentParser(RasterisedDocumentParser):
+    """
+    This parser uses a remote OCR engine to parse documents. Currently, it supports Azure AI Vision
+    as this is the only service that provides a remote OCR API with text-embedded PDF output.
+    """
+
+    logging_name = "paperless.parsing.remote"
+
+    def get_settings(self) -> RemoteEngineConfig:
+        """
+        Returns the configuration for the remote OCR engine, loaded from Django settings.
+        """
+        return RemoteEngineConfig(
+            engine=settings.REMOTE_OCR_ENGINE,
+            api_key=settings.REMOTE_OCR_API_KEY,
+            endpoint=settings.REMOTE_OCR_ENDPOINT,
+        )
+
+    def supported_mime_types(self):
+        if self.settings.engine_is_valid():
+            return {
+                "application/pdf": ".pdf",
+                "image/png": ".png",
+                "image/jpeg": ".jpg",
+                "image/tiff": ".tiff",
+                "image/bmp": ".bmp",
+                "image/gif": ".gif",
+                "image/webp": ".webp",
+            }
+        else:
+            return {}
+
+    def azure_ai_vision_parse(
+        self,
+        file: Path,
+    ) -> str | None:
+        """
+        Uses Azure AI Vision to parse the document and return the text content.
+        It requests a searchable PDF output with embedded text.
+        The PDF is saved to the archive_path attribute.
+        Returns the text content extracted from the document.
+        If the parsing fails, it returns None.
+        """
+        from azure.ai.documentintelligence import DocumentIntelligenceClient
+        from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
+        from azure.ai.documentintelligence.models import AnalyzeOutputOption
+        from azure.ai.documentintelligence.models import DocumentContentFormat
+        from azure.core.credentials import AzureKeyCredential
+
+        client = DocumentIntelligenceClient(
+            endpoint=self.settings.endpoint,
+            credential=AzureKeyCredential(self.settings.api_key),
+        )
+
+        with file.open("rb") as f:
+            analyze_request = AnalyzeDocumentRequest(bytes_source=f.read())
+            poller = client.begin_analyze_document(
+                model_id="prebuilt-read",
+                body=analyze_request,
+                output_content_format=DocumentContentFormat.TEXT,
+                output=[AnalyzeOutputOption.PDF],  # request searchable PDF output
+                content_type="application/json",
+            )
+
+        poller.wait()
+        result_id = poller.details["operation_id"]
+        result = poller.result()
+
+        # Download the PDF with embedded text
+        self.archive_path = Path(self.tempdir) / "archive.pdf"
+        with self.archive_path.open("wb") as f:
+            for chunk in client.get_analyze_result_pdf(
+                model_id="prebuilt-read",
+                result_id=result_id,
+            ):
+                f.write(chunk)
+
+        return result.content
+
+    def parse(self, document_path: Path, mime_type, file_name=None):
+        if not self.settings.engine_is_valid():
+            self.log.warning(
+                "No valid remote parser engine is configured, content will be empty.",
+            )
+            self.text = ""
+            return
+        elif self.settings.engine == "azureai":
+            self.text = self.azure_ai_vision_parse(document_path)
--- a/src/paperless_remote/signals.py
+++ b/src/paperless_remote/signals.py
@@ -0,0 +1,18 @@
+def get_parser(*args, **kwargs):
+    from paperless_remote.parsers import RemoteDocumentParser
+
+    return RemoteDocumentParser(*args, **kwargs)
+
+
+def get_supported_mime_types():
+    from paperless_remote.parsers import RemoteDocumentParser
+
+    return RemoteDocumentParser(None).supported_mime_types()
+
+
+def remote_consumer_declaration(sender, **kwargs):
+    return {
+        "parser": get_parser,
+        "weight": 5,
+        "mime_types": get_supported_mime_types(),
+    }
--- a/src/paperless_remote/tests/init.py
+++ b/src/paperless_remote/tests/init.py
--- a/src/paperless_remote/tests/samples/simple-digital.pdf
+++ b/src/paperless_remote/tests/samples/simple-digital.pdf
--- a/src/paperless_remote/tests/test_checks.py
+++ b/src/paperless_remote/tests/test_checks.py
@@ -0,0 +1,29 @@
+from django.test import TestCase
+from django.test import override_settings
+
+from paperless_remote import check_remote_parser_configured
+
+
+class TestChecks(TestCase):
+    @override_settings(REMOTE_OCR_ENGINE=None)
+    def test_no_engine(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 0)
+
+    @override_settings(REMOTE_OCR_ENGINE="azureai")
+    @override_settings(REMOTE_OCR_API_KEY="somekey")
+    @override_settings(REMOTE_OCR_ENDPOINT=None)
+    def test_azure_no_endpoint(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 1)
+        self.assertTrue(
+            msgs[0].msg.startswith(
+                "Azure AI remote parser requires endpoint to be configured.",
+            ),
+        )
+
+    @override_settings(REMOTE_OCR_ENGINE="something")
+    @override_settings(REMOTE_OCR_API_KEY="somekey")
+    def test_valid_configuration(self):
+        msgs = check_remote_parser_configured(None)
+        self.assertEqual(len(msgs), 0)
--- a/src/paperless_remote/tests/test_parser.py
+++ b/src/paperless_remote/tests/test_parser.py
@@ -0,0 +1,101 @@
+import uuid
+from pathlib import Path
+from unittest import mock
+
+from django.test import TestCase
+from django.test import override_settings
+
+from documents.tests.utils import DirectoriesMixin
+from documents.tests.utils import FileSystemAssertsMixin
+from paperless_remote.parsers import RemoteDocumentParser
+from paperless_remote.signals import get_parser
+
+
+class TestParser(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
+    SAMPLE_FILES = Path(__file__).resolve().parent / "samples"
+
+    def assertContainsStrings(self, content, strings):
+        # Asserts that all strings appear in content, in the given order.
+        indices = []
+        for s in strings:
+            if s in content:
+                indices.append(content.index(s))
+            else:
+                self.fail(f"'{s}' is not in '{content}'")
+        self.assertListEqual(indices, sorted(indices))
+
+    @mock.patch("paperless_tesseract.parsers.run_subprocess")
+    @mock.patch("azure.ai.documentintelligence.DocumentIntelligenceClient")
+    def test_get_text_with_azure(self, mock_client_cls, mock_subprocess):
+        # Arrange mock Azure client
+        mock_client = mock.Mock()
+        mock_client_cls.return_value = mock_client
+
+        # Simulate poller result and its `.details`
+        mock_poller = mock.Mock()
+        mock_poller.wait.return_value = None
+        mock_poller.details = {"operation_id": "fake-op-id"}
+        mock_client.begin_analyze_document.return_value = mock_poller
+        mock_poller.result.return_value.content = "This is a test document."
+
+        # Return dummy PDF bytes
+        mock_client.get_analyze_result_pdf.return_value = [
+            b"%PDF-",
+            b"1.7 ",
+            b"FAKEPDF",
+        ]
+
+        # Simulate pdftotext by writing dummy text to sidecar file
+        def fake_run(cmd, *args, **kwargs):
+            with Path(cmd[-1]).open("w", encoding="utf-8") as f:
+                f.write("This is a test document.")
+
+        mock_subprocess.side_effect = fake_run
+
+        with override_settings(
+            REMOTE_OCR_ENGINE="azureai",
+            REMOTE_OCR_API_KEY="somekey",
+            REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+        ):
+            parser = get_parser(uuid.uuid4())
+            parser.parse(
+                self.SAMPLE_FILES / "simple-digital.pdf",
+                "application/pdf",
+            )
+
+            self.assertContainsStrings(
+                parser.text.strip(),
+                ["This is a test document."],
+            )
+
+    @override_settings(
+        REMOTE_OCR_ENGINE="azureai",
+        REMOTE_OCR_API_KEY="key",
+        REMOTE_OCR_ENDPOINT="https://endpoint.cognitiveservices.azure.com",
+    )
+    def test_supported_mime_types_valid_config(self):
+        parser = RemoteDocumentParser(uuid.uuid4())
+        expected_types = {
+            "application/pdf": ".pdf",
+            "image/png": ".png",
+            "image/jpeg": ".jpg",
+            "image/tiff": ".tiff",
+            "image/bmp": ".bmp",
+            "image/gif": ".gif",
+            "image/webp": ".webp",
+        }
+        self.assertEqual(parser.supported_mime_types(), expected_types)
+
+    def test_supported_mime_types_invalid_config(self):
+        parser = get_parser(uuid.uuid4())
+        self.assertEqual(parser.supported_mime_types(), {})
+
+    @override_settings(
+        REMOTE_OCR_ENGINE=None,
+        REMOTE_OCR_API_KEY=None,
+        REMOTE_OCR_ENDPOINT=None,
+    )
+    def test_parse_with_invalid_config(self):
+        parser = get_parser(uuid.uuid4())
+        parser.parse(self.SAMPLE_FILES / "simple-digital.pdf", "application/pdf")
+        self.assertEqual(parser.text, "")
--- a/uv.lock
+++ b/uv.lock
@@ -95,6 +95,34 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/af/cc/55a32a2c98022d88812b5986d2a92c4ff3ee087e83b712ebc703bba452bf/Automat-24.8.1-py3-none-any.whl", hash = "sha256:bf029a7bc3da1e2c24da2343e7598affaa9f10bf0ab63ff808566ce90551e02a", size = 42585, upload-time = "2024-08-19T17:31:56.729Z" },
 ]

+[[package]]
+name = "azure-ai-documentintelligence"
+version = "1.0.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "azure-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "isodate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/44/7b/8115cd713e2caa5e44def85f2b7ebd02a74ae74d7113ba20bdd41fd6dd80/azure_ai_documentintelligence-1.0.2.tar.gz", hash = "sha256:4d75a2513f2839365ebabc0e0e1772f5601b3a8c9a71e75da12440da13b63484", size = 170940 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d9/75/c9ec040f23082f54ffb1977ff8f364c2d21c79a640a13d1c1809e7fd6b1a/azure_ai_documentintelligence-1.0.2-py3-none-any.whl", hash = "sha256:e1fb446abbdeccc9759d897898a0fe13141ed29f9ad11fc705f951925822ed59", size = 106005 },
+]
+
+[[package]]
+name = "azure-core"
+version = "1.33.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "requests", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "six", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+    { name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/75/aa/7c9db8edd626f1a7d99d09ef7926f6f4fb34d5f9fa00dc394afdfe8e2a80/azure_core-1.33.0.tar.gz", hash = "sha256:f367aa07b5e3005fec2c1e184b882b0b039910733907d001c20fb08ebb8c0eb9", size = 295633 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/07/b7/76b7e144aa53bd206bf1ce34fa75350472c3f69bf30e5c8c18bc9881035d/azure_core-1.33.0-py3-none-any.whl", hash = "sha256:9b5b6d0223a1d38c37500e6971118c1e0f13f54951e6893968b38910bc9cda8f", size = 207071 },
+]
+
 [[package]]
 name = "babel"
 version = "2.17.0"
@@ -1402,6 +1430,15 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/c7/fc/4e5a141c3f7c7bed550ac1f69e599e92b6be449dd4677ec09f325cad0955/inotifyrecursive-0.3.5-py3-none-any.whl", hash = "sha256:7e5f4a2e1dc2bef0efa3b5f6b339c41fb4599055a2b54909d020e9e932cc8d2f", size = 8009, upload-time = "2020-11-20T12:38:46.981Z" },
 ]

+[[package]]
+name = "isodate"
+version = "0.7.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/54/4d/e940025e2ce31a8ce1202635910747e5a87cc3a6a6bb2d00973375014749/isodate-0.7.2.tar.gz", hash = "sha256:4cd1aa0f43ca76f4a6c6c0292a85f40b35ec2e43e315b59f06e6d32171a953e6", size = 29705 }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/15/aa/0aca39a37d3c7eb941ba736ede56d689e7be91cab5d9ca846bde3999eba6/isodate-0.7.2-py3-none-any.whl", hash = "sha256:28009937d8031054830160fce6d409ed342816b543597cece116d966c6d99e15", size = 22320 },
+]
+
 [[package]]
 name = "jinja2"
 version = "3.1.6"
@@ -2010,6 +2047,7 @@ name = "paperless-ngx"
 version = "2.18.1"
 source = { virtual = "." }
 dependencies = [
+    { name = "azure-ai-documentintelligence", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "babel", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "bleach", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
    { name = "celery", extra = ["redis"], marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2144,6 +2182,7 @@ typing = [

 [package.metadata]
 requires-dist = [
+    { name = "azure-ai-documentintelligence", specifier = ">=1.0.2" },
    { name = "babel", specifier = ">=2.17" },
    { name = "bleach", specifier = "~=6.2.0" },
    { name = "celery", extras = ["redis"], specifier = "~=5.5.1" },
Author	SHA1	Message	Date
shamoon	bd6585d3b4	Merge branch 'dev' into feature-remote-ocr-2	2025-08-22 08:54:26 -07:00
shamoon	bfd468103b	Revert "Update ci.yml" This reverts commit `be0c1fd1ed`.	2025-08-22 08:46:01 -07:00
shamoon	be0c1fd1ed	Update ci.yml	2025-08-22 08:45:33 -07:00
shamoon	82370963da	Fix: ignore incomplete tasks for system status 'last run' (#10641 )	2025-08-21 21:44:41 +00:00
shamoon	0fdfa42a83	Tweak: improve dateparser auto-detection messages (#10640 )	2025-08-21 21:14:25 +00:00
shamoon	0f0ba92e15	Fix: increase legibility of date filter clear button in light mode (#10649 )	2025-08-21 07:25:21 -07:00
Guntbert Reiter	5f0281e427	Documentation: fix typo in troubleshooting docs (#10643 )	2025-08-20 13:25:42 -07:00
shamoon	a0c7785881	Dont require_changes for codecov comment	2025-08-20 11:18:38 -07:00
shamoon	717e828a1d	Merge branch 'dev' into feature-remote-ocr-2	2025-08-17 21:25:14 -07:00
shamoon	07381d48e6	Merge branch 'dev' into feature-remote-ocr-2	2025-08-17 07:49:58 -07:00
shamoon	dd0ffaf312	Merge branch 'dev' into feature-remote-ocr-2	2025-08-11 10:48:36 -07:00
shamoon	264504affc	Fix consumer declaration file extensions	2025-08-10 05:32:52 -07:00
shamoon	4feedf2add	Merge branch 'dev' into feature-remote-ocr-2	2025-08-06 16:04:25 -04:00
shamoon	2f76cf9831	Merge branch 'dev' into feature-remote-ocr-2	2025-08-01 23:55:49 -04:00
shamoon	1002d37f6b	Update test_parser.py	2025-07-09 11:05:37 -07:00
shamoon	d260a94740	Update parsers.py	2025-07-09 11:02:57 -07:00
shamoon	88c69b83ea	Update index.md	2025-07-09 11:00:12 -07:00
shamoon	2557ee2014	Update docs to mention remote OCR with Azure AI	2025-07-09 09:53:30 -07:00
shamoon	3c75deed80	Add paperless_remote tests to testpaths	2025-07-08 14:19:45 -07:00
shamoon	d05343c927	Test fixes / coverage	2025-07-08 14:19:45 -07:00
shamoon	e7972b7eaf	Coverage	2025-07-08 14:19:45 -07:00
shamoon	75a091cc0d	Fix test	2025-07-08 14:19:44 -07:00
shamoon	dca74803fd	Use output_content_format poller.result to get clean content	2025-07-08 14:19:44 -07:00
shamoon	3cf3d868d0	Some docs	2025-07-08 14:19:43 -07:00
shamoon	bf4fc6604a	Test	2025-07-08 14:19:43 -07:00
shamoon	e8c1eb86fa	This actually works [ci skip]	2025-07-08 14:19:43 -07:00
shamoon	c3dad3cf69	Basic parse	2025-07-08 14:19:42 -07:00
shamoon	811bd66088	Ok, restart implementing this with just azure [ci skip]	2025-07-08 14:19:42 -07:00