reworked PDF parser that uses OCRmyPDF and produces archive files.

2026-01-30 23:08:59 -06:00 · 2020-11-25 14:50:43 +01:00
parent ac6c72a6c9
commit 15935ab61f
7 changed files with 374 additions and 186 deletions
--- a/docs/configuration.rst
+++ b/docs/configuration.rst
@@ -218,11 +218,37 @@ PAPERLESS_OCR_LANGUAGE=<lang>

    Defaults to "eng".

-PAPERLESS_OCR_ALWAYS=<bool>
-    By default Paperless does not OCR a document if the text can be retrieved from
-    the document directly. Set to true to always OCR documents.
+PAPERLESS_OCR_MODE=<mode>
+    Tell paperless when and how to perform ocr on your documents. Three modes
+    are available:

-    Defaults to false.
+    *   ``skip``: Paperless skips all pages and will perform ocr only on pages
+        where no text is present. This is the safest and fastest option.
+    *   ``redo``: Paperless will OCR all pages of your documents and attempt to
+        replace any existing text layers with new text. This will be useful for
+        documents from scanners that already performed OCR with insufficient
+        results. It will also perform OCR on purely digital documents.
+
+        This option may fail on some documents that have features that cannot
+        be removed, such as forms. In this case, the text from the document is
+        used instead.
+    *   ``force``: Paperless rasterizes your documents, converting any text
+        into images and puts the OCRed text on top. This works for all documents,
+        however, the resulting document may be significantly larger and text
+        won't appear as sharp when zoomed in.
+    
+    The default is ``skip``, which only performs OCR when necessary.
+
+PAPERLESS_OCR_OUTPUT_TYPE=<type>
+    Specify the the type of PDF documents that paperless should produce.
+    
+    *   ``pdf``: Modify the PDF document as little as possible.
+    *   ``pdfa``: Convert PDF documents into PDF/A documents, which is a
+        subset of the entire PDF specification and meant for storing
+        documents long term.
+    
+    If not specified, ``pdfa`` is used. Remember that paperless also keeps
+    the original input file as well as the archived version.

 PAPERLESS_CONSUMER_POLLING=<num>
    If paperless won't find documents added to your consume folder, it might