Enable parallel OCR processing

At the moment, every page in a PDF will be processed one by one using tesseract. Since the processing of a single page is independent from every other page, one can make use of multi-core machines. This PR introduces a multiprocessing pool to process multiple pages simultaneously. The amount of threads to use can be specified in the environment variable `PAPERLESS_OCR_THREADS`. This will default to the number of cores/hyperthreads Python detects for your system.
2025-12-14 01:21:14 -06:00 · 2016-02-14 15:57:42 +01:00
parent 6b0a537bff
commit f5beda9c56
2 changed files with 18 additions and 5 deletions
--- a/src/paperless/settings.py
+++ b/src/paperless/settings.py
@@ -144,6 +144,9 @@ MEDIA_URL = "/media/"
 # documents.  It should be a 3-letter language code consistent with ISO 639.
 OCR_LANGUAGE = "eng"

+# The amount of threads to use for OCR
+OCR_THREADS = os.environ.get("PAPERLESS_OCR_THREADS")
+
 # If this is true, any failed attempts to OCR a PDF will result in the PDF being
 # indexed anyway, with whatever we could get.  If it's False, the file will
 # simply be left in the CONSUMPTION_DIR.