mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-07-28 18:24:38 -05:00
Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using tesseract. Since the processing of a single page is independent from every other page, one can make use of multi-core machines. This PR introduces a multiprocessing pool to process multiple pages simultaneously. The amount of threads to use can be specified in the environment variable `PAPERLESS_OCR_THREADS`. This will default to the number of cores/hyperthreads Python detects for your system.
This commit is contained in:
@@ -144,6 +144,9 @@ MEDIA_URL = "/media/"
|
||||
# documents. It should be a 3-letter language code consistent with ISO 639.
|
||||
OCR_LANGUAGE = "eng"
|
||||
|
||||
# The amount of threads to use for OCR
|
||||
OCR_THREADS = os.environ.get("PAPERLESS_OCR_THREADS")
|
||||
|
||||
# If this is true, any failed attempts to OCR a PDF will result in the PDF being
|
||||
# indexed anyway, with whatever we could get. If it's False, the file will
|
||||
# simply be left in the CONSUMPTION_DIR.
|
||||
|
Reference in New Issue
Block a user