Enable parallel OCR processing

At the moment, every page in a PDF will be processed one by one using
tesseract. Since the processing of a single page is independent from every
other page, one can make use of multi-core machines.

This PR introduces a multiprocessing pool to process multiple pages
simultaneously. The amount of threads to use can be specified in the
environment variable `PAPERLESS_OCR_THREADS`. This will default to the
number of cores/hyperthreads Python detects for your system.
This commit is contained in:
Pit Kleyersburg
2016-02-14 15:57:42 +01:00
parent 6b0a537bff
commit f5beda9c56
2 changed files with 18 additions and 5 deletions

View File

@@ -144,6 +144,9 @@ MEDIA_URL = "/media/"
# documents. It should be a 3-letter language code consistent with ISO 639.
OCR_LANGUAGE = "eng"
# The amount of threads to use for OCR
OCR_THREADS = os.environ.get("PAPERLESS_OCR_THREADS")
# If this is true, any failed attempts to OCR a PDF will result in the PDF being
# indexed anyway, with whatever we could get. If it's False, the file will
# simply be left in the CONSUMPTION_DIR.