mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-10-30 03:56:23 -05:00 
			
		
		
		
	Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using tesseract. Since the processing of a single page is independent from every other page, one can make use of multi-core machines. This PR introduces a multiprocessing pool to process multiple pages simultaneously. The amount of threads to use can be specified in the environment variable `PAPERLESS_OCR_THREADS`. This will default to the number of cores/hyperthreads Python detects for your system.
This commit is contained in:
		| @@ -144,6 +144,9 @@ MEDIA_URL = "/media/" | ||||
| # documents.  It should be a 3-letter language code consistent with ISO 639. | ||||
| OCR_LANGUAGE = "eng" | ||||
|  | ||||
| # The amount of threads to use for OCR | ||||
| OCR_THREADS = os.environ.get("PAPERLESS_OCR_THREADS") | ||||
|  | ||||
| # If this is true, any failed attempts to OCR a PDF will result in the PDF being | ||||
| # indexed anyway, with whatever we could get.  If it's False, the file will | ||||
| # simply be left in the CONSUMPTION_DIR. | ||||
|   | ||||
		Reference in New Issue
	
	Block a user
	 Pit Kleyersburg
					Pit Kleyersburg