paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2025-04-19 10:19:27 -05:00

Author	SHA1	Message	Date
Pit Kleyersburg	aeab9a0e81	Detect language only on one page of PDF To detect the language currently the entire document gets processed. If a different language has been detected than the default one, the entire document will be processed again for the new language. This PR analyzes the middle page for its language and either processes the remaining pages with the default language if it didn't differ, or processes all pages for the new guessed language. The amount of processed pages comes down from the worst case `2n` to worst case `n+1`.	2016-02-14 17:55:13 +01:00
Daniel Quinn	7843ea5037	Added and implemented a rudimentary logger	2016-02-14 16:09:52 +00:00
Pit Kleyersburg	20b2408dbb	Ensure `OCR_THREADS` is integer, add documentation	2016-02-14 16:37:38 +01:00
Pit Kleyersburg	f5beda9c56	Enable parallel OCR processing At the moment, every page in a PDF will be processed one by one using tesseract. Since the processing of a single page is independent from every other page, one can make use of multi-core machines. This PR introduces a multiprocessing pool to process multiple pages simultaneously. The amount of threads to use can be specified in the environment variable `PAPERLESS_OCR_THREADS`. This will default to the number of cores/hyperthreads Python detects for your system.	2016-02-14 15:57:42 +01:00
Daniel Quinn	a846b3f7b8	Adding some more debugging	2016-02-13 00:57:05 +00:00
Daniel Quinn	2421f559be	Simpler regex	2016-02-12 08:27:09 +00:00
Daniel Quinn	a022fcb8f1	Fixed the auto-naming regexes	2016-02-11 22:05:55 +00:00
Daniel Quinn	48761911b3	Image imports and consumption by mail work	2016-02-06 17:05:36 +00:00

1 2 3 4

158 Commits