27 Commits

Author SHA1 Message Date
Tikitu de Jager
95217e8e21 Use FileInfo directly instead of via indirection 2016-03-07 21:08:07 +02:00
Tikitu de Jager
1f75af0137 Extract filename parsing into testable class 2016-03-07 21:05:04 +02:00
Pit Kleyersburg
fb36a49c26 Add unpaper as another pre-processing step 2016-03-06 15:30:37 +01:00
Daniel Quinn
495ed1c36c Added thumbnail generation to the conumer 2016-03-05 12:09:06 +00:00
Daniel Quinn
5d4587ef8b Accounted for .sender in a few places 2016-03-04 09:14:50 +00:00
Daniel Quinn
070463b85a s/Sender/Correspondent & reworked the (im|ex)porter 2016-03-03 20:52:42 +00:00
Daniel Quinn
fad466477b More verbose error logging 2016-03-03 18:18:48 +00:00
Daniel Quinn
631aa99d92 No need to pass verbosity around anymore 2016-02-28 00:39:40 +00:00
Daniel Quinn
2fe9b0cbc1 New logging appears to work 2016-02-27 20:18:50 +00:00
Daniel Quinn
1aecb1e63a Compensate for case and format of jpg vs. jpeg 2016-02-23 20:15:13 +00:00
Daniel Quinn
3a7923e32d Moved pyocr.get_available_tools() into a method 2016-02-21 02:24:05 +00:00
Daniel Quinn
422ae9303a pep8 2016-02-21 00:14:50 +00:00
Daniel Quinn
51b19f4c19 Issue #57 2016-02-20 22:30:01 +00:00
Pit Kleyersburg
c45f951ca0 Ignore error if orientation detection fails
Fixes an additional issue that came up in #48.
2016-02-19 09:52:32 +01:00
Pit Kleyersburg
c34d57a872 Detect image orientation if the OCR supports it
Fixes issue #47.
2016-02-18 09:37:13 +01:00
Daniel Quinn
1e7ece81ee Fixes #45 2016-02-17 23:07:54 +00:00
Daniel Quinn
6f95b05287 Support appropriate sorting for long documents 2016-02-17 00:10:05 +00:00
Pit Kleyersburg
46f8f492f5 Safely and non-randomly create scratch directory
Creating the scratch-files in `_get_grayscale` using a random integer is
for one inherently unsafe and can cause a collision. On the other hand,
it should be unnecessary given that the files will be cleaned up after
the OCR run.

Since we don't know if OCR runs might be parallel in the future, this
commit implements thread-safe and deterministic directory-creation.

Additionally it fixes the call to `_cleanup` by `consume`. In the
current implementation `_cleanup` will not be called if the last
consumed document failed with an `OCRError`, this commit fixes this.
2016-02-16 12:15:57 +01:00
Daniel Quinn
a0f4f6c5f2 Fixed merge conflict and did some pep8 2016-02-14 17:13:48 +00:00
Pit Kleyersburg
aeab9a0e81 Detect language only on one page of PDF
To detect the language currently the entire document gets processed. If
a different language has been detected than the default one, the entire
document will be processed again for the new language.

This PR analyzes the middle page for its language and either processes
the remaining pages with the default language if it didn't differ, or
processes all pages for the new guessed language.

The amount of processed pages comes down from the worst case `2n` to
worst case `n+1`.
2016-02-14 17:55:13 +01:00
Daniel Quinn
7843ea5037 Added and implemented a rudimentary logger 2016-02-14 16:09:52 +00:00
Pit Kleyersburg
20b2408dbb Ensure OCR_THREADS is integer, add documentation 2016-02-14 16:37:38 +01:00
Pit Kleyersburg
f5beda9c56 Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using
tesseract. Since the processing of a single page is independent from every
other page, one can make use of multi-core machines.

This PR introduces a multiprocessing pool to process multiple pages
simultaneously. The amount of threads to use can be specified in the
environment variable `PAPERLESS_OCR_THREADS`. This will default to the
number of cores/hyperthreads Python detects for your system.
2016-02-14 15:57:42 +01:00
Daniel Quinn
a846b3f7b8 Adding some more debugging 2016-02-13 00:57:05 +00:00
Daniel Quinn
2421f559be Simpler regex 2016-02-12 08:27:09 +00:00
Daniel Quinn
a022fcb8f1 Fixed the auto-naming regexes 2016-02-11 22:05:55 +00:00
Daniel Quinn
48761911b3 Image imports and consumption by mail work 2016-02-06 17:05:36 +00:00