I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`. This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
* containing a `parsers.py` for your parser modelled after
`paperless_tesseract.parsers.RasterisedDocumentParser`
* containing a `signals.py` with a handler moddelled after
`paperless_tesseract.signals.ConsumerDeclaration`
* connect the signal handler to
`documents.signals.document_consumer_declaration` in
`your_app.apps`
* Install the app into Paperless by declaring
`PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be
separated with commas.
* Restart the consumer
Creating the scratch-files in `_get_grayscale` using a random integer is
for one inherently unsafe and can cause a collision. On the other hand,
it should be unnecessary given that the files will be cleaned up after
the OCR run.
Since we don't know if OCR runs might be parallel in the future, this
commit implements thread-safe and deterministic directory-creation.
Additionally it fixes the call to `_cleanup` by `consume`. In the
current implementation `_cleanup` will not be called if the last
consumed document failed with an `OCRError`, this commit fixes this.
To detect the language currently the entire document gets processed. If
a different language has been detected than the default one, the entire
document will be processed again for the new language.
This PR analyzes the middle page for its language and either processes
the remaining pages with the default language if it didn't differ, or
processes all pages for the new guessed language.
The amount of processed pages comes down from the worst case `2n` to
worst case `n+1`.