157 Commits

Author SHA1 Message Date
Erik Arvstedt
a56a3eb86d Use os.scandir instead of os.listdir
It's simpler and better suited for use cases introduced in later commits.
2018-05-11 14:05:25 +02:00
Erik Arvstedt
2fe7df8ca0 Consume documents in order of increasing mtime
This increases overall usability, especially for multi-page scans.
Previously, the consumption order was undefined (see os.listdir())
2018-05-11 14:04:37 +02:00
Erik Arvstedt
873c98dddb Refactor: extract fn 'make_dirs' 2018-05-11 14:04:36 +02:00
Daniel Quinn
73e62600c2 Clean up docstring to be properly rst 2018-03-03 18:43:20 +00:00
Ovv
8fefafb844 style & test 2018-03-03 18:43:20 +00:00
Ovv
d1a57b5d68 Configuration cli argument for document_consumer 2018-03-03 18:43:20 +00:00
Daniel Quinn
ea6d040809 Monitor return codes of calls to convert and unpaper
...and handle the failures nicely.  Addresses #303.
2018-02-18 16:02:27 +00:00
Daniel Quinn
fb1da4834c Style and removal of Python 2.7 stuff 2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
b140935843 Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
fa4924d5ba fix: allow for caps in file name suffixes #206
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
55e81ca4bb feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00
Daniel Quinn
18495ce9da Fix for #154
* Added a test with a faked pyocr and tesseract
* Added a catch for pyocr's *other* TesseractError
2016-11-27 15:06:45 +00:00
Daniel Quinn
ca21929cee Moved logging logic into the consumer 2016-10-26 09:52:09 +00:00
Daniel Quinn
8e58406881 pep8 corrections 2016-10-26 09:32:59 +00:00
Aleksandr Bogdanov
63de2ca1b0 Collapsing excess whitespace after OCR 2016-10-12 01:46:34 +02:00
Daniel Quinn
1ce76a5486 Actually write the date found in the file name 2016-08-20 18:11:51 +01:00
Lenz Weber
018efc576b wait until file is completely transmitted
negation was missing for feature to be active, see #128
2016-06-26 10:18:58 +02:00
Brian Martin
b6ae129ad1 Sample Config and Bug Fix
Update sample config to reflect new setting variable.
Change consumer to handle density setting as str instead of int.
2016-05-13 23:23:58 -04:00
Brian Martin
52c5aafb3f Convert Density
Add settings variable for the convert density setting.
If no variable is set, default to 300.
2016-05-13 22:47:40 -04:00
Daniel Quinn
e96c7448bc Fix for #107 2016-04-11 23:28:12 +01:00
Daniel Quinn
90939be6af @Pitkley made a good suggestion in #98 2016-04-10 17:39:49 +01:00
Daniel Quinn
64b72d4337 Added test for duplicates 2016-04-03 18:44:00 +01:00
Daniel Quinn
bbe691f342 Merge pull request #101 from danielquinn/issue/89
Closes #89.
2016-03-28 14:25:56 +01:00
Daniel Quinn
b4e648e1e3 Test All The Things 2016-03-28 14:16:26 +01:00
Daniel Quinn
b92e007e15 Removed log components and introduced signals for tags & correspondents 2016-03-28 11:11:15 +01:00
Daniel Quinn
49b56425e8 Merge branch 'master' into issue/81 2016-03-25 20:56:30 +00:00
Daniel Quinn
b387be6f25 I didn't mean to explicitly set -limit 2016-03-25 20:33:00 +00:00
Daniel Quinn
9991f5a6b2 Introducing optional env vars for ImageMagick 2016-03-25 20:31:15 +00:00
Daniel Quinn
0aa0513004 Modifications for support for dates 2016-03-24 19:18:33 +00:00
Daniel Quinn
1170139127 Added a consume-start and consume-finish signal 2016-03-14 21:20:44 +00:00
Tikitu de Jager
95217e8e21 Use FileInfo directly instead of via indirection 2016-03-07 21:08:07 +02:00
Tikitu de Jager
1f75af0137 Extract filename parsing into testable class 2016-03-07 21:05:04 +02:00
Pit Kleyersburg
fb36a49c26 Add unpaper as another pre-processing step 2016-03-06 15:30:37 +01:00
Daniel Quinn
495ed1c36c Added thumbnail generation to the conumer 2016-03-05 12:09:06 +00:00
Daniel Quinn
5d4587ef8b Accounted for .sender in a few places 2016-03-04 09:14:50 +00:00
Daniel Quinn
070463b85a s/Sender/Correspondent & reworked the (im|ex)porter 2016-03-03 20:52:42 +00:00
Daniel Quinn
fad466477b More verbose error logging 2016-03-03 18:18:48 +00:00
Daniel Quinn
631aa99d92 No need to pass verbosity around anymore 2016-02-28 00:39:40 +00:00
Daniel Quinn
2fe9b0cbc1 New logging appears to work 2016-02-27 20:18:50 +00:00
Daniel Quinn
1aecb1e63a Compensate for case and format of jpg vs. jpeg 2016-02-23 20:15:13 +00:00
Daniel Quinn
3a7923e32d Moved pyocr.get_available_tools() into a method 2016-02-21 02:24:05 +00:00
Daniel Quinn
422ae9303a pep8 2016-02-21 00:14:50 +00:00
Daniel Quinn
51b19f4c19 Issue #57 2016-02-20 22:30:01 +00:00
Pit Kleyersburg
c45f951ca0 Ignore error if orientation detection fails
Fixes an additional issue that came up in #48.
2016-02-19 09:52:32 +01:00
Pit Kleyersburg
c34d57a872 Detect image orientation if the OCR supports it
Fixes issue #47.
2016-02-18 09:37:13 +01:00
Daniel Quinn
1e7ece81ee Fixes #45 2016-02-17 23:07:54 +00:00
Daniel Quinn
6f95b05287 Support appropriate sorting for long documents 2016-02-17 00:10:05 +00:00
Pit Kleyersburg
46f8f492f5 Safely and non-randomly create scratch directory
Creating the scratch-files in `_get_grayscale` using a random integer is
for one inherently unsafe and can cause a collision. On the other hand,
it should be unnecessary given that the files will be cleaned up after
the OCR run.

Since we don't know if OCR runs might be parallel in the future, this
commit implements thread-safe and deterministic directory-creation.

Additionally it fixes the call to `_cleanup` by `consume`. In the
current implementation `_cleanup` will not be called if the last
consumed document failed with an `OCRError`, this commit fixes this.
2016-02-16 12:15:57 +01:00
Daniel Quinn
a0f4f6c5f2 Fixed merge conflict and did some pep8 2016-02-14 17:13:48 +00:00
Pit Kleyersburg
aeab9a0e81 Detect language only on one page of PDF
To detect the language currently the entire document gets processed. If
a different language has been detected than the default one, the entire
document will be processed again for the new language.

This PR analyzes the middle page for its language and either processes
the remaining pages with the default language if it didn't differ, or
processes all pages for the new guessed language.

The amount of processed pages comes down from the worst case `2n` to
worst case `n+1`.
2016-02-14 17:55:13 +01:00