Erik Arvstedt
f018e8e54f
Refactor: extract fn try_consume_file
...
The main purpose of this change is to make the following commits more
readable.
2018-05-11 14:05:28 +02:00
Erik Arvstedt
a56a3eb86d
Use os.scandir instead of os.listdir
...
It's simpler and better suited for use cases introduced in later commits.
2018-05-11 14:05:25 +02:00
Erik Arvstedt
2fe7df8ca0
Consume documents in order of increasing mtime
...
This increases overall usability, especially for multi-page scans.
Previously, the consumption order was undefined (see os.listdir())
2018-05-11 14:04:37 +02:00
Erik Arvstedt
873c98dddb
Refactor: extract fn 'make_dirs'
2018-05-11 14:04:36 +02:00
Daniel Quinn
73e62600c2
Clean up docstring to be properly rst
2018-03-03 18:43:20 +00:00
Ovv
8fefafb844
style & test
2018-03-03 18:43:20 +00:00
Ovv
d1a57b5d68
Configuration cli argument for document_consumer
2018-03-03 18:43:20 +00:00
Daniel Quinn
ea6d040809
Monitor return codes of calls to convert
and unpaper
...
...and handle the failures nicely. Addresses #303 .
2018-02-18 16:02:27 +00:00
Daniel Quinn
fb1da4834c
Style and removal of Python 2.7 stuff
2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
b140935843
Add support for a heuristic that extracts the document date from its text
2018-01-28 19:37:10 +01:00
Daniel Quinn
fa4924d5ba
fix: allow for caps in file name suffixes #206
...
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
55e81ca4bb
feat: refactor for pluggable consumers
...
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`. This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
* containing a `parsers.py` for your parser modelled after
`paperless_tesseract.parsers.RasterisedDocumentParser`
* containing a `signals.py` with a handler moddelled after
`paperless_tesseract.signals.ConsumerDeclaration`
* connect the signal handler to
`documents.signals.document_consumer_declaration` in
`your_app.apps`
* Install the app into Paperless by declaring
`PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be
separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00
Daniel Quinn
18495ce9da
Fix for #154
...
* Added a test with a faked pyocr and tesseract
* Added a catch for pyocr's *other* TesseractError
2016-11-27 15:06:45 +00:00
Daniel Quinn
ca21929cee
Moved logging logic into the consumer
2016-10-26 09:52:09 +00:00
Daniel Quinn
8e58406881
pep8 corrections
2016-10-26 09:32:59 +00:00
Aleksandr Bogdanov
63de2ca1b0
Collapsing excess whitespace after OCR
2016-10-12 01:46:34 +02:00
Daniel Quinn
1ce76a5486
Actually write the date found in the file name
2016-08-20 18:11:51 +01:00
Lenz Weber
018efc576b
wait until file is completely transmitted
...
negation was missing for feature to be active, see #128
2016-06-26 10:18:58 +02:00
Brian Martin
b6ae129ad1
Sample Config and Bug Fix
...
Update sample config to reflect new setting variable.
Change consumer to handle density setting as str instead of int.
2016-05-13 23:23:58 -04:00
Brian Martin
52c5aafb3f
Convert Density
...
Add settings variable for the convert density setting.
If no variable is set, default to 300.
2016-05-13 22:47:40 -04:00
Daniel Quinn
e96c7448bc
Fix for #107
2016-04-11 23:28:12 +01:00
Daniel Quinn
90939be6af
@Pitkley made a good suggestion in #98
2016-04-10 17:39:49 +01:00
Daniel Quinn
64b72d4337
Added test for duplicates
2016-04-03 18:44:00 +01:00
Daniel Quinn
bbe691f342
Merge pull request #101 from danielquinn/issue/89
...
Closes #89 .
2016-03-28 14:25:56 +01:00
Daniel Quinn
b4e648e1e3
Test All The Things
2016-03-28 14:16:26 +01:00
Daniel Quinn
b92e007e15
Removed log components and introduced signals for tags & correspondents
2016-03-28 11:11:15 +01:00
Daniel Quinn
49b56425e8
Merge branch 'master' into issue/81
2016-03-25 20:56:30 +00:00
Daniel Quinn
b387be6f25
I didn't mean to explicitly set -limit
2016-03-25 20:33:00 +00:00
Daniel Quinn
9991f5a6b2
Introducing optional env vars for ImageMagick
2016-03-25 20:31:15 +00:00
Daniel Quinn
0aa0513004
Modifications for support for dates
2016-03-24 19:18:33 +00:00
Daniel Quinn
1170139127
Added a consume-start and consume-finish signal
2016-03-14 21:20:44 +00:00
Tikitu de Jager
95217e8e21
Use FileInfo directly instead of via indirection
2016-03-07 21:08:07 +02:00
Tikitu de Jager
1f75af0137
Extract filename parsing into testable class
2016-03-07 21:05:04 +02:00
Pit Kleyersburg
fb36a49c26
Add unpaper as another pre-processing step
2016-03-06 15:30:37 +01:00
Daniel Quinn
495ed1c36c
Added thumbnail generation to the conumer
2016-03-05 12:09:06 +00:00
Daniel Quinn
5d4587ef8b
Accounted for .sender in a few places
2016-03-04 09:14:50 +00:00
Daniel Quinn
070463b85a
s/Sender/Correspondent & reworked the (im|ex)porter
2016-03-03 20:52:42 +00:00
Daniel Quinn
fad466477b
More verbose error logging
2016-03-03 18:18:48 +00:00
Daniel Quinn
631aa99d92
No need to pass verbosity around anymore
2016-02-28 00:39:40 +00:00
Daniel Quinn
2fe9b0cbc1
New logging appears to work
2016-02-27 20:18:50 +00:00
Daniel Quinn
1aecb1e63a
Compensate for case and format of jpg vs. jpeg
2016-02-23 20:15:13 +00:00
Daniel Quinn
3a7923e32d
Moved pyocr.get_available_tools() into a method
2016-02-21 02:24:05 +00:00
Daniel Quinn
422ae9303a
pep8
2016-02-21 00:14:50 +00:00
Daniel Quinn
51b19f4c19
Issue #57
2016-02-20 22:30:01 +00:00
Pit Kleyersburg
c45f951ca0
Ignore error if orientation detection fails
...
Fixes an additional issue that came up in #48 .
2016-02-19 09:52:32 +01:00
Pit Kleyersburg
c34d57a872
Detect image orientation if the OCR supports it
...
Fixes issue #47 .
2016-02-18 09:37:13 +01:00
Daniel Quinn
1e7ece81ee
Fixes #45
2016-02-17 23:07:54 +00:00
Daniel Quinn
6f95b05287
Support appropriate sorting for long documents
2016-02-17 00:10:05 +00:00
Pit Kleyersburg
46f8f492f5
Safely and non-randomly create scratch directory
...
Creating the scratch-files in `_get_grayscale` using a random integer is
for one inherently unsafe and can cause a collision. On the other hand,
it should be unnecessary given that the files will be cleaned up after
the OCR run.
Since we don't know if OCR runs might be parallel in the future, this
commit implements thread-safe and deterministic directory-creation.
Additionally it fixes the call to `_cleanup` by `consume`. In the
current implementation `_cleanup` will not be called if the last
consumed document failed with an `OCRError`, this commit fixes this.
2016-02-16 12:15:57 +01:00
Daniel Quinn
a0f4f6c5f2
Fixed merge conflict and did some pep8
2016-02-14 17:13:48 +00:00