Daniel Quinn
46cbd10ba0
Merge pull request #399 from jat255/ENH_convert_only_one_page
...
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
c99f5923d5
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
2dc35cc856
Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer
2018-09-09 20:52:59 +01:00
Daniel Quinn
5342db6ada
Fix pycodestyle complaints
...
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
72c828170e
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Joshua Taillon
cac63494f0
change tesseract parser to only convert first page to save (potentially) massive amounts of work
2018-09-05 15:18:35 -04:00
Daniel Quinn
82f9dde055
Account for KeyError problem in #345
2018-04-28 12:20:43 +01:00
Daniel Quinn
c983e73d0f
Account for KeyError problem in #345
2018-04-28 12:19:53 +01:00
Ovv
75ac8d2796
Log detected document date with isoformat
2018-03-04 13:10:49 +01:00
Daniel Quinn
5d01410dc0
Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
...
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
ea6d040809
Monitor return codes of calls to convert
and unpaper
...
...and handle the failures nicely. Addresses #303 .
2018-02-18 16:02:27 +00:00
Daniel Quinn
8e9d5caa37
Rename .TEXT_CACHE to .text
...
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00
Daniel Quinn
122aa2b9f1
Make isort happy
2018-02-18 16:00:03 +00:00
Daniel Quinn
fb1da4834c
Style and removal of Python 2.7 stuff
2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
96c7222269
Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date
2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
39f198138a
Extended exception handling
2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c74bb84c83
Added log output for date detected in document
2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
07d06d9aee
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-12 22:41:15 +01:00
Wolf-Bastian Pöttner
40f8ba23a4
Added a text cache to optimize performance of date detection
2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
f39c7654a0
Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text
2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
87e466c47c
Add support for using pre-existing text from PDFs
2018-02-02 22:37:58 +01:00
Matt
ce98019b49
Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.
2018-02-01 10:08:57 -05:00
Daniel Quinn
cd92c005e3
Add support for using pre-existing text from PDFs
2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
b140935843
Add support for a heuristic that extracts the document date from its text
2018-01-28 19:37:10 +01:00
Daniel Quinn
55e81ca4bb
feat: refactor for pluggable consumers
...
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`. This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
* containing a `parsers.py` for your parser modelled after
`paperless_tesseract.parsers.RasterisedDocumentParser`
* containing a `signals.py` with a handler moddelled after
`paperless_tesseract.signals.ConsumerDeclaration`
* connect the signal handler to
`documents.signals.document_consumer_declaration` in
`your_app.apps`
* Install the app into Paperless by declaring
`PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be
separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00