Daniel Quinn
0a4338143a
Tweak the date guesser to not allow dates prior to 1900 ( #414 )
2018-10-01 20:03:47 +01:00
Daniel Quinn
52bfeb2ad0
Improve the unknown language error message
2018-09-23 12:41:14 +01:00
Daniel Quinn
21e53aa55c
Merge pull request #399 from jat255/ENH_convert_only_one_page
...
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
ef7f98281d
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
a3158eedf9
Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer
2018-09-09 20:52:59 +01:00
Daniel Quinn
6b63ce9201
Fix pycodestyle complaints
...
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
5326895334
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Joshua Taillon
98a437f78a
change tesseract parser to only convert first page to save (potentially) massive amounts of work
2018-09-05 15:18:35 -04:00
Daniel Quinn
bce2d3dd22
Account for KeyError problem in #345
2018-04-28 12:20:43 +01:00
Daniel Quinn
f3f86242de
Account for KeyError problem in #345
2018-04-28 12:19:53 +01:00
Ovv
32c440cbd9
Log detected document date with isoformat
2018-03-04 13:10:49 +01:00
Daniel Quinn
7c5ca5f505
Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
...
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
4f726e1991
Monitor return codes of calls to convert
and unpaper
...
...and handle the failures nicely. Addresses #303 .
2018-02-18 16:02:27 +00:00
Daniel Quinn
e53033d1b3
Rename .TEXT_CACHE to .text
...
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00
Daniel Quinn
3302ee2a78
Make isort happy
2018-02-18 16:00:03 +00:00
Daniel Quinn
caf44146db
Style and removal of Python 2.7 stuff
2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
5fed7ba6d4
Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date
2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
3e65054e39
Extended exception handling
2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c0c20f99e9
Added log output for date detected in document
2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
3899763261
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-12 22:41:15 +01:00
Wolf-Bastian Pöttner
acfacaac4f
Added a text cache to optimize performance of date detection
2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
73d261484a
Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text
2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
3dc730808e
Add support for using pre-existing text from PDFs
2018-02-02 22:37:58 +01:00
Matt
bc5c45a705
Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.
2018-02-01 10:08:57 -05:00
Daniel Quinn
269c32ce6a
Add support for using pre-existing text from PDFs
2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
21fc51c09a
Add support for a heuristic that extracts the document date from its text
2018-01-28 19:37:10 +01:00
Daniel Quinn
d2c283582b
feat: refactor for pluggable consumers
...
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`. This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
* containing a `parsers.py` for your parser modelled after
`paperless_tesseract.parsers.RasterisedDocumentParser`
* containing a `signals.py` with a handler moddelled after
`paperless_tesseract.signals.ConsumerDeclaration`
* connect the signal handler to
`documents.signals.document_consumer_declaration` in
`your_app.apps`
* Install the app into Paperless by declaring
`PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be
separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00