27 Commits

Author SHA1 Message Date
Daniel Quinn
0a4338143a Tweak the date guesser to not allow dates prior to 1900 (#414) 2018-10-01 20:03:47 +01:00
Daniel Quinn
52bfeb2ad0 Improve the unknown language error message 2018-09-23 12:41:14 +01:00
Daniel Quinn
21e53aa55c Merge pull request #399 from jat255/ENH_convert_only_one_page
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
ef7f98281d Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
a3158eedf9 Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer 2018-09-09 20:52:59 +01:00
Daniel Quinn
6b63ce9201 Fix pycodestyle complaints
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
5326895334 move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Joshua Taillon
98a437f78a change tesseract parser to only convert first page to save (potentially) massive amounts of work 2018-09-05 15:18:35 -04:00
Daniel Quinn
bce2d3dd22 Account for KeyError problem in #345 2018-04-28 12:20:43 +01:00
Daniel Quinn
f3f86242de Account for KeyError problem in #345 2018-04-28 12:19:53 +01:00
Ovv
32c440cbd9 Log detected document date with isoformat 2018-03-04 13:10:49 +01:00
Daniel Quinn
7c5ca5f505 Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
4f726e1991 Monitor return codes of calls to convert and unpaper
...and handle the failures nicely.  Addresses #303.
2018-02-18 16:02:27 +00:00
Daniel Quinn
e53033d1b3 Rename .TEXT_CACHE to .text
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00
Daniel Quinn
3302ee2a78 Make isort happy 2018-02-18 16:00:03 +00:00
Daniel Quinn
caf44146db Style and removal of Python 2.7 stuff 2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
5fed7ba6d4 Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date 2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
3e65054e39 Extended exception handling 2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c0c20f99e9 Added log output for date detected in document 2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
3899763261 Extends the regex to find dates in documents as reported by @isaacsando 2018-02-12 22:41:15 +01:00
Wolf-Bastian Pöttner
acfacaac4f Added a text cache to optimize performance of date detection 2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
73d261484a Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text 2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
3dc730808e Add support for using pre-existing text from PDFs 2018-02-02 22:37:58 +01:00
Matt
bc5c45a705 Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously. 2018-02-01 10:08:57 -05:00
Daniel Quinn
269c32ce6a Add support for using pre-existing text from PDFs 2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
21fc51c09a Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
d2c283582b feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00