195 Commits

Author SHA1 Message Date
Joshua Taillon
a2422cc529 Add option for parsing of date from filename (and associated tests) 2018-11-15 20:32:15 -05:00
Joshua Taillon
8b69aa1e52 Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC 2018-11-15 20:30:23 -05:00
Daniel Quinn
3952c6d921 Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling
Clarify forgiving ocr handling
2018-10-08 09:35:57 +00:00
David Martin
b0afa37ec1 Mention FORGIVING_OCR config option when language detection fails.
It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear.
2018-10-08 19:37:05 +11:00
David Martin
7022c98aab Let unpaper overwrite temporary files.
I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.

[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630
2018-10-08 19:12:11 +11:00
Daniel Quinn
bc898c1992 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
074609e1fc Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
0a4338143a Tweak the date guesser to not allow dates prior to 1900 (#414) 2018-10-01 20:03:47 +01:00
Daniel Quinn
52bfeb2ad0 Improve the unknown language error message 2018-09-23 12:41:14 +01:00
Daniel Quinn
21e53aa55c Merge pull request #399 from jat255/ENH_convert_only_one_page
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
ef7f98281d Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
a3158eedf9 Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer 2018-09-09 20:52:59 +01:00
Daniel Quinn
6b63ce9201 Fix pycodestyle complaints
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
5326895334 move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Joshua Taillon
98a437f78a change tesseract parser to only convert first page to save (potentially) massive amounts of work 2018-09-05 15:18:35 -04:00
Erik Arvstedt
4fa9ff60fc Stop tests from writing to the source tree 2018-07-19 23:48:23 +02:00
Daniel Quinn
bce2d3dd22 Account for KeyError problem in #345 2018-04-28 12:20:43 +01:00
Daniel Quinn
f3f86242de Account for KeyError problem in #345 2018-04-28 12:19:53 +01:00
Ovv
32c440cbd9 Log detected document date with isoformat 2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner
328330eb08 Increase testcoverage by testing two more date detection cases 2018-02-19 21:36:48 +01:00
Daniel Quinn
fc6d2d5e0c Fix formatting 2018-02-18 18:00:34 +00:00
Daniel Quinn
9e26e7b39e Fix tests to use _text instead of TEXT_CACHE 2018-02-18 18:00:22 +00:00
Daniel Quinn
7c5ca5f505 Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
4f726e1991 Monitor return codes of calls to convert and unpaper
...and handle the failures nicely.  Addresses #303.
2018-02-18 16:02:27 +00:00
Daniel Quinn
e53033d1b3 Rename .TEXT_CACHE to .text
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00
Daniel Quinn
3302ee2a78 Make isort happy 2018-02-18 16:00:03 +00:00
Daniel Quinn
caf44146db Style and removal of Python 2.7 stuff 2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
5fed7ba6d4 Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date 2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
fc81feb32e Add more (fast-running) unit tests 2018-02-14 21:41:01 +01:00
Wolf-Bastian Pöttner
3e65054e39 Extended exception handling 2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c0c20f99e9 Added log output for date detected in document 2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
3899763261 Extends the regex to find dates in documents as reported by @isaacsando 2018-02-12 22:41:15 +01:00
Daniel Quinn
c6e671f2fa No need to extend object 2018-02-03 15:26:28 +00:00
Daniel Quinn
4c0b908a41 Rework tests to write to /tmp
Originally the test wrote scratch data inside the repo dir, which meant
manual cleanup.  Now it writes to `/tmp/paperless-tests-<random-string>`
and cleans up after itself.
2018-02-03 14:49:48 +00:00
Wolf-Bastian Pöttner
acfacaac4f Added a text cache to optimize performance of date detection 2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
4f725cf4d2 Add test cases for date parsing 2018-02-03 00:28:49 +01:00
Wolf-Bastian Pöttner
73d261484a Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text 2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
3dc730808e Add support for using pre-existing text from PDFs 2018-02-02 22:37:58 +01:00
Matt
bc5c45a705 Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously. 2018-02-01 10:08:57 -05:00
Daniel Quinn
269c32ce6a Add support for using pre-existing text from PDFs 2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
21fc51c09a Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
67844dff0c Update test for #259 fix 2017-10-16 10:53:18 +01:00
Daniel Quinn
2820767f29 Support .jpeg as well as .jpg 2017-10-16 09:00:38 +01:00
Daniel Quinn
e7d4ca92ba fix: allow for caps in file name suffixes #206
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
d2c283582b feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00