107 Commits

Author SHA1 Message Date
JensPfeifle
29b0886950 try to run convert, but fall back on gs if needed 2019-03-03 20:31:52 +01:00
JensPfeifle
ea282c22ba Add GS_BINARY to settings to avoid harcoded call of "gs" 2019-03-03 20:31:52 +01:00
Pit
cbf008f37b Fix quoting in call to run_convert
Co-Authored-By: JensPfeifle <jens@pfeifle.tech>
2019-03-03 20:31:52 +01:00
JensPfeifle
50504c3fd8 remove unnecessary env arg in Popen 2019-03-03 20:31:52 +01:00
Jens Pfeifle
0220199766 fix parse error of some documents by using gs 2019-03-03 20:31:52 +01:00
Daniel Quinn
637b0d4cc2 Drop problematic tests
Some tests had differing outcomes depending on the version of Tesseract
installed on the test system.  This lead to a bunch of false test
failures, which lead to people (including me) just ignoring the Travis
results.

This commit removes those tests, and while it reduces our coverage, at
least the results are predictable.
2018-12-30 17:32:45 +00:00
Daniel Quinn
27af2603f5 Use modern languages for sample test files 2018-12-30 14:09:17 +00:00
Erik Arvstedt
a19f0ef97e Fix date test sample image
The previous version of `tests_date_3.png` had too much spacing
between the `0` and the `8` glyphs, which resulted in the year getting
parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00).
This caused the date parsing test to fail.
2018-12-02 15:10:21 +01:00
Daniel Quinn
d544f269e0 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Daniel Quinn
650db75c2b Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing 2018-12-01 16:57:16 +00:00
Daniel Quinn
c1d18c1e83 Fix language guesses in tests
It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.
2018-12-01 15:55:59 +00:00
Joshua Taillon
730daa3d6d Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing 2018-11-15 23:17:59 -05:00
Joshua Taillon
e1d8744c66 Add option for parsing of date from filename (and associated tests) 2018-11-15 20:32:15 -05:00
Joshua Taillon
4409f65840 Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC 2018-11-15 20:30:23 -05:00
Daniel Quinn
bd95804fbf
Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling
Clarify forgiving ocr handling
2018-10-08 09:35:57 +00:00
David Martin
b350ec48b7 Mention FORGIVING_OCR config option when language detection fails.
It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear.
2018-10-08 19:37:05 +11:00
David Martin
f948ee11be Let unpaper overwrite temporary files.
I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.

[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630
2018-10-08 19:12:11 +11:00
Daniel Quinn
750ab5bf85 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
2a3f766b93 Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
8010d72f18 Tweak the date guesser to not allow dates prior to 1900 (#414) 2018-10-01 20:03:47 +01:00
Daniel Quinn
117d7dad04 Improve the unknown language error message 2018-09-23 12:41:14 +01:00
Daniel Quinn
46cbd10ba0
Merge pull request #399 from jat255/ENH_convert_only_one_page
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
c99f5923d5 Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
2dc35cc856 Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer 2018-09-09 20:52:59 +01:00
Daniel Quinn
5342db6ada Fix pycodestyle complaints
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
72c828170e move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Joshua Taillon
cac63494f0 change tesseract parser to only convert first page to save (potentially) massive amounts of work 2018-09-05 15:18:35 -04:00
Erik Arvstedt
be2cbebaf7 Stop tests from writing to the source tree 2018-07-19 23:48:23 +02:00
Daniel Quinn
82f9dde055 Account for KeyError problem in #345 2018-04-28 12:20:43 +01:00
Daniel Quinn
c983e73d0f Account for KeyError problem in #345 2018-04-28 12:19:53 +01:00
Ovv
75ac8d2796 Log detected document date with isoformat 2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner
fba58f3bdd Increase testcoverage by testing two more date detection cases 2018-02-19 21:36:48 +01:00
Daniel Quinn
6662ca3467 Fix formatting 2018-02-18 18:00:34 +00:00
Daniel Quinn
6f1ed89e26 Fix tests to use _text instead of TEXT_CACHE 2018-02-18 18:00:22 +00:00
Daniel Quinn
5d01410dc0
Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
ea6d040809 Monitor return codes of calls to convert and unpaper
...and handle the failures nicely.  Addresses #303.
2018-02-18 16:02:27 +00:00
Daniel Quinn
8e9d5caa37 Rename .TEXT_CACHE to .text
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00
Daniel Quinn
122aa2b9f1 Make isort happy 2018-02-18 16:00:03 +00:00
Daniel Quinn
fb1da4834c Style and removal of Python 2.7 stuff 2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
96c7222269 Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date 2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
1737e27b34 Add more (fast-running) unit tests 2018-02-14 21:41:01 +01:00
Wolf-Bastian Pöttner
39f198138a Extended exception handling 2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c74bb84c83 Added log output for date detected in document 2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
07d06d9aee Extends the regex to find dates in documents as reported by @isaacsando 2018-02-12 22:41:15 +01:00
Daniel Quinn
73163d893f No need to extend object 2018-02-03 15:26:28 +00:00
Daniel Quinn
c90ed2da1d Rework tests to write to /tmp
Originally the test wrote scratch data inside the repo dir, which meant
manual cleanup.  Now it writes to `/tmp/paperless-tests-<random-string>`
and cleans up after itself.
2018-02-03 14:49:48 +00:00
Wolf-Bastian Pöttner
40f8ba23a4 Added a text cache to optimize performance of date detection 2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
bef2d94374 Add test cases for date parsing 2018-02-03 00:28:49 +01:00
Wolf-Bastian Pöttner
f39c7654a0 Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text 2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
87e466c47c Add support for using pre-existing text from PDFs 2018-02-02 22:37:58 +01:00