70 Commits

Author SHA1 Message Date
Jonas Winkler
bd04c966c5 first version of the new consumer. 2020-11-16 18:26:54 +01:00
Jonas Winkler
eb6805e37e code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
340f9f141f fixed most of the tests 2020-11-02 19:42:23 +01:00
Jonas Winkler
d42979842e made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
a89773ad71 removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
def3a85858 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
972a6a2333 bugfix 2020-11-02 01:26:42 +01:00
Jonas Winkler
6adc870a20 silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
0f4094f3ca better thumbnail generation for smaller files 2020-10-26 01:05:23 +01:00
Johannes Wienke
ebcfcea05b Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Johannes Wienke
6531a67940 Remove duplicated date parsing test
The exact same tests existed twice in the file.
2020-03-08 18:26:29 +01:00
Stéphane Brunner
3fab354a6e Strip the thumbnails 2019-03-17 16:37:47 +01:00
jenspfeifle
5c40da1a48 make pycodestyle happy 2019-03-03 20:41:17 +01:00
JensPfeifle
078d66b077 try to run convert, but fall back on gs if needed 2019-03-03 20:31:52 +01:00
JensPfeifle
4c64ea0404 Add GS_BINARY to settings to avoid harcoded call of "gs" 2019-03-03 20:31:52 +01:00
Pit
99718bcf17 Fix quoting in call to run_convert
Co-Authored-By: JensPfeifle <jens@pfeifle.tech>
2019-03-03 20:31:52 +01:00
JensPfeifle
3dfd0253ed remove unnecessary env arg in Popen 2019-03-03 20:31:52 +01:00
Jens Pfeifle
6ab21afeb6 fix parse error of some documents by using gs 2019-03-03 20:31:52 +01:00
Daniel Quinn
e395b0e081 Drop problematic tests
Some tests had differing outcomes depending on the version of Tesseract
installed on the test system.  This lead to a bunch of false test
failures, which lead to people (including me) just ignoring the Travis
results.

This commit removes those tests, and while it reduces our coverage, at
least the results are predictable.
2018-12-30 17:32:45 +00:00
Daniel Quinn
86b0d08377 Use modern languages for sample test files 2018-12-30 14:09:17 +00:00
Erik Arvstedt
f38ac7f62b Fix date test sample image
The previous version of `tests_date_3.png` had too much spacing
between the `0` and the `8` glyphs, which resulted in the year getting
parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00).
This caused the date parsing test to fail.
2018-12-02 15:10:21 +01:00
Daniel Quinn
0d59844567 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Daniel Quinn
4e186ede0e Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing 2018-12-01 16:57:16 +00:00
Daniel Quinn
9c6b8629a3 Fix language guesses in tests
It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.
2018-12-01 15:55:59 +00:00
Joshua Taillon
b0326b5a19 Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing 2018-11-15 23:17:59 -05:00
Joshua Taillon
a2422cc529 Add option for parsing of date from filename (and associated tests) 2018-11-15 20:32:15 -05:00
Joshua Taillon
8b69aa1e52 Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC 2018-11-15 20:30:23 -05:00
Daniel Quinn
3952c6d921 Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling
Clarify forgiving ocr handling
2018-10-08 09:35:57 +00:00
David Martin
b0afa37ec1 Mention FORGIVING_OCR config option when language detection fails.
It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear.
2018-10-08 19:37:05 +11:00
David Martin
7022c98aab Let unpaper overwrite temporary files.
I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.

[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630
2018-10-08 19:12:11 +11:00
Daniel Quinn
bc898c1992 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
074609e1fc Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
0a4338143a Tweak the date guesser to not allow dates prior to 1900 (#414) 2018-10-01 20:03:47 +01:00
Daniel Quinn
52bfeb2ad0 Improve the unknown language error message 2018-09-23 12:41:14 +01:00
Daniel Quinn
21e53aa55c Merge pull request #399 from jat255/ENH_convert_only_one_page
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
ef7f98281d Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
a3158eedf9 Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer 2018-09-09 20:52:59 +01:00
Daniel Quinn
6b63ce9201 Fix pycodestyle complaints
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
5326895334 move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Joshua Taillon
98a437f78a change tesseract parser to only convert first page to save (potentially) massive amounts of work 2018-09-05 15:18:35 -04:00
Erik Arvstedt
4fa9ff60fc Stop tests from writing to the source tree 2018-07-19 23:48:23 +02:00
Daniel Quinn
bce2d3dd22 Account for KeyError problem in #345 2018-04-28 12:20:43 +01:00
Daniel Quinn
f3f86242de Account for KeyError problem in #345 2018-04-28 12:19:53 +01:00
Ovv
32c440cbd9 Log detected document date with isoformat 2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner
328330eb08 Increase testcoverage by testing two more date detection cases 2018-02-19 21:36:48 +01:00
Daniel Quinn
fc6d2d5e0c Fix formatting 2018-02-18 18:00:34 +00:00
Daniel Quinn
9e26e7b39e Fix tests to use _text instead of TEXT_CACHE 2018-02-18 18:00:22 +00:00
Daniel Quinn
7c5ca5f505 Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
4f726e1991 Monitor return codes of calls to convert and unpaper
...and handle the failures nicely.  Addresses #303.
2018-02-18 16:02:27 +00:00
Daniel Quinn
e53033d1b3 Rename .TEXT_CACHE to .text
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00