Daniel Quinn
e395b0e081
Drop problematic tests
...
Some tests had differing outcomes depending on the version of Tesseract
installed on the test system. This lead to a bunch of false test
failures, which lead to people (including me) just ignoring the Travis
results.
This commit removes those tests, and while it reduces our coverage, at
least the results are predictable.
2018-12-30 17:32:45 +00:00
Daniel Quinn
86b0d08377
Use modern languages for sample test files
2018-12-30 14:09:17 +00:00
Erik Arvstedt
f38ac7f62b
Fix date test sample image
...
The previous version of `tests_date_3.png` had too much spacing
between the `0` and the `8` glyphs, which resulted in the year getting
parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00).
This caused the date parsing test to fail.
2018-12-02 15:10:21 +01:00
Daniel Quinn
0d59844567
Conform everything to the coding standards
...
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Daniel Quinn
4e186ede0e
Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing
2018-12-01 16:57:16 +00:00
Daniel Quinn
9c6b8629a3
Fix language guesses in tests
...
It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.
2018-12-01 15:55:59 +00:00
Joshua Taillon
b0326b5a19
Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing
2018-11-15 23:17:59 -05:00
Joshua Taillon
a2422cc529
Add option for parsing of date from filename (and associated tests)
2018-11-15 20:32:15 -05:00
Joshua Taillon
8b69aa1e52
Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC
2018-11-15 20:30:23 -05:00
Daniel Quinn
3952c6d921
Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling
...
Clarify forgiving ocr handling
2018-10-08 09:35:57 +00:00
David Martin
b0afa37ec1
Mention FORGIVING_OCR config option when language detection fails.
...
It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear.
2018-10-08 19:37:05 +11:00
David Martin
7022c98aab
Let unpaper overwrite temporary files.
...
I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.
[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630
2018-10-08 19:12:11 +11:00
Daniel Quinn
bc898c1992
Use optipng to optimise document thumbnails
2018-10-07 14:56:38 +01:00
Daniel Quinn
074609e1fc
Consolidate get_date onto the DocumentParser parent class
2018-10-07 14:56:02 +01:00
Daniel Quinn
0a4338143a
Tweak the date guesser to not allow dates prior to 1900 ( #414 )
2018-10-01 20:03:47 +01:00
Daniel Quinn
52bfeb2ad0
Improve the unknown language error message
2018-09-23 12:41:14 +01:00
Daniel Quinn
21e53aa55c
Merge pull request #399 from jat255/ENH_convert_only_one_page
...
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
ef7f98281d
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
a3158eedf9
Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer
2018-09-09 20:52:59 +01:00
Daniel Quinn
6b63ce9201
Fix pycodestyle complaints
...
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
5326895334
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Joshua Taillon
98a437f78a
change tesseract parser to only convert first page to save (potentially) massive amounts of work
2018-09-05 15:18:35 -04:00
Erik Arvstedt
4fa9ff60fc
Stop tests from writing to the source tree
2018-07-19 23:48:23 +02:00
Daniel Quinn
bce2d3dd22
Account for KeyError problem in #345
2018-04-28 12:20:43 +01:00
Daniel Quinn
f3f86242de
Account for KeyError problem in #345
2018-04-28 12:19:53 +01:00
Ovv
32c440cbd9
Log detected document date with isoformat
2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner
328330eb08
Increase testcoverage by testing two more date detection cases
2018-02-19 21:36:48 +01:00
Daniel Quinn
fc6d2d5e0c
Fix formatting
2018-02-18 18:00:34 +00:00
Daniel Quinn
9e26e7b39e
Fix tests to use _text instead of TEXT_CACHE
2018-02-18 18:00:22 +00:00
Daniel Quinn
7c5ca5f505
Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates
...
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-18 17:23:49 +01:00
Daniel Quinn
4f726e1991
Monitor return codes of calls to convert
and unpaper
...
...and handle the failures nicely. Addresses #303 .
2018-02-18 16:02:27 +00:00
Daniel Quinn
e53033d1b3
Rename .TEXT_CACHE to .text
...
Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`.
2018-02-18 16:00:43 +00:00
Daniel Quinn
3302ee2a78
Make isort happy
2018-02-18 16:00:03 +00:00
Daniel Quinn
caf44146db
Style and removal of Python 2.7 stuff
2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
5fed7ba6d4
Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date
2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
fc81feb32e
Add more (fast-running) unit tests
2018-02-14 21:41:01 +01:00
Wolf-Bastian Pöttner
3e65054e39
Extended exception handling
2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c0c20f99e9
Added log output for date detected in document
2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
3899763261
Extends the regex to find dates in documents as reported by @isaacsando
2018-02-12 22:41:15 +01:00
Daniel Quinn
c6e671f2fa
No need to extend object
2018-02-03 15:26:28 +00:00
Daniel Quinn
4c0b908a41
Rework tests to write to /tmp
...
Originally the test wrote scratch data inside the repo dir, which meant
manual cleanup. Now it writes to `/tmp/paperless-tests-<random-string>`
and cleans up after itself.
2018-02-03 14:49:48 +00:00
Wolf-Bastian Pöttner
acfacaac4f
Added a text cache to optimize performance of date detection
2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
4f725cf4d2
Add test cases for date parsing
2018-02-03 00:28:49 +01:00
Wolf-Bastian Pöttner
73d261484a
Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text
2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
3dc730808e
Add support for using pre-existing text from PDFs
2018-02-02 22:37:58 +01:00
Matt
bc5c45a705
Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.
2018-02-01 10:08:57 -05:00
Daniel Quinn
269c32ce6a
Add support for using pre-existing text from PDFs
2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
21fc51c09a
Add support for a heuristic that extracts the document date from its text
2018-01-28 19:37:10 +01:00
Daniel Quinn
67844dff0c
Update test for #259 fix
2017-10-16 10:53:18 +01:00
Daniel Quinn
2820767f29
Support .jpeg as well as .jpg
2017-10-16 09:00:38 +01:00