jonaswinkler
0fb294d556
testing the new noarchive option.
2020-12-01 14:30:13 +01:00
jonaswinkler
1f90d50833
some more tests.
2020-12-01 14:15:43 +01:00
jonaswinkler
1df64e3129
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
jonaswinkler
20cc7e3dc0
more tests!
2020-11-29 19:58:48 +01:00
jonaswinkler
388f6cfbe6
reorganised settings documentation and added OCR_USER_ARGS
2020-11-29 12:38:32 +01:00
jonaswinkler
a19a336567
fixed checking the installed languages.
2020-11-29 12:31:42 +01:00
jonaswinkler
99e6906b51
test case fixes.
2020-11-27 14:06:37 +01:00
Jonas Winkler
f901def797
more tests of the new parser
2020-11-26 00:08:23 +01:00
Jonas Winkler
c00c63c639
fixed the test cases
2020-11-25 19:51:09 +01:00
Jonas Winkler
e55d1ff9cc
OMP_THREAD_LIMIT
2020-11-25 19:37:59 +01:00
Jonas Winkler
3b655c95d9
added image DPI detection to the tesseract parser.
2020-11-25 19:37:48 +01:00
Jonas Winkler
9bfa088eb5
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
b02f29ce9d
Merge branch 'dev' into feature-ocrmypdf
2020-11-25 16:58:20 +01:00
Jonas Winkler
bd8a2eaf1e
codestyle
2020-11-25 16:05:52 +01:00
Jonas Winkler
f5656222e2
removed obsolete tests.
2020-11-25 14:51:32 +01:00
Jonas Winkler
15935ab61f
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
7a6dcf8520
default language check
2020-11-25 10:52:38 +01:00
Jonas Winkler
ae198f0767
new setting: PAPERLESS_OCR_PAGES
2020-11-22 12:54:08 +01:00
Jonas Winkler
a532200d10
code cleanup
2020-11-21 15:34:00 +01:00
Jonas Winkler
afc3753e58
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
f976a0b4ba
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
cbee56ae8c
testing the tesseract parser
2020-11-19 20:31:08 +01:00
Jonas Winkler
680ab3d56b
updated logging, logging for the mail consumer to see whats happening
2020-11-18 13:23:30 +01:00
Jonas Winkler
9a48d6c577
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
bd04c966c5
first version of the new consumer.
2020-11-16 18:26:54 +01:00
Jonas Winkler
eb6805e37e
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
340f9f141f
fixed most of the tests
2020-11-02 19:42:23 +01:00
Jonas Winkler
d42979842e
made unpaper and convert a little bit nicer to interact with
2020-11-02 19:31:04 +01:00
Jonas Winkler
a89773ad71
removed unused code, small fixes
2020-11-02 18:20:04 +01:00
Jonas Winkler
def3a85858
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Jonas Winkler
972a6a2333
bugfix
2020-11-02 01:26:42 +01:00
Jonas Winkler
6adc870a20
silenced unpaper, optipng for cleaner output
...
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
0f4094f3ca
better thumbnail generation for smaller files
2020-10-26 01:05:23 +01:00
Johannes Wienke
ebcfcea05b
Handle dateparser ValueErrors
...
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Johannes Wienke
6531a67940
Remove duplicated date parsing test
...
The exact same tests existed twice in the file.
2020-03-08 18:26:29 +01:00
Stéphane Brunner
3fab354a6e
Strip the thumbnails
2019-03-17 16:37:47 +01:00
jenspfeifle
5c40da1a48
make pycodestyle happy
2019-03-03 20:41:17 +01:00
JensPfeifle
078d66b077
try to run convert, but fall back on gs if needed
2019-03-03 20:31:52 +01:00
JensPfeifle
4c64ea0404
Add GS_BINARY to settings to avoid harcoded call of "gs"
2019-03-03 20:31:52 +01:00
Pit
99718bcf17
Fix quoting in call to run_convert
...
Co-Authored-By: JensPfeifle <jens@pfeifle.tech>
2019-03-03 20:31:52 +01:00
JensPfeifle
3dfd0253ed
remove unnecessary env arg in Popen
2019-03-03 20:31:52 +01:00
Jens Pfeifle
6ab21afeb6
fix parse error of some documents by using gs
2019-03-03 20:31:52 +01:00
Daniel Quinn
e395b0e081
Drop problematic tests
...
Some tests had differing outcomes depending on the version of Tesseract
installed on the test system. This lead to a bunch of false test
failures, which lead to people (including me) just ignoring the Travis
results.
This commit removes those tests, and while it reduces our coverage, at
least the results are predictable.
2018-12-30 17:32:45 +00:00
Daniel Quinn
86b0d08377
Use modern languages for sample test files
2018-12-30 14:09:17 +00:00
Erik Arvstedt
f38ac7f62b
Fix date test sample image
...
The previous version of `tests_date_3.png` had too much spacing
between the `0` and the `8` glyphs, which resulted in the year getting
parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00).
This caused the date parsing test to fail.
2018-12-02 15:10:21 +01:00
Daniel Quinn
0d59844567
Conform everything to the coding standards
...
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Daniel Quinn
4e186ede0e
Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing
2018-12-01 16:57:16 +00:00
Daniel Quinn
9c6b8629a3
Fix language guesses in tests
...
It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.
2018-12-01 15:55:59 +00:00
Joshua Taillon
b0326b5a19
Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing
2018-11-15 23:17:59 -05:00