107 Commits

Author SHA1 Message Date
jonaswinkler
bee7a06e41 fix bugs and test cases 2021-01-02 15:37:27 +01:00
jonaswinkler
755f950cd2 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
f1e9b414f9 remove duplicate code 2021-01-01 21:50:45 +01:00
jonaswinkler
4b7138f477 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
d329b371ef removed unused code 2020-12-20 14:00:24 +01:00
jonaswinkler
a3334293af more tests 2020-12-19 15:54:13 +01:00
jonaswinkler
45d31f9735 fixes bauerj/paperless_app#23 and most of all other scanner apps out there. 2020-12-12 18:25:15 +01:00
jonaswinkler
0c6c4a62d8 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
905c090908 fixes for the parser. 2020-12-04 16:44:34 +01:00
jonaswinkler
884eec9b61 disabled thumbnail trimming. 2020-12-04 12:44:02 +01:00
jonaswinkler
e2a375c9aa catch encrypted pdf documents 2020-12-03 01:02:37 +01:00
jonaswinkler
1d073d2cfd a couple fixes and more supported image files 2020-12-02 17:39:49 +01:00
jonaswinkler
0fb294d556 testing the new noarchive option. 2020-12-01 14:30:13 +01:00
jonaswinkler
1f90d50833 some more tests. 2020-12-01 14:15:43 +01:00
jonaswinkler
1df64e3129 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
jonaswinkler
20cc7e3dc0 more tests! 2020-11-29 19:58:48 +01:00
jonaswinkler
388f6cfbe6 reorganised settings documentation and added OCR_USER_ARGS 2020-11-29 12:38:32 +01:00
jonaswinkler
a19a336567 fixed checking the installed languages. 2020-11-29 12:31:42 +01:00
jonaswinkler
99e6906b51 test case fixes. 2020-11-27 14:06:37 +01:00
Jonas Winkler
f901def797 more tests of the new parser 2020-11-26 00:08:23 +01:00
Jonas Winkler
c00c63c639 fixed the test cases 2020-11-25 19:51:09 +01:00
Jonas Winkler
e55d1ff9cc OMP_THREAD_LIMIT 2020-11-25 19:37:59 +01:00
Jonas Winkler
3b655c95d9 added image DPI detection to the tesseract parser. 2020-11-25 19:37:48 +01:00
Jonas Winkler
9bfa088eb5 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
b02f29ce9d Merge branch 'dev' into feature-ocrmypdf 2020-11-25 16:58:20 +01:00
Jonas Winkler
bd8a2eaf1e codestyle 2020-11-25 16:05:52 +01:00
Jonas Winkler
f5656222e2 removed obsolete tests. 2020-11-25 14:51:32 +01:00
Jonas Winkler
15935ab61f reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
7a6dcf8520 default language check 2020-11-25 10:52:38 +01:00
Jonas Winkler
ae198f0767 new setting: PAPERLESS_OCR_PAGES 2020-11-22 12:54:08 +01:00
Jonas Winkler
a532200d10 code cleanup 2020-11-21 15:34:00 +01:00
Jonas Winkler
afc3753e58 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
f976a0b4ba mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
cbee56ae8c testing the tesseract parser 2020-11-19 20:31:08 +01:00
Jonas Winkler
680ab3d56b updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
9a48d6c577 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
bd04c966c5 first version of the new consumer. 2020-11-16 18:26:54 +01:00
Jonas Winkler
eb6805e37e code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
340f9f141f fixed most of the tests 2020-11-02 19:42:23 +01:00
Jonas Winkler
d42979842e made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
a89773ad71 removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
def3a85858 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
972a6a2333 bugfix 2020-11-02 01:26:42 +01:00
Jonas Winkler
6adc870a20 silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
0f4094f3ca better thumbnail generation for smaller files 2020-10-26 01:05:23 +01:00
Johannes Wienke
ebcfcea05b Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Johannes Wienke
6531a67940 Remove duplicated date parsing test
The exact same tests existed twice in the file.
2020-03-08 18:26:29 +01:00
Stéphane Brunner
3fab354a6e Strip the thumbnails 2019-03-17 16:37:47 +01:00
jenspfeifle
5c40da1a48 make pycodestyle happy 2019-03-03 20:41:17 +01:00