107 Commits

Author SHA1 Message Date
jonaswinkler
89d6e422f5 fix bugs and test cases 2021-01-02 15:37:27 +01:00
jonaswinkler
40ef375c15 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
c05bfb894a remove duplicate code 2021-01-01 21:50:45 +01:00
jonaswinkler
713985f259 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
ee31fdc650 removed unused code 2020-12-20 14:00:24 +01:00
jonaswinkler
1b1b57eb6a more tests 2020-12-19 15:54:13 +01:00
jonaswinkler
a0631413d6 fixes bauerj/paperless_app#23 and most of all other scanner apps out there. 2020-12-12 18:25:15 +01:00
jonaswinkler
2f7bb01f34 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
dab4b1253a fixes for the parser. 2020-12-04 16:44:34 +01:00
jonaswinkler
991a46c4f0 disabled thumbnail trimming. 2020-12-04 12:44:02 +01:00
jonaswinkler
6a04e95f69 catch encrypted pdf documents 2020-12-03 01:02:37 +01:00
jonaswinkler
e3ce573fbb a couple fixes and more supported image files 2020-12-02 17:39:49 +01:00
jonaswinkler
12fa844c7f testing the new noarchive option. 2020-12-01 14:30:13 +01:00
jonaswinkler
fd3df1ec58 some more tests. 2020-12-01 14:15:43 +01:00
jonaswinkler
aaa6599283 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32 added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
jonaswinkler
ac1b701000 more tests! 2020-11-29 19:58:48 +01:00
jonaswinkler
fca98b411e reorganised settings documentation and added OCR_USER_ARGS 2020-11-29 12:38:32 +01:00
jonaswinkler
0565118a01 fixed checking the installed languages. 2020-11-29 12:31:42 +01:00
jonaswinkler
06cfc3113a test case fixes. 2020-11-27 14:06:37 +01:00
Jonas Winkler
e87575240d more tests of the new parser 2020-11-26 00:08:23 +01:00
Jonas Winkler
f51d2be303 fixed the test cases 2020-11-25 19:51:09 +01:00
Jonas Winkler
a60a4babf6 OMP_THREAD_LIMIT 2020-11-25 19:37:59 +01:00
Jonas Winkler
a03315102a added image DPI detection to the tesseract parser. 2020-11-25 19:37:48 +01:00
Jonas Winkler
df801d17e1 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
b269af7572 Merge branch 'dev' into feature-ocrmypdf 2020-11-25 16:58:20 +01:00
Jonas Winkler
d92214d412 codestyle 2020-11-25 16:05:52 +01:00
Jonas Winkler
56ce267f89 removed obsolete tests. 2020-11-25 14:51:32 +01:00
Jonas Winkler
2d559d330d reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
dd83364326 default language check 2020-11-25 10:52:38 +01:00
Jonas Winkler
fec9e54049 new setting: PAPERLESS_OCR_PAGES 2020-11-22 12:54:08 +01:00
Jonas Winkler
450fb877f6 code cleanup 2020-11-21 15:34:00 +01:00
Jonas Winkler
b44f8383e4 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
41650f20f4 mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
1655d85a53 testing the tesseract parser 2020-11-19 20:31:08 +01:00
Jonas Winkler
8908bc259e updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
d2e22e3f27 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
8dca459573 first version of the new consumer. 2020-11-16 18:26:54 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
f182709fdd fixed most of the tests 2020-11-02 19:42:23 +01:00
Jonas Winkler
3a08a2d206 made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
7d282a4e4e removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
06ad212320 bugfix 2020-11-02 01:26:42 +01:00
Jonas Winkler
9f55fb668d silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
743ce1dc14 better thumbnail generation for smaller files 2020-10-26 01:05:23 +01:00
Johannes Wienke
a311cd498c Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Johannes Wienke
a3aab0cb48 Remove duplicated date parsing test
The exact same tests existed twice in the file.
2020-03-08 18:26:29 +01:00
Stéphane Brunner
daca77cc1b Strip the thumbnails 2019-03-17 16:37:47 +01:00
jenspfeifle
336f747f16 make pycodestyle happy 2019-03-03 20:41:17 +01:00