138 Commits

Author SHA1 Message Date
Martin Müller
1e288100a9 Remove unneded exception handler from has_alpha() 2022-02-21 22:58:19 +01:00
Martin Müller
2a47b3f1a1 Fix code style (line too long) 2022-02-21 22:34:34 +01:00
Martin Müller
41494ee689 Remove alpha layer from PNG files for img2pdf
Fixes issue #1254
2022-02-21 22:06:43 +01:00
jonaswinkler
23c6f849d6 fix bug with DPI calculation 2021-08-18 18:33:33 +02:00
jonaswinkler
1f707e86cc fix logging getting spammed with pdfminer warnings on JPG files 2021-06-13 12:09:16 +02:00
jonaswinkler
814d90745b Workaround for all PDFminer.six issues. 2021-05-15 12:15:32 +02:00
jonaswinkler
0e596bd1fc also apply \0 removal to sidecar contents 2021-03-22 23:08:34 +01:00
jonaswinkler
fda2bfbea7 better exception logging 2021-03-22 23:00:15 +01:00
jonaswinkler
d26c46e034 fixes #794 2021-03-22 22:46:35 +01:00
jonaswinkler
40ce38254b fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
265432f2a5 fix up the ocrmypdf parameter construction for clean-final and redo 2021-02-21 23:39:19 +01:00
jonaswinkler
a13e9f23b1 use archived file for thumbnail, if available 2021-02-21 23:30:14 +01:00
jonaswinkler
14e2ad7bc4 more parameter checking 2021-02-21 22:19:24 +01:00
jonaswinkler
6da237dd9e pycodestyle 2021-02-21 00:21:43 +01:00
jonaswinkler
ce121a261d completely reworked the OCRmyPDF parser. 2021-02-21 00:16:57 +01:00
jonaswinkler
56bd966c02 local import of ocrmypdf so that the webserver does not load that 2021-02-15 12:18:10 +01:00
jonaswinkler
8d6071e977 fix a bug with thumbnail generation when TIKA was enabled 2021-02-09 22:12:43 +01:00
jonaswinkler
431d4fd8e4 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
d17de45791 fix typo 2021-02-03 14:51:04 +01:00
jonaswinkler
bdc247ce49 code style 2021-02-02 23:58:25 +01:00
jonaswinkler
b0ed06003b better error messages 2021-01-27 17:56:06 +01:00
jonaswinkler
40ef375c15 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
c05bfb894a remove duplicate code 2021-01-01 21:50:45 +01:00
jonaswinkler
713985f259 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
a0631413d6 fixes bauerj/paperless_app#23 and most of all other scanner apps out there. 2020-12-12 18:25:15 +01:00
jonaswinkler
2f7bb01f34 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
dab4b1253a fixes for the parser. 2020-12-04 16:44:34 +01:00
jonaswinkler
991a46c4f0 disabled thumbnail trimming. 2020-12-04 12:44:02 +01:00
jonaswinkler
6a04e95f69 catch encrypted pdf documents 2020-12-03 01:02:37 +01:00
jonaswinkler
e3ce573fbb a couple fixes and more supported image files 2020-12-02 17:39:49 +01:00
jonaswinkler
fd3df1ec58 some more tests. 2020-12-01 14:15:43 +01:00
jonaswinkler
fca98b411e reorganised settings documentation and added OCR_USER_ARGS 2020-11-29 12:38:32 +01:00
Jonas Winkler
e87575240d more tests of the new parser 2020-11-26 00:08:23 +01:00
Jonas Winkler
a60a4babf6 OMP_THREAD_LIMIT 2020-11-25 19:37:59 +01:00
Jonas Winkler
a03315102a added image DPI detection to the tesseract parser. 2020-11-25 19:37:48 +01:00
Jonas Winkler
df801d17e1 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
2d559d330d reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
fec9e54049 new setting: PAPERLESS_OCR_PAGES 2020-11-22 12:54:08 +01:00
Jonas Winkler
450fb877f6 code cleanup 2020-11-21 15:34:00 +01:00
Jonas Winkler
b44f8383e4 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
8908bc259e updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
8dca459573 first version of the new consumer. 2020-11-16 18:26:54 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
3a08a2d206 made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
7d282a4e4e removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
06ad212320 bugfix 2020-11-02 01:26:42 +01:00
Jonas Winkler
9f55fb668d silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
743ce1dc14 better thumbnail generation for smaller files 2020-10-26 01:05:23 +01:00
Stéphane Brunner
daca77cc1b Strip the thumbnails 2019-03-17 16:37:47 +01:00