jonaswinkler
ee31fdc650
removed unused code
2020-12-20 14:00:24 +01:00
jonaswinkler
1b1b57eb6a
more tests
2020-12-19 15:54:13 +01:00
jonaswinkler
a0631413d6
fixes bauerj/paperless_app#23 and most of all other scanner apps out there.
2020-12-12 18:25:15 +01:00
jonaswinkler
2f7bb01f34
moved metadata extraction to the parsers
2020-12-10 14:57:53 +01:00
jonaswinkler
dab4b1253a
fixes for the parser.
2020-12-04 16:44:34 +01:00
jonaswinkler
991a46c4f0
disabled thumbnail trimming.
2020-12-04 12:44:02 +01:00
jonaswinkler
6a04e95f69
catch encrypted pdf documents
2020-12-03 01:02:37 +01:00
jonaswinkler
e3ce573fbb
a couple fixes and more supported image files
2020-12-02 17:39:49 +01:00
jonaswinkler
12fa844c7f
testing the new noarchive option.
2020-12-01 14:30:13 +01:00
jonaswinkler
fd3df1ec58
some more tests.
2020-12-01 14:15:43 +01:00
jonaswinkler
aaa6599283
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
jonaswinkler
ac1b701000
more tests!
2020-11-29 19:58:48 +01:00
jonaswinkler
fca98b411e
reorganised settings documentation and added OCR_USER_ARGS
2020-11-29 12:38:32 +01:00
jonaswinkler
0565118a01
fixed checking the installed languages.
2020-11-29 12:31:42 +01:00
jonaswinkler
06cfc3113a
test case fixes.
2020-11-27 14:06:37 +01:00
Jonas Winkler
e87575240d
more tests of the new parser
2020-11-26 00:08:23 +01:00
Jonas Winkler
f51d2be303
fixed the test cases
2020-11-25 19:51:09 +01:00
Jonas Winkler
a60a4babf6
OMP_THREAD_LIMIT
2020-11-25 19:37:59 +01:00
Jonas Winkler
a03315102a
added image DPI detection to the tesseract parser.
2020-11-25 19:37:48 +01:00
Jonas Winkler
df801d17e1
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
b269af7572
Merge branch 'dev' into feature-ocrmypdf
2020-11-25 16:58:20 +01:00
Jonas Winkler
d92214d412
codestyle
2020-11-25 16:05:52 +01:00
Jonas Winkler
56ce267f89
removed obsolete tests.
2020-11-25 14:51:32 +01:00
Jonas Winkler
2d559d330d
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
dd83364326
default language check
2020-11-25 10:52:38 +01:00
Jonas Winkler
fec9e54049
new setting: PAPERLESS_OCR_PAGES
2020-11-22 12:54:08 +01:00
Jonas Winkler
450fb877f6
code cleanup
2020-11-21 15:34:00 +01:00
Jonas Winkler
b44f8383e4
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
41650f20f4
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
1655d85a53
testing the tesseract parser
2020-11-19 20:31:08 +01:00
Jonas Winkler
8908bc259e
updated logging, logging for the mail consumer to see whats happening
2020-11-18 13:23:30 +01:00
Jonas Winkler
d2e22e3f27
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
8dca459573
first version of the new consumer.
2020-11-16 18:26:54 +01:00
Jonas Winkler
2e04ba1c04
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
f182709fdd
fixed most of the tests
2020-11-02 19:42:23 +01:00
Jonas Winkler
3a08a2d206
made unpaper and convert a little bit nicer to interact with
2020-11-02 19:31:04 +01:00
Jonas Winkler
7d282a4e4e
removed unused code, small fixes
2020-11-02 18:20:04 +01:00
Jonas Winkler
d15405ef56
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Jonas Winkler
06ad212320
bugfix
2020-11-02 01:26:42 +01:00
Jonas Winkler
9f55fb668d
silenced unpaper, optipng for cleaner output
...
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
743ce1dc14
better thumbnail generation for smaller files
2020-10-26 01:05:23 +01:00
Johannes Wienke
a311cd498c
Handle dateparser ValueErrors
...
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Johannes Wienke
a3aab0cb48
Remove duplicated date parsing test
...
The exact same tests existed twice in the file.
2020-03-08 18:26:29 +01:00
Stéphane Brunner
daca77cc1b
Strip the thumbnails
2019-03-17 16:37:47 +01:00
jenspfeifle
336f747f16
make pycodestyle happy
2019-03-03 20:41:17 +01:00
JensPfeifle
29b0886950
try to run convert, but fall back on gs if needed
2019-03-03 20:31:52 +01:00
JensPfeifle
ea282c22ba
Add GS_BINARY to settings to avoid harcoded call of "gs"
2019-03-03 20:31:52 +01:00
Pit
cbf008f37b
Fix quoting in call to run_convert
...
Co-Authored-By: JensPfeifle <jens@pfeifle.tech>
2019-03-03 20:31:52 +01:00
JensPfeifle
50504c3fd8
remove unnecessary env arg in Popen
2019-03-03 20:31:52 +01:00