jonaswinkler
|
3cfd97aa08
|
pycodestyle
|
2021-02-21 00:21:43 +01:00 |
|
jonaswinkler
|
26c65b29d5
|
tests
|
2021-02-21 00:18:34 +01:00 |
|
jonaswinkler
|
e3dd1863a9
|
completely reworked the OCRmyPDF parser.
|
2021-02-21 00:16:57 +01:00 |
|
jonaswinkler
|
99cb371483
|
add some test files
|
2021-02-21 00:13:08 +01:00 |
|
jonaswinkler
|
94cc9876d9
|
local import of ocrmypdf so that the webserver does not load that
|
2021-02-15 12:18:10 +01:00 |
|
jonaswinkler
|
b04d91d68c
|
fix a bug with thumbnail generation when TIKA was enabled
|
2021-02-09 22:12:43 +01:00 |
|
jonaswinkler
|
e5a7dc0cc7
|
rework most of the logging
|
2021-02-05 01:10:29 +01:00 |
|
jonaswinkler
|
95f5c9f3a6
|
lazy loading for parsers
|
2021-02-04 13:17:24 +01:00 |
|
jonaswinkler
|
701897dc3c
|
fix typo
|
2021-02-03 14:51:04 +01:00 |
|
jonaswinkler
|
eeff7b3bdb
|
code style
|
2021-02-02 23:58:25 +01:00 |
|
jonaswinkler
|
14c61d72f3
|
better error messages
|
2021-01-27 17:56:06 +01:00 |
|
jonaswinkler
|
bee7a06e41
|
fix bugs and test cases
|
2021-01-02 15:37:27 +01:00 |
|
jonaswinkler
|
755f950cd2
|
supply file_name for tika parser
|
2021-01-01 22:19:43 +01:00 |
|
jonaswinkler
|
f1e9b414f9
|
remove duplicate code
|
2021-01-01 21:50:45 +01:00 |
|
jonaswinkler
|
4b7138f477
|
fixes #218
|
2020-12-30 15:12:16 +01:00 |
|
jonaswinkler
|
d329b371ef
|
removed unused code
|
2020-12-20 14:00:24 +01:00 |
|
jonaswinkler
|
a3334293af
|
more tests
|
2020-12-19 15:54:13 +01:00 |
|
jonaswinkler
|
45d31f9735
|
fixes bauerj/paperless_app#23 and most of all other scanner apps out there.
|
2020-12-12 18:25:15 +01:00 |
|
jonaswinkler
|
0c6c4a62d8
|
moved metadata extraction to the parsers
|
2020-12-10 14:57:53 +01:00 |
|
jonaswinkler
|
905c090908
|
fixes for the parser.
|
2020-12-04 16:44:34 +01:00 |
|
jonaswinkler
|
884eec9b61
|
disabled thumbnail trimming.
|
2020-12-04 12:44:02 +01:00 |
|
jonaswinkler
|
e2a375c9aa
|
catch encrypted pdf documents
|
2020-12-03 01:02:37 +01:00 |
|
jonaswinkler
|
1d073d2cfd
|
a couple fixes and more supported image files
|
2020-12-02 17:39:49 +01:00 |
|
jonaswinkler
|
0fb294d556
|
testing the new noarchive option.
|
2020-12-01 14:30:13 +01:00 |
|
jonaswinkler
|
1f90d50833
|
some more tests.
|
2020-12-01 14:15:43 +01:00 |
|
jonaswinkler
|
1df64e3129
|
Merge branch 'dev' into feature-ocrmypdf
|
2020-11-30 16:48:09 +01:00 |
|
jonaswinkler
|
7658c07b4d
|
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
|
2020-11-30 00:40:04 +01:00 |
|
jonaswinkler
|
20cc7e3dc0
|
more tests!
|
2020-11-29 19:58:48 +01:00 |
|
jonaswinkler
|
388f6cfbe6
|
reorganised settings documentation and added OCR_USER_ARGS
|
2020-11-29 12:38:32 +01:00 |
|
jonaswinkler
|
a19a336567
|
fixed checking the installed languages.
|
2020-11-29 12:31:42 +01:00 |
|
jonaswinkler
|
99e6906b51
|
test case fixes.
|
2020-11-27 14:06:37 +01:00 |
|
Jonas Winkler
|
f901def797
|
more tests of the new parser
|
2020-11-26 00:08:23 +01:00 |
|
Jonas Winkler
|
c00c63c639
|
fixed the test cases
|
2020-11-25 19:51:09 +01:00 |
|
Jonas Winkler
|
e55d1ff9cc
|
OMP_THREAD_LIMIT
|
2020-11-25 19:37:59 +01:00 |
|
Jonas Winkler
|
3b655c95d9
|
added image DPI detection to the tesseract parser.
|
2020-11-25 19:37:48 +01:00 |
|
Jonas Winkler
|
9bfa088eb5
|
reworked the interface of the parsers.
|
2020-11-25 19:36:39 +01:00 |
|
Jonas Winkler
|
b02f29ce9d
|
Merge branch 'dev' into feature-ocrmypdf
|
2020-11-25 16:58:20 +01:00 |
|
Jonas Winkler
|
bd8a2eaf1e
|
codestyle
|
2020-11-25 16:05:52 +01:00 |
|
Jonas Winkler
|
f5656222e2
|
removed obsolete tests.
|
2020-11-25 14:51:32 +01:00 |
|
Jonas Winkler
|
15935ab61f
|
reworked PDF parser that uses OCRmyPDF and produces archive files.
|
2020-11-25 14:50:43 +01:00 |
|
Jonas Winkler
|
7a6dcf8520
|
default language check
|
2020-11-25 10:52:38 +01:00 |
|
Jonas Winkler
|
ae198f0767
|
new setting: PAPERLESS_OCR_PAGES
|
2020-11-22 12:54:08 +01:00 |
|
Jonas Winkler
|
a532200d10
|
code cleanup
|
2020-11-21 15:34:00 +01:00 |
|
Jonas Winkler
|
afc3753e58
|
code cleanup
|
2020-11-21 14:03:45 +01:00 |
|
Jonas Winkler
|
f976a0b4ba
|
mime type handling
|
2020-11-20 13:31:03 +01:00 |
|
Jonas Winkler
|
cbee56ae8c
|
testing the tesseract parser
|
2020-11-19 20:31:08 +01:00 |
|
Jonas Winkler
|
680ab3d56b
|
updated logging, logging for the mail consumer to see whats happening
|
2020-11-18 13:23:30 +01:00 |
|
Jonas Winkler
|
9a48d6c577
|
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
|
2020-11-16 23:53:12 +01:00 |
|
Jonas Winkler
|
bd04c966c5
|
first version of the new consumer.
|
2020-11-16 18:26:54 +01:00 |
|
Jonas Winkler
|
eb6805e37e
|
code style fixes
|
2020-11-12 21:09:45 +01:00 |
|