jonaswinkler
40ef375c15
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
c05bfb894a
remove duplicate code
2021-01-01 21:50:45 +01:00
jonaswinkler
713985f259
fixes #218
2020-12-30 15:12:16 +01:00
jonaswinkler
a0631413d6
fixes bauerj/paperless_app#23 and most of all other scanner apps out there.
2020-12-12 18:25:15 +01:00
jonaswinkler
2f7bb01f34
moved metadata extraction to the parsers
2020-12-10 14:57:53 +01:00
jonaswinkler
dab4b1253a
fixes for the parser.
2020-12-04 16:44:34 +01:00
jonaswinkler
991a46c4f0
disabled thumbnail trimming.
2020-12-04 12:44:02 +01:00
jonaswinkler
6a04e95f69
catch encrypted pdf documents
2020-12-03 01:02:37 +01:00
jonaswinkler
e3ce573fbb
a couple fixes and more supported image files
2020-12-02 17:39:49 +01:00
jonaswinkler
fd3df1ec58
some more tests.
2020-12-01 14:15:43 +01:00
jonaswinkler
fca98b411e
reorganised settings documentation and added OCR_USER_ARGS
2020-11-29 12:38:32 +01:00
Jonas Winkler
e87575240d
more tests of the new parser
2020-11-26 00:08:23 +01:00
Jonas Winkler
a60a4babf6
OMP_THREAD_LIMIT
2020-11-25 19:37:59 +01:00
Jonas Winkler
a03315102a
added image DPI detection to the tesseract parser.
2020-11-25 19:37:48 +01:00
Jonas Winkler
df801d17e1
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
2d559d330d
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
fec9e54049
new setting: PAPERLESS_OCR_PAGES
2020-11-22 12:54:08 +01:00
Jonas Winkler
450fb877f6
code cleanup
2020-11-21 15:34:00 +01:00
Jonas Winkler
b44f8383e4
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
8908bc259e
updated logging, logging for the mail consumer to see whats happening
2020-11-18 13:23:30 +01:00
Jonas Winkler
8dca459573
first version of the new consumer.
2020-11-16 18:26:54 +01:00
Jonas Winkler
2e04ba1c04
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
3a08a2d206
made unpaper and convert a little bit nicer to interact with
2020-11-02 19:31:04 +01:00
Jonas Winkler
7d282a4e4e
removed unused code, small fixes
2020-11-02 18:20:04 +01:00
Jonas Winkler
d15405ef56
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Jonas Winkler
06ad212320
bugfix
2020-11-02 01:26:42 +01:00
Jonas Winkler
9f55fb668d
silenced unpaper, optipng for cleaner output
...
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Jonas Winkler
743ce1dc14
better thumbnail generation for smaller files
2020-10-26 01:05:23 +01:00
Stéphane Brunner
daca77cc1b
Strip the thumbnails
2019-03-17 16:37:47 +01:00
jenspfeifle
336f747f16
make pycodestyle happy
2019-03-03 20:41:17 +01:00
JensPfeifle
29b0886950
try to run convert, but fall back on gs if needed
2019-03-03 20:31:52 +01:00
JensPfeifle
ea282c22ba
Add GS_BINARY to settings to avoid harcoded call of "gs"
2019-03-03 20:31:52 +01:00
Pit
cbf008f37b
Fix quoting in call to run_convert
...
Co-Authored-By: JensPfeifle <jens@pfeifle.tech>
2019-03-03 20:31:52 +01:00
JensPfeifle
50504c3fd8
remove unnecessary env arg in Popen
2019-03-03 20:31:52 +01:00
Jens Pfeifle
0220199766
fix parse error of some documents by using gs
2019-03-03 20:31:52 +01:00
Daniel Quinn
bd95804fbf
Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling
...
Clarify forgiving ocr handling
2018-10-08 09:35:57 +00:00
David Martin
b350ec48b7
Mention FORGIVING_OCR config option when language detection fails.
...
It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear.
2018-10-08 19:37:05 +11:00
David Martin
f948ee11be
Let unpaper overwrite temporary files.
...
I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.
[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630
2018-10-08 19:12:11 +11:00
Daniel Quinn
750ab5bf85
Use optipng to optimise document thumbnails
2018-10-07 14:56:38 +01:00
Daniel Quinn
2a3f766b93
Consolidate get_date onto the DocumentParser parent class
2018-10-07 14:56:02 +01:00
Daniel Quinn
8010d72f18
Tweak the date guesser to not allow dates prior to 1900 ( #414 )
2018-10-01 20:03:47 +01:00
Daniel Quinn
117d7dad04
Improve the unknown language error message
2018-09-23 12:41:14 +01:00
Daniel Quinn
46cbd10ba0
Merge pull request #399 from jat255/ENH_convert_only_one_page
...
Speed up thumbnail generation for PDFs
2018-09-09 21:12:42 +01:00
Daniel Quinn
c99f5923d5
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
2dc35cc856
Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer
2018-09-09 20:52:59 +01:00
Daniel Quinn
5342db6ada
Fix pycodestyle complaints
...
Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r"").
2018-09-09 20:00:12 +01:00
Joshua Taillon
72c828170e
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Joshua Taillon
cac63494f0
change tesseract parser to only convert first page to save (potentially) massive amounts of work
2018-09-05 15:18:35 -04:00
Daniel Quinn
82f9dde055
Account for KeyError problem in #345
2018-04-28 12:20:43 +01:00
Daniel Quinn
c983e73d0f
Account for KeyError problem in #345
2018-04-28 12:19:53 +01:00