32 Commits

Author SHA1 Message Date
Jens van Almsick
ad6ef7314b fix: csv recognition by consumer
paperless-ngx detects the file format via the mime-type based on the response of python-magic which rely on the response of the file command.
In version 5.39 (which is shipped with debian bullseye and I think many more non-rolling distributions) of the file command a *.csv will be detected as "application/csv" instead of "text/csv" as in newer versions.
2022-10-02 16:09:07 -07:00
Trenton Holmes
6844f8f2bf Minor tweaks to getting the document thumbnail path. Adds text thumbnail as webp 2022-06-10 06:56:28 -07:00
Trenton Holmes
fc26fe0ac0
Updates to provide the user provided max pixel size to ocrmypdf 2022-05-22 16:56:08 -07:00
Trenton Holmes
3003bdd507 Runs pyupgrade to Python 3.8+ and adds a hook for it 2022-05-06 09:04:08 -07:00
Henning Häcker
3b4da70c85 extract OCR_MAX_IMAGE_PIXELS into settings.py 2022-03-30 09:23:45 +02:00
Henning Häcker
95199bd325 formatting according to black 2022-03-30 09:23:45 +02:00
Henning Häcker
a8887b211e implement PAPERLESS_OCR_MAX_IMAGE_PIXELS 2022-03-30 09:23:45 +02:00
Trenton Holmes
1771d18a21 Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
kpj
fc695896dd Format Python code with black 2022-02-27 15:26:41 +01:00
jonaswinkler
8d6071e977 fix a bug with thumbnail generation when TIKA was enabled 2021-02-09 22:12:43 +01:00
jonaswinkler
431d4fd8e4 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
44ec3a3d9c lazy loading for parsers 2021-02-04 13:17:24 +01:00
jonaswinkler
40ef375c15 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
f964dd5935 added configuration option for the font #197 #207 2020-12-29 12:26:41 +01:00
jonaswinkler
ee31fdc650 removed unused code 2020-12-20 14:00:24 +01:00
jonaswinkler
b2e0a8c884 thumbnail generation 2020-12-16 14:19:11 +01:00
jonaswinkler
e47b105185 fixes #7 and some test cases. 2020-12-16 14:17:05 +01:00
jonaswinkler
7e0aa7136a more tests 2020-12-15 13:26:01 +01:00
jonaswinkler
aaa6599283 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32 added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
Jonas Winkler
df801d17e1 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
41650f20f4 mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
d2e22e3f27 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Daniel Quinn
750ab5bf85 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
2a3f766b93 Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
c99f5923d5 Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
ef302abed7 Fix pycodestyle complaints 2018-09-09 20:55:37 +01:00
Joshua Taillon
72c828170e move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Joshua Taillon
4849249d86 explicitly add txt, md, and csv types for consumer and viewer; fix thumbnail generation 2018-09-03 23:46:13 -04:00
Joshua Taillon
d6fedbec52 first stab at text consumer 2018-08-30 23:32:41 -04:00