Trenton H
dc642152d1
Standarizes the imports across all the files and modules ( #4248 )
2023-09-23 20:17:01 -07:00
Trenton Holmes
2f12206911
Changes the error mode to replace instead of ignore, to better highlight where a problem happened
2023-05-13 09:29:18 -07:00
Trenton H
6722b6e31c
Adds better handling for files with invalid utf8 content
2023-05-13 09:29:18 -07:00
Trenton H
aabcc9a1c4
Upgrades black to v23, upgrades ruff
2023-04-26 09:35:27 -07:00
Trenton H
30655f1b73
Fixes ruff not running isort against the codebase
2023-04-26 09:35:27 -07:00
Trenton H
d2c02b9102
Configures ruff as the one stop linter and resolves warnings it raised
2023-04-01 17:03:52 -07:00
Trenton Holmes
acfa7d633d
Creates a mix-in for asserting file system states
2023-02-20 10:25:21 -08:00
Jens van Almsick
d89443b31d
fix: csv recognition by consumer
...
paperless-ngx detects the file format via the mime-type based on the response of python-magic which rely on the response of the file command.
In version 5.39 (which is shipped with debian bullseye and I think many more non-rolling distributions) of the file command a *.csv will be detected as "application/csv" instead of "text/csv" as in newer versions.
2022-10-02 16:09:07 -07:00
Trenton Holmes
ea8596b4d2
Minor tweaks to getting the document thumbnail path. Adds text thumbnail as webp
2022-06-10 06:56:28 -07:00
Trenton Holmes
95bbf47995
Updates to provide the user provided max pixel size to ocrmypdf
2022-05-22 16:56:08 -07:00
Trenton Holmes
f62193099c
Runs pyupgrade to Python 3.8+ and adds a hook for it
2022-05-06 09:04:08 -07:00
Henning Häcker
f4a0d8c040
extract OCR_MAX_IMAGE_PIXELS into settings.py
2022-03-30 09:23:45 +02:00
Henning Häcker
6bc2cb0607
formatting according to black
2022-03-30 09:23:45 +02:00
Henning Häcker
2c1f0cd3ee
implement PAPERLESS_OCR_MAX_IMAGE_PIXELS
2022-03-30 09:23:45 +02:00
Trenton Holmes
6635fa5f0d
Runs the pre-commit hooks over all the Python files
2022-03-11 11:34:28 -08:00
kpj
c56cb25b5f
Format Python code with black
2022-02-27 15:26:41 +01:00
jonaswinkler
b04d91d68c
fix a bug with thumbnail generation when TIKA was enabled
2021-02-09 22:12:43 +01:00
jonaswinkler
e5a7dc0cc7
rework most of the logging
2021-02-05 01:10:29 +01:00
jonaswinkler
95f5c9f3a6
lazy loading for parsers
2021-02-04 13:17:24 +01:00
jonaswinkler
755f950cd2
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
fe73f42495
added configuration option for the font #197 #207
2020-12-29 12:26:41 +01:00
jonaswinkler
d329b371ef
removed unused code
2020-12-20 14:00:24 +01:00
jonaswinkler
e28f741fac
thumbnail generation
2020-12-16 14:19:11 +01:00
jonaswinkler
b1cee55edb
fixes #7 and some test cases.
2020-12-16 14:17:05 +01:00
jonaswinkler
f7db27de70
more tests
2020-12-15 13:26:01 +01:00
jonaswinkler
1df64e3129
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
Jonas Winkler
9bfa088eb5
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
f976a0b4ba
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
9a48d6c577
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
eb6805e37e
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
def3a85858
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Daniel Quinn
bc898c1992
Use optipng to optimise document thumbnails
2018-10-07 14:56:38 +01:00
Daniel Quinn
074609e1fc
Consolidate get_date onto the DocumentParser parent class
2018-10-07 14:56:02 +01:00
Daniel Quinn
ef7f98281d
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
69fc0d6d80
Fix pycodestyle complaints
2018-09-09 20:55:37 +01:00
Joshua Taillon
5326895334
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Joshua Taillon
cc7a341e75
explicitly add txt, md, and csv types for consumer and viewer; fix thumbnail generation
2018-09-03 23:46:13 -04:00
Joshua Taillon
3c074d9e36
first stab at text consumer
2018-08-30 23:32:41 -04:00