Trenton H
8d60506884
Standarizes the imports across all the files and modules ( #4248 )
2023-09-23 20:17:01 -07:00
Trenton Holmes
3205d52331
Changes the error mode to replace instead of ignore, to better highlight where a problem happened
2023-05-13 09:29:18 -07:00
Trenton H
111960c530
Adds better handling for files with invalid utf8 content
2023-05-13 09:29:18 -07:00
Trenton H
6f163111ce
Upgrades black to v23, upgrades ruff
2023-04-26 09:35:27 -07:00
Trenton H
3bcbd05252
Fixes ruff not running isort against the codebase
2023-04-26 09:35:27 -07:00
Trenton H
ce41ac9158
Configures ruff as the one stop linter and resolves warnings it raised
2023-04-01 17:03:52 -07:00
Trenton Holmes
0df91c31f1
Creates a mix-in for asserting file system states
2023-02-20 10:25:21 -08:00
Jens van Almsick
ad6ef7314b
fix: csv recognition by consumer
...
paperless-ngx detects the file format via the mime-type based on the response of python-magic which rely on the response of the file command.
In version 5.39 (which is shipped with debian bullseye and I think many more non-rolling distributions) of the file command a *.csv will be detected as "application/csv" instead of "text/csv" as in newer versions.
2022-10-02 16:09:07 -07:00
Trenton Holmes
6844f8f2bf
Minor tweaks to getting the document thumbnail path. Adds text thumbnail as webp
2022-06-10 06:56:28 -07:00
Trenton Holmes
fc26fe0ac0
Updates to provide the user provided max pixel size to ocrmypdf
2022-05-22 16:56:08 -07:00
Trenton Holmes
3003bdd507
Runs pyupgrade to Python 3.8+ and adds a hook for it
2022-05-06 09:04:08 -07:00
Henning Häcker
3b4da70c85
extract OCR_MAX_IMAGE_PIXELS into settings.py
2022-03-30 09:23:45 +02:00
Henning Häcker
95199bd325
formatting according to black
2022-03-30 09:23:45 +02:00
Henning Häcker
a8887b211e
implement PAPERLESS_OCR_MAX_IMAGE_PIXELS
2022-03-30 09:23:45 +02:00
Trenton Holmes
1771d18a21
Runs the pre-commit hooks over all the Python files
2022-03-11 11:34:28 -08:00
kpj
fc695896dd
Format Python code with black
2022-02-27 15:26:41 +01:00
jonaswinkler
8d6071e977
fix a bug with thumbnail generation when TIKA was enabled
2021-02-09 22:12:43 +01:00
jonaswinkler
431d4fd8e4
rework most of the logging
2021-02-05 01:10:29 +01:00
jonaswinkler
44ec3a3d9c
lazy loading for parsers
2021-02-04 13:17:24 +01:00
jonaswinkler
40ef375c15
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
f964dd5935
added configuration option for the font #197 #207
2020-12-29 12:26:41 +01:00
jonaswinkler
ee31fdc650
removed unused code
2020-12-20 14:00:24 +01:00
jonaswinkler
b2e0a8c884
thumbnail generation
2020-12-16 14:19:11 +01:00
jonaswinkler
e47b105185
fixes #7 and some test cases.
2020-12-16 14:17:05 +01:00
jonaswinkler
7e0aa7136a
more tests
2020-12-15 13:26:01 +01:00
jonaswinkler
aaa6599283
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
Jonas Winkler
df801d17e1
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
41650f20f4
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
d2e22e3f27
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
2e04ba1c04
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
d15405ef56
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Daniel Quinn
750ab5bf85
Use optipng to optimise document thumbnails
2018-10-07 14:56:38 +01:00
Daniel Quinn
2a3f766b93
Consolidate get_date onto the DocumentParser parent class
2018-10-07 14:56:02 +01:00
Daniel Quinn
c99f5923d5
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Daniel Quinn
ef302abed7
Fix pycodestyle complaints
2018-09-09 20:55:37 +01:00
Joshua Taillon
72c828170e
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Joshua Taillon
4849249d86
explicitly add txt, md, and csv types for consumer and viewer; fix thumbnail generation
2018-09-03 23:46:13 -04:00
Joshua Taillon
d6fedbec52
first stab at text consumer
2018-08-30 23:32:41 -04:00