Trenton H
452c79f9a1
Improves the logging mixin and allows it to be typed better
2023-05-23 17:16:39 -07:00
Trenton Holmes
3205d52331
Changes the error mode to replace instead of ignore, to better highlight where a problem happened
2023-05-13 09:29:18 -07:00
Trenton H
111960c530
Adds better handling for files with invalid utf8 content
2023-05-13 09:29:18 -07:00
Trenton H
6f163111ce
Upgrades black to v23, upgrades ruff
2023-04-26 09:35:27 -07:00
Trenton H
3bcbd05252
Fixes ruff not running isort against the codebase
2023-04-26 09:35:27 -07:00
Trenton H
ce41ac9158
Configures ruff as the one stop linter and resolves warnings it raised
2023-04-01 17:03:52 -07:00
Trenton H
c21775980f
relock with Python 3.8.15
2023-01-06 17:59:39 -08:00
Trenton H
d19bf59f47
Cleans up and improves parser discovery testing, simplifies the determination of supported or not supported extensions and mime types
2023-01-05 08:39:48 -08:00
Trenton H
914661fdbb
Don't allow an exception when trying to parse a date cause complete failure
2022-11-17 13:37:37 -08:00
Matthias Eck
3d0a26fdb1
fix(parsers|test_api): fix failed tests
2022-08-06 19:19:10 +02:00
Matthias Eck
a5d2ae2588
feat(parsers): add generator for date parsing
2022-08-06 13:03:20 +02:00
Trenton Holmes
e8868d7ebf
Entirely removes the optipng, updates ghostscript fall back to also use WebP. Updates the conversion to use a multiprocessing pool
2022-06-11 08:38:49 -07:00
Michael Shamoon
58f2c6a5fc
webp thumbnail support with png fallback
2022-06-10 02:28:13 -07:00
shamoon
536576518e
Merge pull request #721 from paperless-ngx/bug-fix-date-ignore
...
Fix Ignore Date Parsing
2022-05-10 16:45:58 -07:00
Trenton Holmes
5b96944940
Updates the ignore date parsing to utilize the settings defined date order, instead of guessing a bit
2022-05-08 16:57:35 -07:00
Trenton Holmes
8a6aaf4e2d
Adds additional testing for both date parsing and consumed document created date
2022-05-08 16:57:35 -07:00
Trenton Holmes
3003bdd507
Runs pyupgrade to Python 3.8+ and adds a hook for it
2022-05-06 09:04:08 -07:00
Fantasticle
0baacbef98
update new regex pattern for second boundary
2022-03-31 09:37:15 +02:00
fantasticle
1ecb26a3fb
Update regex date match patterns
2022-03-30 12:19:30 +02:00
Simon Siebert
54cbacf4f4
Update parsers.py and test_consumer.py
2022-03-14 19:03:09 +01:00
Trenton Holmes
1771d18a21
Runs the pre-commit hooks over all the Python files
2022-03-11 11:34:28 -08:00
kpj
fc695896dd
Format Python code with black
2022-02-27 15:26:41 +01:00
jonaswinkler
40ce38254b
fixes #631
2021-03-14 14:42:48 +01:00
jonaswinkler
416101d557
only import dateparser when required
2021-02-15 11:52:46 +01:00
jonaswinkler
8d6071e977
fix a bug with thumbnail generation when TIKA was enabled
2021-02-09 22:12:43 +01:00
jonaswinkler
431d4fd8e4
rework most of the logging
2021-02-05 01:10:29 +01:00
jonaswinkler
bdc247ce49
code style
2021-02-02 23:58:25 +01:00
jonaswinkler
2faa425caf
localization for websockets
2021-01-28 22:06:02 +01:00
jonaswinkler
868fd4155a
bug fixes, test case fixes
2021-01-26 15:19:56 +01:00
jonaswinkler
05d69c0882
Merge branch 'dev' into feature-websockets-status
2021-01-23 22:22:17 +01:00
Jonas Winkler
be94a8e49a
Merge pull request #251 from jayme-github/ignore-date
...
Add option to ignore certain dates in parse_date
2021-01-05 00:19:13 +01:00
jonaswinkler
9f9581e1f8
Merge branch 'dev' into feature-websockets-status
2021-01-04 22:45:56 +01:00
jonaswinkler
e97ff3d671
code style
2021-01-02 15:26:09 +01:00
jayme-github
654ee4e62e
Add option to ignore certain dates in parse_date
...
PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates
to ignore during date parsing (from filename and content). This can be
used so specify dates that do appear often in documents but are usually
not the documents creation date (like your date of birth).
2021-01-02 15:20:49 +01:00
jonaswinkler
40ef375c15
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
c05bfb894a
remove duplicate code
2021-01-01 21:50:45 +01:00
jonaswinkler
713985f259
fixes #218
2020-12-30 15:12:16 +01:00
jonaswinkler
5894060dc5
fixes #25
2020-12-15 13:52:35 +01:00
jonaswinkler
2f7bb01f34
moved metadata extraction to the parsers
2020-12-10 14:57:53 +01:00
jonaswinkler
522ada88ea
Merge branch 'dev' into feature-websockets-status
2020-12-06 22:53:54 +01:00
jonaswinkler
4548cf08c7
fixes #78
2020-12-02 18:00:49 +01:00
jonaswinkler
834352130c
checking file types against parsers in the consumer.
2020-12-01 15:26:05 +01:00
jonaswinkler
aaa6599283
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
Jonas Winkler
df801d17e1
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
2d559d330d
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
8069c2eb6a
add support for archive files.
2020-11-25 14:47:17 +01:00
Jonas Winkler
d252a1dcda
Merge branch 'dev' into celery-tasks
2020-11-22 22:49:37 +01:00
Jonas Winkler
b44f8383e4
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
41650f20f4
mime type handling
2020-11-20 13:31:03 +01:00