74 Commits

Author SHA1 Message Date
Trenton Holmes
2f12206911 Changes the error mode to replace instead of ignore, to better highlight where a problem happened 2023-05-13 09:29:18 -07:00
Trenton H
6722b6e31c Adds better handling for files with invalid utf8 content 2023-05-13 09:29:18 -07:00
Trenton H
aabcc9a1c4 Upgrades black to v23, upgrades ruff 2023-04-26 09:35:27 -07:00
Trenton H
30655f1b73 Fixes ruff not running isort against the codebase 2023-04-26 09:35:27 -07:00
Trenton H
d2c02b9102 Configures ruff as the one stop linter and resolves warnings it raised 2023-04-01 17:03:52 -07:00
Trenton H
d58747c912 relock with Python 3.8.15 2023-01-06 17:59:39 -08:00
Trenton H
8504b6f7da Cleans up and improves parser discovery testing, simplifies the determination of supported or not supported extensions and mime types 2023-01-05 08:39:48 -08:00
Trenton H
cdfcbff529 Don't allow an exception when trying to parse a date cause complete failure 2022-11-17 13:37:37 -08:00
Matthias Eck
05d97d2cf1 fix(parsers|test_api): fix failed tests 2022-08-06 19:19:10 +02:00
Matthias Eck
1195fb9afe feat(parsers): add generator for date parsing 2022-08-06 13:03:20 +02:00
Trenton Holmes
ef6ebf9888 Entirely removes the optipng, updates ghostscript fall back to also use WebP. Updates the conversion to use a multiprocessing pool 2022-06-11 08:38:49 -07:00
Michael Shamoon
f208f89179 webp thumbnail support with png fallback 2022-06-10 02:28:13 -07:00
shamoon
3ccf143c0b Merge pull request #721 from paperless-ngx/bug-fix-date-ignore
Fix Ignore Date Parsing
2022-05-10 16:45:58 -07:00
Trenton Holmes
304d5b0d5a Updates the ignore date parsing to utilize the settings defined date order, instead of guessing a bit 2022-05-08 16:57:35 -07:00
Trenton Holmes
a944ef1ca6 Adds additional testing for both date parsing and consumed document created date 2022-05-08 16:57:35 -07:00
Trenton Holmes
f62193099c Runs pyupgrade to Python 3.8+ and adds a hook for it 2022-05-06 09:04:08 -07:00
Fantasticle
6982641398 update new regex pattern for second boundary 2022-03-31 09:37:15 +02:00
fantasticle
95fdcab953 Update regex date match patterns 2022-03-30 12:19:30 +02:00
Simon Siebert
5aea4da8b2 Update parsers.py and test_consumer.py 2022-03-14 19:03:09 +01:00
Trenton Holmes
6635fa5f0d Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
kpj
c56cb25b5f Format Python code with black 2022-02-27 15:26:41 +01:00
jonaswinkler
3a67462396 fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
f8f49bac75 only import dateparser when required 2021-02-15 11:52:46 +01:00
jonaswinkler
b04d91d68c fix a bug with thumbnail generation when TIKA was enabled 2021-02-09 22:12:43 +01:00
jonaswinkler
e5a7dc0cc7 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
eeff7b3bdb code style 2021-02-02 23:58:25 +01:00
jonaswinkler
5f7d817d69 localization for websockets 2021-01-28 22:06:02 +01:00
jonaswinkler
c0f185fe7e bug fixes, test case fixes 2021-01-26 15:19:56 +01:00
jonaswinkler
044aa55d74 Merge branch 'dev' into feature-websockets-status 2021-01-23 22:22:17 +01:00
Jonas Winkler
22f45ac619 Merge pull request #251 from jayme-github/ignore-date
Add option to ignore certain dates in parse_date
2021-01-05 00:19:13 +01:00
jonaswinkler
179b53d373 Merge branch 'dev' into feature-websockets-status 2021-01-04 22:45:56 +01:00
jonaswinkler
e2680b7113 code style 2021-01-02 15:26:09 +01:00
jayme-github
cd15490e91 Add option to ignore certain dates in parse_date
PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates
to ignore during date parsing (from filename and content). This can be
used so specify dates that do appear often in documents but are usually
not the documents creation date (like your date of birth).
2021-01-02 15:20:49 +01:00
jonaswinkler
755f950cd2 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
f1e9b414f9 remove duplicate code 2021-01-01 21:50:45 +01:00
jonaswinkler
4b7138f477 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
cdd2c873bd fixes #25 2020-12-15 13:52:35 +01:00
jonaswinkler
0c6c4a62d8 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
0bfecaa0fc Merge branch 'dev' into feature-websockets-status 2020-12-06 22:53:54 +01:00
jonaswinkler
b0507ce92a fixes #78 2020-12-02 18:00:49 +01:00
jonaswinkler
e4eeb29f54 checking file types against parsers in the consumer. 2020-12-01 15:26:05 +01:00
jonaswinkler
1df64e3129 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
Jonas Winkler
9bfa088eb5 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
15935ab61f reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
17b62b61fa add support for archive files. 2020-11-25 14:47:17 +01:00
Jonas Winkler
3893a23852 Merge branch 'dev' into celery-tasks 2020-11-22 22:49:37 +01:00
Jonas Winkler
afc3753e58 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
f976a0b4ba mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
196faa8fdc Merge branch 'dev' into celery-tasks 2020-11-19 22:10:57 +01:00