Trenton Holmes
304d5b0d5a
Updates the ignore date parsing to utilize the settings defined date order, instead of guessing a bit
2022-05-08 16:57:35 -07:00
Trenton Holmes
a944ef1ca6
Adds additional testing for both date parsing and consumed document created date
2022-05-08 16:57:35 -07:00
Fantasticle
6982641398
update new regex pattern for second boundary
2022-03-31 09:37:15 +02:00
fantasticle
95fdcab953
Update regex date match patterns
2022-03-30 12:19:30 +02:00
Simon Siebert
5aea4da8b2
Update parsers.py and test_consumer.py
2022-03-14 19:03:09 +01:00
Trenton Holmes
6635fa5f0d
Runs the pre-commit hooks over all the Python files
2022-03-11 11:34:28 -08:00
kpj
c56cb25b5f
Format Python code with black
2022-02-27 15:26:41 +01:00
jonaswinkler
3a67462396
fixes #631
2021-03-14 14:42:48 +01:00
jonaswinkler
f8f49bac75
only import dateparser when required
2021-02-15 11:52:46 +01:00
jonaswinkler
b04d91d68c
fix a bug with thumbnail generation when TIKA was enabled
2021-02-09 22:12:43 +01:00
jonaswinkler
e5a7dc0cc7
rework most of the logging
2021-02-05 01:10:29 +01:00
jonaswinkler
eeff7b3bdb
code style
2021-02-02 23:58:25 +01:00
jonaswinkler
5f7d817d69
localization for websockets
2021-01-28 22:06:02 +01:00
jonaswinkler
c0f185fe7e
bug fixes, test case fixes
2021-01-26 15:19:56 +01:00
jonaswinkler
044aa55d74
Merge branch 'dev' into feature-websockets-status
2021-01-23 22:22:17 +01:00
Jonas Winkler
22f45ac619
Merge pull request #251 from jayme-github/ignore-date
...
Add option to ignore certain dates in parse_date
2021-01-05 00:19:13 +01:00
jonaswinkler
179b53d373
Merge branch 'dev' into feature-websockets-status
2021-01-04 22:45:56 +01:00
jonaswinkler
e2680b7113
code style
2021-01-02 15:26:09 +01:00
jayme-github
cd15490e91
Add option to ignore certain dates in parse_date
...
PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates
to ignore during date parsing (from filename and content). This can be
used so specify dates that do appear often in documents but are usually
not the documents creation date (like your date of birth).
2021-01-02 15:20:49 +01:00
jonaswinkler
755f950cd2
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
f1e9b414f9
remove duplicate code
2021-01-01 21:50:45 +01:00
jonaswinkler
4b7138f477
fixes #218
2020-12-30 15:12:16 +01:00
jonaswinkler
cdd2c873bd
fixes #25
2020-12-15 13:52:35 +01:00
jonaswinkler
0c6c4a62d8
moved metadata extraction to the parsers
2020-12-10 14:57:53 +01:00
jonaswinkler
0bfecaa0fc
Merge branch 'dev' into feature-websockets-status
2020-12-06 22:53:54 +01:00
jonaswinkler
b0507ce92a
fixes #78
2020-12-02 18:00:49 +01:00
jonaswinkler
e4eeb29f54
checking file types against parsers in the consumer.
2020-12-01 15:26:05 +01:00
jonaswinkler
1df64e3129
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
Jonas Winkler
9bfa088eb5
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
15935ab61f
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
17b62b61fa
add support for archive files.
2020-11-25 14:47:17 +01:00
Jonas Winkler
3893a23852
Merge branch 'dev' into celery-tasks
2020-11-22 22:49:37 +01:00
Jonas Winkler
afc3753e58
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
f976a0b4ba
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
196faa8fdc
Merge branch 'dev' into celery-tasks
2020-11-19 22:10:57 +01:00
Jonas Winkler
4230a0a474
a new setting that allows you to skip thumbnail optimization.
2020-11-18 22:42:05 +01:00
Jonas Winkler
680ab3d56b
updated logging, logging for the mail consumer to see whats happening
2020-11-18 13:23:30 +01:00
Jonas Winkler
9a48d6c577
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
4734dec465
add some more checks.
2020-11-12 21:20:12 +01:00
Jonas Winkler
eb6805e37e
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
d46203c114
backend that supports asgi and status update sockets with channels
2020-11-07 11:31:04 +01:00
Jonas Winkler
cf5e463b9b
silenced unpaper once and for all
2020-11-03 14:04:21 +01:00
Jonas Winkler
9757e261f2
A handy script to redo ocr on all documents,
2020-11-03 14:04:11 +01:00
Jonas Winkler
d42979842e
made unpaper and convert a little bit nicer to interact with
2020-11-02 19:31:04 +01:00
Jonas Winkler
def3a85858
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Jonas Winkler
ffdb517b73
removed settings constants
2020-11-01 23:37:56 +01:00
Jonas Winkler
6adc870a20
silenced unpaper, optipng for cleaner output
...
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
ebcfcea05b
Handle dateparser ValueErrors
...
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
0d59844567
Conform everything to the coding standards
...
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00