60 Commits

Author SHA1 Message Date
Trenton Holmes
304d5b0d5a Updates the ignore date parsing to utilize the settings defined date order, instead of guessing a bit 2022-05-08 16:57:35 -07:00
Trenton Holmes
a944ef1ca6 Adds additional testing for both date parsing and consumed document created date 2022-05-08 16:57:35 -07:00
Fantasticle
6982641398 update new regex pattern for second boundary 2022-03-31 09:37:15 +02:00
fantasticle
95fdcab953 Update regex date match patterns 2022-03-30 12:19:30 +02:00
Simon Siebert
5aea4da8b2 Update parsers.py and test_consumer.py 2022-03-14 19:03:09 +01:00
Trenton Holmes
6635fa5f0d Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
kpj
c56cb25b5f Format Python code with black 2022-02-27 15:26:41 +01:00
jonaswinkler
3a67462396 fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
f8f49bac75 only import dateparser when required 2021-02-15 11:52:46 +01:00
jonaswinkler
b04d91d68c fix a bug with thumbnail generation when TIKA was enabled 2021-02-09 22:12:43 +01:00
jonaswinkler
e5a7dc0cc7 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
eeff7b3bdb code style 2021-02-02 23:58:25 +01:00
jonaswinkler
5f7d817d69 localization for websockets 2021-01-28 22:06:02 +01:00
jonaswinkler
c0f185fe7e bug fixes, test case fixes 2021-01-26 15:19:56 +01:00
jonaswinkler
044aa55d74 Merge branch 'dev' into feature-websockets-status 2021-01-23 22:22:17 +01:00
Jonas Winkler
22f45ac619 Merge pull request #251 from jayme-github/ignore-date
Add option to ignore certain dates in parse_date
2021-01-05 00:19:13 +01:00
jonaswinkler
179b53d373 Merge branch 'dev' into feature-websockets-status 2021-01-04 22:45:56 +01:00
jonaswinkler
e2680b7113 code style 2021-01-02 15:26:09 +01:00
jayme-github
cd15490e91 Add option to ignore certain dates in parse_date
PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates
to ignore during date parsing (from filename and content). This can be
used so specify dates that do appear often in documents but are usually
not the documents creation date (like your date of birth).
2021-01-02 15:20:49 +01:00
jonaswinkler
755f950cd2 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
f1e9b414f9 remove duplicate code 2021-01-01 21:50:45 +01:00
jonaswinkler
4b7138f477 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
cdd2c873bd fixes #25 2020-12-15 13:52:35 +01:00
jonaswinkler
0c6c4a62d8 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
0bfecaa0fc Merge branch 'dev' into feature-websockets-status 2020-12-06 22:53:54 +01:00
jonaswinkler
b0507ce92a fixes #78 2020-12-02 18:00:49 +01:00
jonaswinkler
e4eeb29f54 checking file types against parsers in the consumer. 2020-12-01 15:26:05 +01:00
jonaswinkler
1df64e3129 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
Jonas Winkler
9bfa088eb5 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
15935ab61f reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
17b62b61fa add support for archive files. 2020-11-25 14:47:17 +01:00
Jonas Winkler
3893a23852 Merge branch 'dev' into celery-tasks 2020-11-22 22:49:37 +01:00
Jonas Winkler
afc3753e58 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
f976a0b4ba mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
196faa8fdc Merge branch 'dev' into celery-tasks 2020-11-19 22:10:57 +01:00
Jonas Winkler
4230a0a474 a new setting that allows you to skip thumbnail optimization. 2020-11-18 22:42:05 +01:00
Jonas Winkler
680ab3d56b updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
9a48d6c577 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
4734dec465 add some more checks. 2020-11-12 21:20:12 +01:00
Jonas Winkler
eb6805e37e code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
d46203c114 backend that supports asgi and status update sockets with channels 2020-11-07 11:31:04 +01:00
Jonas Winkler
cf5e463b9b silenced unpaper once and for all 2020-11-03 14:04:21 +01:00
Jonas Winkler
9757e261f2 A handy script to redo ocr on all documents, 2020-11-03 14:04:11 +01:00
Jonas Winkler
d42979842e made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
def3a85858 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
ffdb517b73 removed settings constants 2020-11-01 23:37:56 +01:00
Jonas Winkler
6adc870a20 silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
ebcfcea05b Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
0d59844567 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00