59 Commits

Author SHA1 Message Date
Trenton Holmes
8a6aaf4e2d Adds additional testing for both date parsing and consumed document created date 2022-05-08 16:57:35 -07:00
Fantasticle
0baacbef98 update new regex pattern for second boundary 2022-03-31 09:37:15 +02:00
fantasticle
1ecb26a3fb
Update regex date match patterns 2022-03-30 12:19:30 +02:00
Simon Siebert
54cbacf4f4 Update parsers.py and test_consumer.py 2022-03-14 19:03:09 +01:00
Trenton Holmes
1771d18a21 Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
kpj
fc695896dd Format Python code with black 2022-02-27 15:26:41 +01:00
jonaswinkler
40ce38254b fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
416101d557 only import dateparser when required 2021-02-15 11:52:46 +01:00
jonaswinkler
8d6071e977 fix a bug with thumbnail generation when TIKA was enabled 2021-02-09 22:12:43 +01:00
jonaswinkler
431d4fd8e4 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
bdc247ce49 code style 2021-02-02 23:58:25 +01:00
jonaswinkler
2faa425caf localization for websockets 2021-01-28 22:06:02 +01:00
jonaswinkler
868fd4155a bug fixes, test case fixes 2021-01-26 15:19:56 +01:00
jonaswinkler
05d69c0882 Merge branch 'dev' into feature-websockets-status 2021-01-23 22:22:17 +01:00
Jonas Winkler
be94a8e49a
Merge pull request #251 from jayme-github/ignore-date
Add option to ignore certain dates in parse_date
2021-01-05 00:19:13 +01:00
jonaswinkler
9f9581e1f8 Merge branch 'dev' into feature-websockets-status 2021-01-04 22:45:56 +01:00
jonaswinkler
e97ff3d671 code style 2021-01-02 15:26:09 +01:00
jayme-github
654ee4e62e Add option to ignore certain dates in parse_date
PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates
to ignore during date parsing (from filename and content). This can be
used so specify dates that do appear often in documents but are usually
not the documents creation date (like your date of birth).
2021-01-02 15:20:49 +01:00
jonaswinkler
40ef375c15 supply file_name for tika parser 2021-01-01 22:19:43 +01:00
jonaswinkler
c05bfb894a remove duplicate code 2021-01-01 21:50:45 +01:00
jonaswinkler
713985f259 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
5894060dc5 fixes #25 2020-12-15 13:52:35 +01:00
jonaswinkler
2f7bb01f34 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
522ada88ea Merge branch 'dev' into feature-websockets-status 2020-12-06 22:53:54 +01:00
jonaswinkler
4548cf08c7 fixes #78 2020-12-02 18:00:49 +01:00
jonaswinkler
834352130c checking file types against parsers in the consumer. 2020-12-01 15:26:05 +01:00
jonaswinkler
aaa6599283 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32 added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
Jonas Winkler
df801d17e1 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
2d559d330d reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
8069c2eb6a add support for archive files. 2020-11-25 14:47:17 +01:00
Jonas Winkler
d252a1dcda Merge branch 'dev' into celery-tasks 2020-11-22 22:49:37 +01:00
Jonas Winkler
b44f8383e4 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
41650f20f4 mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
17430210a1 Merge branch 'dev' into celery-tasks 2020-11-19 22:10:57 +01:00
Jonas Winkler
c487e5f017 a new setting that allows you to skip thumbnail optimization. 2020-11-18 22:42:05 +01:00
Jonas Winkler
8908bc259e updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
d2e22e3f27 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
0421031128 add some more checks. 2020-11-12 21:20:12 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
572e40ca27 backend that supports asgi and status update sockets with channels 2020-11-07 11:31:04 +01:00
Jonas Winkler
28ba634e6a silenced unpaper once and for all 2020-11-03 14:04:21 +01:00
Jonas Winkler
f4cebda085 A handy script to redo ocr on all documents, 2020-11-03 14:04:11 +01:00
Jonas Winkler
3a08a2d206 made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
d6d37efa35 removed settings constants 2020-11-01 23:37:56 +01:00
Jonas Winkler
9f55fb668d silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
a311cd498c Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
d544f269e0 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Joshua Taillon
730daa3d6d Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing 2018-11-15 23:17:59 -05:00