157 Commits

Author SHA1 Message Date
Jonas Winkler
9bd0bee2f6 codestyle 2020-11-25 19:51:02 +01:00
Jonas Winkler
df801d17e1 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
8069c2eb6a add support for archive files. 2020-11-25 14:47:17 +01:00
Jonas Winkler
9a33f191a7 added archive directory. 2020-11-25 14:45:21 +01:00
Jonas Winkler
d252a1dcda Merge branch 'dev' into celery-tasks 2020-11-22 22:49:37 +01:00
Jonas Winkler
b44f8383e4 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
3d5b66c2b7 FileType does not care about the extension anymore. 2020-11-20 16:18:59 +01:00
Jonas Winkler
41650f20f4 mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
391020a2b0 small changes 2020-11-20 10:58:17 +01:00
Jonas Winkler
17430210a1 Merge branch 'dev' into celery-tasks 2020-11-19 22:10:57 +01:00
Jonas Winkler
727f86c369 codestyle 2020-11-18 22:41:14 +01:00
Jonas Winkler
8908bc259e updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
c7c6be42be refactor 2020-11-17 11:49:44 +01:00
Jonas Winkler
70d8e8bc56 added more testing 2020-11-16 23:16:37 +01:00
Jonas Winkler
8dca459573 first version of the new consumer. 2020-11-16 18:26:54 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
734da28b69 fixed the file handling implementation. The feature is cool, but the original implementation had so many small flaws it wasn't even funny. 2020-11-11 14:21:33 +01:00
Jonas Winkler
02ef7cb038 small consumer fixes 2020-11-11 14:14:21 +01:00
Jonas Winkler
83f82f3caf added a setting: delete duplicate documents 2020-11-10 01:47:58 +01:00
Jonas Winkler
572e40ca27 backend that supports asgi and status update sockets with channels 2020-11-07 11:31:04 +01:00
Jonas Winkler
296c113b16 updated the classifier. Its now much faster and does not retrain when data hasnt changed. 2020-11-06 14:46:06 +01:00
Jonas Winkler
f4cebda085 A handy script to redo ocr on all documents, 2020-11-03 14:04:11 +01:00
Jonas Winkler
7d282a4e4e removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
9f29dc2863 updated consumer: now using watchdog 2020-11-01 23:07:54 +01:00
Jonas Winkler
05f20c19c3 the document classifier is now stateless 2020-10-29 14:33:42 +01:00
Jonas Winkler
11af74ba36 unified document matching, legacy and automatching work alongside now 2020-10-28 11:45:11 +01:00
Jonas Winkler
052c1680f3 added
- document index
- api access for thumbnails/downloads
- more api filters

updated
- pipfile

removed
- filename handling
- legacy thumb/download access
- obsolete admin gui settings (per page items, FY, inline view)
2020-10-25 23:03:02 +01:00
Jonas Winkler
421dab786d Merge branch 'master' into dev 2020-10-16 15:02:57 +02:00
JOKer
8698f92ac9
Merge pull request #593 from BastianPoe/feature-293
Give stored documents a structured and configurable filename
2020-05-02 08:33:49 +02:00
Johann Bauer
22c7f309a7 Warn if consume directory contains subdirectories
.
2020-01-04 01:09:54 +01:00
Wolf-Bastian Poettner
6813805712 Allows to configure directory and filename formats for documents stored in paperless
Default configuration is as before (incrementing numbers), but additional fields can be added at will
2019-12-27 14:25:38 +00:00
Jonas Winkler
ea58c66fd4 Merge branch 'master' into dev 2018-12-11 12:38:15 +01:00
Jonas Winkler
766109ae4e Merge remote-tracking branch 'upstream/master' 2018-12-11 12:06:15 +01:00
Daniel Quinn
750ab5bf85 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
14bb52b6a4 Wrap document consumption in a transaction #262 2018-10-07 13:12:22 +01:00
Jonas Winkler
b347e3347d Restored tagging functionality 2018-09-27 20:41:16 +02:00
Jonas Winkler
11adc94e5e mode change 2018-09-06 12:00:01 +02:00
Jonas Winkler
70bd05450a removed matching model fields, automatic classifier reloading, added autmatic_classification field to matching model 2018-09-04 18:40:26 +02:00
Erik Arvstedt
742b01d1f5 Update Consumer class documentation 2018-06-17 20:17:40 +01:00
Daniel Quinn
90cd9f3eb7 Drop lines thanks to @erikarvstedt's eagle-eye 2018-06-17 17:10:45 +01:00
Daniel Quinn
c9f35a7da2
Merge branch 'master' into mcronce-disable_encryption 2018-06-17 16:32:51 +01:00
Daniel Quinn
81a8cb45d7 It's exist_ok=, not exists_ok= -- my bad. 2018-05-28 13:08:00 +01:00
Daniel Quinn
6e1f2b3f03 Drop STORAGE_TYPE in favour of just using PAPERLESS_PASSPHRASE 2018-05-28 12:58:28 +01:00
Daniel Quinn
d8740ee5ca Make the consumer aware of the different storage types 2018-05-28 12:58:28 +01:00
Erik Arvstedt
bccac5017c fixup: remove helper fn 'make_dirs' 2018-05-21 00:45:00 +02:00
Erik Arvstedt
e65e27d11f Consider mtime of ignored files, garbage-collect ignore list
1. Store the mtime of ignored files so that we can reconsider them if
they have changed.

2. Regularly reset the ignore list to files that still exist in the
consumption dir. Previously, the list could grow indefinitely.
2018-05-11 14:05:30 +02:00
Erik Arvstedt
12488c9634 Simplify ignoring docs 2018-05-11 14:05:29 +02:00
Erik Arvstedt
61cd050e24 Ensure docs have been unmodified for some time before consuming
Previously, the second mtime check for new files usually happened right
after the first one, which could have caused consumption of docs that
were still being modified.

We're now waiting for at least FILES_MIN_UNMODIFIED_DURATION (0.5s).

This also cleans up the logic by eliminating the consumer.stats attribute
and the weird double call to consumer.run().

Additionally, this a fixes memory leak in consumer.stats where paths could be
added but never removed if the corresponding files disappeared from
the consumer dir before being considered ready.
2018-05-11 14:05:29 +02:00
Erik Arvstedt
f018e8e54f Refactor: extract fn try_consume_file
The main purpose of this change is to make the following commits more
readable.
2018-05-11 14:05:28 +02:00