202 Commits

Author SHA1 Message Date
Jonas Winkler
afc3753e58 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
fc0ba2098a FileType does not care about the extension anymore. 2020-11-20 16:18:59 +01:00
Jonas Winkler
f976a0b4ba mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
4a6b8ef138 small changes 2020-11-20 10:58:17 +01:00
Jonas Winkler
196faa8fdc Merge branch 'dev' into celery-tasks 2020-11-19 22:10:57 +01:00
Jonas Winkler
8c40c54421 codestyle 2020-11-18 22:41:14 +01:00
Jonas Winkler
680ab3d56b updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
39ba14aac1 refactor 2020-11-17 11:49:44 +01:00
Jonas Winkler
e30f0b274b added more testing 2020-11-16 23:16:37 +01:00
Jonas Winkler
bd04c966c5 first version of the new consumer. 2020-11-16 18:26:54 +01:00
Jonas Winkler
eb6805e37e code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
8b8a2af053 fixed the file handling implementation. The feature is cool, but the original implementation had so many small flaws it wasn't even funny. 2020-11-11 14:21:33 +01:00
Jonas Winkler
a91e46364a small consumer fixes 2020-11-11 14:14:21 +01:00
Jonas Winkler
3048342de7 added a setting: delete duplicate documents 2020-11-10 01:47:58 +01:00
Jonas Winkler
d46203c114 backend that supports asgi and status update sockets with channels 2020-11-07 11:31:04 +01:00
Jonas Winkler
33f1c82943 updated the classifier. Its now much faster and does not retrain when data hasnt changed. 2020-11-06 14:46:06 +01:00
Jonas Winkler
9757e261f2 A handy script to redo ocr on all documents, 2020-11-03 14:04:11 +01:00
Jonas Winkler
a89773ad71 removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
def3a85858 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
6fd73a04b8 updated consumer: now using watchdog 2020-11-01 23:07:54 +01:00
Jonas Winkler
6ce493e3a7 the document classifier is now stateless 2020-10-29 14:33:42 +01:00
Jonas Winkler
dd16b7262e unified document matching, legacy and automatching work alongside now 2020-10-28 11:45:11 +01:00
Jonas Winkler
93d963ed4e added
- document index
- api access for thumbnails/downloads
- more api filters

updated
- pipfile

removed
- filename handling
- legacy thumb/download access
- obsolete admin gui settings (per page items, FY, inline view)
2020-10-25 23:03:02 +01:00
Jonas Winkler
b71049ad16 Merge branch 'master' into dev 2020-10-16 15:02:57 +02:00
JOKer
5f8120add1 Merge pull request #593 from BastianPoe/feature-293
Give stored documents a structured and configurable filename
2020-05-02 08:33:49 +02:00
Johann Bauer
cea6dcce23 Warn if consume directory contains subdirectories
.
2020-01-04 01:09:54 +01:00
Wolf-Bastian Poettner
d1a54d6576 Allows to configure directory and filename formats for documents stored in paperless
Default configuration is as before (incrementing numbers), but additional fields can be added at will
2019-12-27 14:25:38 +00:00
Jonas Winkler
f711b146e1 Merge branch 'master' into dev 2018-12-11 12:38:15 +01:00
Jonas Winkler
8f0d53c54a Merge remote-tracking branch 'upstream/master' 2018-12-11 12:06:15 +01:00
Daniel Quinn
bc898c1992 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
40b9e44bfe Wrap document consumption in a transaction #262 2018-10-07 13:12:22 +01:00
Jonas Winkler
001a80a528 Restored tagging functionality 2018-09-27 20:41:16 +02:00
Jonas Winkler
1c8576cfb9 mode change 2018-09-06 12:00:01 +02:00
Jonas Winkler
9d4155a907 removed matching model fields, automatic classifier reloading, added autmatic_classification field to matching model 2018-09-04 18:40:26 +02:00
Erik Arvstedt
88a05947f7 Update Consumer class documentation 2018-06-17 20:17:40 +01:00
Daniel Quinn
d566a014d2 Drop lines thanks to @erikarvstedt's eagle-eye 2018-06-17 17:10:45 +01:00
Daniel Quinn
e7fefc40fe Merge branch 'master' into mcronce-disable_encryption 2018-06-17 16:32:51 +01:00
Daniel Quinn
d1b6e9329f It's exist_ok=, not exists_ok= -- my bad. 2018-05-28 13:08:00 +01:00
Daniel Quinn
a9382ffd1a Drop STORAGE_TYPE in favour of just using PAPERLESS_PASSPHRASE 2018-05-28 12:58:28 +01:00
Daniel Quinn
92d9506a2e Make the consumer aware of the different storage types 2018-05-28 12:58:28 +01:00
Erik Arvstedt
d132e2b9f5 fixup: remove helper fn 'make_dirs' 2018-05-21 00:45:00 +02:00
Erik Arvstedt
8b37af994a Consider mtime of ignored files, garbage-collect ignore list
1. Store the mtime of ignored files so that we can reconsider them if
they have changed.

2. Regularly reset the ignore list to files that still exist in the
consumption dir. Previously, the list could grow indefinitely.
2018-05-11 14:05:30 +02:00
Erik Arvstedt
cc22204e5a Simplify ignoring docs 2018-05-11 14:05:29 +02:00
Erik Arvstedt
f56ec70aad Ensure docs have been unmodified for some time before consuming
Previously, the second mtime check for new files usually happened right
after the first one, which could have caused consumption of docs that
were still being modified.

We're now waiting for at least FILES_MIN_UNMODIFIED_DURATION (0.5s).

This also cleans up the logic by eliminating the consumer.stats attribute
and the weird double call to consumer.run().

Additionally, this a fixes memory leak in consumer.stats where paths could be
added but never removed if the corresponding files disappeared from
the consumer dir before being considered ready.
2018-05-11 14:05:29 +02:00
Erik Arvstedt
0db6ed225b Refactor: extract fn try_consume_file
The main purpose of this change is to make the following commits more
readable.
2018-05-11 14:05:28 +02:00
Erik Arvstedt
312a6a91b5 Use os.scandir instead of os.listdir
It's simpler and better suited for use cases introduced in later commits.
2018-05-11 14:05:25 +02:00
Erik Arvstedt
2c64e70754 Consume documents in order of increasing mtime
This increases overall usability, especially for multi-page scans.
Previously, the consumption order was undefined (see os.listdir())
2018-05-11 14:04:37 +02:00
Erik Arvstedt
9320230100 Refactor: extract fn 'make_dirs' 2018-05-11 14:04:36 +02:00
Daniel Quinn
13452ba33b Clean up docstring to be properly rst 2018-03-03 18:43:20 +00:00
Ovv
b10c2c770c style & test 2018-03-03 18:43:20 +00:00