139 Commits

Author SHA1 Message Date
Jonas Winkler
83f82f3caf added a setting: delete duplicate documents 2020-11-10 01:47:58 +01:00
Jonas Winkler
572e40ca27 backend that supports asgi and status update sockets with channels 2020-11-07 11:31:04 +01:00
Jonas Winkler
296c113b16 updated the classifier. Its now much faster and does not retrain when data hasnt changed. 2020-11-06 14:46:06 +01:00
Jonas Winkler
f4cebda085 A handy script to redo ocr on all documents, 2020-11-03 14:04:11 +01:00
Jonas Winkler
7d282a4e4e removed unused code, small fixes 2020-11-02 18:20:04 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
9f29dc2863 updated consumer: now using watchdog 2020-11-01 23:07:54 +01:00
Jonas Winkler
05f20c19c3 the document classifier is now stateless 2020-10-29 14:33:42 +01:00
Jonas Winkler
11af74ba36 unified document matching, legacy and automatching work alongside now 2020-10-28 11:45:11 +01:00
Jonas Winkler
052c1680f3 added
- document index
- api access for thumbnails/downloads
- more api filters

updated
- pipfile

removed
- filename handling
- legacy thumb/download access
- obsolete admin gui settings (per page items, FY, inline view)
2020-10-25 23:03:02 +01:00
Jonas Winkler
421dab786d Merge branch 'master' into dev 2020-10-16 15:02:57 +02:00
JOKer
8698f92ac9
Merge pull request #593 from BastianPoe/feature-293
Give stored documents a structured and configurable filename
2020-05-02 08:33:49 +02:00
Johann Bauer
22c7f309a7 Warn if consume directory contains subdirectories
.
2020-01-04 01:09:54 +01:00
Wolf-Bastian Poettner
6813805712 Allows to configure directory and filename formats for documents stored in paperless
Default configuration is as before (incrementing numbers), but additional fields can be added at will
2019-12-27 14:25:38 +00:00
Jonas Winkler
ea58c66fd4 Merge branch 'master' into dev 2018-12-11 12:38:15 +01:00
Jonas Winkler
766109ae4e Merge remote-tracking branch 'upstream/master' 2018-12-11 12:06:15 +01:00
Daniel Quinn
750ab5bf85 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
14bb52b6a4 Wrap document consumption in a transaction #262 2018-10-07 13:12:22 +01:00
Jonas Winkler
b347e3347d Restored tagging functionality 2018-09-27 20:41:16 +02:00
Jonas Winkler
11adc94e5e mode change 2018-09-06 12:00:01 +02:00
Jonas Winkler
70bd05450a removed matching model fields, automatic classifier reloading, added autmatic_classification field to matching model 2018-09-04 18:40:26 +02:00
Erik Arvstedt
742b01d1f5 Update Consumer class documentation 2018-06-17 20:17:40 +01:00
Daniel Quinn
90cd9f3eb7 Drop lines thanks to @erikarvstedt's eagle-eye 2018-06-17 17:10:45 +01:00
Daniel Quinn
c9f35a7da2
Merge branch 'master' into mcronce-disable_encryption 2018-06-17 16:32:51 +01:00
Daniel Quinn
81a8cb45d7 It's exist_ok=, not exists_ok= -- my bad. 2018-05-28 13:08:00 +01:00
Daniel Quinn
6e1f2b3f03 Drop STORAGE_TYPE in favour of just using PAPERLESS_PASSPHRASE 2018-05-28 12:58:28 +01:00
Daniel Quinn
d8740ee5ca Make the consumer aware of the different storage types 2018-05-28 12:58:28 +01:00
Erik Arvstedt
bccac5017c fixup: remove helper fn 'make_dirs' 2018-05-21 00:45:00 +02:00
Erik Arvstedt
e65e27d11f Consider mtime of ignored files, garbage-collect ignore list
1. Store the mtime of ignored files so that we can reconsider them if
they have changed.

2. Regularly reset the ignore list to files that still exist in the
consumption dir. Previously, the list could grow indefinitely.
2018-05-11 14:05:30 +02:00
Erik Arvstedt
12488c9634 Simplify ignoring docs 2018-05-11 14:05:29 +02:00
Erik Arvstedt
61cd050e24 Ensure docs have been unmodified for some time before consuming
Previously, the second mtime check for new files usually happened right
after the first one, which could have caused consumption of docs that
were still being modified.

We're now waiting for at least FILES_MIN_UNMODIFIED_DURATION (0.5s).

This also cleans up the logic by eliminating the consumer.stats attribute
and the weird double call to consumer.run().

Additionally, this a fixes memory leak in consumer.stats where paths could be
added but never removed if the corresponding files disappeared from
the consumer dir before being considered ready.
2018-05-11 14:05:29 +02:00
Erik Arvstedt
f018e8e54f Refactor: extract fn try_consume_file
The main purpose of this change is to make the following commits more
readable.
2018-05-11 14:05:28 +02:00
Erik Arvstedt
a56a3eb86d Use os.scandir instead of os.listdir
It's simpler and better suited for use cases introduced in later commits.
2018-05-11 14:05:25 +02:00
Erik Arvstedt
2fe7df8ca0 Consume documents in order of increasing mtime
This increases overall usability, especially for multi-page scans.
Previously, the consumption order was undefined (see os.listdir())
2018-05-11 14:04:37 +02:00
Erik Arvstedt
873c98dddb Refactor: extract fn 'make_dirs' 2018-05-11 14:04:36 +02:00
Daniel Quinn
73e62600c2 Clean up docstring to be properly rst 2018-03-03 18:43:20 +00:00
Ovv
8fefafb844 style & test 2018-03-03 18:43:20 +00:00
Ovv
d1a57b5d68 Configuration cli argument for document_consumer 2018-03-03 18:43:20 +00:00
Daniel Quinn
ea6d040809 Monitor return codes of calls to convert and unpaper
...and handle the failures nicely.  Addresses #303.
2018-02-18 16:02:27 +00:00
Daniel Quinn
fb1da4834c Style and removal of Python 2.7 stuff 2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner
b140935843 Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
fa4924d5ba fix: allow for caps in file name suffixes #206
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
55e81ca4bb feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00
Daniel Quinn
18495ce9da Fix for #154
* Added a test with a faked pyocr and tesseract
* Added a catch for pyocr's *other* TesseractError
2016-11-27 15:06:45 +00:00
Daniel Quinn
ca21929cee Moved logging logic into the consumer 2016-10-26 09:52:09 +00:00
Daniel Quinn
8e58406881 pep8 corrections 2016-10-26 09:32:59 +00:00
Aleksandr Bogdanov
63de2ca1b0 Collapsing excess whitespace after OCR 2016-10-12 01:46:34 +02:00
Daniel Quinn
1ce76a5486 Actually write the date found in the file name 2016-08-20 18:11:51 +01:00
Lenz Weber
018efc576b wait until file is completely transmitted
negation was missing for feature to be active, see #128
2016-06-26 10:18:58 +02:00
Brian Martin
b6ae129ad1 Sample Config and Bug Fix
Update sample config to reflect new setting variable.
Change consumer to handle density setting as str instead of int.
2016-05-13 23:23:58 -04:00