89 Commits

Author SHA1 Message Date
jonaswinkler
713985f259 fixes #218 2020-12-30 15:12:16 +01:00
jonaswinkler
5894060dc5 fixes #25 2020-12-15 13:52:35 +01:00
jonaswinkler
2f7bb01f34 moved metadata extraction to the parsers 2020-12-10 14:57:53 +01:00
jonaswinkler
522ada88ea Merge branch 'dev' into feature-websockets-status 2020-12-06 22:53:54 +01:00
jonaswinkler
4548cf08c7 fixes #78 2020-12-02 18:00:49 +01:00
jonaswinkler
834352130c checking file types against parsers in the consumer. 2020-12-01 15:26:05 +01:00
jonaswinkler
aaa6599283 Merge branch 'dev' into feature-ocrmypdf 2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32 added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type. 2020-11-30 00:40:04 +01:00
Jonas Winkler
df801d17e1 reworked the interface of the parsers. 2020-11-25 19:36:39 +01:00
Jonas Winkler
2d559d330d reworked PDF parser that uses OCRmyPDF and produces archive files. 2020-11-25 14:50:43 +01:00
Jonas Winkler
8069c2eb6a add support for archive files. 2020-11-25 14:47:17 +01:00
Jonas Winkler
d252a1dcda Merge branch 'dev' into celery-tasks 2020-11-22 22:49:37 +01:00
Jonas Winkler
b44f8383e4 code cleanup 2020-11-21 14:03:45 +01:00
Jonas Winkler
41650f20f4 mime type handling 2020-11-20 13:31:03 +01:00
Jonas Winkler
17430210a1 Merge branch 'dev' into celery-tasks 2020-11-19 22:10:57 +01:00
Jonas Winkler
c487e5f017 a new setting that allows you to skip thumbnail optimization. 2020-11-18 22:42:05 +01:00
Jonas Winkler
8908bc259e updated logging, logging for the mail consumer to see whats happening 2020-11-18 13:23:30 +01:00
Jonas Winkler
d2e22e3f27 Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere. 2020-11-16 23:53:12 +01:00
Jonas Winkler
0421031128 add some more checks. 2020-11-12 21:20:12 +01:00
Jonas Winkler
2e04ba1c04 code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
572e40ca27 backend that supports asgi and status update sockets with channels 2020-11-07 11:31:04 +01:00
Jonas Winkler
28ba634e6a silenced unpaper once and for all 2020-11-03 14:04:21 +01:00
Jonas Winkler
f4cebda085 A handy script to redo ocr on all documents, 2020-11-03 14:04:11 +01:00
Jonas Winkler
3a08a2d206 made unpaper and convert a little bit nicer to interact with 2020-11-02 19:31:04 +01:00
Jonas Winkler
d15405ef56 reworked most of the tesseract parser, better logging 2020-11-02 15:40:44 +01:00
Jonas Winkler
d6d37efa35 removed settings constants 2020-11-01 23:37:56 +01:00
Jonas Winkler
9f55fb668d silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
a311cd498c Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
d544f269e0 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Joshua Taillon
730daa3d6d Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing 2018-11-15 23:17:59 -05:00
Joshua Taillon
c225281f95 Change the massive regex to match boundaries with _ or - characters (not just word breaks); add line for year first formats like YYYY-MM-DD 2018-11-15 20:38:53 -05:00
Daniel Quinn
750ab5bf85 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
2a3f766b93 Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
c99f5923d5 Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Joshua Taillon
72c828170e move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Daniel Quinn
cebb8b9fa2 Use paperless- instead of paperless for tempdir name
This is purely aesthetic.
2018-02-03 14:49:17 +00:00
Daniel Quinn
46aca10a72 No need to explicitly extend object 2018-02-03 14:49:01 +00:00
Wolf-Bastian Pöttner
b140935843 Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
55e81ca4bb feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00