13 Commits

Author SHA1 Message Date
Jonas Winkler
6adc870a20 silenced unpaper, optipng for cleaner output
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
ebcfcea05b Handle dateparser ValueErrors
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
0d59844567 Conform everything to the coding standards
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Joshua Taillon
b0326b5a19 Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing 2018-11-15 23:17:59 -05:00
Joshua Taillon
6e88634fa8 Change the massive regex to match boundaries with _ or - characters (not just word breaks); add line for year first formats like YYYY-MM-DD 2018-11-15 20:38:53 -05:00
Daniel Quinn
bc898c1992 Use optipng to optimise document thumbnails 2018-10-07 14:56:38 +01:00
Daniel Quinn
074609e1fc Consolidate get_date onto the DocumentParser parent class 2018-10-07 14:56:02 +01:00
Daniel Quinn
ef7f98281d Rename parsers to DATE_REGEX
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Joshua Taillon
5326895334 move date-matching regex pattern to base parser module for use by all subclasses 2018-09-05 21:13:36 -04:00
Daniel Quinn
5cc10a282b Use paperless- instead of paperless for tempdir name
This is purely aesthetic.
2018-02-03 14:49:17 +00:00
Daniel Quinn
648e7b6d4f No need to explicitly extend object 2018-02-03 14:49:01 +00:00
Wolf-Bastian Pöttner
21fc51c09a Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
d2c283582b feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00