107 Commits

Author SHA1 Message Date
Matt
ce98019b49 Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously. 2018-02-01 10:08:57 -05:00
Daniel Quinn
cd92c005e3 Add support for using pre-existing text from PDFs 2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
b140935843 Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
bd67b53d50 Update test for #259 fix 2017-10-16 10:53:18 +01:00
Daniel Quinn
e32ed09da3 Support .jpeg as well as .jpg 2017-10-16 09:00:38 +01:00
Daniel Quinn
fa4924d5ba fix: allow for caps in file name suffixes #206
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
55e81ca4bb feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00