68 Commits

Author SHA1 Message Date
Wolf-Bastian Pöttner
96c7222269 Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date 2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner
1737e27b34 Add more (fast-running) unit tests 2018-02-14 21:41:01 +01:00
Wolf-Bastian Pöttner
39f198138a Extended exception handling 2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner
c74bb84c83 Added log output for date detected in document 2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner
07d06d9aee Extends the regex to find dates in documents as reported by @isaacsando 2018-02-12 22:41:15 +01:00
Daniel Quinn
73163d893f No need to extend object 2018-02-03 15:26:28 +00:00
Daniel Quinn
c90ed2da1d Rework tests to write to /tmp
Originally the test wrote scratch data inside the repo dir, which meant
manual cleanup.  Now it writes to `/tmp/paperless-tests-<random-string>`
and cleans up after itself.
2018-02-03 14:49:48 +00:00
Wolf-Bastian Pöttner
40f8ba23a4 Added a text cache to optimize performance of date detection 2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner
bef2d94374 Add test cases for date parsing 2018-02-03 00:28:49 +01:00
Wolf-Bastian Pöttner
f39c7654a0 Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text 2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner
87e466c47c Add support for using pre-existing text from PDFs 2018-02-02 22:37:58 +01:00
Matt
ce98019b49 Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously. 2018-02-01 10:08:57 -05:00
Daniel Quinn
cd92c005e3 Add support for using pre-existing text from PDFs 2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner
b140935843 Add support for a heuristic that extracts the document date from its text 2018-01-28 19:37:10 +01:00
Daniel Quinn
bd67b53d50 Update test for #259 fix 2017-10-16 10:53:18 +01:00
Daniel Quinn
e32ed09da3 Support .jpeg as well as .jpg 2017-10-16 09:00:38 +01:00
Daniel Quinn
fa4924d5ba fix: allow for caps in file name suffixes #206
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
55e81ca4bb feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00