paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2025-05-01 11:19:32 -05:00

Author	SHA1	Message	Date
Daniel Quinn	0a4338143a	Tweak the date guesser to not allow dates prior to 1900 (#414 )	2018-10-01 20:03:47 +01:00
Daniel Quinn	52bfeb2ad0	Improve the unknown language error message	2018-09-23 12:41:14 +01:00
Daniel Quinn	21e53aa55c	Merge pull request #399 from jat255/ENH_convert_only_one_page Speed up thumbnail generation for PDFs	2018-09-09 21:12:42 +01:00
Daniel Quinn	ef7f98281d	Rename `parsers` to `DATE_REGEX` In moving the `parsers` variable into the package-level, it lost the context, so a more descriptive name was needed.	2018-09-09 21:02:30 +01:00
Daniel Quinn	a3158eedf9	Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer	2018-09-09 20:52:59 +01:00
Daniel Quinn	6b63ce9201	Fix pycodestyle complaints Apparently, pycodestyle updated itself to now check for invalid escape sequences, which only complain if the regex in use isn't a raw string (r"").	2018-09-09 20:00:12 +01:00
Joshua Taillon	5326895334	move date-matching regex pattern to base parser module for use by all subclasses	2018-09-05 21:13:36 -04:00
Joshua Taillon	98a437f78a	change tesseract parser to only convert first page to save (potentially) massive amounts of work	2018-09-05 15:18:35 -04:00
Daniel Quinn	bce2d3dd22	Account for KeyError problem in #345	2018-04-28 12:20:43 +01:00
Daniel Quinn	f3f86242de	Account for KeyError problem in #345	2018-04-28 12:19:53 +01:00
Ovv	32c440cbd9	Log detected document date with isoformat	2018-03-04 13:10:49 +01:00
Daniel Quinn	7c5ca5f505	Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates Extends the regex to find dates in documents as reported by @isaacsando	2018-02-18 17:23:49 +01:00
Daniel Quinn	4f726e1991	Monitor return codes of calls to `convert` and `unpaper` ...and handle the failures nicely. Addresses #303.	2018-02-18 16:02:27 +00:00
Daniel Quinn	e53033d1b3	Rename .TEXT_CACHE to .text Properties should use snake_case, and only constants should be ALL_CAPS. This change also makes use of the convention of "private" properties being prefixed with `_`.	2018-02-18 16:00:43 +00:00
Daniel Quinn	3302ee2a78	Make isort happy	2018-02-18 16:00:03 +00:00
Daniel Quinn	caf44146db	Style and removal of Python 2.7 stuff	2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner	5fed7ba6d4	Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date	2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner	3e65054e39	Extended exception handling	2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner	c0c20f99e9	Added log output for date detected in document	2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner	3899763261	Extends the regex to find dates in documents as reported by @isaacsando	2018-02-12 22:41:15 +01:00
Wolf-Bastian Pöttner	acfacaac4f	Added a text cache to optimize performance of date detection	2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner	73d261484a	Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text	2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner	3dc730808e	Add support for using pre-existing text from PDFs	2018-02-02 22:37:58 +01:00
Matt	bc5c45a705	Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.	2018-02-01 10:08:57 -05:00
Daniel Quinn	269c32ce6a	Add support for using pre-existing text from PDFs	2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner	21fc51c09a	Add support for a heuristic that extracts the document date from its text	2018-01-28 19:37:10 +01:00
Daniel Quinn	d2c283582b	feat: refactor for pluggable consumers I've broken out the OCR-specific code from the consumers and dumped it all into its own app, `paperless_tesseract`. This new app should serve as a sample of how to create one's own consumer for different file types. Documentation for how to do this isn't ready yet, but for the impatient: * Create a new app * containing a `parsers.py` for your parser modelled after `paperless_tesseract.parsers.RasterisedDocumentParser` * containing a `signals.py` with a handler moddelled after `paperless_tesseract.signals.ConsumerDeclaration` * connect the signal handler to `documents.signals.document_consumer_declaration` in `your_app.apps` * Install the app into Paperless by declaring `PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be separated with commas. * Restart the consumer	2017-03-25 15:10:25 +00:00

27 Commits