paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-01-24 22:39:02 -06:00

Author	SHA1	Message	Date
Daniel Quinn	c1d18c1e83	Fix language guesses in tests It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.	2018-12-01 15:55:59 +00:00
Daniel Quinn	bd95804fbf	Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling Clarify forgiving ocr handling	2018-10-08 09:35:57 +00:00
David Martin	b350ec48b7	Mention FORGIVING_OCR config option when language detection fails. It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let document consumption happen even if no language can be detected. Mentioning it in the actual error message in the log seems like the best way to make it clear.	2018-10-08 19:37:05 +11:00
David Martin	f948ee11be	Let unpaper overwrite temporary files. I'm not sure what the circumstances are, but it looks like unpaper can attempt to write a temporary file that already exists [0]. This then fails the consumption. As per daedadu's comment simply letting unpaper overwrite files fixes this. [0] unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present. See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630	2018-10-08 19:12:11 +11:00
Daniel Quinn	750ab5bf85	Use optipng to optimise document thumbnails	2018-10-07 14:56:38 +01:00
Daniel Quinn	2a3f766b93	Consolidate get_date onto the DocumentParser parent class	2018-10-07 14:56:02 +01:00
Daniel Quinn	8010d72f18	Tweak the date guesser to not allow dates prior to 1900 (#414 )	2018-10-01 20:03:47 +01:00
Daniel Quinn	117d7dad04	Improve the unknown language error message	2018-09-23 12:41:14 +01:00
Daniel Quinn	46cbd10ba0	Merge pull request #399 from jat255/ENH_convert_only_one_page Speed up thumbnail generation for PDFs	2018-09-09 21:12:42 +01:00
Daniel Quinn	c99f5923d5	Rename `parsers` to `DATE_REGEX` In moving the `parsers` variable into the package-level, it lost the context, so a more descriptive name was needed.	2018-09-09 21:02:30 +01:00
Daniel Quinn	2dc35cc856	Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer	2018-09-09 20:52:59 +01:00
Daniel Quinn	5342db6ada	Fix pycodestyle complaints Apparently, pycodestyle updated itself to now check for invalid escape sequences, which only complain if the regex in use isn't a raw string (r"").	2018-09-09 20:00:12 +01:00
Joshua Taillon	72c828170e	move date-matching regex pattern to base parser module for use by all subclasses	2018-09-05 21:13:36 -04:00
Joshua Taillon	cac63494f0	change tesseract parser to only convert first page to save (potentially) massive amounts of work	2018-09-05 15:18:35 -04:00
Erik Arvstedt	be2cbebaf7	Stop tests from writing to the source tree	2018-07-19 23:48:23 +02:00
Daniel Quinn	82f9dde055	Account for KeyError problem in #345	2018-04-28 12:20:43 +01:00
Daniel Quinn	c983e73d0f	Account for KeyError problem in #345	2018-04-28 12:19:53 +01:00
Ovv	75ac8d2796	Log detected document date with isoformat	2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner	fba58f3bdd	Increase testcoverage by testing two more date detection cases	2018-02-19 21:36:48 +01:00
Daniel Quinn	6662ca3467	Fix formatting	2018-02-18 18:00:34 +00:00
Daniel Quinn	6f1ed89e26	Fix tests to use _text instead of TEXT_CACHE	2018-02-18 18:00:22 +00:00
Daniel Quinn	5d01410dc0	Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates Extends the regex to find dates in documents as reported by @isaacsando	2018-02-18 17:23:49 +01:00
Daniel Quinn	ea6d040809	Monitor return codes of calls to `convert` and `unpaper` ...and handle the failures nicely. Addresses #303.	2018-02-18 16:02:27 +00:00
Daniel Quinn	8e9d5caa37	Rename .TEXT_CACHE to .text Properties should use snake_case, and only constants should be ALL_CAPS. This change also makes use of the convention of "private" properties being prefixed with `_`.	2018-02-18 16:00:43 +00:00
Daniel Quinn	122aa2b9f1	Make isort happy	2018-02-18 16:00:03 +00:00
Daniel Quinn	fb1da4834c	Style and removal of Python 2.7 stuff	2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner	96c7222269	Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date	2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner	1737e27b34	Add more (fast-running) unit tests	2018-02-14 21:41:01 +01:00
Wolf-Bastian Pöttner	39f198138a	Extended exception handling	2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner	c74bb84c83	Added log output for date detected in document	2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner	07d06d9aee	Extends the regex to find dates in documents as reported by @isaacsando	2018-02-12 22:41:15 +01:00
Daniel Quinn	73163d893f	No need to extend object	2018-02-03 15:26:28 +00:00
Daniel Quinn	c90ed2da1d	Rework tests to write to /tmp Originally the test wrote scratch data inside the repo dir, which meant manual cleanup. Now it writes to `/tmp/paperless-tests-<random-string>` and cleans up after itself.	2018-02-03 14:49:48 +00:00
Wolf-Bastian Pöttner	40f8ba23a4	Added a text cache to optimize performance of date detection	2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner	bef2d94374	Add test cases for date parsing	2018-02-03 00:28:49 +01:00
Wolf-Bastian Pöttner	f39c7654a0	Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text	2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner	87e466c47c	Add support for using pre-existing text from PDFs	2018-02-02 22:37:58 +01:00
Matt	ce98019b49	Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.	2018-02-01 10:08:57 -05:00
Daniel Quinn	cd92c005e3	Add support for using pre-existing text from PDFs	2018-01-30 20:13:35 +00:00
Wolf-Bastian Pöttner	b140935843	Add support for a heuristic that extracts the document date from its text	2018-01-28 19:37:10 +01:00
Daniel Quinn	bd67b53d50	Update test for #259 fix	2017-10-16 10:53:18 +01:00
Daniel Quinn	e32ed09da3	Support .jpeg as well as .jpg	2017-10-16 09:00:38 +01:00
Daniel Quinn	fa4924d5ba	fix: allow for caps in file name suffixes #206 @schinkelg ran aground of this one and I took the opportunity to add a test to catch this sort of thing for next time.	2017-03-28 21:14:24 +00:00
Daniel Quinn	55e81ca4bb	feat: refactor for pluggable consumers I've broken out the OCR-specific code from the consumers and dumped it all into its own app, `paperless_tesseract`. This new app should serve as a sample of how to create one's own consumer for different file types. Documentation for how to do this isn't ready yet, but for the impatient: * Create a new app * containing a `parsers.py` for your parser modelled after `paperless_tesseract.parsers.RasterisedDocumentParser` * containing a `signals.py` with a handler moddelled after `paperless_tesseract.signals.ConsumerDeclaration` * connect the signal handler to `documents.signals.document_consumer_declaration` in `your_app.apps` * Install the app into Paperless by declaring `PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be separated with commas. * Restart the consumer	2017-03-25 15:10:25 +00:00

44 Commits