paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2025-12-29 13:48:09 -06:00

Author	SHA1	Message	Date
JensPfeifle	29b0886950	try to run convert, but fall back on gs if needed	2019-03-03 20:31:52 +01:00
JensPfeifle	ea282c22ba	Add GS_BINARY to settings to avoid harcoded call of "gs"	2019-03-03 20:31:52 +01:00
Pit	cbf008f37b	Fix quoting in call to run_convert Co-Authored-By: JensPfeifle <jens@pfeifle.tech>	2019-03-03 20:31:52 +01:00
JensPfeifle	50504c3fd8	remove unnecessary env arg in Popen	2019-03-03 20:31:52 +01:00
Jens Pfeifle	0220199766	fix parse error of some documents by using gs	2019-03-03 20:31:52 +01:00
Daniel Quinn	637b0d4cc2	Drop problematic tests Some tests had differing outcomes depending on the version of Tesseract installed on the test system. This lead to a bunch of false test failures, which lead to people (including me) just ignoring the Travis results. This commit removes those tests, and while it reduces our coverage, at least the results are predictable.	2018-12-30 17:32:45 +00:00
Daniel Quinn	27af2603f5	Use modern languages for sample test files	2018-12-30 14:09:17 +00:00
Erik Arvstedt	a19f0ef97e	Fix date test sample image The previous version of `tests_date_3.png` had too much spacing between the `0` and the `8` glyphs, which resulted in the year getting parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00). This caused the date parsing test to fail.	2018-12-02 15:10:21 +01:00
Daniel Quinn	d544f269e0	Conform everything to the coding standards https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides	2018-12-01 17:09:12 +00:00
Daniel Quinn	650db75c2b	Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing	2018-12-01 16:57:16 +00:00
Daniel Quinn	c1d18c1e83	Fix language guesses in tests It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.	2018-12-01 15:55:59 +00:00
Joshua Taillon	730daa3d6d	Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing	2018-11-15 23:17:59 -05:00
Joshua Taillon	e1d8744c66	Add option for parsing of date from filename (and associated tests)	2018-11-15 20:32:15 -05:00
Joshua Taillon	4409f65840	Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC	2018-11-15 20:30:23 -05:00
Daniel Quinn	bd95804fbf	Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling Clarify forgiving ocr handling	2018-10-08 09:35:57 +00:00
David Martin	b350ec48b7	Mention FORGIVING_OCR config option when language detection fails. It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let document consumption happen even if no language can be detected. Mentioning it in the actual error message in the log seems like the best way to make it clear.	2018-10-08 19:37:05 +11:00
David Martin	f948ee11be	Let unpaper overwrite temporary files. I'm not sure what the circumstances are, but it looks like unpaper can attempt to write a temporary file that already exists [0]. This then fails the consumption. As per daedadu's comment simply letting unpaper overwrite files fixes this. [0] unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present. See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630	2018-10-08 19:12:11 +11:00
Daniel Quinn	750ab5bf85	Use optipng to optimise document thumbnails	2018-10-07 14:56:38 +01:00
Daniel Quinn	2a3f766b93	Consolidate get_date onto the DocumentParser parent class	2018-10-07 14:56:02 +01:00
Daniel Quinn	8010d72f18	Tweak the date guesser to not allow dates prior to 1900 (#414 )	2018-10-01 20:03:47 +01:00
Daniel Quinn	117d7dad04	Improve the unknown language error message	2018-09-23 12:41:14 +01:00
Daniel Quinn	46cbd10ba0	Merge pull request #399 from jat255/ENH_convert_only_one_page Speed up thumbnail generation for PDFs	2018-09-09 21:12:42 +01:00
Daniel Quinn	c99f5923d5	Rename `parsers` to `DATE_REGEX` In moving the `parsers` variable into the package-level, it lost the context, so a more descriptive name was needed.	2018-09-09 21:02:30 +01:00
Daniel Quinn	2dc35cc856	Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer	2018-09-09 20:52:59 +01:00
Daniel Quinn	5342db6ada	Fix pycodestyle complaints Apparently, pycodestyle updated itself to now check for invalid escape sequences, which only complain if the regex in use isn't a raw string (r"").	2018-09-09 20:00:12 +01:00
Joshua Taillon	72c828170e	move date-matching regex pattern to base parser module for use by all subclasses	2018-09-05 21:13:36 -04:00
Joshua Taillon	cac63494f0	change tesseract parser to only convert first page to save (potentially) massive amounts of work	2018-09-05 15:18:35 -04:00
Erik Arvstedt	be2cbebaf7	Stop tests from writing to the source tree	2018-07-19 23:48:23 +02:00
Daniel Quinn	82f9dde055	Account for KeyError problem in #345	2018-04-28 12:20:43 +01:00
Daniel Quinn	c983e73d0f	Account for KeyError problem in #345	2018-04-28 12:19:53 +01:00
Ovv	75ac8d2796	Log detected document date with isoformat	2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner	fba58f3bdd	Increase testcoverage by testing two more date detection cases	2018-02-19 21:36:48 +01:00
Daniel Quinn	6662ca3467	Fix formatting	2018-02-18 18:00:34 +00:00
Daniel Quinn	6f1ed89e26	Fix tests to use _text instead of TEXT_CACHE	2018-02-18 18:00:22 +00:00
Daniel Quinn	5d01410dc0	Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates Extends the regex to find dates in documents as reported by @isaacsando	2018-02-18 17:23:49 +01:00
Daniel Quinn	ea6d040809	Monitor return codes of calls to `convert` and `unpaper` ...and handle the failures nicely. Addresses #303.	2018-02-18 16:02:27 +00:00
Daniel Quinn	8e9d5caa37	Rename .TEXT_CACHE to .text Properties should use snake_case, and only constants should be ALL_CAPS. This change also makes use of the convention of "private" properties being prefixed with `_`.	2018-02-18 16:00:43 +00:00
Daniel Quinn	122aa2b9f1	Make isort happy	2018-02-18 16:00:03 +00:00
Daniel Quinn	fb1da4834c	Style and removal of Python 2.7 stuff	2018-02-18 15:55:55 +00:00
Wolf-Bastian Pöttner	96c7222269	Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date	2018-02-14 21:41:04 +01:00
Wolf-Bastian Pöttner	1737e27b34	Add more (fast-running) unit tests	2018-02-14 21:41:01 +01:00
Wolf-Bastian Pöttner	39f198138a	Extended exception handling	2018-02-12 22:43:16 +01:00
Wolf-Bastian Pöttner	c74bb84c83	Added log output for date detected in document	2018-02-12 22:41:19 +01:00
Wolf-Bastian Pöttner	07d06d9aee	Extends the regex to find dates in documents as reported by @isaacsando	2018-02-12 22:41:15 +01:00
Daniel Quinn	73163d893f	No need to extend object	2018-02-03 15:26:28 +00:00
Daniel Quinn	c90ed2da1d	Rework tests to write to /tmp Originally the test wrote scratch data inside the repo dir, which meant manual cleanup. Now it writes to `/tmp/paperless-tests-<random-string>` and cleans up after itself.	2018-02-03 14:49:48 +00:00
Wolf-Bastian Pöttner	40f8ba23a4	Added a text cache to optimize performance of date detection	2018-02-03 00:28:52 +01:00
Wolf-Bastian Pöttner	bef2d94374	Add test cases for date parsing	2018-02-03 00:28:49 +01:00
Wolf-Bastian Pöttner	f39c7654a0	Merge branch 'master' of https://github.com/danielquinn/paperless into feature/heuristically-extract-date-from-document-text	2018-02-02 22:44:03 +01:00
Wolf-Bastian Pöttner	87e466c47c	Add support for using pre-existing text from PDFs	2018-02-02 22:37:58 +01:00

1 2 3

107 Commits