paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-02-22 00:49:35 -06:00

Author	SHA1	Message	Date
Jonas Winkler	bd04c966c5	first version of the new consumer.	2020-11-16 18:26:54 +01:00
Jonas Winkler	eb6805e37e	code style fixes	2020-11-12 21:09:45 +01:00
Jonas Winkler	340f9f141f	fixed most of the tests	2020-11-02 19:42:23 +01:00
Jonas Winkler	d42979842e	made unpaper and convert a little bit nicer to interact with	2020-11-02 19:31:04 +01:00
Jonas Winkler	a89773ad71	removed unused code, small fixes	2020-11-02 18:20:04 +01:00
Jonas Winkler	def3a85858	reworked most of the tesseract parser, better logging	2020-11-02 15:40:44 +01:00
Jonas Winkler	972a6a2333	bugfix	2020-11-02 01:26:42 +01:00
Jonas Winkler	6adc870a20	silenced unpaper, optipng for cleaner output moved parser settings to settings removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.	2020-11-01 23:23:42 +01:00
Jonas Winkler	0f4094f3ca	better thumbnail generation for smaller files	2020-10-26 01:05:23 +01:00
Johannes Wienke	ebcfcea05b	Handle dateparser ValueErrors When parsing dates from the document text or filenames, correctly handle values errors indicating broken dates. Newly added tests ensure that this handling works properly.	2020-03-08 18:44:15 +01:00
Johannes Wienke	6531a67940	Remove duplicated date parsing test The exact same tests existed twice in the file.	2020-03-08 18:26:29 +01:00
Stéphane Brunner	3fab354a6e	Strip the thumbnails	2019-03-17 16:37:47 +01:00
jenspfeifle	5c40da1a48	make pycodestyle happy	2019-03-03 20:41:17 +01:00
JensPfeifle	078d66b077	try to run convert, but fall back on gs if needed	2019-03-03 20:31:52 +01:00
JensPfeifle	4c64ea0404	Add GS_BINARY to settings to avoid harcoded call of "gs"	2019-03-03 20:31:52 +01:00
Pit	99718bcf17	Fix quoting in call to run_convert Co-Authored-By: JensPfeifle <jens@pfeifle.tech>	2019-03-03 20:31:52 +01:00
JensPfeifle	3dfd0253ed	remove unnecessary env arg in Popen	2019-03-03 20:31:52 +01:00
Jens Pfeifle	6ab21afeb6	fix parse error of some documents by using gs	2019-03-03 20:31:52 +01:00
Daniel Quinn	e395b0e081	Drop problematic tests Some tests had differing outcomes depending on the version of Tesseract installed on the test system. This lead to a bunch of false test failures, which lead to people (including me) just ignoring the Travis results. This commit removes those tests, and while it reduces our coverage, at least the results are predictable.	2018-12-30 17:32:45 +00:00
Daniel Quinn	86b0d08377	Use modern languages for sample test files	2018-12-30 14:09:17 +00:00
Erik Arvstedt	f38ac7f62b	Fix date test sample image The previous version of `tests_date_3.png` had too much spacing between the `0` and the `8` glyphs, which resulted in the year getting parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00). This caused the date parsing test to fail.	2018-12-02 15:10:21 +01:00
Daniel Quinn	0d59844567	Conform everything to the coding standards https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides	2018-12-01 17:09:12 +00:00
Daniel Quinn	4e186ede0e	Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing	2018-12-01 16:57:16 +00:00
Daniel Quinn	9c6b8629a3	Fix language guesses in tests It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.	2018-12-01 15:55:59 +00:00
Joshua Taillon	b0326b5a19	Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing	2018-11-15 23:17:59 -05:00
Joshua Taillon	a2422cc529	Add option for parsing of date from filename (and associated tests)	2018-11-15 20:32:15 -05:00
Joshua Taillon	8b69aa1e52	Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC	2018-11-15 20:30:23 -05:00
Daniel Quinn	3952c6d921	Merge pull request #421 from ddddavidmartin/clarify_forgiving_ocr_handling Clarify forgiving ocr handling	2018-10-08 09:35:57 +00:00
David Martin	b0afa37ec1	Mention FORGIVING_OCR config option when language detection fails. It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let document consumption happen even if no language can be detected. Mentioning it in the actual error message in the log seems like the best way to make it clear.	2018-10-08 19:37:05 +11:00
David Martin	7022c98aab	Let unpaper overwrite temporary files. I'm not sure what the circumstances are, but it looks like unpaper can attempt to write a temporary file that already exists [0]. This then fails the consumption. As per daedadu's comment simply letting unpaper overwrite files fixes this. [0] unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present. See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630	2018-10-08 19:12:11 +11:00
Daniel Quinn	bc898c1992	Use optipng to optimise document thumbnails	2018-10-07 14:56:38 +01:00
Daniel Quinn	074609e1fc	Consolidate get_date onto the DocumentParser parent class	2018-10-07 14:56:02 +01:00
Daniel Quinn	0a4338143a	Tweak the date guesser to not allow dates prior to 1900 (#414 )	2018-10-01 20:03:47 +01:00
Daniel Quinn	52bfeb2ad0	Improve the unknown language error message	2018-09-23 12:41:14 +01:00
Daniel Quinn	21e53aa55c	Merge pull request #399 from jat255/ENH_convert_only_one_page Speed up thumbnail generation for PDFs	2018-09-09 21:12:42 +01:00
Daniel Quinn	ef7f98281d	Rename `parsers` to `DATE_REGEX` In moving the `parsers` variable into the package-level, it lost the context, so a more descriptive name was needed.	2018-09-09 21:02:30 +01:00
Daniel Quinn	a3158eedf9	Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer	2018-09-09 20:52:59 +01:00
Daniel Quinn	6b63ce9201	Fix pycodestyle complaints Apparently, pycodestyle updated itself to now check for invalid escape sequences, which only complain if the regex in use isn't a raw string (r"").	2018-09-09 20:00:12 +01:00
Joshua Taillon	5326895334	move date-matching regex pattern to base parser module for use by all subclasses	2018-09-05 21:13:36 -04:00
Joshua Taillon	98a437f78a	change tesseract parser to only convert first page to save (potentially) massive amounts of work	2018-09-05 15:18:35 -04:00
Erik Arvstedt	4fa9ff60fc	Stop tests from writing to the source tree	2018-07-19 23:48:23 +02:00
Daniel Quinn	bce2d3dd22	Account for KeyError problem in #345	2018-04-28 12:20:43 +01:00
Daniel Quinn	f3f86242de	Account for KeyError problem in #345	2018-04-28 12:19:53 +01:00
Ovv	32c440cbd9	Log detected document date with isoformat	2018-03-04 13:10:49 +01:00
Wolf-Bastian Pöttner	328330eb08	Increase testcoverage by testing two more date detection cases	2018-02-19 21:36:48 +01:00
Daniel Quinn	fc6d2d5e0c	Fix formatting	2018-02-18 18:00:34 +00:00
Daniel Quinn	9e26e7b39e	Fix tests to use _text instead of TEXT_CACHE	2018-02-18 18:00:22 +00:00
Daniel Quinn	7c5ca5f505	Merge pull request #302 from BastianPoe/bugfix/extend_regex_to_find_more_dates Extends the regex to find dates in documents as reported by @isaacsando	2018-02-18 17:23:49 +01:00
Daniel Quinn	4f726e1991	Monitor return codes of calls to `convert` and `unpaper` ...and handle the failures nicely. Addresses #303.	2018-02-18 16:02:27 +00:00
Daniel Quinn	e53033d1b3	Rename .TEXT_CACHE to .text Properties should use snake_case, and only constants should be ALL_CAPS. This change also makes use of the convention of "private" properties being prefixed with `_`.	2018-02-18 16:00:43 +00:00

1 2

70 Commits