paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-02-09 23:49:29 -06:00

Author	SHA1	Message	Date
jonaswinkler	fd3df1ec58	some more tests.	2020-12-01 14:15:43 +01:00
jonaswinkler	aaa6599283	Merge branch 'dev' into feature-ocrmypdf	2020-11-30 16:48:09 +01:00
jonaswinkler	f51207fc32	added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.	2020-11-30 00:40:04 +01:00
jonaswinkler	ac1b701000	more tests!	2020-11-29 19:58:48 +01:00
jonaswinkler	fca98b411e	reorganised settings documentation and added OCR_USER_ARGS	2020-11-29 12:38:32 +01:00
jonaswinkler	0565118a01	fixed checking the installed languages.	2020-11-29 12:31:42 +01:00
jonaswinkler	06cfc3113a	test case fixes.	2020-11-27 14:06:37 +01:00
Jonas Winkler	e87575240d	more tests of the new parser	2020-11-26 00:08:23 +01:00
Jonas Winkler	f51d2be303	fixed the test cases	2020-11-25 19:51:09 +01:00
Jonas Winkler	a60a4babf6	OMP_THREAD_LIMIT	2020-11-25 19:37:59 +01:00
Jonas Winkler	a03315102a	added image DPI detection to the tesseract parser.	2020-11-25 19:37:48 +01:00
Jonas Winkler	df801d17e1	reworked the interface of the parsers.	2020-11-25 19:36:39 +01:00
Jonas Winkler	b269af7572	Merge branch 'dev' into feature-ocrmypdf	2020-11-25 16:58:20 +01:00
Jonas Winkler	d92214d412	codestyle	2020-11-25 16:05:52 +01:00
Jonas Winkler	56ce267f89	removed obsolete tests.	2020-11-25 14:51:32 +01:00
Jonas Winkler	2d559d330d	reworked PDF parser that uses OCRmyPDF and produces archive files.	2020-11-25 14:50:43 +01:00
Jonas Winkler	dd83364326	default language check	2020-11-25 10:52:38 +01:00
Jonas Winkler	fec9e54049	new setting: PAPERLESS_OCR_PAGES	2020-11-22 12:54:08 +01:00
Jonas Winkler	450fb877f6	code cleanup	2020-11-21 15:34:00 +01:00
Jonas Winkler	b44f8383e4	code cleanup	2020-11-21 14:03:45 +01:00
Jonas Winkler	41650f20f4	mime type handling	2020-11-20 13:31:03 +01:00
Jonas Winkler	1655d85a53	testing the tesseract parser	2020-11-19 20:31:08 +01:00
Jonas Winkler	8908bc259e	updated logging, logging for the mail consumer to see whats happening	2020-11-18 13:23:30 +01:00
Jonas Winkler	d2e22e3f27	Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.	2020-11-16 23:53:12 +01:00
Jonas Winkler	8dca459573	first version of the new consumer.	2020-11-16 18:26:54 +01:00
Jonas Winkler	2e04ba1c04	code style fixes	2020-11-12 21:09:45 +01:00
Jonas Winkler	f182709fdd	fixed most of the tests	2020-11-02 19:42:23 +01:00
Jonas Winkler	3a08a2d206	made unpaper and convert a little bit nicer to interact with	2020-11-02 19:31:04 +01:00
Jonas Winkler	7d282a4e4e	removed unused code, small fixes	2020-11-02 18:20:04 +01:00
Jonas Winkler	d15405ef56	reworked most of the tesseract parser, better logging	2020-11-02 15:40:44 +01:00
Jonas Winkler	06ad212320	bugfix	2020-11-02 01:26:42 +01:00
Jonas Winkler	9f55fb668d	silenced unpaper, optipng for cleaner output moved parser settings to settings removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.	2020-11-01 23:23:42 +01:00
Jonas Winkler	743ce1dc14	better thumbnail generation for smaller files	2020-10-26 01:05:23 +01:00
Johannes Wienke	a311cd498c	Handle dateparser ValueErrors When parsing dates from the document text or filenames, correctly handle values errors indicating broken dates. Newly added tests ensure that this handling works properly.	2020-03-08 18:44:15 +01:00
Johannes Wienke	a3aab0cb48	Remove duplicated date parsing test The exact same tests existed twice in the file.	2020-03-08 18:26:29 +01:00
Stéphane Brunner	daca77cc1b	Strip the thumbnails	2019-03-17 16:37:47 +01:00
jenspfeifle	336f747f16	make pycodestyle happy	2019-03-03 20:41:17 +01:00
JensPfeifle	29b0886950	try to run convert, but fall back on gs if needed	2019-03-03 20:31:52 +01:00
JensPfeifle	ea282c22ba	Add GS_BINARY to settings to avoid harcoded call of "gs"	2019-03-03 20:31:52 +01:00
Pit	cbf008f37b	Fix quoting in call to run_convert Co-Authored-By: JensPfeifle <jens@pfeifle.tech>	2019-03-03 20:31:52 +01:00
JensPfeifle	50504c3fd8	remove unnecessary env arg in Popen	2019-03-03 20:31:52 +01:00
Jens Pfeifle	0220199766	fix parse error of some documents by using gs	2019-03-03 20:31:52 +01:00
Daniel Quinn	637b0d4cc2	Drop problematic tests Some tests had differing outcomes depending on the version of Tesseract installed on the test system. This lead to a bunch of false test failures, which lead to people (including me) just ignoring the Travis results. This commit removes those tests, and while it reduces our coverage, at least the results are predictable.	2018-12-30 17:32:45 +00:00
Daniel Quinn	27af2603f5	Use modern languages for sample test files	2018-12-30 14:09:17 +00:00
Erik Arvstedt	a19f0ef97e	Fix date test sample image The previous version of `tests_date_3.png` had too much spacing between the `0` and the `8` glyphs, which resulted in the year getting parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00). This caused the date parsing test to fail.	2018-12-02 15:10:21 +01:00
Daniel Quinn	d544f269e0	Conform everything to the coding standards https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides	2018-12-01 17:09:12 +00:00
Daniel Quinn	650db75c2b	Merge branch 'ENH_filename_date_parsing' of https://github.com/jat255/paperless into jat255-ENH_filename_date_parsing	2018-12-01 16:57:16 +00:00
Daniel Quinn	c1d18c1e83	Fix language guesses in tests It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German.	2018-12-01 15:55:59 +00:00
Joshua Taillon	730daa3d6d	Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing	2018-11-15 23:17:59 -05:00
Joshua Taillon	e1d8744c66	Add option for parsing of date from filename (and associated tests)	2018-11-15 20:32:15 -05:00

1 2

94 Commits