JensPfeifle 
							
						 
					 
					
						
						
							
						
						ea282c22ba 
					 
					
						
						
							
							Add GS_BINARY to settings to avoid harcoded call of "gs"  
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00 
						 
				 
			
				
					
						
							
							
								Pit 
							
						 
					 
					
						
						
							
						
						cbf008f37b 
					 
					
						
						
							
							Fix quoting in call to run_convert  
						
						... 
						
						
						
						Co-Authored-By: JensPfeifle <jens@pfeifle.tech > 
						
						
					 
					
						2019-03-03 20:31:52 +01:00 
						 
				 
			
				
					
						
							
							
								JensPfeifle 
							
						 
					 
					
						
						
							
						
						50504c3fd8 
					 
					
						
						
							
							remove unnecessary env arg in Popen  
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00 
						 
				 
			
				
					
						
							
							
								Jens Pfeifle 
							
						 
					 
					
						
						
							
						
						0220199766 
					 
					
						
						
							
							fix parse error of some documents by using gs  
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						637b0d4cc2 
					 
					
						
						
							
							Drop problematic tests  
						
						... 
						
						
						
						Some tests had differing outcomes depending on the version of Tesseract
installed on the test system.  This lead to a bunch of false test
failures, which lead to people (including me) just ignoring the Travis
results.
This commit removes those tests, and while it reduces our coverage, at
least the results are predictable. 
						
						
					 
					
						2018-12-30 17:32:45 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						27af2603f5 
					 
					
						
						
							
							Use modern languages for sample test files  
						
						
						
						
					 
					
						2018-12-30 14:09:17 +00:00 
						 
				 
			
				
					
						
							
							
								Erik Arvstedt 
							
						 
					 
					
						
						
							
						
						a19f0ef97e 
					 
					
						
						
							
							Fix date test sample image  
						
						... 
						
						
						
						The previous version of `tests_date_3.png` had too much spacing
between the `0` and the `8` glyphs, which resulted in the year getting
parsed as `200 8` in Tesseract 3.05.00 (+ tessdata 3.04.00).
This caused the date parsing test to fail. 
						
						
					 
					
						2018-12-02 15:10:21 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						d544f269e0 
					 
					
						
						
							
							Conform everything to the coding standards  
						
						... 
						
						
						
						https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides  
					
						2018-12-01 17:09:12 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						650db75c2b 
					 
					
						
						
							
							Merge branch 'ENH_filename_date_parsing' of  https://github.com/jat255/paperless  into jat255-ENH_filename_date_parsing  
						
						
						
						
					 
					
						2018-12-01 16:57:16 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						c1d18c1e83 
					 
					
						
						
							
							Fix language guesses in tests  
						
						... 
						
						
						
						It turns out that the Lorem ipsum text in the sample files was confuing the language guesser, causing it to think the file was in Catalan and not English or German. 
						
						
					 
					
						2018-12-01 15:55:59 +00:00 
						 
				 
			
				
					
						
							
							
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						730daa3d6d 
					 
					
						
						
							
							Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing  
						
						
						
						
					 
					
						2018-11-15 23:17:59 -05:00 
						 
				 
			
				
					
						
							
							
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						e1d8744c66 
					 
					
						
						
							
							Add option for parsing of date from filename (and associated tests)  
						
						
						
						
					 
					
						2018-11-15 20:32:15 -05:00 
						 
				 
			
				
					
						
							
							
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						4409f65840 
					 
					
						
						
							
							Update date tests to be more explicit with settings and allow tests to pass if using a timezone other than UTC  
						
						
						
						
					 
					
						2018-11-15 20:30:23 -05:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						bd95804fbf 
					 
					
						
						
							
							Merge pull request  #421  from ddddavidmartin/clarify_forgiving_ocr_handling  
						
						... 
						
						
						
						Clarify forgiving ocr handling 
						
						
					 
					
						2018-10-08 09:35:57 +00:00 
						 
				 
			
				
					
						
							
							
								David Martin 
							
						 
					 
					
						
						
							
						
						b350ec48b7 
					 
					
						
						
							
							Mention FORGIVING_OCR config option when language detection fails.  
						
						... 
						
						
						
						It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear. 
						
						
					 
					
						2018-10-08 19:37:05 +11:00 
						 
				 
			
				
					
						
							
							
								David Martin 
							
						 
					 
					
						
						
							
						
						f948ee11be 
					 
					
						
						
							
							Let unpaper overwrite temporary files.  
						
						... 
						
						
						
						I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.
[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630  
						
						
					 
					
						2018-10-08 19:12:11 +11:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						750ab5bf85 
					 
					
						
						
							
							Use optipng to optimise document thumbnails  
						
						
						
						
					 
					
						2018-10-07 14:56:38 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						2a3f766b93 
					 
					
						
						
							
							Consolidate get_date onto the DocumentParser parent class  
						
						
						
						
					 
					
						2018-10-07 14:56:02 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						8010d72f18 
					 
					
						
						
							
							Tweak the date guesser to not allow dates prior to 1900 ( #414 )  
						
						
						
						
					 
					
						2018-10-01 20:03:47 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						117d7dad04 
					 
					
						
						
							
							Improve the unknown language error message  
						
						
						
						
					 
					
						2018-09-23 12:41:14 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						46cbd10ba0 
					 
					
						
						
							
							Merge pull request  #399  from jat255/ENH_convert_only_one_page  
						
						... 
						
						
						
						Speed up thumbnail generation for PDFs 
						
						
					 
					
						2018-09-09 21:12:42 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						c99f5923d5 
					 
					
						
						
							
							Rename parsers to DATE_REGEX  
						
						... 
						
						
						
						In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed. 
						
						
					 
					
						2018-09-09 21:02:30 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						2dc35cc856 
					 
					
						
						
							
							Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer  
						
						
						
						
					 
					
						2018-09-09 20:52:59 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						5342db6ada 
					 
					
						
						
							
							Fix pycodestyle complaints  
						
						... 
						
						
						
						Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r""). 
						
						
					 
					
						2018-09-09 20:00:12 +01:00 
						 
				 
			
				
					
						
							
							
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						72c828170e 
					 
					
						
						
							
							move date-matching regex pattern to base parser module for use by all subclasses  
						
						
						
						
					 
					
						2018-09-05 21:13:36 -04:00 
						 
				 
			
				
					
						
							
							
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						cac63494f0 
					 
					
						
						
							
							change tesseract parser to only convert first page to save (potentially) massive amounts of work  
						
						
						
						
					 
					
						2018-09-05 15:18:35 -04:00 
						 
				 
			
				
					
						
							
							
								Erik Arvstedt 
							
						 
					 
					
						
						
							
						
						be2cbebaf7 
					 
					
						
						
							
							Stop tests from writing to the source tree  
						
						
						
						
					 
					
						2018-07-19 23:48:23 +02:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						82f9dde055 
					 
					
						
						
							
							Account for KeyError problem in  #345  
						
						
						
						
					 
					
						2018-04-28 12:20:43 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						c983e73d0f 
					 
					
						
						
							
							Account for KeyError problem in  #345  
						
						
						
						
					 
					
						2018-04-28 12:19:53 +01:00 
						 
				 
			
				
					
						
							
							
								Ovv 
							
						 
					 
					
						
						
							
						
						75ac8d2796 
					 
					
						
						
							
							Log detected document date with isoformat  
						
						
						
						
					 
					
						2018-03-04 13:10:49 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						fba58f3bdd 
					 
					
						
						
							
							Increase testcoverage by testing two more date detection cases  
						
						
						
						
					 
					
						2018-02-19 21:36:48 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						6662ca3467 
					 
					
						
						
							
							Fix formatting  
						
						
						
						
					 
					
						2018-02-18 18:00:34 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						6f1ed89e26 
					 
					
						
						
							
							Fix tests to use _text instead of TEXT_CACHE  
						
						
						
						
					 
					
						2018-02-18 18:00:22 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						5d01410dc0 
					 
					
						
						
							
							Merge pull request  #302  from BastianPoe/bugfix/extend_regex_to_find_more_dates  
						
						... 
						
						
						
						Extends the regex to find dates in documents as reported by @isaacsando 
						
						
					 
					
						2018-02-18 17:23:49 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						ea6d040809 
					 
					
						
						
							
							Monitor return codes of calls to convert and unpaper  
						
						... 
						
						
						
						...and handle the failures nicely.  Addresses #303 . 
						
						
					 
					
						2018-02-18 16:02:27 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						8e9d5caa37 
					 
					
						
						
							
							Rename .TEXT_CACHE to .text  
						
						... 
						
						
						
						Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`. 
						
						
					 
					
						2018-02-18 16:00:43 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						122aa2b9f1 
					 
					
						
						
							
							Make isort happy  
						
						
						
						
					 
					
						2018-02-18 16:00:03 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						fb1da4834c 
					 
					
						
						
							
							Style and removal of Python 2.7 stuff  
						
						
						
						
					 
					
						2018-02-18 15:55:55 +00:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						96c7222269 
					 
					
						
						
							
							Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date  
						
						
						
						
					 
					
						2018-02-14 21:41:04 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						1737e27b34 
					 
					
						
						
							
							Add more (fast-running) unit tests  
						
						
						
						
					 
					
						2018-02-14 21:41:01 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						39f198138a 
					 
					
						
						
							
							Extended exception handling  
						
						
						
						
					 
					
						2018-02-12 22:43:16 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						c74bb84c83 
					 
					
						
						
							
							Added log output for date detected in document  
						
						
						
						
					 
					
						2018-02-12 22:41:19 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						07d06d9aee 
					 
					
						
						
							
							Extends the regex to find dates in documents as reported by @isaacsando  
						
						
						
						
					 
					
						2018-02-12 22:41:15 +01:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						73163d893f 
					 
					
						
						
							
							No need to extend object  
						
						
						
						
					 
					
						2018-02-03 15:26:28 +00:00 
						 
				 
			
				
					
						
							
							
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						c90ed2da1d 
					 
					
						
						
							
							Rework tests to write to /tmp  
						
						... 
						
						
						
						Originally the test wrote scratch data inside the repo dir, which meant
manual cleanup.  Now it writes to `/tmp/paperless-tests-<random-string>`
and cleans up after itself. 
						
						
					 
					
						2018-02-03 14:49:48 +00:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						40f8ba23a4 
					 
					
						
						
							
							Added a text cache to optimize performance of date detection  
						
						
						
						
					 
					
						2018-02-03 00:28:52 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						bef2d94374 
					 
					
						
						
							
							Add test cases for date parsing  
						
						
						
						
					 
					
						2018-02-03 00:28:49 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						f39c7654a0 
					 
					
						
						
							
							Merge branch 'master' of  https://github.com/danielquinn/paperless  into feature/heuristically-extract-date-from-document-text  
						
						
						
						
					 
					
						2018-02-02 22:44:03 +01:00 
						 
				 
			
				
					
						
							
							
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						87e466c47c 
					 
					
						
						
							
							Add support for using pre-existing text from PDFs  
						
						
						
						
					 
					
						2018-02-02 22:37:58 +01:00 
						 
				 
			
				
					
						
							
							
								Matt 
							
						 
					 
					
						
						
							
						
						ce98019b49 
					 
					
						
						
							
							Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.  
						
						
						
						
					 
					
						2018-02-01 10:08:57 -05:00