Jonas Winkler 
							
						 
					 
					
						
						
							
						
						afc3753e58 
					 
					
						
						
							
							code cleanup  
						
						 
						
						
						
						
					 
					
						2020-11-21 14:03:45 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						680ab3d56b 
					 
					
						
						
							
							updated logging, logging for the mail consumer to see whats happening  
						
						 
						
						
						
						
					 
					
						2020-11-18 13:23:30 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						bd04c966c5 
					 
					
						
						
							
							first version of the new consumer.  
						
						 
						
						
						
						
					 
					
						2020-11-16 18:26:54 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						eb6805e37e 
					 
					
						
						
							
							code style fixes  
						
						 
						
						
						
						
					 
					
						2020-11-12 21:09:45 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						d42979842e 
					 
					
						
						
							
							made unpaper and convert a little bit nicer to interact with  
						
						 
						
						
						
						
					 
					
						2020-11-02 19:31:04 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						a89773ad71 
					 
					
						
						
							
							removed unused code, small fixes  
						
						 
						
						
						
						
					 
					
						2020-11-02 18:20:04 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						def3a85858 
					 
					
						
						
							
							reworked most of the tesseract parser, better logging  
						
						 
						
						
						
						
					 
					
						2020-11-02 15:40:44 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						972a6a2333 
					 
					
						
						
							
							bugfix  
						
						 
						
						
						
						
					 
					
						2020-11-02 01:26:42 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						6adc870a20 
					 
					
						
						
							
							silenced unpaper, optipng for cleaner output  
						
						 
						
						... 
						
						
						
						moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language. 
						
						
					 
					
						2020-11-01 23:23:42 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jonas Winkler 
							
						 
					 
					
						
						
							
						
						0f4094f3ca 
					 
					
						
						
							
							better thumbnail generation for smaller files  
						
						 
						
						
						
						
					 
					
						2020-10-26 01:05:23 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Stéphane Brunner 
							
						 
					 
					
						
						
							
						
						3fab354a6e 
					 
					
						
						
							
							Strip the thumbnails  
						
						 
						
						
						
						
					 
					
						2019-03-17 16:37:47 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								jenspfeifle 
							
						 
					 
					
						
						
							
						
						5c40da1a48 
					 
					
						
						
							
							make pycodestyle happy  
						
						 
						
						
						
						
					 
					
						2019-03-03 20:41:17 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								JensPfeifle 
							
						 
					 
					
						
						
							
						
						078d66b077 
					 
					
						
						
							
							try to run convert, but fall back on gs if needed  
						
						 
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								JensPfeifle 
							
						 
					 
					
						
						
							
						
						4c64ea0404 
					 
					
						
						
							
							Add GS_BINARY to settings to avoid harcoded call of "gs"  
						
						 
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Pit 
							
						 
					 
					
						
						
							
						
						99718bcf17 
					 
					
						
						
							
							Fix quoting in call to run_convert  
						
						 
						
						... 
						
						
						
						Co-Authored-By: JensPfeifle <jens@pfeifle.tech > 
						
						
					 
					
						2019-03-03 20:31:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								JensPfeifle 
							
						 
					 
					
						
						
							
						
						3dfd0253ed 
					 
					
						
						
							
							remove unnecessary env arg in Popen  
						
						 
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Jens Pfeifle 
							
						 
					 
					
						
						
							
						
						6ab21afeb6 
					 
					
						
						
							
							fix parse error of some documents by using gs  
						
						 
						
						
						
						
					 
					
						2019-03-03 20:31:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						3952c6d921 
					 
					
						
						
							
							Merge pull request  #421  from ddddavidmartin/clarify_forgiving_ocr_handling  
						
						 
						
						... 
						
						
						
						Clarify forgiving ocr handling 
						
						
					 
					
						2018-10-08 09:35:57 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								David Martin 
							
						 
					 
					
						
						
							
						
						b0afa37ec1 
					 
					
						
						
							
							Mention FORGIVING_OCR config option when language detection fails.  
						
						 
						
						... 
						
						
						
						It is not obvious that the PAPERLESS_FORGIVING_OCR allows to let
document consumption happen even if no language can be detected.
Mentioning it in the actual error message in the log seems like the best
way to make it clear. 
						
						
					 
					
						2018-10-08 19:37:05 +11:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								David Martin 
							
						 
					 
					
						
						
							
						
						7022c98aab 
					 
					
						
						
							
							Let unpaper overwrite temporary files.  
						
						 
						
						... 
						
						
						
						I'm not sure what the circumstances are, but it looks like unpaper can
attempt to write a temporary file that already exists [0]. This then
fails the consumption. As per daedadu's comment simply letting unpaper
overwrite files fixes this.
[0]
unpaper: error: output file '/tmp/paperless/paperless-pjkrcr4l/convert-0000.unpaper.pnm' already present.
See https://web.archive.org/web/20181008081515/https://github.com/danielquinn/paperless/issues/406#issue-360651630  
						
						
					 
					
						2018-10-08 19:12:11 +11:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						bc898c1992 
					 
					
						
						
							
							Use optipng to optimise document thumbnails  
						
						 
						
						
						
						
					 
					
						2018-10-07 14:56:38 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						074609e1fc 
					 
					
						
						
							
							Consolidate get_date onto the DocumentParser parent class  
						
						 
						
						
						
						
					 
					
						2018-10-07 14:56:02 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						0a4338143a 
					 
					
						
						
							
							Tweak the date guesser to not allow dates prior to 1900 ( #414 )  
						
						 
						
						
						
						
					 
					
						2018-10-01 20:03:47 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						52bfeb2ad0 
					 
					
						
						
							
							Improve the unknown language error message  
						
						 
						
						
						
						
					 
					
						2018-09-23 12:41:14 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						21e53aa55c 
					 
					
						
						
							
							Merge pull request  #399  from jat255/ENH_convert_only_one_page  
						
						 
						
						... 
						
						
						
						Speed up thumbnail generation for PDFs 
						
						
					 
					
						2018-09-09 21:12:42 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						ef7f98281d 
					 
					
						
						
							
							Rename parsers to DATE_REGEX  
						
						 
						
						... 
						
						
						
						In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed. 
						
						
					 
					
						2018-09-09 21:02:30 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						a3158eedf9 
					 
					
						
						
							
							Merge branch 'ENH_text_consumer' of git://github.com/jat255/paperless into jat255-ENH_text_consumer  
						
						 
						
						
						
						
					 
					
						2018-09-09 20:52:59 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						6b63ce9201 
					 
					
						
						
							
							Fix pycodestyle complaints  
						
						 
						
						... 
						
						
						
						Apparently, pycodestyle updated itself to now check for invalid escape
sequences, which only complain if the regex in use isn't a raw string
(r""). 
						
						
					 
					
						2018-09-09 20:00:12 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						5326895334 
					 
					
						
						
							
							move date-matching regex pattern to base parser module for use by all subclasses  
						
						 
						
						
						
						
					 
					
						2018-09-05 21:13:36 -04:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Joshua Taillon 
							
						 
					 
					
						
						
							
						
						98a437f78a 
					 
					
						
						
							
							change tesseract parser to only convert first page to save (potentially) massive amounts of work  
						
						 
						
						
						
						
					 
					
						2018-09-05 15:18:35 -04:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						bce2d3dd22 
					 
					
						
						
							
							Account for KeyError problem in  #345  
						
						 
						
						
						
						
					 
					
						2018-04-28 12:20:43 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						f3f86242de 
					 
					
						
						
							
							Account for KeyError problem in  #345  
						
						 
						
						
						
						
					 
					
						2018-04-28 12:19:53 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Ovv 
							
						 
					 
					
						
						
							
						
						32c440cbd9 
					 
					
						
						
							
							Log detected document date with isoformat  
						
						 
						
						
						
						
					 
					
						2018-03-04 13:10:49 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						7c5ca5f505 
					 
					
						
						
							
							Merge pull request  #302  from BastianPoe/bugfix/extend_regex_to_find_more_dates  
						
						 
						
						... 
						
						
						
						Extends the regex to find dates in documents as reported by @isaacsando 
						
						
					 
					
						2018-02-18 17:23:49 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						4f726e1991 
					 
					
						
						
							
							Monitor return codes of calls to convert and unpaper  
						
						 
						
						... 
						
						
						
						...and handle the failures nicely.  Addresses #303 . 
						
						
					 
					
						2018-02-18 16:02:27 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						e53033d1b3 
					 
					
						
						
							
							Rename .TEXT_CACHE to .text  
						
						 
						
						... 
						
						
						
						Properties should use snake_case, and only constants should be ALL_CAPS.
This change also makes use of the convention of "private" properties
being prefixed with `_`. 
						
						
					 
					
						2018-02-18 16:00:43 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						3302ee2a78 
					 
					
						
						
							
							Make isort happy  
						
						 
						
						
						
						
					 
					
						2018-02-18 16:00:03 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						caf44146db 
					 
					
						
						
							
							Style and removal of Python 2.7 stuff  
						
						 
						
						
						
						
					 
					
						2018-02-18 15:55:55 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						5fed7ba6d4 
					 
					
						
						
							
							Improved regular expression to only match for (unicode) characters in month names + parsed one regex match after another until one gave a parsable date  
						
						 
						
						
						
						
					 
					
						2018-02-14 21:41:04 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						3e65054e39 
					 
					
						
						
							
							Extended exception handling  
						
						 
						
						
						
						
					 
					
						2018-02-12 22:43:16 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						c0c20f99e9 
					 
					
						
						
							
							Added log output for date detected in document  
						
						 
						
						
						
						
					 
					
						2018-02-12 22:41:19 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						3899763261 
					 
					
						
						
							
							Extends the regex to find dates in documents as reported by @isaacsando  
						
						 
						
						
						
						
					 
					
						2018-02-12 22:41:15 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						acfacaac4f 
					 
					
						
						
							
							Added a text cache to optimize performance of date detection  
						
						 
						
						
						
						
					 
					
						2018-02-03 00:28:52 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						73d261484a 
					 
					
						
						
							
							Merge branch 'master' of  https://github.com/danielquinn/paperless  into feature/heuristically-extract-date-from-document-text  
						
						 
						
						
						
						
					 
					
						2018-02-02 22:44:03 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						3dc730808e 
					 
					
						
						
							
							Add support for using pre-existing text from PDFs  
						
						 
						
						
						
						
					 
					
						2018-02-02 22:37:58 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Matt 
							
						 
					 
					
						
						
							
						
						bc5c45a705 
					 
					
						
						
							
							Fixing error sentinel for pdftotext when the PDF has no text (scanned images). It was causing a crash previously.  
						
						 
						
						
						
						
					 
					
						2018-02-01 10:08:57 -05:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						269c32ce6a 
					 
					
						
						
							
							Add support for using pre-existing text from PDFs  
						
						 
						
						
						
						
					 
					
						2018-01-30 20:13:35 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Wolf-Bastian Pöttner 
							
						 
					 
					
						
						
							
						
						21fc51c09a 
					 
					
						
						
							
							Add support for a heuristic that extracts the document date from its text  
						
						 
						
						
						
						
					 
					
						2018-01-28 19:37:10 +01:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
							
							
								 
								Daniel Quinn 
							
						 
					 
					
						
						
							
						
						d2c283582b 
					 
					
						
						
							
							feat: refactor for pluggable consumers  
						
						 
						
						... 
						
						
						
						I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer 
						
						
					 
					
						2017-03-25 15:10:25 +00:00