paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2025-05-01 11:19:32 -05:00

Author	SHA1	Message	Date
jonaswinkler	e2680b7113	code style	2021-01-02 15:26:09 +01:00
jayme-github	cd15490e91	Add option to ignore certain dates in parse_date PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates to ignore during date parsing (from filename and content). This can be used so specify dates that do appear often in documents but are usually not the documents creation date (like your date of birth).	2021-01-02 15:20:49 +01:00
jonaswinkler	755f950cd2	supply file_name for tika parser	2021-01-01 22:19:43 +01:00
jonaswinkler	f1e9b414f9	remove duplicate code	2021-01-01 21:50:45 +01:00
jonaswinkler	4b7138f477	fixes #218	2020-12-30 15:12:16 +01:00
jonaswinkler	cdd2c873bd	fixes #25	2020-12-15 13:52:35 +01:00
jonaswinkler	0c6c4a62d8	moved metadata extraction to the parsers	2020-12-10 14:57:53 +01:00
jonaswinkler	0bfecaa0fc	Merge branch 'dev' into feature-websockets-status	2020-12-06 22:53:54 +01:00
jonaswinkler	b0507ce92a	fixes #78	2020-12-02 18:00:49 +01:00
jonaswinkler	e4eeb29f54	checking file types against parsers in the consumer.	2020-12-01 15:26:05 +01:00
jonaswinkler	1df64e3129	Merge branch 'dev' into feature-ocrmypdf	2020-11-30 16:48:09 +01:00
jonaswinkler	7658c07b4d	added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.	2020-11-30 00:40:04 +01:00
Jonas Winkler	9bfa088eb5	reworked the interface of the parsers.	2020-11-25 19:36:39 +01:00
Jonas Winkler	15935ab61f	reworked PDF parser that uses OCRmyPDF and produces archive files.	2020-11-25 14:50:43 +01:00
Jonas Winkler	17b62b61fa	add support for archive files.	2020-11-25 14:47:17 +01:00
Jonas Winkler	3893a23852	Merge branch 'dev' into celery-tasks	2020-11-22 22:49:37 +01:00
Jonas Winkler	afc3753e58	code cleanup	2020-11-21 14:03:45 +01:00
Jonas Winkler	f976a0b4ba	mime type handling	2020-11-20 13:31:03 +01:00
Jonas Winkler	196faa8fdc	Merge branch 'dev' into celery-tasks	2020-11-19 22:10:57 +01:00
Jonas Winkler	4230a0a474	a new setting that allows you to skip thumbnail optimization.	2020-11-18 22:42:05 +01:00
Jonas Winkler	680ab3d56b	updated logging, logging for the mail consumer to see whats happening	2020-11-18 13:23:30 +01:00
Jonas Winkler	9a48d6c577	Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.	2020-11-16 23:53:12 +01:00
Jonas Winkler	4734dec465	add some more checks.	2020-11-12 21:20:12 +01:00
Jonas Winkler	eb6805e37e	code style fixes	2020-11-12 21:09:45 +01:00
Jonas Winkler	d46203c114	backend that supports asgi and status update sockets with channels	2020-11-07 11:31:04 +01:00
Jonas Winkler	cf5e463b9b	silenced unpaper once and for all	2020-11-03 14:04:21 +01:00
Jonas Winkler	9757e261f2	A handy script to redo ocr on all documents,	2020-11-03 14:04:11 +01:00
Jonas Winkler	d42979842e	made unpaper and convert a little bit nicer to interact with	2020-11-02 19:31:04 +01:00
Jonas Winkler	def3a85858	reworked most of the tesseract parser, better logging	2020-11-02 15:40:44 +01:00
Jonas Winkler	ffdb517b73	removed settings constants	2020-11-01 23:37:56 +01:00
Jonas Winkler	6adc870a20	silenced unpaper, optipng for cleaner output moved parser settings to settings removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.	2020-11-01 23:23:42 +01:00
Johannes Wienke	ebcfcea05b	Handle dateparser ValueErrors When parsing dates from the document text or filenames, correctly handle values errors indicating broken dates. Newly added tests ensure that this handling works properly.	2020-03-08 18:44:15 +01:00
Daniel Quinn	0d59844567	Conform everything to the coding standards https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides	2018-12-01 17:09:12 +00:00
Joshua Taillon	b0326b5a19	Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing	2018-11-15 23:17:59 -05:00
Joshua Taillon	6e88634fa8	Change the massive regex to match boundaries with _ or - characters (not just word breaks); add line for year first formats like YYYY-MM-DD	2018-11-15 20:38:53 -05:00
Daniel Quinn	bc898c1992	Use optipng to optimise document thumbnails	2018-10-07 14:56:38 +01:00
Daniel Quinn	074609e1fc	Consolidate get_date onto the DocumentParser parent class	2018-10-07 14:56:02 +01:00
Daniel Quinn	ef7f98281d	Rename `parsers` to `DATE_REGEX` In moving the `parsers` variable into the package-level, it lost the context, so a more descriptive name was needed.	2018-09-09 21:02:30 +01:00
Joshua Taillon	5326895334	move date-matching regex pattern to base parser module for use by all subclasses	2018-09-05 21:13:36 -04:00
Daniel Quinn	5cc10a282b	Use `paperless-` instead of `paperless` for tempdir name This is purely aesthetic.	2018-02-03 14:49:17 +00:00
Daniel Quinn	648e7b6d4f	No need to explicitly extend object	2018-02-03 14:49:01 +00:00
Wolf-Bastian Pöttner	21fc51c09a	Add support for a heuristic that extracts the document date from its text	2018-01-28 19:37:10 +01:00
Daniel Quinn	d2c283582b	feat: refactor for pluggable consumers I've broken out the OCR-specific code from the consumers and dumped it all into its own app, `paperless_tesseract`. This new app should serve as a sample of how to create one's own consumer for different file types. Documentation for how to do this isn't ready yet, but for the impatient: * Create a new app * containing a `parsers.py` for your parser modelled after `paperless_tesseract.parsers.RasterisedDocumentParser` * containing a `signals.py` with a handler moddelled after `paperless_tesseract.signals.ConsumerDeclaration` * connect the signal handler to `documents.signals.document_consumer_declaration` in `your_app.apps` * Install the app into Paperless by declaring `PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be separated with commas. * Restart the consumer	2017-03-25 15:10:25 +00:00

1 2

93 Commits