jonaswinkler
40ef375c15
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
c05bfb894a
remove duplicate code
2021-01-01 21:50:45 +01:00
jonaswinkler
713985f259
fixes #218
2020-12-30 15:12:16 +01:00
jonaswinkler
5894060dc5
fixes #25
2020-12-15 13:52:35 +01:00
jonaswinkler
2f7bb01f34
moved metadata extraction to the parsers
2020-12-10 14:57:53 +01:00
jonaswinkler
4548cf08c7
fixes #78
2020-12-02 18:00:49 +01:00
jonaswinkler
834352130c
checking file types against parsers in the consumer.
2020-12-01 15:26:05 +01:00
jonaswinkler
aaa6599283
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
f51207fc32
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
Jonas Winkler
df801d17e1
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
2d559d330d
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
8069c2eb6a
add support for archive files.
2020-11-25 14:47:17 +01:00
Jonas Winkler
b44f8383e4
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
41650f20f4
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
c487e5f017
a new setting that allows you to skip thumbnail optimization.
2020-11-18 22:42:05 +01:00
Jonas Winkler
8908bc259e
updated logging, logging for the mail consumer to see whats happening
2020-11-18 13:23:30 +01:00
Jonas Winkler
d2e22e3f27
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
0421031128
add some more checks.
2020-11-12 21:20:12 +01:00
Jonas Winkler
2e04ba1c04
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
28ba634e6a
silenced unpaper once and for all
2020-11-03 14:04:21 +01:00
Jonas Winkler
f4cebda085
A handy script to redo ocr on all documents,
2020-11-03 14:04:11 +01:00
Jonas Winkler
3a08a2d206
made unpaper and convert a little bit nicer to interact with
2020-11-02 19:31:04 +01:00
Jonas Winkler
d15405ef56
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Jonas Winkler
d6d37efa35
removed settings constants
2020-11-01 23:37:56 +01:00
Jonas Winkler
9f55fb668d
silenced unpaper, optipng for cleaner output
...
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
a311cd498c
Handle dateparser ValueErrors
...
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
d544f269e0
Conform everything to the coding standards
...
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Joshua Taillon
730daa3d6d
Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing
2018-11-15 23:17:59 -05:00
Joshua Taillon
c225281f95
Change the massive regex to match boundaries with _ or - characters (not just word breaks); add line for year first formats like YYYY-MM-DD
2018-11-15 20:38:53 -05:00
Daniel Quinn
750ab5bf85
Use optipng to optimise document thumbnails
2018-10-07 14:56:38 +01:00
Daniel Quinn
2a3f766b93
Consolidate get_date onto the DocumentParser parent class
2018-10-07 14:56:02 +01:00
Daniel Quinn
c99f5923d5
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Joshua Taillon
72c828170e
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Daniel Quinn
cebb8b9fa2
Use paperless-
instead of paperless
for tempdir name
...
This is purely aesthetic.
2018-02-03 14:49:17 +00:00
Daniel Quinn
46aca10a72
No need to explicitly extend object
2018-02-03 14:49:01 +00:00
Wolf-Bastian Pöttner
b140935843
Add support for a heuristic that extracts the document date from its text
2018-01-28 19:37:10 +01:00
Daniel Quinn
55e81ca4bb
feat: refactor for pluggable consumers
...
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`. This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
* containing a `parsers.py` for your parser modelled after
`paperless_tesseract.parsers.RasterisedDocumentParser`
* containing a `signals.py` with a handler moddelled after
`paperless_tesseract.signals.ConsumerDeclaration`
* connect the signal handler to
`documents.signals.document_consumer_declaration` in
`your_app.apps`
* Install the app into Paperless by declaring
`PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be
separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00