jonaswinkler
e2680b7113
code style
2021-01-02 15:26:09 +01:00
jayme-github
cd15490e91
Add option to ignore certain dates in parse_date
...
PAPERLESS_IGNORE_DATES allows to specify a comma separated list of dates
to ignore during date parsing (from filename and content). This can be
used so specify dates that do appear often in documents but are usually
not the documents creation date (like your date of birth).
2021-01-02 15:20:49 +01:00
jonaswinkler
755f950cd2
supply file_name for tika parser
2021-01-01 22:19:43 +01:00
jonaswinkler
f1e9b414f9
remove duplicate code
2021-01-01 21:50:45 +01:00
jonaswinkler
4b7138f477
fixes #218
2020-12-30 15:12:16 +01:00
jonaswinkler
cdd2c873bd
fixes #25
2020-12-15 13:52:35 +01:00
jonaswinkler
0c6c4a62d8
moved metadata extraction to the parsers
2020-12-10 14:57:53 +01:00
jonaswinkler
0bfecaa0fc
Merge branch 'dev' into feature-websockets-status
2020-12-06 22:53:54 +01:00
jonaswinkler
b0507ce92a
fixes #78
2020-12-02 18:00:49 +01:00
jonaswinkler
e4eeb29f54
checking file types against parsers in the consumer.
2020-12-01 15:26:05 +01:00
jonaswinkler
1df64e3129
Merge branch 'dev' into feature-ocrmypdf
2020-11-30 16:48:09 +01:00
jonaswinkler
7658c07b4d
added file type checks to the parsers to prevent temporary files from being consumed. Also: parsers announce file types they wish to use as default for each mime type.
2020-11-30 00:40:04 +01:00
Jonas Winkler
9bfa088eb5
reworked the interface of the parsers.
2020-11-25 19:36:39 +01:00
Jonas Winkler
15935ab61f
reworked PDF parser that uses OCRmyPDF and produces archive files.
2020-11-25 14:50:43 +01:00
Jonas Winkler
17b62b61fa
add support for archive files.
2020-11-25 14:47:17 +01:00
Jonas Winkler
3893a23852
Merge branch 'dev' into celery-tasks
2020-11-22 22:49:37 +01:00
Jonas Winkler
afc3753e58
code cleanup
2020-11-21 14:03:45 +01:00
Jonas Winkler
f976a0b4ba
mime type handling
2020-11-20 13:31:03 +01:00
Jonas Winkler
196faa8fdc
Merge branch 'dev' into celery-tasks
2020-11-19 22:10:57 +01:00
Jonas Winkler
4230a0a474
a new setting that allows you to skip thumbnail optimization.
2020-11-18 22:42:05 +01:00
Jonas Winkler
680ab3d56b
updated logging, logging for the mail consumer to see whats happening
2020-11-18 13:23:30 +01:00
Jonas Winkler
9a48d6c577
Changed the way parsers are discovered. This also prepares for upcoming changes regarding content types and file types: parsers should declare what they support, and actual file extensions should not be hardcoded everywhere.
2020-11-16 23:53:12 +01:00
Jonas Winkler
4734dec465
add some more checks.
2020-11-12 21:20:12 +01:00
Jonas Winkler
eb6805e37e
code style fixes
2020-11-12 21:09:45 +01:00
Jonas Winkler
d46203c114
backend that supports asgi and status update sockets with channels
2020-11-07 11:31:04 +01:00
Jonas Winkler
cf5e463b9b
silenced unpaper once and for all
2020-11-03 14:04:21 +01:00
Jonas Winkler
9757e261f2
A handy script to redo ocr on all documents,
2020-11-03 14:04:11 +01:00
Jonas Winkler
d42979842e
made unpaper and convert a little bit nicer to interact with
2020-11-02 19:31:04 +01:00
Jonas Winkler
def3a85858
reworked most of the tesseract parser, better logging
2020-11-02 15:40:44 +01:00
Jonas Winkler
ffdb517b73
removed settings constants
2020-11-01 23:37:56 +01:00
Jonas Winkler
6adc870a20
silenced unpaper, optipng for cleaner output
...
moved parser settings to settings
removed forgiving ocr (now default) since tesseract is plenty accurate even without defining the correct language.
2020-11-01 23:23:42 +01:00
Johannes Wienke
ebcfcea05b
Handle dateparser ValueErrors
...
When parsing dates from the document text or filenames, correctly handle values
errors indicating broken dates. Newly added tests ensure that this handling
works properly.
2020-03-08 18:44:15 +01:00
Daniel Quinn
0d59844567
Conform everything to the coding standards
...
https://paperless.readthedocs.io/en/latest/contributing.html#additional-style-guides
2018-12-01 17:09:12 +00:00
Joshua Taillon
b0326b5a19
Merge branch 'master' of github.com:danielquinn/paperless into ENH_filename_date_parsing
2018-11-15 23:17:59 -05:00
Joshua Taillon
6e88634fa8
Change the massive regex to match boundaries with _ or - characters (not just word breaks); add line for year first formats like YYYY-MM-DD
2018-11-15 20:38:53 -05:00
Daniel Quinn
bc898c1992
Use optipng to optimise document thumbnails
2018-10-07 14:56:38 +01:00
Daniel Quinn
074609e1fc
Consolidate get_date onto the DocumentParser parent class
2018-10-07 14:56:02 +01:00
Daniel Quinn
ef7f98281d
Rename parsers
to DATE_REGEX
...
In moving the `parsers` variable into the package-level, it lost the
context, so a more descriptive name was needed.
2018-09-09 21:02:30 +01:00
Joshua Taillon
5326895334
move date-matching regex pattern to base parser module for use by all subclasses
2018-09-05 21:13:36 -04:00
Daniel Quinn
5cc10a282b
Use paperless-
instead of paperless
for tempdir name
...
This is purely aesthetic.
2018-02-03 14:49:17 +00:00
Daniel Quinn
648e7b6d4f
No need to explicitly extend object
2018-02-03 14:49:01 +00:00
Wolf-Bastian Pöttner
21fc51c09a
Add support for a heuristic that extracts the document date from its text
2018-01-28 19:37:10 +01:00
Daniel Quinn
d2c283582b
feat: refactor for pluggable consumers
...
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`. This new app should serve
as a sample of how to create one's own consumer for different file
types.
Documentation for how to do this isn't ready yet, but for the impatient:
* Create a new app
* containing a `parsers.py` for your parser modelled after
`paperless_tesseract.parsers.RasterisedDocumentParser`
* containing a `signals.py` with a handler moddelled after
`paperless_tesseract.signals.ConsumerDeclaration`
* connect the signal handler to
`documents.signals.document_consumer_declaration` in
`your_app.apps`
* Install the app into Paperless by declaring
`PAPERLESS_INSTALLED_APPS=your_app`. Additional apps should be
separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00