Add the new paperless_tika parser

This parser will use an external Tika and Gotenberg server to parse
"Office" documents (.doc, .xls, .odt, etc.)

Signed-off-by: Jo Vandeginste <Jo.Vandeginste@kuleuven.be>
This commit is contained in:
Jo Vandeginste
2020-12-29 01:23:40 +01:00
parent 99c7ff3123
commit bf8739864d
9 changed files with 276 additions and 0 deletions

View File

@@ -277,6 +277,35 @@ PAPERLESS_OCR_USER_ARG=<json>
{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
.. _configuration-tika:
Tika settings
#############
Paperless can make use of `Tika <https://tika.apache.org/>`_ and
`Gotenberg <https://thecodingmachine.github.io/gotenberg/>`_ for parsing and
converting "Office" documents (such as ".doc", ".xlsx" and ".odt"). If you
wish to use this, you must provide a Tika server and a Gotenberg server,
configure their endpoints, and enable the feature.
If you run paperless on docker, you can add those services to the docker-compose
file (see the examples provided).
PAPERLESS_TIKA=<bool>
Enable (or disable) the Tika parser.
Defaults to false.
TIKA_SERVER_ENDPOINT=<url>
Set the endpoint URL were Paperless can reach your Tika server.
Defaults to "http://localhost:9998".
GOTENBERG_SERVER_ENDPOINT=<url>
Set the endpoint URL were Paperless can reach your Gotenberg server.
Defaults to "http://localhost:3000".
Software tweaks
###############