From bcd10f63ea300d8f80214c9b8a615bdd5b75385d Mon Sep 17 00:00:00 2001 From: tooomm Date: Sun, 5 Mar 2023 16:03:42 +0100 Subject: [PATCH 1/2] better language code help --- docs/configuration.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 2f6566170..f14ee8c46 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -383,21 +383,20 @@ needs. : Customize the language that paperless will attempt to use when parsing documents. - It should be a 3-letter language code consistent with ISO 639: - https://www.loc.gov/standards/iso639-2/php/code_list.php + It should be a 3-letter code, see the list of [languages Tesseract supports](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). Set this to the language most of your documents are written in. This can be a combination of multiple languages such as `deu+eng`, - in which case tesseract will use whatever language matches best. - Keep in mind that tesseract uses much more cpu time with multiple + in which case Tesseract will use whatever language matches best. + Keep in mind that Tesseract uses much more CPU time with multiple languages enabled. Defaults to "eng". !!! note - If your language contains a '-' such as chi-sim, you must use chi_sim + If your language contains a '-' such as chi-sim, you must use `chi_sim`. `PAPERLESS_OCR_MODE=` From c5b701f99da3d12e2e141e882c8c9126148165fa Mon Sep 17 00:00:00 2001 From: tooomm Date: Mon, 6 Mar 2023 23:51:07 +0100 Subject: [PATCH 2/2] add hints to ocr languages installation --- docs/configuration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index f14ee8c46..eee39af5f 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1096,12 +1096,14 @@ actual group ID on the host system, which you can get by executing : Additional OCR languages to install. By default, paperless comes with English, German, Italian, Spanish and French. If your language is not in this list, install additional languages with this -configuration option: +configuration option ([find the right LangCodes](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)): ``` bash PAPERLESS_OCR_LANGUAGES=tur ces ``` + Make sure it's a space separated list when using several values. + To actually use these languages, also set the default OCR language of paperless: