From 19bb29d5cdf68c1b731d498a4307a60d638c4dd6 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Tue, 1 Dec 2020 23:38:42 +0100 Subject: [PATCH] documentation --- docs/administration.rst | 30 ++++++++++++++++++++++++++++++ docs/changelog.rst | 24 ++++++++++++++++++++++++ docs/configuration.rst | 9 +++++++-- docs/faq.rst | 33 +++++++++++++++++++++++++++++++++ docs/index.rst | 3 +++ docs/usage_overview.rst | 25 +++++++++++++++++++++++++ 6 files changed, 122 insertions(+), 2 deletions(-) diff --git a/docs/administration.rst b/docs/administration.rst index 3284f7141..2acae86f0 100644 --- a/docs/administration.rst +++ b/docs/administration.rst @@ -333,6 +333,36 @@ command: The command takes no arguments and processes all your mail accounts and rules. +.. _utilities-archiver: + +Creating archived documents +=========================== + +Paperless stores archived PDF/A documents alongside your original documents. +These archived documents will also contain selectable text for image-only +originals. +These documents are derived from the originals, which are always stored +unmodified. If coming from an earlier version of paperless, your documents +won't have archived versions. + +This command creates PDF/A documents for your documents. + +.. code:: + + document_archiver --overwrite + +This command will only attempt to create archived documents when no archived +document exists yet, unless ``--overwrite`` is specified. + +.. note:: + + This command essentially performs OCR on all your documents again, + according to your settings. If you run this with ``PAPERLESS_OCR_MODE=redo``, + it will potentially run for a very long time. You can cancel the command + at any time, since this command will skip already archived versions the next time + it is run. + + .. _utilities-encyption: Managing encryption diff --git a/docs/changelog.rst b/docs/changelog.rst index 580dd7830..5380ce49b 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -5,6 +5,29 @@ Changelog ********* +paperless-ng 0.9.5 +################## + +* OCR + + * Paperless now uses `OCRmyPDF `_ to perform OCR on documents. + * OCRmyPDF creates archived PDF/A documents with embedded text that can be selected in the front end. + * Paperless stores archived versions of documents alongside with the originals. The originals can be + accessed on the document edit page, if available. + * Many of the configuration options regarding OCR have changed. See :ref:`configuration-ocr` for details. + * Paperless no longer guesses the language of your documents. It always uses the language that you + specified with ``PAPERLESS_OCR_LANGUAGE``. Be sure to set this to the language the majority of your + documents are in. + * The management command :ref:`document_archiver ` can be used to create archived versions for already + existing documents. + +* Tags from consumption folder. + + * Thanks to `jayme-github`_, paperless now consumes files from sub folders in the consumption folder and is able to assign tags + based on the sub folders a document was found in. This can be configured with ``PAPERLESS_CONSUMER_RECURSIVE`` and + ``PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS``. + + paperless-ng 0.9.4 ################## @@ -750,6 +773,7 @@ bulk of the work on this big change. * Initial release +.. _jayme-github: http://github.com/jayme-github .. _Brian Conn: https://github.com/TheConnMan .. _Christopher Luu: https://github.com/nuudles .. _Florian Jung: https://github.com/the01 diff --git a/docs/configuration.rst b/docs/configuration.rst index 492d4e76e..1f9da8b51 100644 --- a/docs/configuration.rst +++ b/docs/configuration.rst @@ -152,6 +152,8 @@ PAPERLESS_AUTO_LOGIN_USERNAME= Defaults to none, which disables this feature. +.. _configuration-ocr: + OCR settings ############ @@ -184,6 +186,8 @@ PAPERLESS_OCR_MODE= where no text is present. This is the safest and fastest option. * ``skip_noarchive``: In addition to skip, paperless won't create an archived version of your documents when it finds any text in them. + This is useful if you don't want to have two almost-identical versions + of your digital documents in the media folder. * ``redo``: Paperless will OCR all pages of your documents and attempt to replace any existing text layers with new text. This will be useful for documents from scanners that already performed OCR with insufficient @@ -197,7 +201,8 @@ PAPERLESS_OCR_MODE= however, the resulting document may be significantly larger and text won't appear as sharp when zoomed in. - The default is ``skip``, which only performs OCR when necessary. + The default is ``skip``, which only performs OCR when necessary and always + creates archived documents. PAPERLESS_OCR_OUTPUT_TYPE= Specify the the type of PDF documents that paperless should produce. @@ -244,7 +249,7 @@ PAPERLESS_OCR_USER_ARG= OCRmyPDF offers many more options. Use this parameter to specify any additional arguments you wish to pass to OCRmyPDF. Since Paperless uses the API of OCRmyPDF, you have to specify these in a format that can be - passed to the API. See `https://ocrmypdf.readthedocs.io/en/latest/api.html#reference`_ + passed to the API. See `the API reference of OCRmyPDF `_ for valid parameters. All command line options are supported, but they use underscores instead of dashed. diff --git a/docs/faq.rst b/docs/faq.rst index 7b5432326..74e99c6c7 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -3,6 +3,18 @@ Frequently asked questions ************************** +**Q:** *What's the general plan for Paperless-ng?* + +**A:** Paperless-ng is already almost feature-complete. This project will remain +as simple as it is right now. It will see improvements to features that are already there. +If you need advanced features such as document versions, +workflows or multi-user with customizable access to individual files, this is +not the tool for you. + +Features that *are* planned are some more quality of life extensions for the searching +(i.e., search for similar documents, group results by correspondents with "more from this" +links, etc), bulk editing and hierarchical tags. + **Q:** *I'm using docker. Where are my documents?* **A:** Your documents are stored inside the docker volume ``paperless_media``. @@ -21,6 +33,18 @@ is files around manually. This folder is meant to be entirely managed by docker and paperless. +**Q:** *Let's say you don't support this project anymore in a year. Can I easily move to other systems?* + +**A:** Your documents are stored as plain files inside the media folder. You can always drag those files +out of that folder to use them elsewhere. Here are a couple notes about that. + +* Paperless never modifies your original documents. It keeps checksums of all documents and uses a + scheduled sanity checker to check that they remain the same. +* By default, paperless uses the internal ID of each document as its filename. This might not be very + convenient for export. However, you can adjust the way files are stored in paperless by + :ref:`configuring the filename format `. +* :ref:`The exporter ` is another easy way to get your files out of paperless with reasonable file names. + **Q:** *What file types does paperless-ng support?* **A:** Currently, the following files are supported: @@ -53,3 +77,12 @@ in your browser and paperless has to do much less work to serve the data. that automatically, I'm all ears. For now, you have to grab the latest release archive from the project page and build the image yourself. The release comes with the front end already compiled, so you don't have to do this on the Pi. + +**Q:** *How do I run this on my toaster?* + +**A:** I honestly don't know! As for all other devices that might be able +to run paperless, you're a bit on your own. If you can't run the docker image, +the documentation has instructions for bare metal installs. I'm running +paperless on an i3 processor from 2015 or so. This is also what I use to test +new releases with. Apart from that, I also have a Raspberry Pi, which I +occasionally build the image on and see if it works. diff --git a/docs/index.rst b/docs/index.rst index a9142a682..a083fb3d1 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -42,6 +42,9 @@ resources in the documentation: learn about how paperless automates all tagging using machine learning. * Paperless now comes with a :ref:`proper email consumer ` that's fully tested and production ready. +* Paperless creates searchable PDF/A documents from whatever you you put into + the consumption directory. This means that you can select text in + image-only documents coming from your scanner. * See :ref:`this note ` about GnuPG encryption in paperless-ng. * Paperless is now integrated with a diff --git a/docs/usage_overview.rst b/docs/usage_overview.rst index 35ca505a3..db50d5706 100644 --- a/docs/usage_overview.rst +++ b/docs/usage_overview.rst @@ -60,6 +60,31 @@ Once you've got Paperless setup, you need to start feeding documents into it. Currently, there are three options: the consumption directory, IMAP (email), and HTTP POST. +When adding documents to paperless, it will perform the following operations on +your documents: + +1. OCR the document, if it has no text. Digital documents usually have text, + and this step will be skipped for those documents. +2. Paperless will create an archiveable PDF/A document from your document. + If this document is coming from your scanner, it will have embedded selectable text. +3. Paperless performs automatic matching of tags, correspondents and types on the + document before storing it in the database. + +.. hint:: + + This process can be configured to fit your needs. If you don't want paperless + to create archived versions for digital documents, you can configure that by + configuring ``PAPERLESS_OCR_MODE=skip_noarchive``. Please read the + :ref:`relevant section in the documentation `. + +.. note:: + + No matter which options you choose, Paperless will always store the original + document that it found in the consumption directory or in the mail and + will never overwrite that document. Archived versions are stored alongside the + digital versions. + + The consumption directory =========================