reworking the documentation.

This commit is contained in:
Jonas Winkler 2020-11-13 18:46:19 +01:00
parent 04335e4aac
commit f2dbb74d44
21 changed files with 1042 additions and 1427 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 113 KiB

354
docs/administration.rst Normal file
View File

@ -0,0 +1,354 @@
**************
Administration
**************
Making backups
##############
.. warning::
This section is not updated yet.
So you're bored of this whole project, or you want to make a remote backup of
your files for whatever reason. This is easy to do, simply use the
:ref:`exporter <utilities-exporter>` to dump your documents and database out
into an arbitrary directory.
.. _migrating-restoring:
Restoring
=========
Restoring your data is just as easy, since nearly all of your data exists either
in the file names, or in the contents of the files themselves. You just need to
create an empty database (just follow the
:ref:`installation instructions <setup-installation>` again) and then import the
``tags.json`` file you created as part of your backup. Lastly, copy your
exported documents into the consumption directory and start up the consumer.
.. code-block:: shell-session
$ cd /path/to/project
$ rm data/db.sqlite3 # Delete the database
$ cd src
$ ./manage.py migrate # Create the database
$ ./manage.py createsuperuser
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
$ ./manage.py document_consumer
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
is almost as simple:
.. code-block:: shell-session
# Stop and remove your current containers
$ docker-compose stop
$ docker-compose rm -f
# Recreate them, add the superuser
$ docker-compose up -d
$ docker-compose run --rm webserver createsuperuser
# Load the tags
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
# Load your exported documents into the consumption directory
# (How you do this highly depends on how you have set this up)
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
After loading the documents into the consumption directory the consumer will
immediately start consuming the documents.
.. _administration-updating:
Updating paperless
##################
.. warning::
This section is not updated yet.
For the most part, all you have to do to update Paperless is run ``git pull``
on the directory containing the project files, and then use Django's
``migrate`` command to execute any database schema updates that might have been
rolled in as part of the update:
.. code-block:: shell-session
$ cd /path/to/project
$ git pull
$ pip install -r requirements.txt
$ cd src
$ ./manage.py migrate
Note that it's possible (even likely) that while ``git pull`` may update some
files, the ``migrate`` step may not update anything. This is totally normal.
Additionally, as new features are added, the ability to control those features
is typically added by way of an environment variable set in ``paperless.conf``.
You may want to take a look at the ``paperless.conf.example`` file to see if
there's anything new in there compared to what you've got in ``/etc``.
If you are :ref:`using Docker <setup-installation-docker>` the update process
is similar:
.. code-block:: shell-session
$ cd /path/to/project
$ git pull
$ docker build -t paperless .
$ docker-compose run --rm consumer migrate
$ docker-compose up -d
If ``git pull`` doesn't report any changes, there is no need to continue with
the remaining steps.
This depends on the route you've chosen to run paperless.
a. If you are not using docker, update python requirements. Paperless uses
`Pipenv`_ for managing dependencies:
.. code:: bash
$ pip install --upgrade pipenv
$ cd /path/to/paperless
$ pipenv install
This creates a new virtual environment (or uses your existing environment)
and installs all dependencies into it. Running commands inside the environment
is done via
.. code:: bash
$ cd /path/to/paperless/src
$ pipenv run python3 manage.py my_command
You will also need to build the frontend each time a new update is pushed.
See updating paperless for more information. TODO REFERENCE
b. If you are using docker, build the docker image.
.. code:: bash
$ docker build -t jonaswinkler/paperless-ng:latest .
Copy either docker-compose.yml.example or docker-compose.yml.sqlite.example
to docker-compose.yml and adjust the consumption directory.
Management utilities
####################
Paperless comes with some management commands that perform various maintenance
tasks on your paperless instance. You can invoce these commands either by
.. code:: bash
$ cd /path/to/paperless
$ docker-compose run --rm webserver <command> <arguments>
or
.. code:: bash
$ cd /path/to/paperless/src
$ pipenv run python manage.py <command> <arguments>
depending on whether you use docker or not.
All commands have built-in help, which can be accessed by executing them with
the argument ``--help``.
Document exporter
=================
The document exporter exports all your data from paperless into a folder for
backup or migration to another DMS.
.. code::
document_exporter target
``target`` is a folder to which the data gets written. This includes documents,
thumbnails and a ``manifest.json`` file. The manifest contains all metadata from
the database (correspondents, tags, etc).
When you use the provided docker compose script, specify ``../export`` as the
target. This path inside the container is automatically mounted on your host on
the folder ``export``.
.. _utilities-importer:
Document importer
=================
The document importer takes the export produced by the `Document exporter`_ and
imports it into paperless.
The importer works just like the exporter. You point it at a directory, and
the script does the rest of the work:
.. code::
document_importer source
When you use the provided docker compose script, put the export inside the
``export`` folder in your paperless source directory. Specify ``../export``
as the ``source``.
.. _utilities-retagger:
Document retagger
=================
Say you've imported a few hundred documents and now want to introduce
a tag or set up a new correspondent, and apply its matching to all of
the currently-imported docs. This problem is common enough that
there are tools for it.
.. code::
document_retagger [-h] [-c] [-T] [-t] [-i] [--use-first] [-f]
optional arguments:
-c, --correspondent
-T, --tags
-t, --document_type
-i, --inbox-only
--use-first
-f, --overwrite
Run this after changing or adding matching rules. It'll loop over all
of the documents in your database and attempt to match documents
according to the new rules.
Specify any combination of ``-c``, ``-T`` and ``-t`` to have the
retagger perform matching of the specified metadata type. If you don't
specify any of these options, the document retagger won't do anything.
Specify ``-i`` to have the document retagger work on documents tagged
with inbox tags only. This is useful when you don't want to mess with
your already processed documents.
When multiple document types or correspondents match a single document,
the retagger won't assign these to the document. Specify ``--use-first``
to override this behaviour and just use the first correspondent or type
it finds. This option does not apply to tags, since any amount of tags
can be applied to a document.
Finally, ``-f`` specifies that you wish to overwrite already assigned
correspondents, types and/or tags. The default behaviour is to not
assign correspondents and types to documents that have this data already
assigned. ``-f`` works differently for tags: By default, only additional tags get
added to documents, no tags will be removed. With ``-f``, tags that don't
match a document anymore get removed as well.
Managing the Automatic matching algorithm
=========================================
The *Auto* matching algorithm requires a trained neural network to work.
This network needs to be updated whenever somethings in your data
changes. The docker image takes care of that automatically with the task
scheduler. You can manually renew the classifier by invoking the following
management command:
.. code::
document_create_classifier
This command takes no arguments.
Managing the document search index
==================================
The document search index is responsible for delivering search results for the
website. The document index is automatically updated whenever documents get
added to, changed, or removed from paperless. However, if the search yields
non-existing documents or won't find anything, you may need to recreate the
index manually.
.. code::
document_index {reindex,optimize}
Specify ``reindex`` to have the index created from scratch. This may take some
time.
Specify ``optimize`` to optimize the index. This updates certain aspects of
the index and usually makes queries faster and also ensures that the
autocompletion works properly. This command is regularly invoked by the task
scheduler.
Managing filenames
==================
.. warning::
TBD
.. code::
document_renamer
.. _utilities-encyption:
Managing encrpytion
===================
Documents can be stored in Paperless using GnuPG encryption.
.. danger::
Decryption is depreceated since paperless-ng 1.0 and doesn't really provide any
additional security, since you have to store the passphrase in a configuration
file on the same system as the encrypted documents for paperless to work. Also,
paperless provides transparent access to your encrypted documents.
Consider running paperless on an encrypted filesystem instead, which will then
at least provide security against physical hardware theft.
.. code::
change_storage_type [--passphrase PASSPHRASE] {gpg,unencrypted} {gpg,unencrypted}
positional arguments:
{gpg,unencrypted} The state you want to change your documents from
{gpg,unencrypted} The state you want to change your documents to
optional arguments:
--passphrase PASSPHRASE
Enabling encryption
-------------------
Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**):
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code::
change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg
Disabling encryption
--------------------
Basic usage to enable encryption of your document store:
(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code::
change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted
.. _Pipenv: https://pipenv.pypa.io/en/latest/

244
docs/advanced_usage.rst Normal file
View File

@ -0,0 +1,244 @@
***************
Advanced topics
***************
Paperless offers a couple features that automate certain tasks and make your life
easier.
Guesswork
#########
Any document you put into the consumption directory will be consumed, but if
you name the file right, it'll automatically set some values in the database
for you. This is is the logic the consumer follows:
1. Try to find the correspondent, title, and tags in the file name following
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
works as well.
2. If that doesn't work, we skip the date and try this pattern:
``Correspondent - Title - tag,tag,tag.pdf``.
3. If that doesn't work, we try to find the correspondent and title in the file
name following the pattern: ``Correspondent - Title.pdf``.
4. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect:
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Another Company - Letter of Reference.jpg``
* ``Dad's Recipe for Pancakes.png``
These however wouldn't work:
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Another Company- Letter of Reference.jpg``
Do I have to be so strict about naming?
=======================================
Rather than using the strict document naming rules, one can also set the option
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
to any date format that is found in the title, instead of a date pulled from
the document's text, without requiring the strict formatting of the document
filename as described above.
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
Transforming filenames for parsing
==================================
Some devices can't produce filenames that can be parsed by the default
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
``paperless.conf`` one can add transformations that are applied to the filename
before it's parsed.
The option contains a list of dictionaries of regular expressions (key:
``pattern``) and replacements (key: ``repl``) in JSON format, which are
applied in order by passing them to ``re.subn``. Transformation stops
after the first match, so at most one transformation is applied. The general
syntax is
.. code:: python
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
The example below is for a Brother ADS-2400N, a scanner that allows
different names to different hardware buttons (useful for handling
multiple entities in one instance), but insists on adding ``_<count>``
to the filename.
.. code:: python
# Brother profile configuration, support "Name_Date_Count" (the default
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
Matching tags, correspondents and document types
################################################
After the consumer has tried to figure out what it could from the file name,
it starts looking at the content of the document itself. It will compare the
matching algorithms defined by every tag and correspondent already set in your
database to see if they apply to the text in that document. In other words,
if you defined a tag called ``Home Utility`` that had a ``match`` property of
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
automatically tag your newly-consumed document with your ``Home Utility`` tag
so long as the text ``bc hydro`` appears in the body of the document somewhere.
The matching logic is quite powerful, and supports searching the text of your
document with different algorithms, and as such, some experimentation may be
necessary to get things right.
In order to have a tag, correspondent or type assigned automatically to newly
consumed documents, assign a match and matching algorithm using the web
interface. These settings define when to assign correspondents, tags and types
to documents.
The following algorithms are available:
* **Any:** Looks for any occurrence of any word provided in match in the PDF.
If you define the match as ``Bank1 Bank2``, it will match documents containing
either of these terms.
* **All:** Requires that every word provided appears in the PDF, albeit not in the
order provided.
* **Literal:** Matches only if the match appears exactly as provided in the PDF.
* **Regular expression:** Parses the match as a regular expression and tries to
find a match within the document.
* **Fuzzy match:** I dont know. Look at the source.
* **Auto:** Tries to automatically match new documents. This does not require you
to set a match. See the notes below.
When using the "any" or "all" matching algorithms, you can search for terms
that consist of multiple words by enclosing them in double quotes. For example,
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
will match documents that contain either "Bank of America" or "BofA", but will
not match documents containing "Bank of South America".
Then just save your tag/correspondent and run another document through the
consumer. Once complete, you should see the newly-created document,
automatically tagged with the appropriate data.
Automatic matching
==================
Paperless-ng comes with a new matching algorithm called *Auto*. This matching
algorithm tries to assign tags, correspondents and document types to your
documents based on how you have assigned these on existing documents. It
uses a neural network under the hood.
If, for example, all your bank statements of your account 123 at the Bank of
America are tagged with the tag "bofa_123" and the matching algorithm of this
tag is set to *Auto*, this neural network will examine your documents and
automatically learn when to assign this tag.
There are a couple caveats you need to keep in mind when using this feature:
* Changes to your documents are not immediately reflected by the matching
algorithm. The neural network needs to be *trained* on your documents after
changes. Paperless periodically (default: once each hour) checks for changes
and does this automatically for you.
* The Auto matching algorithm only takes documents into account which are NOT
placed in your inbox (i.e., have inbox tags assigned to them). This ensures
that the neural network only learns from documents which you have correctly
tagged before.
* The matching algorithm can only work if there is a correlation between the
tag, correspondent or document type and the document itself. Your bank
statements usually contain your bank account number and the name of the bank,
so this works reasonably well, However, tags such as "TODO" cannot be
automatically assigned.
* The matching algorithm needs a reasonable number of documents to identify when
to assign tags, correspondents, and types. If one out of a thousand documents
has the correspondent "Very obscure web shop I bought something five years
ago", it will probably not assign this correspondent automatically if you buy
something from them again. The more documents, the better.
Hooking into the consumption process
####################################
Sometimes you may want to do something arbitrary whenever a document is
consumed. Rather than try to predict what you may want to do, Paperless lets
you execute scripts of your own choosing just before or after a document is
consumed using a couple simple hooks.
Just write a script, put it somewhere that Paperless can read & execute, and
then put the path to that script in ``paperless.conf`` with the variable name
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
``PAPERLESS_POST_CONSUME_SCRIPT``.
.. TODO HYPEREF TO CONFIG
.. important::
These scripts are executed in a **blocking** process, which means that if
a script takes a long time to run, it can significantly slow down your
document consumption flow. If you want things to run asynchronously,
you'll have to fork the process in your script and exit.
Pre-consumption script
======================
Executed after the consumer sees a new document in the consumption folder, but
before any processing of the document is performed. This script receives exactly
one argument:
* Document file name
A simple but common example for this would be creating a simple script like
this:
``/usr/local/bin/ocr-pdf``
.. code:: bash
#!/usr/bin/env bash
pdf2pdfocr.py -i ${1}
``/etc/paperless.conf``
.. code:: bash
...
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
...
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
overwrite the file with an OCR'd version of the file and exit. At which point,
the consumption process will begin with the newly modified file.
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
.. _consumption-director-hook-variables-post:
Post-consumption script
=======================
Executed after the consumer has successfully processed a document and has moved it
into paperless. It receives the following arguments:
* Document id
* Generated file name
* Source path
* Thumbnail path
* Download URL
* Thumbnail URL
* Correspondent
* Tags
The script can be in any language you like, but for a simple shell script
example, you can take a look at ``post-consumption-example.sh`` in the
``scripts`` directory in this project.
The post consumption script cannot cancel the consumption process.

View File

@ -1,7 +1,12 @@
.. _api:
************
The REST API
############
************
.. warning::
This section is not updated yet.
Paperless makes use of the `Django REST Framework`_ standard API interface
because of its inherent awesomeness. Conveniently, the system is also
@ -15,7 +20,7 @@ installation.
.. _api-uploading:
Uploading
---------
=========
File uploads in an API are hard and so far as I've been able to tell, there's
no standard way of accepting them, so rather than crowbar file uploads into the

View File

@ -1,6 +1,79 @@
.. _paperless_changelog:
Changelog
#########
paperless-ng 1.0
================
* **Deprecated:** GnuPG. Don't use it. If you're still using it, be aware that it
offers no protection at all, since the passphrase is stored alongside with the
encrypted documents itself. This features will most likely be removed in future
versions.
* **Added:** New frontend. Features:
* Single page application: It's much more responsive than the django admin pages.
* Dashboard. Shows recently scanned documents, or todos, or other documents
at wish. Allows uploading of documents. Shows basic statistics.
* Better document list with multiple display options.
* Full text search with result highlighting, auto completion and scoring based
on the query. It uses a document search index in the background.
* Saveable filters.
* Better log viewer.
* **Added:** Document types. Assign these to documents just as correspondents.
They may be used in the future to perform automatic operations on documents
depending on the type.
* **Added:** Inbox tags. Define an inbox tag and it will automatically be
assigned to any new document scanned into the system.
* **Added:** Automatic matching. A new matching algorithm that automatically
assigns tags, document types and correspondents to your documents. It uses
a neural network trained on your data.
* **Added:** Archive serial numbers. Assign these to quickly find documents stored in
physical binders.
* **Added:** Enabled the internal user management of django. This isn't really a
multi user solution, however, it allows more than one user to access the website
and set some basic permissions / renew passwords.
* **Modified [breaking]:** REST Api changes:
* New filters added, other filters removed (case sensitive filters, slug filters)
* Endpoints for thumbnails, previews and downloads replace the old ``/fetch/`` urls. Redirects are in place.
* Endpoint for document uploads replaces the old ``/push`` url. Redirects are in place.
* Foreign key relationships are now served as IDs, not as urls.
* **Modified [breaking]:** PostgreSQL:
* If ``PAPERLESS_DBHOST`` is specified in the settings, paperless uses postgresql instead of sqlite.
Username, database and password all default to ``paperless`` if not specified.
* **docker-compose.yml uses PostgreSQL by default.**
* **Modified [breaking]:** document_retagger management command rework. See TODO hyperref
* **Removed [breaking]:** Reminders.
* **Removed:** All customizations made to the django admin pages.
* **Internal changes:** Mostly code cleanup, including:
* Rework of the code of the tesseract parser. This is now a lot cleaner.
* Rework of the filename handling code. It was a mess.
* Fixed some issues with the document exporter not exporting all documents when encountering duplicate filenames.
* Consumer rework: now uses the excellent watchdog library, lots of code removed.
* Added a task scheduler that takes care of checking mail, training the classifier and maintaining the document search index.
* Updated dependencies. Now uses Pipenv all around.
* Updated Dockerfile and docker-compose. Now uses ``supervisord`` to run everything paperless-related in a single container.
* **Settings:**
* ``PAPERLESS_FORGIVING_OCR`` is now default and gone. Reason: Even if ``langdetect`` fails to detect
a language, tesseract still does a very good job at ocr'ing a document with the default language.
Certain language specifics such as umlauts may not get picked up properly.
* ``PAPERLESS_DEBUG`` defaults to ``false``.
* The presence of ``PAPERLESS_DBHOST`` now determines whether to use PostgreSQL or
sqlite.
* Many more small changes here and there. The usual stuff.
2.7.0
=====

View File

@ -1,15 +0,0 @@
Changelog (jonaswinkler)
########################
1.0.0
=====
* First release based on paperless 2.6.0
* Added: Automatic document classification using neural networks (replaces
regex-based tagging)
* Added: Document types
* Added: Archive serial number allows easy referencing of physical document
copies
* Added: Inbox tags (added automatically to newly consumed documents)
* Added: Document viewer on document edit page
* Database backend is now configurable

View File

@ -54,7 +54,7 @@ source_suffix = '.rst'
master_doc = 'index'
# General information about the project.
project = u'Paperless'
project = u'Paperless-ng'
copyright = u'2015, Daniel Quinn'
# The version info for the project you're documenting, acts as replacement for
@ -205,7 +205,8 @@ try:
import sphinx_rtd_theme
html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
except ImportError:
except ImportError as e:
print("error " + str(e))
pass
# -- Options for LaTeX output ---------------------------------------------

View File

@ -1,255 +0,0 @@
.. _consumption:
Consumption
###########
Once you've got Paperless setup, you need to start feeding documents into it.
Currently, there are three options: the consumption directory, IMAP (email), and
HTTP POST.
.. _consumption-directory:
The Consumption Directory
=========================
The primary method of getting documents into your database is by putting them in
the consumption directory. The ``document_consumer`` script runs in an infinite
loop looking for new additions to this directory and when it finds them, it goes
about the process of parsing them with the OCR, indexing what it finds, and
encrypting the PDF (if ``PAPERLESS_PASSPHRASE`` is set), storing it in the
media directory.
Getting stuff into this directory is up to you. If you're running Paperless
on your local computer, you might just want to drag and drop files there, but if
you're running this on a server and want your scanner to automatically push
files to this directory, you'll need to setup some sort of service to accept the
files from the scanner. Typically, you're looking at an FTP server like
`Proftpd`_ or `Samba`_.
.. _Proftpd: http://www.proftpd.org/
.. _Samba: http://www.samba.org/
So where is this consumption directory? It's wherever you define it. Look for
the ``CONSUMPTION_DIR`` value in ``settings.py``. Set that to somewhere
appropriate for your use and put some documents in there. When you're ready,
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
.. _consumption-directory-hook:
Hooking into the Consumption Process
------------------------------------
Sometimes you may want to do something arbitrary whenever a document is
consumed. Rather than try to predict what you may want to do, Paperless lets
you execute scripts of your own choosing just before or after a document is
consumed using a couple simple hooks.
Just write a script, put it somewhere that Paperless can read & execute, and
then put the path to that script in ``paperless.conf`` with the variable name
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
``PAPERLESS_POST_CONSUME_SCRIPT``. The script will be executed before or
or after the document is consumed respectively.
.. important::
These scripts are executed in a **blocking** process, which means that if
a script takes a long time to run, it can significantly slow down your
document consumption flow. If you want things to run asynchronously,
you'll have to fork the process in your script and exit.
.. _consumption-directory-hook-variables:
What Can These Scripts Do?
..........................
It's your script, so you're only limited by your imagination and the laws of
physics. However, the following values are passed to the scripts in order:
.. _consumption-director-hook-variables-pre:
Pre-consumption script
::::::::::::::::::::::
* Document file name
A simple but common example for this would be creating a simple script like
this:
``/usr/local/bin/ocr-pdf``
.. code:: bash
#!/usr/bin/env bash
pdf2pdfocr.py -i ${1}
``/etc/paperless.conf``
.. code:: bash
...
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
...
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
overwrite the file with an OCR'd version of the file and exit. At which point,
the consumption process will begin with the newly modified file.
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
.. _consumption-director-hook-variables-post:
Post-consumption script
:::::::::::::::::::::::
* Document id
* Generated file name
* Source path
* Thumbnail path
* Download URL
* Thumbnail URL
* Correspondent
* Tags
The script can be in any language you like, but for a simple shell script
example, you can take a look at ``post-consumption-example.sh`` in the
``scripts`` directory in this project.
.. _consumption-imap:
IMAP (Email)
============
Another handy way to get documents into your database is to email them to
yourself. The typical use-case would be to be out for lunch and want to send a
copy of the receipt back to your system at home. Paperless can be taught to
pull emails down from an arbitrary account and dump them into the consumption
directory where the process :ref:`above <consumption-directory>` will follow the
usual pattern on consuming the document.
Some things you need to know about this feature:
* It's disabled by default. By setting the values below it will be enabled.
* It's been tested in a limited environment, so it may not work for you (please
submit a pull request if you can!)
* It's designed to **delete mail from the server once consumed**. So don't go
pointing this to your personal email account and wonder where all your stuff
went.
* Currently, only one photo (attachment) per email will work.
So, with all that in mind, here's what you do to get it running:
1. Setup a new email account somewhere, or if you're feeling daring, create a
folder in an existing email box and note the path to that folder.
2. In ``/etc/paperless.conf`` set all of the appropriate values in
``PATHS AND FOLDERS`` and ``SECURITY``.
If you decided to use a subfolder of an existing account, then make sure you
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
have to include that in every email you send.
3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check
the configured email account at startup and from then on every 10 minutes
for something new and pulls down whatever it finds.
4. Send yourself an email! Note that the subject is treated as the file name,
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
get what you expect. Also, you must include the aforementioned secret
string in every email so the fetcher knows that it's safe to import.
Note that Paperless only allows the email title to consist of safe characters
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
5. After a few minutes, the consumer will poll your mailbox, pull down the
message, and place the attachment in the consumption directory with the
appropriate name. A few minutes later, the consumer will import it like any
other file.
.. _consumption-http:
HTTP POST
=========
You can also submit a document via HTTP POST, so long as you do so after
authenticating. To push your document to Paperless, send an HTTP POST to the
server with the following name/value pairs:
* ``correspondent``: The name of the document's correspondent. Note that there
are restrictions on what characters you can use here. Specifically,
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
out. You also can't use the sequence ` - ` (space, dash, space).
* ``title``: The title of the document. The rules for characters is the same
here as the correspondent.
* ``document``: The file you're uploading
Specify ``enctype="multipart/form-data"``, and then POST your file with::
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
An example of this in HTML is a typical form:
.. code:: html
<form method="post" enctype="multipart/form-data">
<input type="text" name="correspondent" value="My Correspondent" />
<input type="text" name="title" value="My Title" />
<input type="file" name="document" />
<input type="submit" name="go" value="Do the thing" />
</form>
But a potentially more useful way to do this would be in Python. Here we use
the requests library to handle basic authentication and to send the POST data
to the URL.
.. code:: python
import os
from hashlib import sha256
import requests
from requests.auth import HTTPBasicAuth
# You authenticate via BasicAuth or with a session id.
# We use BasicAuth here
username = "my-username"
password = "my-super-secret-password"
# Where you have Paperless installed and listening
url = "http://localhost:8000/push"
# Document metadata
correspondent = "Test Correspondent"
title = "Test Title"
# The local file you want to push
path = "/path/to/some/directory/my-document.pdf"
with open(path, "rb") as f:
response = requests.post(
url=url,
data={"title": title, "correspondent": correspondent},
files={"document": (os.path.basename(path), f, "application/pdf")},
auth=HTTPBasicAuth(username, password),
allow_redirects=False
)
if response.status_code == 202:
# Everything worked out ok
print("Upload successful")
else:
# If you don't get a 202, it's probably because your credentials
# are wrong or something. This will give you a rough idea of what
# happened.
print("We got HTTP status code: {}".format(response.status_code))
for k, v in response.headers.items():
print("{}: {}".format(k, v))

View File

@ -1,42 +0,0 @@
.. _customising:
Customising Paperless
#####################
Currently, the Paperless' interface is just the default Django admin, which
while powerful, is rather boring. If you'd like to give the site a bit of a
face-lift, or if you simply want to adjust the colours, contrast, or font size
to make things easier to read, you can do that by adding your own CSS or
Javascript quite easily.
.. _customising-overrides:
Overrides
=========
On every page load, Paperless looks for two files in your media root directory
(the directory defined by your ``PAPERLESS_MEDIADIR`` configuration variable or
the default, ``<project root>/media/``) for two files:
* ``overrides.css``
* ``overrides.js``
If it finds either or both of those files, they'll be loaded into the page: the
CSS in the ``<head>``, and the Javascript stuffed into the last line of the
``<body>``.
.. _customising-overrides-note:
An important note about customisation
-------------------------------------
Any changes you make to the site with your CSS or Javascript are likely to
depend on the structure of the current HTML and/or the existing CSS rules. For
the most part it's safe to assume that these bits won't change, but *sometimes
they do* as features are added or bugs are fixed.
If you make a change that you think others would appreciate though, submit it
as a pull request and maybe we can find a way to work it into the project by
default!

View File

@ -1,131 +0,0 @@
.. _guesswork:
Guesswork
#########
During the consumption process, Paperless tries to guess some of the attributes
of the document it's looking at. To do this it uses two approaches:
.. _guesswork-naming:
File Naming
===========
Any document you put into the consumption directory will be consumed, but if
you name the file right, it'll automatically set some values in the database
for you. This is is the logic the consumer follows:
1. Try to find the correspondent, title, and tags in the file name following
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
works as well.
2. If that doesn't work, we skip the date and try this pattern:
``Correspondent - Title - tag,tag,tag.pdf``.
3. If that doesn't work, we try to find the correspondent and title in the file
name following the pattern: ``Correspondent - Title.pdf``.
4. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect:
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Another Company - Letter of Reference.jpg``
* ``Dad's Recipe for Pancakes.png``
These however wouldn't work:
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Another Company- Letter of Reference.jpg``
Do I have to be so strict about naming?
---------------------------------------
Rather than using the strict document naming rules, one can also set the option
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
to any date format that is found in the title, instead of a date pulled from
the document's text, without requiring the strict formatting of the document
filename as described above.
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
Transforming filenames for parsing
----------------------------------
Some devices can't produce filenames that can be parsed by the default
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
``paperless.conf`` one can add transformations that are applied to the filename
before it's parsed.
The option contains a list of dictionaries of regular expressions (key:
``pattern``) and replacements (key: ``repl``) in JSON format, which are
applied in order by passing them to ``re.subn``. Transformation stops
after the first match, so at most one transformation is applied. The general
syntax is
.. code:: python
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
The example below is for a Brother ADS-2400N, a scanner that allows
different names to different hardware buttons (useful for handling
multiple entities in one instance), but insists on adding ``_<count>``
to the filename.
.. code:: python
# Brother profile configuration, support "Name_Date_Count" (the default
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
.. _guesswork-content:
Reading the Document Contents
=============================
After the consumer has tried to figure out what it could from the file name,
it starts looking at the content of the document itself. It will compare the
matching algorithms defined by every tag and correspondent already set in your
database to see if they apply to the text in that document. In other words,
if you defined a tag called ``Home Utility`` that had a ``match`` property of
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
automatically tag your newly-consumed document with your ``Home Utility`` tag
so long as the text ``bc hydro`` appears in the body of the document somewhere.
The matching logic is quite powerful, and supports searching the text of your
document with different algorithms, and as such, some experimentation may be
necessary to get things Just Right.
.. _guesswork-content-howto:
How Do I Set Up These Matching Algorithms?
------------------------------------------
Setting up of the algorithms is easily done through the admin interface. When
you create a new correspondent or tag, there are optional fields for matching
text and matching algorithm. From the help info there:
.. note::
Which algorithm you want to use when matching text to the OCR'd PDF. Here,
"any" looks for any occurrence of any word provided in the PDF, while "all"
requires that every word provided appear in the PDF, albeit not in the
order provided. A "literal" match means that the text you enter must
appear in the PDF exactly as you've entered it, and "regular expression"
uses a regex to match the PDF. If you don't know what a regex is, you
probably don't want this option.
When using the "any" or "all" matching algorithms, you can search for terms
that consist of multiple words by enclosing them in double quotes. For example,
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
will match documents that contain either "Bank of America" or "BofA", but will
not match documents containing "Bank of South America".
Then just save your tag/correspondent and run another document through the
consumer. Once complete, you should see the newly-created document,
automatically tagged with the appropriate data.

View File

@ -4,8 +4,8 @@ Paperless
=========
Paperless is a simple Django application running in two parts:
a :ref:`consumer <utilities-consumer>` (the thing that does the indexing) and
the :ref:`webserver <utilities-webserver>` (the part that lets you search &
a *Consumer* (the thing that does the indexing) and
the *Web server* (the part that lets you search &
download already-indexed documents). If you want to learn more about its
functions keep on reading after the installation section.
@ -25,26 +25,34 @@ finding stuff again. I feed documents right from the post box into the scanner
and then shred them. Perhaps you might find it useful too.
Paperless-ng
============
I wanted to make big changes to the project that will impact the way it is used
by its users greatly. Among the users who currently use paperless in production
there are probably many that don't want these changes right away. I also wanted
to have more control over what goes into the code and what does not. Therefore,
paperless-ng was created. NG stands for both Angular (the framework used for the
Frontend) and next-gen. Publishing this project under a different name also
avoids confusion between paperless and paperless-ng.
It would be great if this project could eventually merge back into the main
repository, but it needs a lot more work before that can happen.
Contents
========
.. toctree::
:maxdepth: 2
:maxdepth: 1
requirements
setup
consumption
usage_overview
advanced_usage
administration
api
utilities
guesswork
migrating
customising
extending
troubleshooting
contributing
scanners
screenshots
changelog
changelog_jonaswinkler

View File

@ -1,109 +0,0 @@
.. _migrating:
Migrating, Updates, and Backups
===============================
As Paperless is still under active development, there's a lot that can change
as software updates roll out. You should backup often, so if anything goes
wrong during an update, you at least have a means of restoring to something
usable. Thankfully, there are automated ways of backing up, restoring, and
updating the software.
.. _migrating-backup:
Backing Up
----------
So you're bored of this whole project, or you want to make a remote backup of
your files for whatever reason. This is easy to do, simply use the
:ref:`exporter <utilities-exporter>` to dump your documents and database out
into an arbitrary directory.
.. _migrating-restoring:
Restoring
---------
Restoring your data is just as easy, since nearly all of your data exists either
in the file names, or in the contents of the files themselves. You just need to
create an empty database (just follow the
:ref:`installation instructions <setup-installation>` again) and then import the
``tags.json`` file you created as part of your backup. Lastly, copy your
exported documents into the consumption directory and start up the consumer.
.. code-block:: shell-session
$ cd /path/to/project
$ rm data/db.sqlite3 # Delete the database
$ cd src
$ ./manage.py migrate # Create the database
$ ./manage.py createsuperuser
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
$ ./manage.py document_consumer
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
is almost as simple:
.. code-block:: shell-session
# Stop and remove your current containers
$ docker-compose stop
$ docker-compose rm -f
# Recreate them, add the superuser
$ docker-compose up -d
$ docker-compose run --rm webserver createsuperuser
# Load the tags
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
# Load your exported documents into the consumption directory
# (How you do this highly depends on how you have set this up)
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
After loading the documents into the consumption directory the consumer will
immediately start consuming the documents.
.. _migrating-updates:
Updates
-------
For the most part, all you have to do to update Paperless is run ``git pull``
on the directory containing the project files, and then use Django's
``migrate`` command to execute any database schema updates that might have been
rolled in as part of the update:
.. code-block:: shell-session
$ cd /path/to/project
$ git pull
$ pip install -r requirements.txt
$ cd src
$ ./manage.py migrate
Note that it's possible (even likely) that while ``git pull`` may update some
files, the ``migrate`` step may not update anything. This is totally normal.
Additionally, as new features are added, the ability to control those features
is typically added by way of an environment variable set in ``paperless.conf``.
You may want to take a look at the ``paperless.conf.example`` file to see if
there's anything new in there compared to what you've got in ``/etc``.
If you are :ref:`using Docker <setup-installation-docker>` the update process
is similar:
.. code-block:: shell-session
$ cd /path/to/project
$ git pull
$ docker build -t paperless .
$ docker-compose run --rm consumer migrate
$ docker-compose up -d
If ``git pull`` doesn't report any changes, there is no need to continue with
the remaining steps.

View File

@ -1,125 +0,0 @@
.. _requirements:
Requirements
============
You need a Linux machine or Unix-like setup (theoretically an Apple machine
should work) that has the following software installed:
* `Python3`_ (with development libraries, pip and virtualenv)
* `GNU Privacy Guard`_
* `Tesseract`_, plus its language files matching your document base.
* `Imagemagick`_ version 6.7.5 or higher
* `unpaper`_
* `libpoppler-cpp-dev`_ PDF rendering library
* `optipng`_
.. _Python3: https://python.org/
.. _GNU Privacy Guard: https://gnupg.org
.. _Tesseract: https://github.com/tesseract-ocr
.. _Imagemagick: http://imagemagick.org/
.. _unpaper: https://github.com/unpaper/unpaper
.. _libpoppler-cpp-dev: https://poppler.freedesktop.org/
.. _optipng: http://optipng.sourceforge.net/
Notably, you should confirm how you access your Python3 installation. Many
Linux distributions will install Python3 in parallel to Python2, using the
names ``python3`` and ``python`` respectively. The same goes for ``pip3`` and
``pip``. Running Paperless with Python2 will likely break things, so make sure
that you're using the right version.
For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to
refer to their Python3 versions.
In addition to the above, there are a number of Python requirements, all of
which are listed in a file called ``requirements.txt`` in the project root
directory.
If you're not working on a virtual environment (like Docker), you
should probably be using a virtualenv, but that's your call. The reasons why
you might choose a virtualenv or not aren't really within the scope of this
document. Needless to say if you don't know what a virtualenv is, you should
probably figure that out before continuing.
.. _requirements-apple:
Problems with Imagemagick & PDFs
--------------------------------
Some users have `run into problems`_ with getting ImageMagick to do its thing
with PDFs. Often this is the case with Apple systems using HomeBrew, but other
Linuxes have been a problem as well. The solution appears to be to install
ghostscript as well as ImageMagick:
.. _run into problems: https://github.com/the-paperless-project/paperless/issues/25
.. code:: bash
$ brew install ghostscript
$ brew install imagemagick
$ brew install libmagic
.. _requirements-baremetal:
Python-specific Requirements: No Virtualenv
-------------------------------------------
If you don't care to use a virtual env, then installation of the Python
dependencies is easy:
.. code:: bash
$ pip install --user --requirement /path/to/paperless/requirements.txt
This will download and install all of the requirements into
``${HOME}/.local``. Remember that your distribution may be using ``pip3`` as
mentioned above.
.. _requirements-virtualenv:
Python-specific Requirements: Virtualenv
----------------------------------------
Using a virtualenv for this is pretty straightforward: create a virtualenv,
enter it, and install the requirements using the ``requirements.txt`` file:
.. code:: bash
$ virtualenv --python=/path/to/python3 /path/to/arbitrary/directory
$ . /path/to/arbitrary/directory/bin/activate
$ pip install --requirement /path/to/paperless/requirements.txt
Now you're ready to go. Just remember to enter (activate) your virtualenv
whenever you want to use Paperless.
.. _requirements-documentation:
Documentation
-------------
As generation of the documentation is not required for the use of Paperless,
dependencies for this process are not included in ``requirements.txt``. If
you'd like to generate your own docs locally, you'll need to:
.. code:: bash
$ pip install sphinx
and then cd into the ``docs`` directory and type ``make html``.
If you are using Docker, you can use the following commands to build the
documentation and run a webserver serving it on `port 8001`_:
.. code:: bash
$ pwd
/path/to/paperless
$ docker build -t paperless:docs -f docs/Dockerfile .
$ docker run --rm -it -p "8001:8000" paperless:docs
.. _port 8001: http://127.0.0.1:8001

View File

View File

@ -1,7 +1,8 @@
.. _scanners:
Scanner Recommendations
=======================
***********************
Scanner recommendations
***********************
As Paperless operates by watching a folder for new files, doesn't care what
scanner you use, but sometimes finding a scanner that will write to an FTP,
@ -23,16 +24,19 @@ that works right for you based on recommentations from other Paperless users.
+---------+----------------+-----+-----+-----+----------------+
| Fujitsu | `ix500`_ | yes | | yes | `eonist`_ |
+---------+----------------+-----+-----+-----+----------------+
| Fujitsu | `S1300i`_ | yes | | yes | `jonaswinkler`_|
+---------+----------------+-----+-----+-----+----------------+
.. _ADS-1500W: https://www.brother.ca/en/p/ads1500w
.. _MFC-J6930DW: https://www.brother.ca/en/p/MFCJ6930DW
.. _MFC-J5910DW: https://www.brother.co.uk/printers/inkjet-printers/mfcj5910dw
.. _MFC-9142CDN: https://www.brother.co.uk/printers/laser-printers/mfc9140cdn
.. _ix500: http://www.fujitsu.com/us/products/computing/peripheral/scanners/scansnap/ix500/
.. _ix500: https://www.fujitsu.com/global/products/computing/peripheral/scanners/scansnap/ix500/
.. _S1300i: https://www.fujitsu.com/global/products/computing/peripheral/scanners/soho/s1300i/
.. _danielquinn: https://github.com/danielquinn
.. _ayounggun: https://github.com/ayounggun
.. _bmsleight: https://github.com/bmsleight
.. _eonist: https://github.com/eonist
.. _REOLDEV: https://github.com/REOLDEV
.. _jonaswinkler: https://github.com/jonaswinkler

View File

@ -1,16 +0,0 @@
.. _screenshots:
Screenshots
===========
Once everything is set-up login to paperless using the web front-end
.. image:: ./_static/Screenshot_first_run_login.png
Nice clean interface
.. image:: ./_static/Screenshot_first_logged.png
Some documents loaded in via ftp or using the scanners ftp.
.. image:: ./_static/Screenshot_upload_and_scanned.png

View File

@ -1,500 +1,187 @@
.. _setup:
*****
Setup
=====
Paperless isn't a very complicated app, but there are a few components, so some
basic documentation is in order. If you follow along in this document and
still have trouble, please open an `issue on GitHub`_ so I can fill in the
gaps.
.. _issue on GitHub: https://github.com/the-paperless-project/paperless/issues
.. _setup-download:
*****
Download
--------
########
The source is currently only available via GitHub, so grab it from there,
either by using ``git``:
by using ``git``:
.. code:: bash
$ git clone https://github.com/the-paperless-project/paperless.git
$ git clone https://github.com/jonaswinkler/paperless-ng.git
$ cd paperless
or just download the tarball and go that route:
.. code:: bash
$ cd to the directory where you want to run Paperless
$ wget https://github.com/the-paperless-project/paperless/archive/master.zip
$ unzip master.zip
$ cd paperless-master
.. _setup-installation:
Installation & Configuration
----------------------------
Installation
############
You can go multiple routes with setting up and running Paperless:
* The `bare metal route`_
* The `docker route`_
* A suggested `linux containers route`_
* The `docker route`_
* The `bare metal route`_
The recommended setup route is docker, since it takes care of all dependencies
for you.
The `docker route`_ is quick & easy.
The `bare metal route`_ is a bit more complicated to setup but makes it easier
The `bare metal route`_ is more complicated to setup but makes it easier
should you want to contribute some code back.
The `linux containers route`_ is quick, but makes alot of assumptions on the
set-up, on the other hand the script could be used to install on a base
debian or ubuntu server.
Docker Route
============
.. _docker route: setup-installation-docker_
.. _bare metal route: setup-installation-bare-metal_
.. _Docker Machine: https://docs.docker.com/machine/
1. Install `Docker`_ and `docker-compose`_. [#compose]_
.. _setup-installation-bare-metal:
.. caution::
Standard (Bare Metal)
+++++++++++++++++++++
If you want to use the included ``docker-compose.yml.example`` file, you
need to have at least Docker version **17.09.0** and docker-compose
version **1.17.0**.
1. Install the requirements as per the :ref:`requirements <requirements>` page.
2. Within the extract of master.zip go to the ``src`` directory.
3. Copy ``../paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
your favourite editor. As this file contains passwords. It should only be
readable by user root and paperless! Set the values for:
See the `Docker installation guide`_ on how to install the current
version of Docker for your operating system or Linux distribution of
choice. To get an up-to-date version of docker-compose, follow the
`docker-compose installation guide`_ if your package repository doesn't
include it.
Set the values for:
.. _Docker installation guide: https://docs.docker.com/engine/installation/
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
dumped to be consumed by Paperless.
* ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
will spawn to process document pages in parallel.
* ``PAPERLESS_PASSPHRASE``: this is only required if you want to use GPG to
encrypt your document files. This is the passphrase Paperless uses to
encrypt/decrypt the original documents. Don't worry about defining this
if you don't want to use encryption (the default).
2. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
You'll be editing both these files: taking a copy ensures that you can
``git pull`` to receive updates without risking merge conflicts with your
modified versions of the configuration files.
3. Modify ``docker-compose.yml`` to your preferences. You should change the path
to the consumption directory in this file. Find the line that specifies where
to mount the consumption directory:
Note also that if you're using the ``runserver`` as mentioned below, you
should make sure that PAPERLESS_DEBUG="true" or is just commented out as
this is the default.
.. code::
- ./consume:/usr/src/paperless/consume
Replace the part BEFORE the colon with a local directory of your choice:
4. Initialise the SQLite database with ``./manage.py migrate``.
5. Collect the static files for the webserver with ``./manage.py collectstatic``.
6. Create a user for your Paperless instance with
``./manage.py createsuperuser``. Follow the prompts to create your user.
7. Start the webserver with ``./manage.py runserver <IP>:<PORT>``.
If no specific IP or port is given, the default is ``127.0.0.1:8000`` also
known as http://localhost:8000/.
You should now be able to visit your (empty) installation at
`Paperless webserver`_ or whatever you chose before. You can login with the
user/pass you created in #5.
.. code::
8. In a separate window, change to the ``src`` directory in this repo again,
but this time, you should start the consumer script with
``./manage.py document_consumer``.
9. Scan something or put a file into the ``CONSUMPTION_DIR``.
10. Wait a few minutes
11. Visit the document list on your webserver, and it should be there, indexed
and downloadable.
.. caution::
This installation is not secure. Once everything is working head over to
`Making things more permanent`_
.. _Paperless webserver: http://127.0.0.1:8000
.. _Making things more permanent: setup-permanent_
.. _setup-installation-docker:
Docker Method
+++++++++++++
1. Install `Docker`_.
.. caution::
As mentioned earlier, this guide assumes that you use Docker natively
under Linux. If you are using `Docker Machine`_ under Mac OS X or
Windows, you will have to adapt IP addresses, volume-mounting, command
execution and maybe more.
2. Install `docker-compose`_. [#compose]_
.. caution::
If you want to use the included ``docker-compose.yml.example`` file, you
need to have at least Docker version **1.12.0** and docker-compose
version **1.9.0**.
See the `Docker installation guide`_ on how to install the current
version of Docker for your operating system or Linux distribution of
choice. To get an up-to-date version of docker-compose, follow the
`docker-compose installation guide`_ if your package repository doesn't
include it.
.. _Docker installation guide: https://docs.docker.com/engine/installation/
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
You'll be editing both these files: taking a copy ensures that you can
``git pull`` to receive updates without risking merge conflicts with your
modified versions of the configuration files.
4. Modify ``docker-compose.yml`` to your preferences, following the
instructions in comments in the file. The only change that is a hard
requirement is to specify where the consumption directory should
mount.[#dockercomposeyml]_
.. caution::
If you are using NFS mounts for the consume directory you also need to
change the command to turn off inotify as it doesn't work with NFS
``command: ["document_consumer", "--no-inotify"]``
- /home/jonaswinkler/paperless-inbox:/usr/src/paperless/consume
Don't change the part after the colon or paperless wont find your documents.
5. Modify ``docker-compose.env`` and adapt the following environment variables:
4. Modify ``docker-compose.env``, following the comments in the file. The
most important change is to set ``USERMAP_UID`` and ``USERMAP_GID``
to the uid and gid of your user on the host system. This ensures that
both the docker container and you on the host machine have write access
to the consumption directory. If your UID and GID on the host system is
1000 (the default for the first normal user on most systems), it will
work out of the box without any modifications.
``PAPERLESS_PASSPHRASE``
This is the passphrase Paperless uses to encrypt/decrypt the original
document. If you aren't planning on using GPG encryption, you can just
leave this undefined.
``PAPERLESS_OCR_THREADS``
This is the number of threads the OCR process will spawn to process
document pages in parallel. If the variable is not set, Python determines
the core-count of your CPU and uses that value.
``PAPERLESS_OCR_LANGUAGES``
If you want the OCR to recognize other languages in addition to the
default English, set this parameter to a space separated list of
three-letter language-codes after `ISO 639-2/T`_. For a list of available
languages -- including their three letter codes -- see the
`Alpine packagelist`_.
``USERMAP_UID`` and ``USERMAP_GID``
If you want to mount the consumption volume (directory ``/consume`` within
the containers) to a host-directory -- which you probably want to do --
access rights might be an issue. The default user and group ``paperless``
in the containers have an id of 1000. The containers will enforce that the
owning group of the consumption directory will be ``paperless`` to be able
to delete consumed documents. If your host-system has a group with an ID
of 1000 and you don't want this group to have access rights to the
consumption directory, you can use ``USERMAP_GID`` to change the id in the
container and thus the one of the consumption directory. Furthermore, you
can change the id of the default user as well using ``USERMAP_UID``.
``PAPERLESS_USE_SSL``
If you want Paperless to use SSL for the user interface, set this variable
to ``true``. You also need to copy your certificate and key to the ``data``
directory, named ``ssl.cert`` and ``ssl.key``.
This is not an ideal solution and, if possible, a reverse proxy with nginx
is preferred.
6. Run ``docker-compose up -d``. This will create and start the necessary
5. Run ``docker-compose up -d``. This will create and start the necessary
containers.
7. To be able to login, you will need a super user. To create it, execute the
following command:
.. code-block:: shell-session
6. To be able to login, you will need a super user. To create it, execute the
following command:
$ docker-compose run --rm webserver createsuperuser
.. code-block:: shell-session
This will prompt you to set a username (default ``paperless``), an optional
e-mail address and finally a password.
8. The default ``docker-compose.yml`` exports the webserver on your local port
8000. If you haven't adapted this, you should now be able to visit your
`Paperless webserver`_ at ``http://127.0.0.1:8000`` (or
``https://127.0.0.1:8000`` if you enabled SSL). You can login with the
user and password you just created.
9. Add files to consumption directory the way you prefer to. Following are two
possible options:
$ docker-compose run --rm webserver createsuperuser
1. Mount the consumption directory to a local host path by modifying your
``docker-compose.yml``:
.. code-block:: diff
diff --git a/docker-compose.yml b/docker-compose.yml
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -17,9 +18,8 @@ services:
volumes:
- paperless-data:/usr/src/paperless/data
- paperless-media:/usr/src/paperless/media
- - /consume
+ - /local/path/you/choose:/consume
.. danger::
While the consumption container will ensure at startup that it can
**delete** a consumed file from a host-mounted directory, it might
not be able to **read** the document in the first place if the access
rights to the file are incorrect.
Make sure that the documents you put into the consumption directory
will either be readable by everyone (``chmod o+r file.pdf``) or
readable by the default user or group id 1000 (or the one you have
set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
2. Use ``docker cp`` to copy your files directly into the container:
.. code-block:: shell-session
$ # Identify your containers
$ docker-compose ps
Name Command State Ports
-------------------------------------------------------------------------
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
$ docker cp /path/to/your/file.pdf paperless_consumer_1:/consume
``docker cp`` is a one-shot-command, just like ``cp``. This means that
every time you want to consume a new document, you will have to execute
``docker cp`` again. You can of course automate this process, but option
1 is generally the preferred one.
.. danger::
``docker cp`` will change the owning user and group of a copied file
to the acting user at the destination, which will be ``root``.
You therefore need to ensure that the documents you want to copy into
the container are readable by everyone (``chmod o+r file.pdf``)
before copying them.
This will prompt you to set a username, an optional e-mail address and
finally a password.
7. The default ``docker-compose.yml`` exports the webserver on your local port
8000. If you haven't adapted this, you should now be able to visit your
Paperless instance at ``http://127.0.0.1:8000``. You can login with the
user and password you just created.
.. _Docker: https://www.docker.com/
.. _docker-compose: https://docs.docker.com/compose/install/
.. _ISO 639-2/T: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
.. _Alpine packagelist: https://pkgs.alpinelinux.org/packages?name=tesseract-ocr-data*&arch=x86_64
.. [#compose] You of course don't have to use docker-compose, but it
simplifies deployment immensely. If you know your way around Docker, feel
free to tinker around without using compose!
.. [#dockercomposeyml] If you're upgrading your docker-compose images from
version 1.1.0 or earlier, you might need to change in the
``docker-compose.yml`` file the ``image: pitkley/paperless`` directive in
both the ``webserver`` and ``consumer`` sections to ``build: ./`` as per the
newer ``docker-compose.yml.example`` file
Bare Metal Route
================
.. _setup-permanent:
.. warning::
Making Things a Little more Permanent
-------------------------------------
TBD. User docker for now.
Once you've tested things and are happy with the work flow, you should secure
the installation and automate the process of starting the webserver and
consumer.
Migration to paperless-ng
#########################
At its core, paperless-ng is still paperless and fully compatible. However, some
things have changed under the hood, so you need to adapt your setup depending on
how you installed paperless. The important things to keep in mind are as follows.
.. _setup-permanent-webserver:
* Read the :ref:`paperless_changelog` and take note of breaking changes.
* It is recommended to use postgresql as the database now. The docker-compose
deployment will automatically create a postgresql instance and instruct
paperless to use it. This means that if you use the docker-compose script
with your current paperless media and data volumes and used the default
sqlite database, **it will not use your sqlite database and it may seem
as if your documents are gone**. You may use the provided
``docker-compose.yml.sqlite.example`` script, which does not use postgresql.
* The task scheduler of paperless, which is used to execute periodic tasks
such as email checking and maintenance, requires a `redis`_ message broker
instance. The docker-compose route takes care of that.
* The layout of the folder structure for your documents and data remains the
same.
* The frontend needs to be built from source. The docker image takes care of
that.
Using a Real Webserver
++++++++++++++++++++++
Migration to paperless-ng is then performed in a few simple steps:
The default is to use Django's development server, as that's easy and does the
job well enough on a home network. However it is heavily discouraged to use
it for more than that.
1. Do a backup for two purposes: If something goes wrong, you still have your
data. Second, if you don't like paperless-ng, you can switch back to
paperless.
If you want to do things right you should use a real webserver capable of
handling more than one thread. You will also have to let the webserver serve
the static files (CSS, JavaScript) from the directory configured in
``PAPERLESS_STATICDIR``. The default static files directory is ``../static``.
2. Replace the paperless source with paperless-ng. If you're using git, this
is done by:
For that you need to activate your virtual environment and collect the static
files with the command:
.. code:: bash
.. code:: bash
$ git remote set-url origin https://github.com/jonaswinkler/paperless-ng
$ git pull
$ cd <paperless directory>/src
$ ./manage.py collectstatic
3. If you are using docker, copy ``docker-compose.yml.example`` to
``docker-compose.yml`` and ``docker-compose.env.example`` to
``docker-compose.env``. Make adjustments to these files as necessary.
See `docker route`_ for details.
4. Update paperless. See :ref:`administration-updating` for details.
Apache
~~~~~~
5. Start paperless-ng.
This is a configuration supplied by `steckerhalter`_ on GitHub. It uses Apache
and mod_wsgi, with a Paperless installation in ``/home/paperless/``:
.. code:: bash
.. code:: apache
$ docker-compose up
This will also migrate your database as usual. Verify by inspecting the
output that the migration was successfully executed. CTRL-C will then
gracefully stop the container. After that, you can start paperless-ng as
usuall with
<VirtualHost *:80>
ServerName example.com
.. code:: bash
Alias /static/ /home/paperless/paperless/static/
<Directory /home/paperless/paperless/static>
Require all granted
</Directory>
$ docker-compose up -d
WSGIScriptAlias / /home/paperless/paperless/src/paperless/wsgi.py
WSGIDaemonProcess example.com user=paperless group=paperless threads=5 python-path=/home/paperless/paperless/src:/home/paperless/.env/lib/python3.6/site-packages
WSGIProcessGroup example.com
6. Paperless installed a permanent redirect to ``admin/`` in your browser. This
redirect is still in place and prevents access to the new UI. Clear
everything related to paperless in your browsers data in order to fix
this issue.
<Directory /home/paperless/paperless/src/paperless>
<Files wsgi.py>
Require all granted
</Files>
</Directory>
</VirtualHost>
Moving data from sqlite to postgresql
=====================================
.. _steckerhalter: https://github.com/steckerhalter
.. warning::
TBD.
Nginx + Gunicorn
~~~~~~~~~~~~~~~~
If you're using Nginx, the most common setup is to combine it with a
Python-based server like Gunicorn so that Nginx is acting as a proxy. Below is
a copy of a simple Nginx configuration fragment making use of a gunicorn
instance listening on localhost port 8000.
.. code:: nginx
server {
listen 80;
index index.html index.htm index.php;
access_log /var/log/nginx/paperless_access.log;
error_log /var/log/nginx/paperless_error.log;
location /static {
autoindex on;
alias <path-to-paperless-static-directory>;
}
location / {
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_pass http://127.0.0.1:8000;
}
}
The gunicorn server can be started with the command:
.. code-block:: shell
$ <path-to-paperless-virtual-environment>/bin/gunicorn --pythonpath=<path-to-paperless>/src paperless.wsgi -w 2
.. _setup-permanent-standard-systemd:
Standard (Bare Metal + Systemd)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're running on a bare metal system that's using Systemd, you can use the
service unit files in the ``scripts`` directory to set this up.
1. You'll need to create a group and user called ``paperless`` (without login)
2. Setup Paperless to be in a place that this new user can read and write to.
3. Ensure ``/etc/paperless`` is readable by the ``paperless`` user.
4. Copy the service file from the ``scripts`` directory to
``/etc/systemd/system``.
.. code-block:: bash
$ cp /path/to/paperless/scripts/paperless-consumer.service /etc/systemd/system/
$ cp /path/to/paperless/scripts/paperless-webserver.service /etc/systemd/system/
5. Edit the service file to point the ``ExecStart`` line to the proper location
of your paperless install, referencing the appropriate Python binary. For
example:
``ExecStart=/path/to/python3 /path/to/paperless/src/manage.py document_consumer``.
6. Start and enable (so they start on boot) the services.
.. code-block:: bash
$ systemctl enable paperless-consumer
$ systemctl enable paperless-webserver
$ systemctl start paperless-consumer
$ systemctl start paperless-webserver
.. _setup-permanent-standard-upstart:
Standard (Bare Metal + Upstart)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
during the boot process. To configure Upstart to run Paperless automatically
after restarting your system:
1. Change to the directory where Upstart's configuration files are kept:
``cd /etc/init``
2. Create a new file: ``sudo nano paperless-server.conf``
3. In the newly-created file enter::
start on (local-filesystems and net-device-up IFACE=eth0)
stop on shutdown
respawn
respawn limit 10 5
script
exec <path to paperless virtual environment>/bin/gunicorn --pythonpath=<path to parperless>/src paperless.wsgi -w 2
end script
Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
path to the ``manage.py`` script in your installation directory.
If you are using a network interface other than ``eth0``, you will have to
change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
run ``ifconfig -a``.
Save the file.
4. Create a new file: ``sudo nano paperless-consumer.conf``
5. In the newly-created file enter::
start on (local-filesystems and net-device-up IFACE=eth0)
stop on shutdown
respawn
respawn limit 10 5
script
exec <path to paperless virtual environment>/bin/python <path to parperless>/manage.py document_consumer
end script
Replace the path placeholder and ``eth0`` with the appropriate value and save the file.
These two configuration files together will start both the Paperless webserver
and document consumer processes when the file system and network interface
specified is available after boot. Furthermore, if either process ever exits
unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
second period.
.. _Upstart: http://upstart.ubuntu.com/
.. _setup-permanent-docker:
Docker
~~~~~~
If you're using Docker, you can set a restart-policy_ in the
``docker-compose.yml`` to have the containers automatically start with the
Docker daemon.
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
.. _redis: https://redis.io/

216
docs/usage_overview.rst Normal file
View File

@ -0,0 +1,216 @@
**************
Usage Overview
**************
Paperless is an application that manages your personal documents. With
the help of a document scanner (see :ref:`scanners`), paperless transforms
your wieldy physical document binders into a searchable archive and
provices many utilities for finding and managing your documents.
Terms and definitions
#####################
Paperless esentially consists of two different parts for managing your
documents:
* The *consumer* watches a specified folder and adds all documents in that
folder to paperless.
* The *web server* provides a UI that you use to manage and search for your
scanned documents.
Each document has a couple of fields that you can assign to them:
* A *Document* is a piece of paper that sometimes contains valuable
information.
* The *correspondent* of a document is the person, institution or company that
a document either originates form, or is sent to.
* A *tag* is a label that you can assign to documents. Think of labels as more
powerful folders: Multiple documents can be grouped together with a single
tag, however, a single document can also have multiple tags. This is not
possible with folders. The reason folders are not implemented in paperless
is simply that tags are much more versatile than folders.
* A *document type* is used to demarkate the type of a document such as letter,
bank statement, invoice, contract, etc. It is used to identify what a document
is about.
* The *date added* of a document is the date the document was scanned into
paperless. You cannot and should not change this date.
* The *date created* of a document is the date the document was intially issued.
This can be the date you bought a product, the date you signed a contract, or
the date a letter was sent to you.
* The *archive serial number* (short: ASN) of a document is the identifier of
the document in your physical document binders. See
:ref:`usage-recommended_workflow` below.
* The *content* of a document is the text that was OCR'ed from the document.
This text is fed into the search engine and is used for matching tags,
correspondents and document types.
.. TODO: hyperref
Frontend overview
#################
.. warning::
TBD. Add some fancy screenshots!
Adding documents to paperless
#############################
Once you've got Paperless setup, you need to start feeding documents into it.
Currently, there are three options: the consumption directory, IMAP (email), and
HTTP POST.
The consumption directory
=========================
The primary method of getting documents into your database is by putting them in
the consumption directory. The consumer runs in an infinite
loop looking for new additions to this directory and when it finds them, it goes
about the process of parsing them with the OCR, indexing what it finds, and storing
it in the media directory.
Getting stuff into this directory is up to you. If you're running Paperless
on your local computer, you might just want to drag and drop files there, but if
you're running this on a server and want your scanner to automatically push
files to this directory, you'll need to setup some sort of service to accept the
files from the scanner. Typically, you're looking at an FTP server like
`Proftpd`_ or a Windows folder share with `Samba`_.
.. _Proftpd: http://www.proftpd.org/
.. _Samba: http://www.samba.org/
.. TODO: hyperref to configuration of the location of this magic folder.
IMAP (Email)
============
Another handy way to get documents into your database is to email them to
yourself. The typical use-case would be to be out for lunch and want to send a
copy of the receipt back to your system at home. Paperless can be taught to
pull emails down from an arbitrary account and dump them into the consumption
directory where the consumer will follow the
usual pattern on consuming the document.
Some things you need to know about this feature:
* It's disabled by default. By setting the values below it will be enabled.
* It's been tested in a limited environment, so it may not work for you (please
submit a pull request if you can!)
* It's designed to **delete mail from the server once consumed**. So don't go
pointing this to your personal email account and wonder where all your stuff
went.
* Currently, only one photo (attachment) per email will work.
So, with all that in mind, here's what you do to get it running:
1. Setup a new email account somewhere, or if you're feeling daring, create a
folder in an existing email box and note the path to that folder.
2. In ``/etc/paperless.conf`` set all of the appropriate values in
``PATHS AND FOLDERS`` and ``SECURITY``.
If you decided to use a subfolder of an existing account, then make sure you
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
have to include that in every email you send.
3. Restart paperless. Paperless will check
the configured email account at startup and from then on every 10 minutes
for something new and pulls down whatever it finds.
4. Send yourself an email! Note that the subject is treated as the file name,
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
get what you expect. Also, you must include the aforementioned secret
string in every email so the fetcher knows that it's safe to import.
Note that Paperless only allows the email title to consist of safe characters
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
REST API
========
You can also submit a document using the REST API, see the API section for details.
.. _usage-recommended_workflow:
The recommended workflow
########################
Once you have familiarized yourself with paperless and are ready to use it
for all your documents, the recommended workflow for managing your documents
is as follows. This workflow also takes into account that some documents
have to be kept in physical form, but still ensures that you get all the
advantages for these documents as well.
Preparations in paperless
=========================
* Create an inbox tag that gets assigned to all new documents.
* Create a TODO tag.
Processing of the physical documents
====================================
Keep a physical inbox. Whenever you receive a document that you need to
archive, put it into your inbox. Regulary, do the following for all documents
in your inbox:
1. For each document, decide if you need to keep the document in physical
form. This applies to certain important documents, such as contracts and
certificates.
2. If you need to keep the document, write a running number on the document
before scanning, starting at one and counting upwards. This is the archive
serial number, or ASN in short.
3. Scan the document.
4. If the document has an ASN assigned, store it in a *single* binder, sorted
by ASN. Don't order this binder in any other way.
5. If the document has no ASN, throw it away. Yay!
Over time, you will notice that your physical binder will fill up. If it is
full, label the binder with the range of ASNs in this binder (i.e., "Documents
1 to 343"), store the binder in your cellar or elsewhere, and start a new
binder.
The idea behind this process is that you will never have to use the physical
binders to find a document. If you need a specific physical document, you
may find this document by:
1. Searching in paperless for the document.
2. Identify the ASN of the document, since it appears on the scan.
3. Grab the relevant document binder and get the document. This is easy since
they are sorted by ASN.
Processing of documents in paperless
====================================
Once you have scanned in a document, proceed in paperless as follows.
1. If the document has an ASN, assign the ASN to the document.
2. Assign a correspondent to the document (i.e., your employer, bank, etc)
This isnt strictly necessary but helps in finding a document when you need
it.
3. Assign a document type (i.e., invoice, bank statement, etc) to the document
This isnt strictly necessary but helps in finding a document when you need
it.
4. Assign a proper title to the document (the name of an item you bought, the
subject of the letter, etc)
5. Check that the date of the document is corrent. Paperless tries to read
the date from the content of the document, but this fails sometimes if the
OCR is bad or multiple dates appear on the document.
6. Remove inbox tags from the documents.
Task management
===============
Some documents require attention and require you to act on the document. You
may take two different approaches to handle these documents based on how
regularly you intent to use paperless and scan documents.
* If you scan and process your documents in paperless regularly, assign a
TODO tag to all scanned documents that you need to process. Create a saved
view on the dashboard that shows all documents with this tag.
* If you do not scan documents regularly and use paperless solely for archiving,
create a physical todo box next to your physical inbox and put documents you
need to process in the TODO box. When you performed the task associated with
the document, move it to the inbox.

View File

@ -1,284 +0,0 @@
.. _utilities:
Utilities
=========
There's basically three utilities to Paperless: the webserver, consumer, and
if needed, the exporter. They're all detailed here.
.. _utilities-webserver:
The Webserver
-------------
At the heart of it, Paperless is a simple Django webservice, and the entire
interface is based on Django's standard admin interface. Once running, visiting
the URL for your service delivers the admin, through which you can get a
detailed listing of all available documents, search for specific files, and
download whatever it is you're looking for.
.. _utilities-webserver-howto:
How to Use It
.............
The webserver is started via the ``manage.py`` script:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py runserver
By default, the server runs on localhost, port 8000, but you can change this
with a few arguments, run ``manage.py --help`` for more information.
Add the option ``--noreload`` to reduce resource usage. Otherwise, the server
continuously polls all source files for changes to auto-reload them.
Note that when exiting this command your webserver will disappear.
If you want to run this full-time (which is kind of the point)
you'll need to have it start in the background -- something you'll need to
figure out for your own system. To get you started though, there are Systemd
service files in the ``scripts`` directory.
.. _utilities-consumer:
The Consumer
------------
The consumer script runs in an infinite loop, constantly looking at a directory
for documents to parse and index. The process is pretty straightforward:
1. Look in ``CONSUMPTION_DIR`` for a document. If one is found, go to #2.
If not, wait 10 seconds and try again. On Linux, new documents are detected
instantly via inotify, so there's no waiting involved.
2. Parse the document with Tesseract
3. Create a new record in the database with the OCR'd text
4. Attempt to automatically assign document attributes by doing some guesswork.
Read up on the :ref:`guesswork documentation<guesswork>` for more
information about this process.
5. Encrypt the document (if you have a passphrase set) and store it in the
``media`` directory under ``documents/originals``.
6. Go to #1.
.. _utilities-consumer-howto:
How to Use It
.............
The consumer is started via the ``manage.py`` script:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_consumer
This starts the service that will consume documents as they appear in
``CONSUMPTION_DIR``.
Note that this command runs continuously, so exiting it will mean your webserver
disappears. If you want to run this full-time (which is kind of the point)
you'll need to have it start in the background -- something you'll need to
figure out for your own system. To get you started though, there are Systemd
service files in the ``scripts`` directory.
Some command line arguments are available to customize the behavior of the
consumer. By default it will use ``/etc/paperless.conf`` values. Display the
help with:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_consumer --help
.. _utilities-exporter:
The Exporter
------------
Tired of fiddling with Paperless, or just want to do something stupid and are
afraid of accidentally damaging your files? You can export all of your
documents into neatly named, dated, and unencrypted files.
.. _utilities-exporter-howto:
How to Use It
.............
This too is done via the ``manage.py`` script:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/
This will dump all of your unencrypted documents into ``/path/to/somewhere``
for you to do with as you please. The files are accompanied with a special
file, ``manifest.json`` which can be used to :ref:`import the files
<utilities-importer>` at a later date if you wish.
.. _utilities-exporter-howto-docker:
Docker
______
If you are :ref:`using Docker <setup-installation-docker>`, running the
expoorter is almost as easy. To mount a volume for exports, follow the
instructions in the ``docker-compose.yml.example`` file for the ``/export``
volume (making the changes in your own ``docker-compose.yml`` file, of course).
Once you have the volume mounted, the command to run an export is:
.. code-block:: shell-session
$ docker-compose run --rm consumer document_exporter /export
If you prefer to use ``docker run`` directly, supplying the necessary commandline
options:
.. code-block:: shell-session
$ # Identify your containers
$ docker-compose ps
Name Command State Ports
-------------------------------------------------------------------------
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
$ # Make sure to replace your passphrase and remove or adapt the id mapping
$ docker run --rm \
--volumes-from paperless_data_1 \
--volume /path/to/arbitrary/place:/export \
-e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \
-e USERMAP_UID=1000 -e USERMAP_GID=1000 \
paperless document_exporter /export
.. _utilities-importer:
The Importer
------------
Looking to transfer Paperless data from one instance to another, or just want
to restore from a backup? This is your go-to toy.
.. _utilities-importer-howto:
How to Use It
.............
The importer works just like the exporter. You point it at a directory, and
the script does the rest of the work:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/
Docker
______
Assuming that you've already gone through the steps above in the
:ref:`export <utilities-exporter-howto-docker>` section, then the easiest thing
to do is just re-use the ``/export`` path you already setup:
.. code-block:: shell-session
$ docker-compose run --rm consumer document_importer /export
Similarly, if you're not using docker-compose, you can adjust the export
instructions above to do the import.
.. _utilities-retagger:
Re-running your tagging and correspondent matchers
--------------------------------------------------
Say you've imported a few hundred documents and now want to introduce
a tag or set up a new correspondent, and apply its matching to all of
the currently-imported docs. This problem is common enough that
there are tools for it.
.. _utilities-retagger-howto:
How to Do It
............
This too is done via the ``manage.py`` script:
.. code:: bash
$ /path/to/paperless/src/manage.py document_retagger
Run this after changing or adding tagging rules. It'll loop over all
of the documents in your database and attempt to match all of your
tags to them. If one matches, it'll be applied. And don't worry, you
can run this as often as you like, it won't double-tag a document.
.. code:: bash
$ /path/to/paperless/src/manage.py document_correspondents
This is the similar command to run after adding or changing a correspondent.
.. _utilities-encyption:
Enabling Encrpytion
-------------------
Let's say you've imported a few documents to play around with paperless and now
you are using it more seriously and want to enable encryption of your files.
.. utilities-encryption-howto:
Basic Syntax
.............
Again we'll use the ``manage.py`` script, passing ``change_storage_type``:
.. code:: console
$ /path/to/paperless/src/manage.py change_storage_type --help
usage: manage.py change_storage_type [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--passphrase PASSPHRASE]
{gpg,unencrypted} {gpg,unencrypted}
This is how you migrate your stored documents from an encrypted state to an
unencrypted one (or vice-versa)
positional arguments:
{gpg,unencrypted} The state you want to change your documents from
{gpg,unencrypted} The state you want to change your documents to
optional arguments:
--passphrase PASSPHRASE
If PAPERLESS_PASSPHRASE isn't set already, you need to
specify it here
Enabling Encryption
...................
Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**):
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code:: bash
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg
Disabling Encryption
....................
Basic usage to enable encryption of your document store:
(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code:: bash
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted