mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-02 13:45:10 -05:00
reworking the documentation.
This commit is contained in:
parent
04335e4aac
commit
f2dbb74d44
BIN
docs/_static/Screenshot_first_logged.png
vendored
BIN
docs/_static/Screenshot_first_logged.png
vendored
Binary file not shown.
Before Width: | Height: | Size: 60 KiB |
BIN
docs/_static/Screenshot_first_run_login.png
vendored
BIN
docs/_static/Screenshot_first_run_login.png
vendored
Binary file not shown.
Before Width: | Height: | Size: 26 KiB |
BIN
docs/_static/Screenshot_upload_and_scanned.png
vendored
BIN
docs/_static/Screenshot_upload_and_scanned.png
vendored
Binary file not shown.
Before Width: | Height: | Size: 113 KiB |
354
docs/administration.rst
Normal file
354
docs/administration.rst
Normal file
@ -0,0 +1,354 @@
|
||||
|
||||
**************
|
||||
Administration
|
||||
**************
|
||||
|
||||
|
||||
Making backups
|
||||
##############
|
||||
|
||||
.. warning::
|
||||
|
||||
This section is not updated yet.
|
||||
|
||||
So you're bored of this whole project, or you want to make a remote backup of
|
||||
your files for whatever reason. This is easy to do, simply use the
|
||||
:ref:`exporter <utilities-exporter>` to dump your documents and database out
|
||||
into an arbitrary directory.
|
||||
|
||||
|
||||
.. _migrating-restoring:
|
||||
|
||||
Restoring
|
||||
=========
|
||||
|
||||
Restoring your data is just as easy, since nearly all of your data exists either
|
||||
in the file names, or in the contents of the files themselves. You just need to
|
||||
create an empty database (just follow the
|
||||
:ref:`installation instructions <setup-installation>` again) and then import the
|
||||
``tags.json`` file you created as part of your backup. Lastly, copy your
|
||||
exported documents into the consumption directory and start up the consumer.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ rm data/db.sqlite3 # Delete the database
|
||||
$ cd src
|
||||
$ ./manage.py migrate # Create the database
|
||||
$ ./manage.py createsuperuser
|
||||
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
|
||||
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
|
||||
$ ./manage.py document_consumer
|
||||
|
||||
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
|
||||
is almost as simple:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
# Stop and remove your current containers
|
||||
$ docker-compose stop
|
||||
$ docker-compose rm -f
|
||||
|
||||
# Recreate them, add the superuser
|
||||
$ docker-compose up -d
|
||||
$ docker-compose run --rm webserver createsuperuser
|
||||
|
||||
# Load the tags
|
||||
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
|
||||
|
||||
# Load your exported documents into the consumption directory
|
||||
# (How you do this highly depends on how you have set this up)
|
||||
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
|
||||
|
||||
After loading the documents into the consumption directory the consumer will
|
||||
immediately start consuming the documents.
|
||||
|
||||
.. _administration-updating:
|
||||
|
||||
Updating paperless
|
||||
##################
|
||||
|
||||
.. warning::
|
||||
|
||||
This section is not updated yet.
|
||||
|
||||
For the most part, all you have to do to update Paperless is run ``git pull``
|
||||
on the directory containing the project files, and then use Django's
|
||||
``migrate`` command to execute any database schema updates that might have been
|
||||
rolled in as part of the update:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ git pull
|
||||
$ pip install -r requirements.txt
|
||||
$ cd src
|
||||
$ ./manage.py migrate
|
||||
|
||||
Note that it's possible (even likely) that while ``git pull`` may update some
|
||||
files, the ``migrate`` step may not update anything. This is totally normal.
|
||||
|
||||
Additionally, as new features are added, the ability to control those features
|
||||
is typically added by way of an environment variable set in ``paperless.conf``.
|
||||
You may want to take a look at the ``paperless.conf.example`` file to see if
|
||||
there's anything new in there compared to what you've got in ``/etc``.
|
||||
|
||||
If you are :ref:`using Docker <setup-installation-docker>` the update process
|
||||
is similar:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ git pull
|
||||
$ docker build -t paperless .
|
||||
$ docker-compose run --rm consumer migrate
|
||||
$ docker-compose up -d
|
||||
|
||||
If ``git pull`` doesn't report any changes, there is no need to continue with
|
||||
the remaining steps.
|
||||
|
||||
This depends on the route you've chosen to run paperless.
|
||||
|
||||
a. If you are not using docker, update python requirements. Paperless uses
|
||||
`Pipenv`_ for managing dependencies:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pip install --upgrade pipenv
|
||||
$ cd /path/to/paperless
|
||||
$ pipenv install
|
||||
|
||||
This creates a new virtual environment (or uses your existing environment)
|
||||
and installs all dependencies into it. Running commands inside the environment
|
||||
is done via
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ cd /path/to/paperless/src
|
||||
$ pipenv run python3 manage.py my_command
|
||||
|
||||
You will also need to build the frontend each time a new update is pushed.
|
||||
See updating paperless for more information. TODO REFERENCE
|
||||
|
||||
b. If you are using docker, build the docker image.
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ docker build -t jonaswinkler/paperless-ng:latest .
|
||||
|
||||
Copy either docker-compose.yml.example or docker-compose.yml.sqlite.example
|
||||
to docker-compose.yml and adjust the consumption directory.
|
||||
|
||||
Management utilities
|
||||
####################
|
||||
|
||||
Paperless comes with some management commands that perform various maintenance
|
||||
tasks on your paperless instance. You can invoce these commands either by
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ cd /path/to/paperless
|
||||
$ docker-compose run --rm webserver <command> <arguments>
|
||||
|
||||
or
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ cd /path/to/paperless/src
|
||||
$ pipenv run python manage.py <command> <arguments>
|
||||
|
||||
depending on whether you use docker or not.
|
||||
|
||||
All commands have built-in help, which can be accessed by executing them with
|
||||
the argument ``--help``.
|
||||
|
||||
Document exporter
|
||||
=================
|
||||
|
||||
The document exporter exports all your data from paperless into a folder for
|
||||
backup or migration to another DMS.
|
||||
|
||||
.. code::
|
||||
|
||||
document_exporter target
|
||||
|
||||
``target`` is a folder to which the data gets written. This includes documents,
|
||||
thumbnails and a ``manifest.json`` file. The manifest contains all metadata from
|
||||
the database (correspondents, tags, etc).
|
||||
|
||||
When you use the provided docker compose script, specify ``../export`` as the
|
||||
target. This path inside the container is automatically mounted on your host on
|
||||
the folder ``export``.
|
||||
|
||||
|
||||
.. _utilities-importer:
|
||||
|
||||
Document importer
|
||||
=================
|
||||
|
||||
The document importer takes the export produced by the `Document exporter`_ and
|
||||
imports it into paperless.
|
||||
|
||||
The importer works just like the exporter. You point it at a directory, and
|
||||
the script does the rest of the work:
|
||||
|
||||
.. code::
|
||||
|
||||
document_importer source
|
||||
|
||||
When you use the provided docker compose script, put the export inside the
|
||||
``export`` folder in your paperless source directory. Specify ``../export``
|
||||
as the ``source``.
|
||||
|
||||
|
||||
.. _utilities-retagger:
|
||||
|
||||
Document retagger
|
||||
=================
|
||||
|
||||
Say you've imported a few hundred documents and now want to introduce
|
||||
a tag or set up a new correspondent, and apply its matching to all of
|
||||
the currently-imported docs. This problem is common enough that
|
||||
there are tools for it.
|
||||
|
||||
.. code::
|
||||
|
||||
document_retagger [-h] [-c] [-T] [-t] [-i] [--use-first] [-f]
|
||||
|
||||
optional arguments:
|
||||
-c, --correspondent
|
||||
-T, --tags
|
||||
-t, --document_type
|
||||
-i, --inbox-only
|
||||
--use-first
|
||||
-f, --overwrite
|
||||
|
||||
Run this after changing or adding matching rules. It'll loop over all
|
||||
of the documents in your database and attempt to match documents
|
||||
according to the new rules.
|
||||
|
||||
Specify any combination of ``-c``, ``-T`` and ``-t`` to have the
|
||||
retagger perform matching of the specified metadata type. If you don't
|
||||
specify any of these options, the document retagger won't do anything.
|
||||
|
||||
Specify ``-i`` to have the document retagger work on documents tagged
|
||||
with inbox tags only. This is useful when you don't want to mess with
|
||||
your already processed documents.
|
||||
|
||||
When multiple document types or correspondents match a single document,
|
||||
the retagger won't assign these to the document. Specify ``--use-first``
|
||||
to override this behaviour and just use the first correspondent or type
|
||||
it finds. This option does not apply to tags, since any amount of tags
|
||||
can be applied to a document.
|
||||
|
||||
Finally, ``-f`` specifies that you wish to overwrite already assigned
|
||||
correspondents, types and/or tags. The default behaviour is to not
|
||||
assign correspondents and types to documents that have this data already
|
||||
assigned. ``-f`` works differently for tags: By default, only additional tags get
|
||||
added to documents, no tags will be removed. With ``-f``, tags that don't
|
||||
match a document anymore get removed as well.
|
||||
|
||||
|
||||
Managing the Automatic matching algorithm
|
||||
=========================================
|
||||
|
||||
The *Auto* matching algorithm requires a trained neural network to work.
|
||||
This network needs to be updated whenever somethings in your data
|
||||
changes. The docker image takes care of that automatically with the task
|
||||
scheduler. You can manually renew the classifier by invoking the following
|
||||
management command:
|
||||
|
||||
.. code::
|
||||
|
||||
document_create_classifier
|
||||
|
||||
This command takes no arguments.
|
||||
|
||||
|
||||
Managing the document search index
|
||||
==================================
|
||||
|
||||
The document search index is responsible for delivering search results for the
|
||||
website. The document index is automatically updated whenever documents get
|
||||
added to, changed, or removed from paperless. However, if the search yields
|
||||
non-existing documents or won't find anything, you may need to recreate the
|
||||
index manually.
|
||||
|
||||
.. code::
|
||||
|
||||
document_index {reindex,optimize}
|
||||
|
||||
Specify ``reindex`` to have the index created from scratch. This may take some
|
||||
time.
|
||||
|
||||
Specify ``optimize`` to optimize the index. This updates certain aspects of
|
||||
the index and usually makes queries faster and also ensures that the
|
||||
autocompletion works properly. This command is regularly invoked by the task
|
||||
scheduler.
|
||||
|
||||
|
||||
Managing filenames
|
||||
==================
|
||||
|
||||
.. warning::
|
||||
|
||||
TBD
|
||||
|
||||
.. code::
|
||||
|
||||
document_renamer
|
||||
|
||||
|
||||
.. _utilities-encyption:
|
||||
|
||||
Managing encrpytion
|
||||
===================
|
||||
|
||||
Documents can be stored in Paperless using GnuPG encryption.
|
||||
|
||||
.. danger::
|
||||
|
||||
Decryption is depreceated since paperless-ng 1.0 and doesn't really provide any
|
||||
additional security, since you have to store the passphrase in a configuration
|
||||
file on the same system as the encrypted documents for paperless to work. Also,
|
||||
paperless provides transparent access to your encrypted documents.
|
||||
|
||||
Consider running paperless on an encrypted filesystem instead, which will then
|
||||
at least provide security against physical hardware theft.
|
||||
|
||||
.. code::
|
||||
|
||||
change_storage_type [--passphrase PASSPHRASE] {gpg,unencrypted} {gpg,unencrypted}
|
||||
|
||||
positional arguments:
|
||||
{gpg,unencrypted} The state you want to change your documents from
|
||||
{gpg,unencrypted} The state you want to change your documents to
|
||||
|
||||
optional arguments:
|
||||
--passphrase PASSPHRASE
|
||||
|
||||
Enabling encryption
|
||||
-------------------
|
||||
|
||||
Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**):
|
||||
|
||||
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code::
|
||||
|
||||
change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg
|
||||
|
||||
|
||||
Disabling encryption
|
||||
--------------------
|
||||
|
||||
Basic usage to enable encryption of your document store:
|
||||
|
||||
(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code::
|
||||
|
||||
change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted
|
||||
|
||||
|
||||
.. _Pipenv: https://pipenv.pypa.io/en/latest/
|
244
docs/advanced_usage.rst
Normal file
244
docs/advanced_usage.rst
Normal file
@ -0,0 +1,244 @@
|
||||
***************
|
||||
Advanced topics
|
||||
***************
|
||||
|
||||
Paperless offers a couple features that automate certain tasks and make your life
|
||||
easier.
|
||||
|
||||
Guesswork
|
||||
#########
|
||||
|
||||
|
||||
Any document you put into the consumption directory will be consumed, but if
|
||||
you name the file right, it'll automatically set some values in the database
|
||||
for you. This is is the logic the consumer follows:
|
||||
|
||||
1. Try to find the correspondent, title, and tags in the file name following
|
||||
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
||||
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
|
||||
works as well.
|
||||
2. If that doesn't work, we skip the date and try this pattern:
|
||||
``Correspondent - Title - tag,tag,tag.pdf``.
|
||||
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||
name following the pattern: ``Correspondent - Title.pdf``.
|
||||
4. If that doesn't work, just assume that the name of the file is the title.
|
||||
|
||||
So given the above, the following examples would work as you'd expect:
|
||||
|
||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Another Company - Letter of Reference.jpg``
|
||||
* ``Dad's Recipe for Pancakes.png``
|
||||
|
||||
These however wouldn't work:
|
||||
|
||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Another Company- Letter of Reference.jpg``
|
||||
|
||||
Do I have to be so strict about naming?
|
||||
=======================================
|
||||
|
||||
Rather than using the strict document naming rules, one can also set the option
|
||||
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
|
||||
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
|
||||
to any date format that is found in the title, instead of a date pulled from
|
||||
the document's text, without requiring the strict formatting of the document
|
||||
filename as described above.
|
||||
|
||||
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
|
||||
|
||||
Transforming filenames for parsing
|
||||
==================================
|
||||
|
||||
Some devices can't produce filenames that can be parsed by the default
|
||||
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
|
||||
``paperless.conf`` one can add transformations that are applied to the filename
|
||||
before it's parsed.
|
||||
|
||||
The option contains a list of dictionaries of regular expressions (key:
|
||||
``pattern``) and replacements (key: ``repl``) in JSON format, which are
|
||||
applied in order by passing them to ``re.subn``. Transformation stops
|
||||
after the first match, so at most one transformation is applied. The general
|
||||
syntax is
|
||||
|
||||
.. code:: python
|
||||
|
||||
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
|
||||
|
||||
The example below is for a Brother ADS-2400N, a scanner that allows
|
||||
different names to different hardware buttons (useful for handling
|
||||
multiple entities in one instance), but insists on adding ``_<count>``
|
||||
to the filename.
|
||||
|
||||
.. code:: python
|
||||
|
||||
# Brother profile configuration, support "Name_Date_Count" (the default
|
||||
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
|
||||
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
|
||||
|
||||
|
||||
Matching tags, correspondents and document types
|
||||
################################################
|
||||
|
||||
After the consumer has tried to figure out what it could from the file name,
|
||||
it starts looking at the content of the document itself. It will compare the
|
||||
matching algorithms defined by every tag and correspondent already set in your
|
||||
database to see if they apply to the text in that document. In other words,
|
||||
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
||||
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
||||
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
||||
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
||||
|
||||
The matching logic is quite powerful, and supports searching the text of your
|
||||
document with different algorithms, and as such, some experimentation may be
|
||||
necessary to get things right.
|
||||
|
||||
In order to have a tag, correspondent or type assigned automatically to newly
|
||||
consumed documents, assign a match and matching algorithm using the web
|
||||
interface. These settings define when to assign correspondents, tags and types
|
||||
to documents.
|
||||
|
||||
The following algorithms are available:
|
||||
|
||||
* **Any:** Looks for any occurrence of any word provided in match in the PDF.
|
||||
If you define the match as ``Bank1 Bank2``, it will match documents containing
|
||||
either of these terms.
|
||||
* **All:** Requires that every word provided appears in the PDF, albeit not in the
|
||||
order provided.
|
||||
* **Literal:** Matches only if the match appears exactly as provided in the PDF.
|
||||
* **Regular expression:** Parses the match as a regular expression and tries to
|
||||
find a match within the document.
|
||||
* **Fuzzy match:** I dont know. Look at the source.
|
||||
* **Auto:** Tries to automatically match new documents. This does not require you
|
||||
to set a match. See the notes below.
|
||||
|
||||
When using the "any" or "all" matching algorithms, you can search for terms
|
||||
that consist of multiple words by enclosing them in double quotes. For example,
|
||||
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
|
||||
will match documents that contain either "Bank of America" or "BofA", but will
|
||||
not match documents containing "Bank of South America".
|
||||
|
||||
Then just save your tag/correspondent and run another document through the
|
||||
consumer. Once complete, you should see the newly-created document,
|
||||
automatically tagged with the appropriate data.
|
||||
|
||||
|
||||
Automatic matching
|
||||
==================
|
||||
|
||||
Paperless-ng comes with a new matching algorithm called *Auto*. This matching
|
||||
algorithm tries to assign tags, correspondents and document types to your
|
||||
documents based on how you have assigned these on existing documents. It
|
||||
uses a neural network under the hood.
|
||||
|
||||
If, for example, all your bank statements of your account 123 at the Bank of
|
||||
America are tagged with the tag "bofa_123" and the matching algorithm of this
|
||||
tag is set to *Auto*, this neural network will examine your documents and
|
||||
automatically learn when to assign this tag.
|
||||
|
||||
There are a couple caveats you need to keep in mind when using this feature:
|
||||
|
||||
* Changes to your documents are not immediately reflected by the matching
|
||||
algorithm. The neural network needs to be *trained* on your documents after
|
||||
changes. Paperless periodically (default: once each hour) checks for changes
|
||||
and does this automatically for you.
|
||||
* The Auto matching algorithm only takes documents into account which are NOT
|
||||
placed in your inbox (i.e., have inbox tags assigned to them). This ensures
|
||||
that the neural network only learns from documents which you have correctly
|
||||
tagged before.
|
||||
* The matching algorithm can only work if there is a correlation between the
|
||||
tag, correspondent or document type and the document itself. Your bank
|
||||
statements usually contain your bank account number and the name of the bank,
|
||||
so this works reasonably well, However, tags such as "TODO" cannot be
|
||||
automatically assigned.
|
||||
* The matching algorithm needs a reasonable number of documents to identify when
|
||||
to assign tags, correspondents, and types. If one out of a thousand documents
|
||||
has the correspondent "Very obscure web shop I bought something five years
|
||||
ago", it will probably not assign this correspondent automatically if you buy
|
||||
something from them again. The more documents, the better.
|
||||
|
||||
Hooking into the consumption process
|
||||
####################################
|
||||
|
||||
Sometimes you may want to do something arbitrary whenever a document is
|
||||
consumed. Rather than try to predict what you may want to do, Paperless lets
|
||||
you execute scripts of your own choosing just before or after a document is
|
||||
consumed using a couple simple hooks.
|
||||
|
||||
Just write a script, put it somewhere that Paperless can read & execute, and
|
||||
then put the path to that script in ``paperless.conf`` with the variable name
|
||||
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
||||
``PAPERLESS_POST_CONSUME_SCRIPT``.
|
||||
|
||||
.. TODO HYPEREF TO CONFIG
|
||||
|
||||
.. important::
|
||||
|
||||
These scripts are executed in a **blocking** process, which means that if
|
||||
a script takes a long time to run, it can significantly slow down your
|
||||
document consumption flow. If you want things to run asynchronously,
|
||||
you'll have to fork the process in your script and exit.
|
||||
|
||||
|
||||
Pre-consumption script
|
||||
======================
|
||||
|
||||
Executed after the consumer sees a new document in the consumption folder, but
|
||||
before any processing of the document is performed. This script receives exactly
|
||||
one argument:
|
||||
|
||||
* Document file name
|
||||
|
||||
A simple but common example for this would be creating a simple script like
|
||||
this:
|
||||
|
||||
``/usr/local/bin/ocr-pdf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
#!/usr/bin/env bash
|
||||
pdf2pdfocr.py -i ${1}
|
||||
|
||||
``/etc/paperless.conf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
...
|
||||
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
|
||||
...
|
||||
|
||||
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
|
||||
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
|
||||
overwrite the file with an OCR'd version of the file and exit. At which point,
|
||||
the consumption process will begin with the newly modified file.
|
||||
|
||||
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
|
||||
|
||||
|
||||
.. _consumption-director-hook-variables-post:
|
||||
|
||||
Post-consumption script
|
||||
=======================
|
||||
|
||||
Executed after the consumer has successfully processed a document and has moved it
|
||||
into paperless. It receives the following arguments:
|
||||
|
||||
* Document id
|
||||
* Generated file name
|
||||
* Source path
|
||||
* Thumbnail path
|
||||
* Download URL
|
||||
* Thumbnail URL
|
||||
* Correspondent
|
||||
* Tags
|
||||
|
||||
The script can be in any language you like, but for a simple shell script
|
||||
example, you can take a look at ``post-consumption-example.sh`` in the
|
||||
``scripts`` directory in this project.
|
||||
|
||||
The post consumption script cannot cancel the consumption process.
|
@ -1,7 +1,12 @@
|
||||
.. _api:
|
||||
|
||||
************
|
||||
The REST API
|
||||
############
|
||||
************
|
||||
|
||||
.. warning::
|
||||
|
||||
This section is not updated yet.
|
||||
|
||||
Paperless makes use of the `Django REST Framework`_ standard API interface
|
||||
because of its inherent awesomeness. Conveniently, the system is also
|
||||
@ -15,7 +20,7 @@ installation.
|
||||
.. _api-uploading:
|
||||
|
||||
Uploading
|
||||
---------
|
||||
=========
|
||||
|
||||
File uploads in an API are hard and so far as I've been able to tell, there's
|
||||
no standard way of accepting them, so rather than crowbar file uploads into the
|
||||
|
@ -1,6 +1,79 @@
|
||||
.. _paperless_changelog:
|
||||
|
||||
Changelog
|
||||
#########
|
||||
|
||||
paperless-ng 1.0
|
||||
================
|
||||
|
||||
* **Deprecated:** GnuPG. Don't use it. If you're still using it, be aware that it
|
||||
offers no protection at all, since the passphrase is stored alongside with the
|
||||
encrypted documents itself. This features will most likely be removed in future
|
||||
versions.
|
||||
|
||||
* **Added:** New frontend. Features:
|
||||
|
||||
* Single page application: It's much more responsive than the django admin pages.
|
||||
* Dashboard. Shows recently scanned documents, or todos, or other documents
|
||||
at wish. Allows uploading of documents. Shows basic statistics.
|
||||
* Better document list with multiple display options.
|
||||
* Full text search with result highlighting, auto completion and scoring based
|
||||
on the query. It uses a document search index in the background.
|
||||
* Saveable filters.
|
||||
* Better log viewer.
|
||||
|
||||
* **Added:** Document types. Assign these to documents just as correspondents.
|
||||
They may be used in the future to perform automatic operations on documents
|
||||
depending on the type.
|
||||
* **Added:** Inbox tags. Define an inbox tag and it will automatically be
|
||||
assigned to any new document scanned into the system.
|
||||
* **Added:** Automatic matching. A new matching algorithm that automatically
|
||||
assigns tags, document types and correspondents to your documents. It uses
|
||||
a neural network trained on your data.
|
||||
* **Added:** Archive serial numbers. Assign these to quickly find documents stored in
|
||||
physical binders.
|
||||
* **Added:** Enabled the internal user management of django. This isn't really a
|
||||
multi user solution, however, it allows more than one user to access the website
|
||||
and set some basic permissions / renew passwords.
|
||||
|
||||
* **Modified [breaking]:** REST Api changes:
|
||||
|
||||
* New filters added, other filters removed (case sensitive filters, slug filters)
|
||||
* Endpoints for thumbnails, previews and downloads replace the old ``/fetch/`` urls. Redirects are in place.
|
||||
* Endpoint for document uploads replaces the old ``/push`` url. Redirects are in place.
|
||||
* Foreign key relationships are now served as IDs, not as urls.
|
||||
|
||||
* **Modified [breaking]:** PostgreSQL:
|
||||
|
||||
* If ``PAPERLESS_DBHOST`` is specified in the settings, paperless uses postgresql instead of sqlite.
|
||||
Username, database and password all default to ``paperless`` if not specified.
|
||||
* **docker-compose.yml uses PostgreSQL by default.**
|
||||
|
||||
* **Modified [breaking]:** document_retagger management command rework. See TODO hyperref
|
||||
* **Removed [breaking]:** Reminders.
|
||||
* **Removed:** All customizations made to the django admin pages.
|
||||
|
||||
* **Internal changes:** Mostly code cleanup, including:
|
||||
|
||||
* Rework of the code of the tesseract parser. This is now a lot cleaner.
|
||||
* Rework of the filename handling code. It was a mess.
|
||||
* Fixed some issues with the document exporter not exporting all documents when encountering duplicate filenames.
|
||||
* Consumer rework: now uses the excellent watchdog library, lots of code removed.
|
||||
* Added a task scheduler that takes care of checking mail, training the classifier and maintaining the document search index.
|
||||
* Updated dependencies. Now uses Pipenv all around.
|
||||
* Updated Dockerfile and docker-compose. Now uses ``supervisord`` to run everything paperless-related in a single container.
|
||||
|
||||
* **Settings:**
|
||||
|
||||
* ``PAPERLESS_FORGIVING_OCR`` is now default and gone. Reason: Even if ``langdetect`` fails to detect
|
||||
a language, tesseract still does a very good job at ocr'ing a document with the default language.
|
||||
Certain language specifics such as umlauts may not get picked up properly.
|
||||
* ``PAPERLESS_DEBUG`` defaults to ``false``.
|
||||
* The presence of ``PAPERLESS_DBHOST`` now determines whether to use PostgreSQL or
|
||||
sqlite.
|
||||
|
||||
* Many more small changes here and there. The usual stuff.
|
||||
|
||||
2.7.0
|
||||
=====
|
||||
|
||||
|
@ -1,15 +0,0 @@
|
||||
Changelog (jonaswinkler)
|
||||
########################
|
||||
|
||||
1.0.0
|
||||
=====
|
||||
|
||||
* First release based on paperless 2.6.0
|
||||
* Added: Automatic document classification using neural networks (replaces
|
||||
regex-based tagging)
|
||||
* Added: Document types
|
||||
* Added: Archive serial number allows easy referencing of physical document
|
||||
copies
|
||||
* Added: Inbox tags (added automatically to newly consumed documents)
|
||||
* Added: Document viewer on document edit page
|
||||
* Database backend is now configurable
|
@ -54,7 +54,7 @@ source_suffix = '.rst'
|
||||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = u'Paperless'
|
||||
project = u'Paperless-ng'
|
||||
copyright = u'2015, Daniel Quinn'
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
@ -205,7 +205,8 @@ try:
|
||||
import sphinx_rtd_theme
|
||||
html_theme = "sphinx_rtd_theme"
|
||||
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
|
||||
except ImportError:
|
||||
except ImportError as e:
|
||||
print("error " + str(e))
|
||||
pass
|
||||
|
||||
# -- Options for LaTeX output ---------------------------------------------
|
||||
|
@ -1,255 +0,0 @@
|
||||
.. _consumption:
|
||||
|
||||
Consumption
|
||||
###########
|
||||
|
||||
Once you've got Paperless setup, you need to start feeding documents into it.
|
||||
Currently, there are three options: the consumption directory, IMAP (email), and
|
||||
HTTP POST.
|
||||
|
||||
|
||||
.. _consumption-directory:
|
||||
|
||||
The Consumption Directory
|
||||
=========================
|
||||
|
||||
The primary method of getting documents into your database is by putting them in
|
||||
the consumption directory. The ``document_consumer`` script runs in an infinite
|
||||
loop looking for new additions to this directory and when it finds them, it goes
|
||||
about the process of parsing them with the OCR, indexing what it finds, and
|
||||
encrypting the PDF (if ``PAPERLESS_PASSPHRASE`` is set), storing it in the
|
||||
media directory.
|
||||
|
||||
Getting stuff into this directory is up to you. If you're running Paperless
|
||||
on your local computer, you might just want to drag and drop files there, but if
|
||||
you're running this on a server and want your scanner to automatically push
|
||||
files to this directory, you'll need to setup some sort of service to accept the
|
||||
files from the scanner. Typically, you're looking at an FTP server like
|
||||
`Proftpd`_ or `Samba`_.
|
||||
|
||||
.. _Proftpd: http://www.proftpd.org/
|
||||
.. _Samba: http://www.samba.org/
|
||||
|
||||
So where is this consumption directory? It's wherever you define it. Look for
|
||||
the ``CONSUMPTION_DIR`` value in ``settings.py``. Set that to somewhere
|
||||
appropriate for your use and put some documents in there. When you're ready,
|
||||
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
|
||||
|
||||
|
||||
.. _consumption-directory-hook:
|
||||
|
||||
Hooking into the Consumption Process
|
||||
------------------------------------
|
||||
|
||||
Sometimes you may want to do something arbitrary whenever a document is
|
||||
consumed. Rather than try to predict what you may want to do, Paperless lets
|
||||
you execute scripts of your own choosing just before or after a document is
|
||||
consumed using a couple simple hooks.
|
||||
|
||||
Just write a script, put it somewhere that Paperless can read & execute, and
|
||||
then put the path to that script in ``paperless.conf`` with the variable name
|
||||
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
||||
``PAPERLESS_POST_CONSUME_SCRIPT``. The script will be executed before or
|
||||
or after the document is consumed respectively.
|
||||
|
||||
.. important::
|
||||
|
||||
These scripts are executed in a **blocking** process, which means that if
|
||||
a script takes a long time to run, it can significantly slow down your
|
||||
document consumption flow. If you want things to run asynchronously,
|
||||
you'll have to fork the process in your script and exit.
|
||||
|
||||
|
||||
.. _consumption-directory-hook-variables:
|
||||
|
||||
What Can These Scripts Do?
|
||||
..........................
|
||||
|
||||
It's your script, so you're only limited by your imagination and the laws of
|
||||
physics. However, the following values are passed to the scripts in order:
|
||||
|
||||
|
||||
.. _consumption-director-hook-variables-pre:
|
||||
|
||||
Pre-consumption script
|
||||
::::::::::::::::::::::
|
||||
|
||||
* Document file name
|
||||
|
||||
A simple but common example for this would be creating a simple script like
|
||||
this:
|
||||
|
||||
``/usr/local/bin/ocr-pdf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
#!/usr/bin/env bash
|
||||
pdf2pdfocr.py -i ${1}
|
||||
|
||||
``/etc/paperless.conf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
...
|
||||
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
|
||||
...
|
||||
|
||||
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
|
||||
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
|
||||
overwrite the file with an OCR'd version of the file and exit. At which point,
|
||||
the consumption process will begin with the newly modified file.
|
||||
|
||||
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
|
||||
|
||||
|
||||
.. _consumption-director-hook-variables-post:
|
||||
|
||||
Post-consumption script
|
||||
:::::::::::::::::::::::
|
||||
|
||||
* Document id
|
||||
* Generated file name
|
||||
* Source path
|
||||
* Thumbnail path
|
||||
* Download URL
|
||||
* Thumbnail URL
|
||||
* Correspondent
|
||||
* Tags
|
||||
|
||||
The script can be in any language you like, but for a simple shell script
|
||||
example, you can take a look at ``post-consumption-example.sh`` in the
|
||||
``scripts`` directory in this project.
|
||||
|
||||
|
||||
.. _consumption-imap:
|
||||
|
||||
IMAP (Email)
|
||||
============
|
||||
|
||||
Another handy way to get documents into your database is to email them to
|
||||
yourself. The typical use-case would be to be out for lunch and want to send a
|
||||
copy of the receipt back to your system at home. Paperless can be taught to
|
||||
pull emails down from an arbitrary account and dump them into the consumption
|
||||
directory where the process :ref:`above <consumption-directory>` will follow the
|
||||
usual pattern on consuming the document.
|
||||
|
||||
Some things you need to know about this feature:
|
||||
|
||||
* It's disabled by default. By setting the values below it will be enabled.
|
||||
* It's been tested in a limited environment, so it may not work for you (please
|
||||
submit a pull request if you can!)
|
||||
* It's designed to **delete mail from the server once consumed**. So don't go
|
||||
pointing this to your personal email account and wonder where all your stuff
|
||||
went.
|
||||
* Currently, only one photo (attachment) per email will work.
|
||||
|
||||
So, with all that in mind, here's what you do to get it running:
|
||||
|
||||
1. Setup a new email account somewhere, or if you're feeling daring, create a
|
||||
folder in an existing email box and note the path to that folder.
|
||||
2. In ``/etc/paperless.conf`` set all of the appropriate values in
|
||||
``PATHS AND FOLDERS`` and ``SECURITY``.
|
||||
If you decided to use a subfolder of an existing account, then make sure you
|
||||
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
|
||||
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
|
||||
have to include that in every email you send.
|
||||
3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check
|
||||
the configured email account at startup and from then on every 10 minutes
|
||||
for something new and pulls down whatever it finds.
|
||||
4. Send yourself an email! Note that the subject is treated as the file name,
|
||||
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
|
||||
get what you expect. Also, you must include the aforementioned secret
|
||||
string in every email so the fetcher knows that it's safe to import.
|
||||
Note that Paperless only allows the email title to consist of safe characters
|
||||
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
|
||||
5. After a few minutes, the consumer will poll your mailbox, pull down the
|
||||
message, and place the attachment in the consumption directory with the
|
||||
appropriate name. A few minutes later, the consumer will import it like any
|
||||
other file.
|
||||
|
||||
|
||||
.. _consumption-http:
|
||||
|
||||
HTTP POST
|
||||
=========
|
||||
|
||||
You can also submit a document via HTTP POST, so long as you do so after
|
||||
authenticating. To push your document to Paperless, send an HTTP POST to the
|
||||
server with the following name/value pairs:
|
||||
|
||||
* ``correspondent``: The name of the document's correspondent. Note that there
|
||||
are restrictions on what characters you can use here. Specifically,
|
||||
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
|
||||
out. You also can't use the sequence ` - ` (space, dash, space).
|
||||
* ``title``: The title of the document. The rules for characters is the same
|
||||
here as the correspondent.
|
||||
* ``document``: The file you're uploading
|
||||
|
||||
Specify ``enctype="multipart/form-data"``, and then POST your file with::
|
||||
|
||||
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
|
||||
|
||||
An example of this in HTML is a typical form:
|
||||
|
||||
.. code:: html
|
||||
|
||||
<form method="post" enctype="multipart/form-data">
|
||||
<input type="text" name="correspondent" value="My Correspondent" />
|
||||
<input type="text" name="title" value="My Title" />
|
||||
<input type="file" name="document" />
|
||||
<input type="submit" name="go" value="Do the thing" />
|
||||
</form>
|
||||
|
||||
But a potentially more useful way to do this would be in Python. Here we use
|
||||
the requests library to handle basic authentication and to send the POST data
|
||||
to the URL.
|
||||
|
||||
.. code:: python
|
||||
|
||||
import os
|
||||
|
||||
from hashlib import sha256
|
||||
|
||||
import requests
|
||||
from requests.auth import HTTPBasicAuth
|
||||
|
||||
# You authenticate via BasicAuth or with a session id.
|
||||
# We use BasicAuth here
|
||||
username = "my-username"
|
||||
password = "my-super-secret-password"
|
||||
|
||||
# Where you have Paperless installed and listening
|
||||
url = "http://localhost:8000/push"
|
||||
|
||||
# Document metadata
|
||||
correspondent = "Test Correspondent"
|
||||
title = "Test Title"
|
||||
|
||||
# The local file you want to push
|
||||
path = "/path/to/some/directory/my-document.pdf"
|
||||
|
||||
|
||||
with open(path, "rb") as f:
|
||||
|
||||
response = requests.post(
|
||||
url=url,
|
||||
data={"title": title, "correspondent": correspondent},
|
||||
files={"document": (os.path.basename(path), f, "application/pdf")},
|
||||
auth=HTTPBasicAuth(username, password),
|
||||
allow_redirects=False
|
||||
)
|
||||
|
||||
if response.status_code == 202:
|
||||
|
||||
# Everything worked out ok
|
||||
print("Upload successful")
|
||||
|
||||
else:
|
||||
|
||||
# If you don't get a 202, it's probably because your credentials
|
||||
# are wrong or something. This will give you a rough idea of what
|
||||
# happened.
|
||||
|
||||
print("We got HTTP status code: {}".format(response.status_code))
|
||||
for k, v in response.headers.items():
|
||||
print("{}: {}".format(k, v))
|
@ -1,42 +0,0 @@
|
||||
.. _customising:
|
||||
|
||||
Customising Paperless
|
||||
#####################
|
||||
|
||||
Currently, the Paperless' interface is just the default Django admin, which
|
||||
while powerful, is rather boring. If you'd like to give the site a bit of a
|
||||
face-lift, or if you simply want to adjust the colours, contrast, or font size
|
||||
to make things easier to read, you can do that by adding your own CSS or
|
||||
Javascript quite easily.
|
||||
|
||||
|
||||
.. _customising-overrides:
|
||||
|
||||
Overrides
|
||||
=========
|
||||
|
||||
On every page load, Paperless looks for two files in your media root directory
|
||||
(the directory defined by your ``PAPERLESS_MEDIADIR`` configuration variable or
|
||||
the default, ``<project root>/media/``) for two files:
|
||||
|
||||
* ``overrides.css``
|
||||
* ``overrides.js``
|
||||
|
||||
If it finds either or both of those files, they'll be loaded into the page: the
|
||||
CSS in the ``<head>``, and the Javascript stuffed into the last line of the
|
||||
``<body>``.
|
||||
|
||||
|
||||
.. _customising-overrides-note:
|
||||
|
||||
An important note about customisation
|
||||
-------------------------------------
|
||||
|
||||
Any changes you make to the site with your CSS or Javascript are likely to
|
||||
depend on the structure of the current HTML and/or the existing CSS rules. For
|
||||
the most part it's safe to assume that these bits won't change, but *sometimes
|
||||
they do* as features are added or bugs are fixed.
|
||||
|
||||
If you make a change that you think others would appreciate though, submit it
|
||||
as a pull request and maybe we can find a way to work it into the project by
|
||||
default!
|
@ -1,131 +0,0 @@
|
||||
.. _guesswork:
|
||||
|
||||
Guesswork
|
||||
#########
|
||||
|
||||
During the consumption process, Paperless tries to guess some of the attributes
|
||||
of the document it's looking at. To do this it uses two approaches:
|
||||
|
||||
|
||||
.. _guesswork-naming:
|
||||
|
||||
File Naming
|
||||
===========
|
||||
|
||||
Any document you put into the consumption directory will be consumed, but if
|
||||
you name the file right, it'll automatically set some values in the database
|
||||
for you. This is is the logic the consumer follows:
|
||||
|
||||
1. Try to find the correspondent, title, and tags in the file name following
|
||||
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
||||
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
|
||||
works as well.
|
||||
2. If that doesn't work, we skip the date and try this pattern:
|
||||
``Correspondent - Title - tag,tag,tag.pdf``.
|
||||
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||
name following the pattern: ``Correspondent - Title.pdf``.
|
||||
4. If that doesn't work, just assume that the name of the file is the title.
|
||||
|
||||
So given the above, the following examples would work as you'd expect:
|
||||
|
||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Another Company - Letter of Reference.jpg``
|
||||
* ``Dad's Recipe for Pancakes.png``
|
||||
|
||||
These however wouldn't work:
|
||||
|
||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Another Company- Letter of Reference.jpg``
|
||||
|
||||
Do I have to be so strict about naming?
|
||||
---------------------------------------
|
||||
Rather than using the strict document naming rules, one can also set the option
|
||||
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
|
||||
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
|
||||
to any date format that is found in the title, instead of a date pulled from
|
||||
the document's text, without requiring the strict formatting of the document
|
||||
filename as described above.
|
||||
|
||||
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
|
||||
|
||||
Transforming filenames for parsing
|
||||
----------------------------------
|
||||
Some devices can't produce filenames that can be parsed by the default
|
||||
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
|
||||
``paperless.conf`` one can add transformations that are applied to the filename
|
||||
before it's parsed.
|
||||
|
||||
The option contains a list of dictionaries of regular expressions (key:
|
||||
``pattern``) and replacements (key: ``repl``) in JSON format, which are
|
||||
applied in order by passing them to ``re.subn``. Transformation stops
|
||||
after the first match, so at most one transformation is applied. The general
|
||||
syntax is
|
||||
|
||||
.. code:: python
|
||||
|
||||
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
|
||||
|
||||
The example below is for a Brother ADS-2400N, a scanner that allows
|
||||
different names to different hardware buttons (useful for handling
|
||||
multiple entities in one instance), but insists on adding ``_<count>``
|
||||
to the filename.
|
||||
|
||||
.. code:: python
|
||||
|
||||
# Brother profile configuration, support "Name_Date_Count" (the default
|
||||
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
|
||||
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
|
||||
|
||||
.. _guesswork-content:
|
||||
|
||||
Reading the Document Contents
|
||||
=============================
|
||||
|
||||
After the consumer has tried to figure out what it could from the file name,
|
||||
it starts looking at the content of the document itself. It will compare the
|
||||
matching algorithms defined by every tag and correspondent already set in your
|
||||
database to see if they apply to the text in that document. In other words,
|
||||
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
||||
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
||||
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
||||
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
||||
|
||||
The matching logic is quite powerful, and supports searching the text of your
|
||||
document with different algorithms, and as such, some experimentation may be
|
||||
necessary to get things Just Right.
|
||||
|
||||
|
||||
.. _guesswork-content-howto:
|
||||
|
||||
How Do I Set Up These Matching Algorithms?
|
||||
------------------------------------------
|
||||
|
||||
Setting up of the algorithms is easily done through the admin interface. When
|
||||
you create a new correspondent or tag, there are optional fields for matching
|
||||
text and matching algorithm. From the help info there:
|
||||
|
||||
.. note::
|
||||
|
||||
Which algorithm you want to use when matching text to the OCR'd PDF. Here,
|
||||
"any" looks for any occurrence of any word provided in the PDF, while "all"
|
||||
requires that every word provided appear in the PDF, albeit not in the
|
||||
order provided. A "literal" match means that the text you enter must
|
||||
appear in the PDF exactly as you've entered it, and "regular expression"
|
||||
uses a regex to match the PDF. If you don't know what a regex is, you
|
||||
probably don't want this option.
|
||||
|
||||
When using the "any" or "all" matching algorithms, you can search for terms
|
||||
that consist of multiple words by enclosing them in double quotes. For example,
|
||||
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
|
||||
will match documents that contain either "Bank of America" or "BofA", but will
|
||||
not match documents containing "Bank of South America".
|
||||
|
||||
Then just save your tag/correspondent and run another document through the
|
||||
consumer. Once complete, you should see the newly-created document,
|
||||
automatically tagged with the appropriate data.
|
@ -4,8 +4,8 @@ Paperless
|
||||
=========
|
||||
|
||||
Paperless is a simple Django application running in two parts:
|
||||
a :ref:`consumer <utilities-consumer>` (the thing that does the indexing) and
|
||||
the :ref:`webserver <utilities-webserver>` (the part that lets you search &
|
||||
a *Consumer* (the thing that does the indexing) and
|
||||
the *Web server* (the part that lets you search &
|
||||
download already-indexed documents). If you want to learn more about its
|
||||
functions keep on reading after the installation section.
|
||||
|
||||
@ -25,26 +25,34 @@ finding stuff again. I feed documents right from the post box into the scanner
|
||||
and then shred them. Perhaps you might find it useful too.
|
||||
|
||||
|
||||
Paperless-ng
|
||||
============
|
||||
|
||||
I wanted to make big changes to the project that will impact the way it is used
|
||||
by its users greatly. Among the users who currently use paperless in production
|
||||
there are probably many that don't want these changes right away. I also wanted
|
||||
to have more control over what goes into the code and what does not. Therefore,
|
||||
paperless-ng was created. NG stands for both Angular (the framework used for the
|
||||
Frontend) and next-gen. Publishing this project under a different name also
|
||||
avoids confusion between paperless and paperless-ng.
|
||||
|
||||
It would be great if this project could eventually merge back into the main
|
||||
repository, but it needs a lot more work before that can happen.
|
||||
|
||||
|
||||
Contents
|
||||
========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:maxdepth: 1
|
||||
|
||||
requirements
|
||||
setup
|
||||
consumption
|
||||
usage_overview
|
||||
advanced_usage
|
||||
administration
|
||||
api
|
||||
utilities
|
||||
guesswork
|
||||
migrating
|
||||
customising
|
||||
extending
|
||||
troubleshooting
|
||||
contributing
|
||||
scanners
|
||||
screenshots
|
||||
changelog
|
||||
changelog_jonaswinkler
|
||||
|
@ -1,109 +0,0 @@
|
||||
.. _migrating:
|
||||
|
||||
Migrating, Updates, and Backups
|
||||
===============================
|
||||
|
||||
As Paperless is still under active development, there's a lot that can change
|
||||
as software updates roll out. You should backup often, so if anything goes
|
||||
wrong during an update, you at least have a means of restoring to something
|
||||
usable. Thankfully, there are automated ways of backing up, restoring, and
|
||||
updating the software.
|
||||
|
||||
|
||||
.. _migrating-backup:
|
||||
|
||||
Backing Up
|
||||
----------
|
||||
|
||||
So you're bored of this whole project, or you want to make a remote backup of
|
||||
your files for whatever reason. This is easy to do, simply use the
|
||||
:ref:`exporter <utilities-exporter>` to dump your documents and database out
|
||||
into an arbitrary directory.
|
||||
|
||||
|
||||
.. _migrating-restoring:
|
||||
|
||||
Restoring
|
||||
---------
|
||||
|
||||
Restoring your data is just as easy, since nearly all of your data exists either
|
||||
in the file names, or in the contents of the files themselves. You just need to
|
||||
create an empty database (just follow the
|
||||
:ref:`installation instructions <setup-installation>` again) and then import the
|
||||
``tags.json`` file you created as part of your backup. Lastly, copy your
|
||||
exported documents into the consumption directory and start up the consumer.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ rm data/db.sqlite3 # Delete the database
|
||||
$ cd src
|
||||
$ ./manage.py migrate # Create the database
|
||||
$ ./manage.py createsuperuser
|
||||
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
|
||||
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
|
||||
$ ./manage.py document_consumer
|
||||
|
||||
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
|
||||
is almost as simple:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
# Stop and remove your current containers
|
||||
$ docker-compose stop
|
||||
$ docker-compose rm -f
|
||||
|
||||
# Recreate them, add the superuser
|
||||
$ docker-compose up -d
|
||||
$ docker-compose run --rm webserver createsuperuser
|
||||
|
||||
# Load the tags
|
||||
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
|
||||
|
||||
# Load your exported documents into the consumption directory
|
||||
# (How you do this highly depends on how you have set this up)
|
||||
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
|
||||
|
||||
After loading the documents into the consumption directory the consumer will
|
||||
immediately start consuming the documents.
|
||||
|
||||
|
||||
.. _migrating-updates:
|
||||
|
||||
Updates
|
||||
-------
|
||||
|
||||
For the most part, all you have to do to update Paperless is run ``git pull``
|
||||
on the directory containing the project files, and then use Django's
|
||||
``migrate`` command to execute any database schema updates that might have been
|
||||
rolled in as part of the update:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ git pull
|
||||
$ pip install -r requirements.txt
|
||||
$ cd src
|
||||
$ ./manage.py migrate
|
||||
|
||||
Note that it's possible (even likely) that while ``git pull`` may update some
|
||||
files, the ``migrate`` step may not update anything. This is totally normal.
|
||||
|
||||
Additionally, as new features are added, the ability to control those features
|
||||
is typically added by way of an environment variable set in ``paperless.conf``.
|
||||
You may want to take a look at the ``paperless.conf.example`` file to see if
|
||||
there's anything new in there compared to what you've got in ``/etc``.
|
||||
|
||||
If you are :ref:`using Docker <setup-installation-docker>` the update process
|
||||
is similar:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ git pull
|
||||
$ docker build -t paperless .
|
||||
$ docker-compose run --rm consumer migrate
|
||||
$ docker-compose up -d
|
||||
|
||||
If ``git pull`` doesn't report any changes, there is no need to continue with
|
||||
the remaining steps.
|
@ -1,125 +0,0 @@
|
||||
.. _requirements:
|
||||
|
||||
Requirements
|
||||
============
|
||||
|
||||
You need a Linux machine or Unix-like setup (theoretically an Apple machine
|
||||
should work) that has the following software installed:
|
||||
|
||||
* `Python3`_ (with development libraries, pip and virtualenv)
|
||||
* `GNU Privacy Guard`_
|
||||
* `Tesseract`_, plus its language files matching your document base.
|
||||
* `Imagemagick`_ version 6.7.5 or higher
|
||||
* `unpaper`_
|
||||
* `libpoppler-cpp-dev`_ PDF rendering library
|
||||
* `optipng`_
|
||||
|
||||
.. _Python3: https://python.org/
|
||||
.. _GNU Privacy Guard: https://gnupg.org
|
||||
.. _Tesseract: https://github.com/tesseract-ocr
|
||||
.. _Imagemagick: http://imagemagick.org/
|
||||
.. _unpaper: https://github.com/unpaper/unpaper
|
||||
.. _libpoppler-cpp-dev: https://poppler.freedesktop.org/
|
||||
.. _optipng: http://optipng.sourceforge.net/
|
||||
|
||||
Notably, you should confirm how you access your Python3 installation. Many
|
||||
Linux distributions will install Python3 in parallel to Python2, using the
|
||||
names ``python3`` and ``python`` respectively. The same goes for ``pip3`` and
|
||||
``pip``. Running Paperless with Python2 will likely break things, so make sure
|
||||
that you're using the right version.
|
||||
|
||||
For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to
|
||||
refer to their Python3 versions.
|
||||
|
||||
In addition to the above, there are a number of Python requirements, all of
|
||||
which are listed in a file called ``requirements.txt`` in the project root
|
||||
directory.
|
||||
|
||||
If you're not working on a virtual environment (like Docker), you
|
||||
should probably be using a virtualenv, but that's your call. The reasons why
|
||||
you might choose a virtualenv or not aren't really within the scope of this
|
||||
document. Needless to say if you don't know what a virtualenv is, you should
|
||||
probably figure that out before continuing.
|
||||
|
||||
|
||||
.. _requirements-apple:
|
||||
|
||||
Problems with Imagemagick & PDFs
|
||||
--------------------------------
|
||||
|
||||
Some users have `run into problems`_ with getting ImageMagick to do its thing
|
||||
with PDFs. Often this is the case with Apple systems using HomeBrew, but other
|
||||
Linuxes have been a problem as well. The solution appears to be to install
|
||||
ghostscript as well as ImageMagick:
|
||||
|
||||
.. _run into problems: https://github.com/the-paperless-project/paperless/issues/25
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ brew install ghostscript
|
||||
$ brew install imagemagick
|
||||
$ brew install libmagic
|
||||
|
||||
|
||||
.. _requirements-baremetal:
|
||||
|
||||
Python-specific Requirements: No Virtualenv
|
||||
-------------------------------------------
|
||||
|
||||
If you don't care to use a virtual env, then installation of the Python
|
||||
dependencies is easy:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pip install --user --requirement /path/to/paperless/requirements.txt
|
||||
|
||||
This will download and install all of the requirements into
|
||||
``${HOME}/.local``. Remember that your distribution may be using ``pip3`` as
|
||||
mentioned above.
|
||||
|
||||
|
||||
.. _requirements-virtualenv:
|
||||
|
||||
Python-specific Requirements: Virtualenv
|
||||
----------------------------------------
|
||||
|
||||
Using a virtualenv for this is pretty straightforward: create a virtualenv,
|
||||
enter it, and install the requirements using the ``requirements.txt`` file:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ virtualenv --python=/path/to/python3 /path/to/arbitrary/directory
|
||||
$ . /path/to/arbitrary/directory/bin/activate
|
||||
$ pip install --requirement /path/to/paperless/requirements.txt
|
||||
|
||||
Now you're ready to go. Just remember to enter (activate) your virtualenv
|
||||
whenever you want to use Paperless.
|
||||
|
||||
|
||||
.. _requirements-documentation:
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
As generation of the documentation is not required for the use of Paperless,
|
||||
dependencies for this process are not included in ``requirements.txt``. If
|
||||
you'd like to generate your own docs locally, you'll need to:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pip install sphinx
|
||||
|
||||
and then cd into the ``docs`` directory and type ``make html``.
|
||||
|
||||
If you are using Docker, you can use the following commands to build the
|
||||
documentation and run a webserver serving it on `port 8001`_:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pwd
|
||||
/path/to/paperless
|
||||
|
||||
$ docker build -t paperless:docs -f docs/Dockerfile .
|
||||
$ docker run --rm -it -p "8001:8000" paperless:docs
|
||||
|
||||
.. _port 8001: http://127.0.0.1:8001
|
@ -1,7 +1,8 @@
|
||||
.. _scanners:
|
||||
|
||||
Scanner Recommendations
|
||||
=======================
|
||||
***********************
|
||||
Scanner recommendations
|
||||
***********************
|
||||
|
||||
As Paperless operates by watching a folder for new files, doesn't care what
|
||||
scanner you use, but sometimes finding a scanner that will write to an FTP,
|
||||
@ -23,16 +24,19 @@ that works right for you based on recommentations from other Paperless users.
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
| Fujitsu | `ix500`_ | yes | | yes | `eonist`_ |
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
| Fujitsu | `S1300i`_ | yes | | yes | `jonaswinkler`_|
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
|
||||
.. _ADS-1500W: https://www.brother.ca/en/p/ads1500w
|
||||
.. _MFC-J6930DW: https://www.brother.ca/en/p/MFCJ6930DW
|
||||
.. _MFC-J5910DW: https://www.brother.co.uk/printers/inkjet-printers/mfcj5910dw
|
||||
.. _MFC-9142CDN: https://www.brother.co.uk/printers/laser-printers/mfc9140cdn
|
||||
.. _ix500: http://www.fujitsu.com/us/products/computing/peripheral/scanners/scansnap/ix500/
|
||||
.. _ix500: https://www.fujitsu.com/global/products/computing/peripheral/scanners/scansnap/ix500/
|
||||
.. _S1300i: https://www.fujitsu.com/global/products/computing/peripheral/scanners/soho/s1300i/
|
||||
|
||||
.. _danielquinn: https://github.com/danielquinn
|
||||
.. _ayounggun: https://github.com/ayounggun
|
||||
.. _bmsleight: https://github.com/bmsleight
|
||||
.. _eonist: https://github.com/eonist
|
||||
.. _REOLDEV: https://github.com/REOLDEV
|
||||
|
||||
.. _jonaswinkler: https://github.com/jonaswinkler
|
||||
|
@ -1,16 +0,0 @@
|
||||
.. _screenshots:
|
||||
|
||||
Screenshots
|
||||
===========
|
||||
|
||||
Once everything is set-up login to paperless using the web front-end
|
||||
|
||||
.. image:: ./_static/Screenshot_first_run_login.png
|
||||
|
||||
Nice clean interface
|
||||
|
||||
.. image:: ./_static/Screenshot_first_logged.png
|
||||
|
||||
Some documents loaded in via ftp or using the scanners ftp.
|
||||
|
||||
.. image:: ./_static/Screenshot_upload_and_scanned.png
|
549
docs/setup.rst
549
docs/setup.rst
@ -1,500 +1,187 @@
|
||||
.. _setup:
|
||||
|
||||
*****
|
||||
Setup
|
||||
=====
|
||||
|
||||
Paperless isn't a very complicated app, but there are a few components, so some
|
||||
basic documentation is in order. If you follow along in this document and
|
||||
still have trouble, please open an `issue on GitHub`_ so I can fill in the
|
||||
gaps.
|
||||
|
||||
.. _issue on GitHub: https://github.com/the-paperless-project/paperless/issues
|
||||
|
||||
|
||||
.. _setup-download:
|
||||
*****
|
||||
|
||||
Download
|
||||
--------
|
||||
########
|
||||
|
||||
The source is currently only available via GitHub, so grab it from there,
|
||||
either by using ``git``:
|
||||
by using ``git``:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ git clone https://github.com/the-paperless-project/paperless.git
|
||||
$ git clone https://github.com/jonaswinkler/paperless-ng.git
|
||||
$ cd paperless
|
||||
|
||||
or just download the tarball and go that route:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ cd to the directory where you want to run Paperless
|
||||
$ wget https://github.com/the-paperless-project/paperless/archive/master.zip
|
||||
$ unzip master.zip
|
||||
$ cd paperless-master
|
||||
|
||||
|
||||
.. _setup-installation:
|
||||
|
||||
Installation & Configuration
|
||||
----------------------------
|
||||
Installation
|
||||
############
|
||||
|
||||
You can go multiple routes with setting up and running Paperless:
|
||||
|
||||
* The `bare metal route`_
|
||||
* The `docker route`_
|
||||
* A suggested `linux containers route`_
|
||||
* The `docker route`_
|
||||
* The `bare metal route`_
|
||||
|
||||
The recommended setup route is docker, since it takes care of all dependencies
|
||||
for you.
|
||||
|
||||
The `docker route`_ is quick & easy.
|
||||
|
||||
The `bare metal route`_ is a bit more complicated to setup but makes it easier
|
||||
The `bare metal route`_ is more complicated to setup but makes it easier
|
||||
should you want to contribute some code back.
|
||||
|
||||
The `linux containers route`_ is quick, but makes alot of assumptions on the
|
||||
set-up, on the other hand the script could be used to install on a base
|
||||
debian or ubuntu server.
|
||||
Docker Route
|
||||
============
|
||||
|
||||
.. _docker route: setup-installation-docker_
|
||||
.. _bare metal route: setup-installation-bare-metal_
|
||||
.. _Docker Machine: https://docs.docker.com/machine/
|
||||
1. Install `Docker`_ and `docker-compose`_. [#compose]_
|
||||
|
||||
.. _setup-installation-bare-metal:
|
||||
.. caution::
|
||||
|
||||
Standard (Bare Metal)
|
||||
+++++++++++++++++++++
|
||||
If you want to use the included ``docker-compose.yml.example`` file, you
|
||||
need to have at least Docker version **17.09.0** and docker-compose
|
||||
version **1.17.0**.
|
||||
|
||||
1. Install the requirements as per the :ref:`requirements <requirements>` page.
|
||||
2. Within the extract of master.zip go to the ``src`` directory.
|
||||
3. Copy ``../paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
|
||||
your favourite editor. As this file contains passwords. It should only be
|
||||
readable by user root and paperless! Set the values for:
|
||||
See the `Docker installation guide`_ on how to install the current
|
||||
version of Docker for your operating system or Linux distribution of
|
||||
choice. To get an up-to-date version of docker-compose, follow the
|
||||
`docker-compose installation guide`_ if your package repository doesn't
|
||||
include it.
|
||||
|
||||
Set the values for:
|
||||
.. _Docker installation guide: https://docs.docker.com/engine/installation/
|
||||
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
|
||||
|
||||
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
|
||||
dumped to be consumed by Paperless.
|
||||
* ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
|
||||
will spawn to process document pages in parallel.
|
||||
* ``PAPERLESS_PASSPHRASE``: this is only required if you want to use GPG to
|
||||
encrypt your document files. This is the passphrase Paperless uses to
|
||||
encrypt/decrypt the original documents. Don't worry about defining this
|
||||
if you don't want to use encryption (the default).
|
||||
2. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
|
||||
and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
|
||||
You'll be editing both these files: taking a copy ensures that you can
|
||||
``git pull`` to receive updates without risking merge conflicts with your
|
||||
modified versions of the configuration files.
|
||||
3. Modify ``docker-compose.yml`` to your preferences. You should change the path
|
||||
to the consumption directory in this file. Find the line that specifies where
|
||||
to mount the consumption directory:
|
||||
|
||||
Note also that if you're using the ``runserver`` as mentioned below, you
|
||||
should make sure that PAPERLESS_DEBUG="true" or is just commented out as
|
||||
this is the default.
|
||||
.. code::
|
||||
|
||||
- ./consume:/usr/src/paperless/consume
|
||||
|
||||
Replace the part BEFORE the colon with a local directory of your choice:
|
||||
|
||||
4. Initialise the SQLite database with ``./manage.py migrate``.
|
||||
5. Collect the static files for the webserver with ``./manage.py collectstatic``.
|
||||
6. Create a user for your Paperless instance with
|
||||
``./manage.py createsuperuser``. Follow the prompts to create your user.
|
||||
7. Start the webserver with ``./manage.py runserver <IP>:<PORT>``.
|
||||
If no specific IP or port is given, the default is ``127.0.0.1:8000`` also
|
||||
known as http://localhost:8000/.
|
||||
You should now be able to visit your (empty) installation at
|
||||
`Paperless webserver`_ or whatever you chose before. You can login with the
|
||||
user/pass you created in #5.
|
||||
.. code::
|
||||
|
||||
8. In a separate window, change to the ``src`` directory in this repo again,
|
||||
but this time, you should start the consumer script with
|
||||
``./manage.py document_consumer``.
|
||||
9. Scan something or put a file into the ``CONSUMPTION_DIR``.
|
||||
10. Wait a few minutes
|
||||
11. Visit the document list on your webserver, and it should be there, indexed
|
||||
and downloadable.
|
||||
|
||||
.. caution::
|
||||
|
||||
This installation is not secure. Once everything is working head over to
|
||||
`Making things more permanent`_
|
||||
|
||||
.. _Paperless webserver: http://127.0.0.1:8000
|
||||
.. _Making things more permanent: setup-permanent_
|
||||
|
||||
.. _setup-installation-docker:
|
||||
|
||||
Docker Method
|
||||
+++++++++++++
|
||||
|
||||
1. Install `Docker`_.
|
||||
|
||||
.. caution::
|
||||
|
||||
As mentioned earlier, this guide assumes that you use Docker natively
|
||||
under Linux. If you are using `Docker Machine`_ under Mac OS X or
|
||||
Windows, you will have to adapt IP addresses, volume-mounting, command
|
||||
execution and maybe more.
|
||||
|
||||
2. Install `docker-compose`_. [#compose]_
|
||||
|
||||
.. caution::
|
||||
|
||||
If you want to use the included ``docker-compose.yml.example`` file, you
|
||||
need to have at least Docker version **1.12.0** and docker-compose
|
||||
version **1.9.0**.
|
||||
|
||||
See the `Docker installation guide`_ on how to install the current
|
||||
version of Docker for your operating system or Linux distribution of
|
||||
choice. To get an up-to-date version of docker-compose, follow the
|
||||
`docker-compose installation guide`_ if your package repository doesn't
|
||||
include it.
|
||||
|
||||
.. _Docker installation guide: https://docs.docker.com/engine/installation/
|
||||
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
|
||||
|
||||
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
|
||||
and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
|
||||
You'll be editing both these files: taking a copy ensures that you can
|
||||
``git pull`` to receive updates without risking merge conflicts with your
|
||||
modified versions of the configuration files.
|
||||
4. Modify ``docker-compose.yml`` to your preferences, following the
|
||||
instructions in comments in the file. The only change that is a hard
|
||||
requirement is to specify where the consumption directory should
|
||||
mount.[#dockercomposeyml]_
|
||||
|
||||
.. caution::
|
||||
|
||||
If you are using NFS mounts for the consume directory you also need to
|
||||
change the command to turn off inotify as it doesn't work with NFS
|
||||
|
||||
``command: ["document_consumer", "--no-inotify"]``
|
||||
- /home/jonaswinkler/paperless-inbox:/usr/src/paperless/consume
|
||||
|
||||
Don't change the part after the colon or paperless wont find your documents.
|
||||
|
||||
|
||||
5. Modify ``docker-compose.env`` and adapt the following environment variables:
|
||||
4. Modify ``docker-compose.env``, following the comments in the file. The
|
||||
most important change is to set ``USERMAP_UID`` and ``USERMAP_GID``
|
||||
to the uid and gid of your user on the host system. This ensures that
|
||||
both the docker container and you on the host machine have write access
|
||||
to the consumption directory. If your UID and GID on the host system is
|
||||
1000 (the default for the first normal user on most systems), it will
|
||||
work out of the box without any modifications.
|
||||
|
||||
``PAPERLESS_PASSPHRASE``
|
||||
This is the passphrase Paperless uses to encrypt/decrypt the original
|
||||
document. If you aren't planning on using GPG encryption, you can just
|
||||
leave this undefined.
|
||||
|
||||
``PAPERLESS_OCR_THREADS``
|
||||
This is the number of threads the OCR process will spawn to process
|
||||
document pages in parallel. If the variable is not set, Python determines
|
||||
the core-count of your CPU and uses that value.
|
||||
|
||||
``PAPERLESS_OCR_LANGUAGES``
|
||||
If you want the OCR to recognize other languages in addition to the
|
||||
default English, set this parameter to a space separated list of
|
||||
three-letter language-codes after `ISO 639-2/T`_. For a list of available
|
||||
languages -- including their three letter codes -- see the
|
||||
`Alpine packagelist`_.
|
||||
|
||||
``USERMAP_UID`` and ``USERMAP_GID``
|
||||
If you want to mount the consumption volume (directory ``/consume`` within
|
||||
the containers) to a host-directory -- which you probably want to do --
|
||||
access rights might be an issue. The default user and group ``paperless``
|
||||
in the containers have an id of 1000. The containers will enforce that the
|
||||
owning group of the consumption directory will be ``paperless`` to be able
|
||||
to delete consumed documents. If your host-system has a group with an ID
|
||||
of 1000 and you don't want this group to have access rights to the
|
||||
consumption directory, you can use ``USERMAP_GID`` to change the id in the
|
||||
container and thus the one of the consumption directory. Furthermore, you
|
||||
can change the id of the default user as well using ``USERMAP_UID``.
|
||||
|
||||
``PAPERLESS_USE_SSL``
|
||||
If you want Paperless to use SSL for the user interface, set this variable
|
||||
to ``true``. You also need to copy your certificate and key to the ``data``
|
||||
directory, named ``ssl.cert`` and ``ssl.key``.
|
||||
This is not an ideal solution and, if possible, a reverse proxy with nginx
|
||||
is preferred.
|
||||
|
||||
6. Run ``docker-compose up -d``. This will create and start the necessary
|
||||
5. Run ``docker-compose up -d``. This will create and start the necessary
|
||||
containers.
|
||||
7. To be able to login, you will need a super user. To create it, execute the
|
||||
following command:
|
||||
|
||||
.. code-block:: shell-session
|
||||
6. To be able to login, you will need a super user. To create it, execute the
|
||||
following command:
|
||||
|
||||
$ docker-compose run --rm webserver createsuperuser
|
||||
.. code-block:: shell-session
|
||||
|
||||
This will prompt you to set a username (default ``paperless``), an optional
|
||||
e-mail address and finally a password.
|
||||
8. The default ``docker-compose.yml`` exports the webserver on your local port
|
||||
8000. If you haven't adapted this, you should now be able to visit your
|
||||
`Paperless webserver`_ at ``http://127.0.0.1:8000`` (or
|
||||
``https://127.0.0.1:8000`` if you enabled SSL). You can login with the
|
||||
user and password you just created.
|
||||
9. Add files to consumption directory the way you prefer to. Following are two
|
||||
possible options:
|
||||
$ docker-compose run --rm webserver createsuperuser
|
||||
|
||||
1. Mount the consumption directory to a local host path by modifying your
|
||||
``docker-compose.yml``:
|
||||
|
||||
.. code-block:: diff
|
||||
|
||||
diff --git a/docker-compose.yml b/docker-compose.yml
|
||||
--- a/docker-compose.yml
|
||||
+++ b/docker-compose.yml
|
||||
@@ -17,9 +18,8 @@ services:
|
||||
volumes:
|
||||
- paperless-data:/usr/src/paperless/data
|
||||
- paperless-media:/usr/src/paperless/media
|
||||
- - /consume
|
||||
+ - /local/path/you/choose:/consume
|
||||
|
||||
.. danger::
|
||||
|
||||
While the consumption container will ensure at startup that it can
|
||||
**delete** a consumed file from a host-mounted directory, it might
|
||||
not be able to **read** the document in the first place if the access
|
||||
rights to the file are incorrect.
|
||||
|
||||
Make sure that the documents you put into the consumption directory
|
||||
will either be readable by everyone (``chmod o+r file.pdf``) or
|
||||
readable by the default user or group id 1000 (or the one you have
|
||||
set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
|
||||
|
||||
2. Use ``docker cp`` to copy your files directly into the container:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ # Identify your containers
|
||||
$ docker-compose ps
|
||||
Name Command State Ports
|
||||
-------------------------------------------------------------------------
|
||||
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
|
||||
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
|
||||
|
||||
$ docker cp /path/to/your/file.pdf paperless_consumer_1:/consume
|
||||
|
||||
``docker cp`` is a one-shot-command, just like ``cp``. This means that
|
||||
every time you want to consume a new document, you will have to execute
|
||||
``docker cp`` again. You can of course automate this process, but option
|
||||
1 is generally the preferred one.
|
||||
|
||||
.. danger::
|
||||
|
||||
``docker cp`` will change the owning user and group of a copied file
|
||||
to the acting user at the destination, which will be ``root``.
|
||||
|
||||
You therefore need to ensure that the documents you want to copy into
|
||||
the container are readable by everyone (``chmod o+r file.pdf``)
|
||||
before copying them.
|
||||
This will prompt you to set a username, an optional e-mail address and
|
||||
finally a password.
|
||||
|
||||
7. The default ``docker-compose.yml`` exports the webserver on your local port
|
||||
8000. If you haven't adapted this, you should now be able to visit your
|
||||
Paperless instance at ``http://127.0.0.1:8000``. You can login with the
|
||||
user and password you just created.
|
||||
|
||||
.. _Docker: https://www.docker.com/
|
||||
.. _docker-compose: https://docs.docker.com/compose/install/
|
||||
.. _ISO 639-2/T: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
|
||||
.. _Alpine packagelist: https://pkgs.alpinelinux.org/packages?name=tesseract-ocr-data*&arch=x86_64
|
||||
|
||||
.. [#compose] You of course don't have to use docker-compose, but it
|
||||
simplifies deployment immensely. If you know your way around Docker, feel
|
||||
free to tinker around without using compose!
|
||||
|
||||
.. [#dockercomposeyml] If you're upgrading your docker-compose images from
|
||||
version 1.1.0 or earlier, you might need to change in the
|
||||
``docker-compose.yml`` file the ``image: pitkley/paperless`` directive in
|
||||
both the ``webserver`` and ``consumer`` sections to ``build: ./`` as per the
|
||||
newer ``docker-compose.yml.example`` file
|
||||
|
||||
Bare Metal Route
|
||||
================
|
||||
|
||||
.. _setup-permanent:
|
||||
.. warning::
|
||||
|
||||
Making Things a Little more Permanent
|
||||
-------------------------------------
|
||||
TBD. User docker for now.
|
||||
|
||||
Once you've tested things and are happy with the work flow, you should secure
|
||||
the installation and automate the process of starting the webserver and
|
||||
consumer.
|
||||
Migration to paperless-ng
|
||||
#########################
|
||||
|
||||
At its core, paperless-ng is still paperless and fully compatible. However, some
|
||||
things have changed under the hood, so you need to adapt your setup depending on
|
||||
how you installed paperless. The important things to keep in mind are as follows.
|
||||
|
||||
.. _setup-permanent-webserver:
|
||||
* Read the :ref:`paperless_changelog` and take note of breaking changes.
|
||||
* It is recommended to use postgresql as the database now. The docker-compose
|
||||
deployment will automatically create a postgresql instance and instruct
|
||||
paperless to use it. This means that if you use the docker-compose script
|
||||
with your current paperless media and data volumes and used the default
|
||||
sqlite database, **it will not use your sqlite database and it may seem
|
||||
as if your documents are gone**. You may use the provided
|
||||
``docker-compose.yml.sqlite.example`` script, which does not use postgresql.
|
||||
* The task scheduler of paperless, which is used to execute periodic tasks
|
||||
such as email checking and maintenance, requires a `redis`_ message broker
|
||||
instance. The docker-compose route takes care of that.
|
||||
* The layout of the folder structure for your documents and data remains the
|
||||
same.
|
||||
* The frontend needs to be built from source. The docker image takes care of
|
||||
that.
|
||||
|
||||
Using a Real Webserver
|
||||
++++++++++++++++++++++
|
||||
Migration to paperless-ng is then performed in a few simple steps:
|
||||
|
||||
The default is to use Django's development server, as that's easy and does the
|
||||
job well enough on a home network. However it is heavily discouraged to use
|
||||
it for more than that.
|
||||
1. Do a backup for two purposes: If something goes wrong, you still have your
|
||||
data. Second, if you don't like paperless-ng, you can switch back to
|
||||
paperless.
|
||||
|
||||
If you want to do things right you should use a real webserver capable of
|
||||
handling more than one thread. You will also have to let the webserver serve
|
||||
the static files (CSS, JavaScript) from the directory configured in
|
||||
``PAPERLESS_STATICDIR``. The default static files directory is ``../static``.
|
||||
2. Replace the paperless source with paperless-ng. If you're using git, this
|
||||
is done by:
|
||||
|
||||
For that you need to activate your virtual environment and collect the static
|
||||
files with the command:
|
||||
.. code:: bash
|
||||
|
||||
.. code:: bash
|
||||
$ git remote set-url origin https://github.com/jonaswinkler/paperless-ng
|
||||
$ git pull
|
||||
|
||||
$ cd <paperless directory>/src
|
||||
$ ./manage.py collectstatic
|
||||
3. If you are using docker, copy ``docker-compose.yml.example`` to
|
||||
``docker-compose.yml`` and ``docker-compose.env.example`` to
|
||||
``docker-compose.env``. Make adjustments to these files as necessary.
|
||||
See `docker route`_ for details.
|
||||
|
||||
4. Update paperless. See :ref:`administration-updating` for details.
|
||||
|
||||
Apache
|
||||
~~~~~~
|
||||
5. Start paperless-ng.
|
||||
|
||||
This is a configuration supplied by `steckerhalter`_ on GitHub. It uses Apache
|
||||
and mod_wsgi, with a Paperless installation in ``/home/paperless/``:
|
||||
.. code:: bash
|
||||
|
||||
.. code:: apache
|
||||
$ docker-compose up
|
||||
|
||||
This will also migrate your database as usual. Verify by inspecting the
|
||||
output that the migration was successfully executed. CTRL-C will then
|
||||
gracefully stop the container. After that, you can start paperless-ng as
|
||||
usuall with
|
||||
|
||||
<VirtualHost *:80>
|
||||
ServerName example.com
|
||||
.. code:: bash
|
||||
|
||||
Alias /static/ /home/paperless/paperless/static/
|
||||
<Directory /home/paperless/paperless/static>
|
||||
Require all granted
|
||||
</Directory>
|
||||
$ docker-compose up -d
|
||||
|
||||
WSGIScriptAlias / /home/paperless/paperless/src/paperless/wsgi.py
|
||||
WSGIDaemonProcess example.com user=paperless group=paperless threads=5 python-path=/home/paperless/paperless/src:/home/paperless/.env/lib/python3.6/site-packages
|
||||
WSGIProcessGroup example.com
|
||||
6. Paperless installed a permanent redirect to ``admin/`` in your browser. This
|
||||
redirect is still in place and prevents access to the new UI. Clear
|
||||
everything related to paperless in your browsers data in order to fix
|
||||
this issue.
|
||||
|
||||
<Directory /home/paperless/paperless/src/paperless>
|
||||
<Files wsgi.py>
|
||||
Require all granted
|
||||
</Files>
|
||||
</Directory>
|
||||
</VirtualHost>
|
||||
Moving data from sqlite to postgresql
|
||||
=====================================
|
||||
|
||||
.. _steckerhalter: https://github.com/steckerhalter
|
||||
.. warning::
|
||||
|
||||
TBD.
|
||||
|
||||
Nginx + Gunicorn
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
If you're using Nginx, the most common setup is to combine it with a
|
||||
Python-based server like Gunicorn so that Nginx is acting as a proxy. Below is
|
||||
a copy of a simple Nginx configuration fragment making use of a gunicorn
|
||||
instance listening on localhost port 8000.
|
||||
|
||||
.. code:: nginx
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
|
||||
index index.html index.htm index.php;
|
||||
access_log /var/log/nginx/paperless_access.log;
|
||||
error_log /var/log/nginx/paperless_error.log;
|
||||
|
||||
location /static {
|
||||
|
||||
autoindex on;
|
||||
alias <path-to-paperless-static-directory>;
|
||||
|
||||
}
|
||||
|
||||
location / {
|
||||
|
||||
proxy_set_header Host $http_host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
proxy_pass http://127.0.0.1:8000;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
The gunicorn server can be started with the command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ <path-to-paperless-virtual-environment>/bin/gunicorn --pythonpath=<path-to-paperless>/src paperless.wsgi -w 2
|
||||
|
||||
|
||||
.. _setup-permanent-standard-systemd:
|
||||
|
||||
Standard (Bare Metal + Systemd)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you're running on a bare metal system that's using Systemd, you can use the
|
||||
service unit files in the ``scripts`` directory to set this up.
|
||||
|
||||
1. You'll need to create a group and user called ``paperless`` (without login)
|
||||
2. Setup Paperless to be in a place that this new user can read and write to.
|
||||
3. Ensure ``/etc/paperless`` is readable by the ``paperless`` user.
|
||||
4. Copy the service file from the ``scripts`` directory to
|
||||
``/etc/systemd/system``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cp /path/to/paperless/scripts/paperless-consumer.service /etc/systemd/system/
|
||||
$ cp /path/to/paperless/scripts/paperless-webserver.service /etc/systemd/system/
|
||||
|
||||
5. Edit the service file to point the ``ExecStart`` line to the proper location
|
||||
of your paperless install, referencing the appropriate Python binary. For
|
||||
example:
|
||||
``ExecStart=/path/to/python3 /path/to/paperless/src/manage.py document_consumer``.
|
||||
6. Start and enable (so they start on boot) the services.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ systemctl enable paperless-consumer
|
||||
$ systemctl enable paperless-webserver
|
||||
$ systemctl start paperless-consumer
|
||||
$ systemctl start paperless-webserver
|
||||
|
||||
|
||||
.. _setup-permanent-standard-upstart:
|
||||
|
||||
Standard (Bare Metal + Upstart)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
|
||||
during the boot process. To configure Upstart to run Paperless automatically
|
||||
after restarting your system:
|
||||
|
||||
1. Change to the directory where Upstart's configuration files are kept:
|
||||
``cd /etc/init``
|
||||
2. Create a new file: ``sudo nano paperless-server.conf``
|
||||
3. In the newly-created file enter::
|
||||
|
||||
start on (local-filesystems and net-device-up IFACE=eth0)
|
||||
stop on shutdown
|
||||
|
||||
respawn
|
||||
respawn limit 10 5
|
||||
|
||||
script
|
||||
exec <path to paperless virtual environment>/bin/gunicorn --pythonpath=<path to parperless>/src paperless.wsgi -w 2
|
||||
end script
|
||||
|
||||
Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
|
||||
path to the ``manage.py`` script in your installation directory.
|
||||
|
||||
If you are using a network interface other than ``eth0``, you will have to
|
||||
change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
|
||||
likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
|
||||
run ``ifconfig -a``.
|
||||
|
||||
Save the file.
|
||||
|
||||
4. Create a new file: ``sudo nano paperless-consumer.conf``
|
||||
|
||||
5. In the newly-created file enter::
|
||||
|
||||
start on (local-filesystems and net-device-up IFACE=eth0)
|
||||
stop on shutdown
|
||||
|
||||
respawn
|
||||
respawn limit 10 5
|
||||
|
||||
script
|
||||
exec <path to paperless virtual environment>/bin/python <path to parperless>/manage.py document_consumer
|
||||
end script
|
||||
|
||||
Replace the path placeholder and ``eth0`` with the appropriate value and save the file.
|
||||
|
||||
These two configuration files together will start both the Paperless webserver
|
||||
and document consumer processes when the file system and network interface
|
||||
specified is available after boot. Furthermore, if either process ever exits
|
||||
unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
|
||||
second period.
|
||||
|
||||
.. _Upstart: http://upstart.ubuntu.com/
|
||||
|
||||
|
||||
.. _setup-permanent-docker:
|
||||
|
||||
Docker
|
||||
~~~~~~
|
||||
|
||||
If you're using Docker, you can set a restart-policy_ in the
|
||||
``docker-compose.yml`` to have the containers automatically start with the
|
||||
Docker daemon.
|
||||
|
||||
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
|
||||
|
||||
.. _redis: https://redis.io/
|
||||
|
216
docs/usage_overview.rst
Normal file
216
docs/usage_overview.rst
Normal file
@ -0,0 +1,216 @@
|
||||
**************
|
||||
Usage Overview
|
||||
**************
|
||||
|
||||
Paperless is an application that manages your personal documents. With
|
||||
the help of a document scanner (see :ref:`scanners`), paperless transforms
|
||||
your wieldy physical document binders into a searchable archive and
|
||||
provices many utilities for finding and managing your documents.
|
||||
|
||||
|
||||
Terms and definitions
|
||||
#####################
|
||||
|
||||
Paperless esentially consists of two different parts for managing your
|
||||
documents:
|
||||
|
||||
* The *consumer* watches a specified folder and adds all documents in that
|
||||
folder to paperless.
|
||||
* The *web server* provides a UI that you use to manage and search for your
|
||||
scanned documents.
|
||||
|
||||
Each document has a couple of fields that you can assign to them:
|
||||
|
||||
* A *Document* is a piece of paper that sometimes contains valuable
|
||||
information.
|
||||
* The *correspondent* of a document is the person, institution or company that
|
||||
a document either originates form, or is sent to.
|
||||
* A *tag* is a label that you can assign to documents. Think of labels as more
|
||||
powerful folders: Multiple documents can be grouped together with a single
|
||||
tag, however, a single document can also have multiple tags. This is not
|
||||
possible with folders. The reason folders are not implemented in paperless
|
||||
is simply that tags are much more versatile than folders.
|
||||
* A *document type* is used to demarkate the type of a document such as letter,
|
||||
bank statement, invoice, contract, etc. It is used to identify what a document
|
||||
is about.
|
||||
* The *date added* of a document is the date the document was scanned into
|
||||
paperless. You cannot and should not change this date.
|
||||
* The *date created* of a document is the date the document was intially issued.
|
||||
This can be the date you bought a product, the date you signed a contract, or
|
||||
the date a letter was sent to you.
|
||||
* The *archive serial number* (short: ASN) of a document is the identifier of
|
||||
the document in your physical document binders. See
|
||||
:ref:`usage-recommended_workflow` below.
|
||||
* The *content* of a document is the text that was OCR'ed from the document.
|
||||
This text is fed into the search engine and is used for matching tags,
|
||||
correspondents and document types.
|
||||
|
||||
.. TODO: hyperref
|
||||
|
||||
Frontend overview
|
||||
#################
|
||||
|
||||
.. warning::
|
||||
|
||||
TBD. Add some fancy screenshots!
|
||||
|
||||
Adding documents to paperless
|
||||
#############################
|
||||
|
||||
Once you've got Paperless setup, you need to start feeding documents into it.
|
||||
Currently, there are three options: the consumption directory, IMAP (email), and
|
||||
HTTP POST.
|
||||
|
||||
|
||||
The consumption directory
|
||||
=========================
|
||||
|
||||
The primary method of getting documents into your database is by putting them in
|
||||
the consumption directory. The consumer runs in an infinite
|
||||
loop looking for new additions to this directory and when it finds them, it goes
|
||||
about the process of parsing them with the OCR, indexing what it finds, and storing
|
||||
it in the media directory.
|
||||
|
||||
Getting stuff into this directory is up to you. If you're running Paperless
|
||||
on your local computer, you might just want to drag and drop files there, but if
|
||||
you're running this on a server and want your scanner to automatically push
|
||||
files to this directory, you'll need to setup some sort of service to accept the
|
||||
files from the scanner. Typically, you're looking at an FTP server like
|
||||
`Proftpd`_ or a Windows folder share with `Samba`_.
|
||||
|
||||
.. _Proftpd: http://www.proftpd.org/
|
||||
.. _Samba: http://www.samba.org/
|
||||
|
||||
.. TODO: hyperref to configuration of the location of this magic folder.
|
||||
|
||||
|
||||
IMAP (Email)
|
||||
============
|
||||
|
||||
Another handy way to get documents into your database is to email them to
|
||||
yourself. The typical use-case would be to be out for lunch and want to send a
|
||||
copy of the receipt back to your system at home. Paperless can be taught to
|
||||
pull emails down from an arbitrary account and dump them into the consumption
|
||||
directory where the consumer will follow the
|
||||
usual pattern on consuming the document.
|
||||
|
||||
Some things you need to know about this feature:
|
||||
|
||||
* It's disabled by default. By setting the values below it will be enabled.
|
||||
* It's been tested in a limited environment, so it may not work for you (please
|
||||
submit a pull request if you can!)
|
||||
* It's designed to **delete mail from the server once consumed**. So don't go
|
||||
pointing this to your personal email account and wonder where all your stuff
|
||||
went.
|
||||
* Currently, only one photo (attachment) per email will work.
|
||||
|
||||
So, with all that in mind, here's what you do to get it running:
|
||||
|
||||
1. Setup a new email account somewhere, or if you're feeling daring, create a
|
||||
folder in an existing email box and note the path to that folder.
|
||||
2. In ``/etc/paperless.conf`` set all of the appropriate values in
|
||||
``PATHS AND FOLDERS`` and ``SECURITY``.
|
||||
If you decided to use a subfolder of an existing account, then make sure you
|
||||
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
|
||||
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
|
||||
have to include that in every email you send.
|
||||
3. Restart paperless. Paperless will check
|
||||
the configured email account at startup and from then on every 10 minutes
|
||||
for something new and pulls down whatever it finds.
|
||||
4. Send yourself an email! Note that the subject is treated as the file name,
|
||||
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
|
||||
get what you expect. Also, you must include the aforementioned secret
|
||||
string in every email so the fetcher knows that it's safe to import.
|
||||
Note that Paperless only allows the email title to consist of safe characters
|
||||
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
|
||||
|
||||
|
||||
REST API
|
||||
========
|
||||
|
||||
You can also submit a document using the REST API, see the API section for details.
|
||||
|
||||
|
||||
.. _usage-recommended_workflow:
|
||||
|
||||
The recommended workflow
|
||||
########################
|
||||
|
||||
Once you have familiarized yourself with paperless and are ready to use it
|
||||
for all your documents, the recommended workflow for managing your documents
|
||||
is as follows. This workflow also takes into account that some documents
|
||||
have to be kept in physical form, but still ensures that you get all the
|
||||
advantages for these documents as well.
|
||||
|
||||
Preparations in paperless
|
||||
=========================
|
||||
|
||||
* Create an inbox tag that gets assigned to all new documents.
|
||||
* Create a TODO tag.
|
||||
|
||||
Processing of the physical documents
|
||||
====================================
|
||||
|
||||
Keep a physical inbox. Whenever you receive a document that you need to
|
||||
archive, put it into your inbox. Regulary, do the following for all documents
|
||||
in your inbox:
|
||||
|
||||
1. For each document, decide if you need to keep the document in physical
|
||||
form. This applies to certain important documents, such as contracts and
|
||||
certificates.
|
||||
2. If you need to keep the document, write a running number on the document
|
||||
before scanning, starting at one and counting upwards. This is the archive
|
||||
serial number, or ASN in short.
|
||||
3. Scan the document.
|
||||
4. If the document has an ASN assigned, store it in a *single* binder, sorted
|
||||
by ASN. Don't order this binder in any other way.
|
||||
5. If the document has no ASN, throw it away. Yay!
|
||||
|
||||
Over time, you will notice that your physical binder will fill up. If it is
|
||||
full, label the binder with the range of ASNs in this binder (i.e., "Documents
|
||||
1 to 343"), store the binder in your cellar or elsewhere, and start a new
|
||||
binder.
|
||||
|
||||
The idea behind this process is that you will never have to use the physical
|
||||
binders to find a document. If you need a specific physical document, you
|
||||
may find this document by:
|
||||
|
||||
1. Searching in paperless for the document.
|
||||
2. Identify the ASN of the document, since it appears on the scan.
|
||||
3. Grab the relevant document binder and get the document. This is easy since
|
||||
they are sorted by ASN.
|
||||
|
||||
Processing of documents in paperless
|
||||
====================================
|
||||
|
||||
Once you have scanned in a document, proceed in paperless as follows.
|
||||
|
||||
1. If the document has an ASN, assign the ASN to the document.
|
||||
2. Assign a correspondent to the document (i.e., your employer, bank, etc)
|
||||
This isnt strictly necessary but helps in finding a document when you need
|
||||
it.
|
||||
3. Assign a document type (i.e., invoice, bank statement, etc) to the document
|
||||
This isnt strictly necessary but helps in finding a document when you need
|
||||
it.
|
||||
4. Assign a proper title to the document (the name of an item you bought, the
|
||||
subject of the letter, etc)
|
||||
5. Check that the date of the document is corrent. Paperless tries to read
|
||||
the date from the content of the document, but this fails sometimes if the
|
||||
OCR is bad or multiple dates appear on the document.
|
||||
6. Remove inbox tags from the documents.
|
||||
|
||||
|
||||
Task management
|
||||
===============
|
||||
|
||||
Some documents require attention and require you to act on the document. You
|
||||
may take two different approaches to handle these documents based on how
|
||||
regularly you intent to use paperless and scan documents.
|
||||
|
||||
* If you scan and process your documents in paperless regularly, assign a
|
||||
TODO tag to all scanned documents that you need to process. Create a saved
|
||||
view on the dashboard that shows all documents with this tag.
|
||||
* If you do not scan documents regularly and use paperless solely for archiving,
|
||||
create a physical todo box next to your physical inbox and put documents you
|
||||
need to process in the TODO box. When you performed the task associated with
|
||||
the document, move it to the inbox.
|
@ -1,284 +0,0 @@
|
||||
.. _utilities:
|
||||
|
||||
Utilities
|
||||
=========
|
||||
|
||||
There's basically three utilities to Paperless: the webserver, consumer, and
|
||||
if needed, the exporter. They're all detailed here.
|
||||
|
||||
|
||||
.. _utilities-webserver:
|
||||
|
||||
The Webserver
|
||||
-------------
|
||||
|
||||
At the heart of it, Paperless is a simple Django webservice, and the entire
|
||||
interface is based on Django's standard admin interface. Once running, visiting
|
||||
the URL for your service delivers the admin, through which you can get a
|
||||
detailed listing of all available documents, search for specific files, and
|
||||
download whatever it is you're looking for.
|
||||
|
||||
|
||||
.. _utilities-webserver-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
The webserver is started via the ``manage.py`` script:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py runserver
|
||||
|
||||
By default, the server runs on localhost, port 8000, but you can change this
|
||||
with a few arguments, run ``manage.py --help`` for more information.
|
||||
|
||||
Add the option ``--noreload`` to reduce resource usage. Otherwise, the server
|
||||
continuously polls all source files for changes to auto-reload them.
|
||||
|
||||
Note that when exiting this command your webserver will disappear.
|
||||
If you want to run this full-time (which is kind of the point)
|
||||
you'll need to have it start in the background -- something you'll need to
|
||||
figure out for your own system. To get you started though, there are Systemd
|
||||
service files in the ``scripts`` directory.
|
||||
|
||||
|
||||
.. _utilities-consumer:
|
||||
|
||||
The Consumer
|
||||
------------
|
||||
|
||||
The consumer script runs in an infinite loop, constantly looking at a directory
|
||||
for documents to parse and index. The process is pretty straightforward:
|
||||
|
||||
1. Look in ``CONSUMPTION_DIR`` for a document. If one is found, go to #2.
|
||||
If not, wait 10 seconds and try again. On Linux, new documents are detected
|
||||
instantly via inotify, so there's no waiting involved.
|
||||
2. Parse the document with Tesseract
|
||||
3. Create a new record in the database with the OCR'd text
|
||||
4. Attempt to automatically assign document attributes by doing some guesswork.
|
||||
Read up on the :ref:`guesswork documentation<guesswork>` for more
|
||||
information about this process.
|
||||
5. Encrypt the document (if you have a passphrase set) and store it in the
|
||||
``media`` directory under ``documents/originals``.
|
||||
6. Go to #1.
|
||||
|
||||
|
||||
.. _utilities-consumer-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
The consumer is started via the ``manage.py`` script:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_consumer
|
||||
|
||||
This starts the service that will consume documents as they appear in
|
||||
``CONSUMPTION_DIR``.
|
||||
|
||||
Note that this command runs continuously, so exiting it will mean your webserver
|
||||
disappears. If you want to run this full-time (which is kind of the point)
|
||||
you'll need to have it start in the background -- something you'll need to
|
||||
figure out for your own system. To get you started though, there are Systemd
|
||||
service files in the ``scripts`` directory.
|
||||
|
||||
Some command line arguments are available to customize the behavior of the
|
||||
consumer. By default it will use ``/etc/paperless.conf`` values. Display the
|
||||
help with:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_consumer --help
|
||||
|
||||
.. _utilities-exporter:
|
||||
|
||||
The Exporter
|
||||
------------
|
||||
|
||||
Tired of fiddling with Paperless, or just want to do something stupid and are
|
||||
afraid of accidentally damaging your files? You can export all of your
|
||||
documents into neatly named, dated, and unencrypted files.
|
||||
|
||||
|
||||
.. _utilities-exporter-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
This too is done via the ``manage.py`` script:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/
|
||||
|
||||
This will dump all of your unencrypted documents into ``/path/to/somewhere``
|
||||
for you to do with as you please. The files are accompanied with a special
|
||||
file, ``manifest.json`` which can be used to :ref:`import the files
|
||||
<utilities-importer>` at a later date if you wish.
|
||||
|
||||
|
||||
.. _utilities-exporter-howto-docker:
|
||||
|
||||
Docker
|
||||
______
|
||||
|
||||
If you are :ref:`using Docker <setup-installation-docker>`, running the
|
||||
expoorter is almost as easy. To mount a volume for exports, follow the
|
||||
instructions in the ``docker-compose.yml.example`` file for the ``/export``
|
||||
volume (making the changes in your own ``docker-compose.yml`` file, of course).
|
||||
Once you have the volume mounted, the command to run an export is:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ docker-compose run --rm consumer document_exporter /export
|
||||
|
||||
If you prefer to use ``docker run`` directly, supplying the necessary commandline
|
||||
options:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ # Identify your containers
|
||||
$ docker-compose ps
|
||||
Name Command State Ports
|
||||
-------------------------------------------------------------------------
|
||||
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
|
||||
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
|
||||
|
||||
$ # Make sure to replace your passphrase and remove or adapt the id mapping
|
||||
$ docker run --rm \
|
||||
--volumes-from paperless_data_1 \
|
||||
--volume /path/to/arbitrary/place:/export \
|
||||
-e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \
|
||||
-e USERMAP_UID=1000 -e USERMAP_GID=1000 \
|
||||
paperless document_exporter /export
|
||||
|
||||
|
||||
.. _utilities-importer:
|
||||
|
||||
The Importer
|
||||
------------
|
||||
|
||||
Looking to transfer Paperless data from one instance to another, or just want
|
||||
to restore from a backup? This is your go-to toy.
|
||||
|
||||
|
||||
.. _utilities-importer-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
The importer works just like the exporter. You point it at a directory, and
|
||||
the script does the rest of the work:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/
|
||||
|
||||
Docker
|
||||
______
|
||||
|
||||
Assuming that you've already gone through the steps above in the
|
||||
:ref:`export <utilities-exporter-howto-docker>` section, then the easiest thing
|
||||
to do is just re-use the ``/export`` path you already setup:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ docker-compose run --rm consumer document_importer /export
|
||||
|
||||
Similarly, if you're not using docker-compose, you can adjust the export
|
||||
instructions above to do the import.
|
||||
|
||||
|
||||
.. _utilities-retagger:
|
||||
|
||||
Re-running your tagging and correspondent matchers
|
||||
--------------------------------------------------
|
||||
|
||||
Say you've imported a few hundred documents and now want to introduce
|
||||
a tag or set up a new correspondent, and apply its matching to all of
|
||||
the currently-imported docs. This problem is common enough that
|
||||
there are tools for it.
|
||||
|
||||
|
||||
.. _utilities-retagger-howto:
|
||||
|
||||
How to Do It
|
||||
............
|
||||
|
||||
This too is done via the ``manage.py`` script:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_retagger
|
||||
|
||||
Run this after changing or adding tagging rules. It'll loop over all
|
||||
of the documents in your database and attempt to match all of your
|
||||
tags to them. If one matches, it'll be applied. And don't worry, you
|
||||
can run this as often as you like, it won't double-tag a document.
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_correspondents
|
||||
|
||||
This is the similar command to run after adding or changing a correspondent.
|
||||
|
||||
.. _utilities-encyption:
|
||||
|
||||
Enabling Encrpytion
|
||||
-------------------
|
||||
|
||||
Let's say you've imported a few documents to play around with paperless and now
|
||||
you are using it more seriously and want to enable encryption of your files.
|
||||
|
||||
.. utilities-encryption-howto:
|
||||
|
||||
Basic Syntax
|
||||
.............
|
||||
|
||||
Again we'll use the ``manage.py`` script, passing ``change_storage_type``:
|
||||
|
||||
.. code:: console
|
||||
|
||||
$ /path/to/paperless/src/manage.py change_storage_type --help
|
||||
usage: manage.py change_storage_type [-h] [--version] [-v {0,1,2,3}]
|
||||
[--settings SETTINGS]
|
||||
[--pythonpath PYTHONPATH] [--traceback]
|
||||
[--no-color] [--passphrase PASSPHRASE]
|
||||
{gpg,unencrypted} {gpg,unencrypted}
|
||||
|
||||
This is how you migrate your stored documents from an encrypted state to an
|
||||
unencrypted one (or vice-versa)
|
||||
|
||||
positional arguments:
|
||||
{gpg,unencrypted} The state you want to change your documents from
|
||||
{gpg,unencrypted} The state you want to change your documents to
|
||||
|
||||
optional arguments:
|
||||
--passphrase PASSPHRASE
|
||||
If PAPERLESS_PASSPHRASE isn't set already, you need to
|
||||
specify it here
|
||||
|
||||
Enabling Encryption
|
||||
...................
|
||||
|
||||
Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**):
|
||||
|
||||
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg
|
||||
|
||||
|
||||
Disabling Encryption
|
||||
....................
|
||||
|
||||
Basic usage to enable encryption of your document store:
|
||||
|
||||
(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted
|
Loading…
x
Reference in New Issue
Block a user