Merge branch 'master' into master

This commit is contained in:
Jonas Winkler
2020-12-19 13:49:01 +01:00
committed by GitHub
602 changed files with 35301 additions and 35494 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 113 KiB

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 1.9 MiB

BIN
docs/_static/recommended_workflow.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 106 KiB

BIN
docs/_static/screenshots/dashboard.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 167 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 306 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 410 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB

BIN
docs/_static/screenshots/editing.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 293 KiB

BIN
docs/_static/screenshots/logs.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 260 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

BIN
docs/_static/screenshots/mobile.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 158 KiB

BIN
docs/_static/screenshots/new-tag.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 261 KiB

415
docs/administration.rst Normal file
View File

@@ -0,0 +1,415 @@
**************
Administration
**************
.. _administration-backup:
Making backups
##############
Multiple options exist for making backups of your paperless instance,
depending on how you installed paperless.
Before making backups, make sure that paperless is not running.
Options available to any installation of paperless:
* Use the :ref:`document exporter <utilities-exporter>`.
The document exporter exports all your documents, thumbnails and
metadata to a specific folder. You may import your documents into a
fresh instance of paperless again or store your documents in another
DMS with this export.
Options available to docker installations:
* Backup the docker volumes. These usually reside within
``/var/lib/docker/volumes`` on the host and you need to be root in order
to access them.
Paperless uses 3 volumes:
* ``paperless_media``: This is where your documents are stored.
* ``paperless_data``: This is where auxillary data is stored. This
folder also contains the SQLite database, if you use it.
* ``paperless_pgdata``: Exists only if you use PostgreSQL and contains
the database.
Options available to bare-metal and non-docker installations:
* Backup the entire paperless folder. This ensures that if your paperless instance
crashes at some point or your disk fails, you can simply copy the folder back
into place and it works.
When using PostgreSQL, you'll also have to backup the database.
.. _migrating-restoring:
Restoring
=========
.. _administration-updating:
Updating paperless
##################
If a new release of paperless-ng is available, upgrading depends on how you
installed paperless-ng in the first place. The releases are available at
`release page <https://github.com/jonaswinkler/paperless-ng/releases>`_.
First of all, ensure that paperless is stopped.
.. code:: shell-session
$ cd /path/to/paperless
$ docker-compose down
After that, :ref:`make a backup <administration-backup>`.
A. If you used the dockerfiles archive, simply download the files of the new release,
adjust the settings in the files (i.e., the path to your consumption directory),
and replace your existing docker-compose files. Then start paperless as usual,
which will pull the new image, and update your database, if necessary:
.. code:: shell-session
$ cd /path/to/paperless
$ docker-compose up
If you see everything working, you can start paperless-ng with "-d" to have it
run in the background.
.. hint::
The released docker-compose files specify exact versions to be pulled from the hub.
This is to ensure that if the docker-compose files should change at some point
(i.e., services updates/configured differently), you wont run into trouble due to
docker pulling the ``latest`` image and running it in an older environment.
B. If you built the image yourself, grab the new archive and replace your current
paperless folder with the new contents.
After that, make the necessary adjustments to the docker-compose.yml (i.e.,
adjust your consumption directory).
Build and start the new image with:
.. code:: shell-session
$ cd /path/to/paperless
$ docker-compose build
$ docker-compose up
If you see everything working, you can start paperless-ng with "-d" to have it
run in the background.
.. hint::
You can usually keep your ``docker-compose.env`` file, since this file will
never include mandatory configuration options. However, it is worth checking
out the new version of this file, since it might have new recommendations
on what to configure.
Updating paperless without docker
=================================
After grabbing the new release and unpacking the contents, do the following:
1. Update dependencies. New paperless version may require additional
dependencies. The dependencies required are listed in the section about
:ref:`bare metal installations <setup-bare_metal>`.
2. Update python requirements. If you use Pipenv, this is done with the following steps.
.. code:: shell-session
$ pip install --upgrade pipenv
$ cd /path/to/paperless
$ pipenv clean
$ pipenv install
This creates a new virtual environment (or uses your existing environment)
and installs all dependencies into it.
3. Collect static files.
.. code:: shell-session
$ cd src
$ pipenv run python3 manage.py collectstatic --clear
4. Migrate the database.
.. code:: shell-session
$ cd src
$ pipenv run python3 manage.py migrate
Management utilities
####################
Paperless comes with some management commands that perform various maintenance
tasks on your paperless instance. You can invoke these commands either by
.. code:: shell-session
$ cd /path/to/paperless
$ docker-compose run --rm webserver <command> <arguments>
or
.. code:: shell-session
$ cd /path/to/paperless/src
$ pipenv run python manage.py <command> <arguments>
depending on whether you use docker or not.
All commands have built-in help, which can be accessed by executing them with
the argument ``--help``.
.. _utilities-exporter:
Document exporter
=================
The document exporter exports all your data from paperless into a folder for
backup or migration to another DMS.
.. code::
document_exporter target
``target`` is a folder to which the data gets written. This includes documents,
thumbnails and a ``manifest.json`` file. The manifest contains all metadata from
the database (correspondents, tags, etc).
When you use the provided docker compose script, specify ``../export`` as the
target. This path inside the container is automatically mounted on your host on
the folder ``export``.
.. _utilities-importer:
Document importer
=================
The document importer takes the export produced by the `Document exporter`_ and
imports it into paperless.
The importer works just like the exporter. You point it at a directory, and
the script does the rest of the work:
.. code::
document_importer source
When you use the provided docker compose script, put the export inside the
``export`` folder in your paperless source directory. Specify ``../export``
as the ``source``.
.. _utilities-retagger:
Document retagger
=================
Say you've imported a few hundred documents and now want to introduce
a tag or set up a new correspondent, and apply its matching to all of
the currently-imported docs. This problem is common enough that
there are tools for it.
.. code::
document_retagger [-h] [-c] [-T] [-t] [-i] [--use-first] [-f]
optional arguments:
-c, --correspondent
-T, --tags
-t, --document_type
-i, --inbox-only
--use-first
-f, --overwrite
Run this after changing or adding matching rules. It'll loop over all
of the documents in your database and attempt to match documents
according to the new rules.
Specify any combination of ``-c``, ``-T`` and ``-t`` to have the
retagger perform matching of the specified metadata type. If you don't
specify any of these options, the document retagger won't do anything.
Specify ``-i`` to have the document retagger work on documents tagged
with inbox tags only. This is useful when you don't want to mess with
your already processed documents.
When multiple document types or correspondents match a single document,
the retagger won't assign these to the document. Specify ``--use-first``
to override this behavior and just use the first correspondent or type
it finds. This option does not apply to tags, since any amount of tags
can be applied to a document.
Finally, ``-f`` specifies that you wish to overwrite already assigned
correspondents, types and/or tags. The default behavior is to not
assign correspondents and types to documents that have this data already
assigned. ``-f`` works differently for tags: By default, only additional tags get
added to documents, no tags will be removed. With ``-f``, tags that don't
match a document anymore get removed as well.
Managing the Automatic matching algorithm
=========================================
The *Auto* matching algorithm requires a trained neural network to work.
This network needs to be updated whenever somethings in your data
changes. The docker image takes care of that automatically with the task
scheduler. You can manually renew the classifier by invoking the following
management command:
.. code::
document_create_classifier
This command takes no arguments.
.. _`administration-index`:
Managing the document search index
==================================
The document search index is responsible for delivering search results for the
website. The document index is automatically updated whenever documents get
added to, changed, or removed from paperless. However, if the search yields
non-existing documents or won't find anything, you may need to recreate the
index manually.
.. code::
document_index {reindex,optimize}
Specify ``reindex`` to have the index created from scratch. This may take some
time.
Specify ``optimize`` to optimize the index. This updates certain aspects of
the index and usually makes queries faster and also ensures that the
autocompletion works properly. This command is regularly invoked by the task
scheduler.
.. _utilities-renamer:
Managing filenames
==================
If you use paperless' feature to
:ref:`assign custom filenames to your documents <advanced-file_name_handling>`,
you can use this command to move all your files after changing
the naming scheme.
.. warning::
Since this command moves you documents around alot, it is advised to to
a backup before. The renaming logic is robust and will never overwrite
or delete a file, but you can't ever be careful enough.
.. code::
document_renamer
The command takes no arguments and processes all your documents at once.
Fetching e-mail
===============
Paperless automatically fetches your e-mail every 10 minutes by default. If
you want to invoke the email consumer manually, call the following management
command:
.. code::
mail_fetcher
The command takes no arguments and processes all your mail accounts and rules.
.. _utilities-archiver:
Creating archived documents
===========================
Paperless stores archived PDF/A documents alongside your original documents.
These archived documents will also contain selectable text for image-only
originals.
These documents are derived from the originals, which are always stored
unmodified. If coming from an earlier version of paperless, your documents
won't have archived versions.
This command creates PDF/A documents for your documents.
.. code::
document_archiver --overwrite --document <id>
This command will only attempt to create archived documents when no archived
document exists yet, unless ``--overwrite`` is specified. If ``--document <id>``
is specified, the archiver will only process that document.
.. note::
This command essentially performs OCR on all your documents again,
according to your settings. If you run this with ``PAPERLESS_OCR_MODE=redo``,
it will potentially run for a very long time. You can cancel the command
at any time, since this command will skip already archived versions the next time
it is run.
.. note::
Some documents will cause errors and cannot be converted into PDF/A documents,
such as encrypted PDF documents. The archiver will skip over these documents
each time it sees them.
.. _utilities-encyption:
Managing encryption
===================
Documents can be stored in Paperless using GnuPG encryption.
.. danger::
Encryption is deprecated since paperless-ng 0.9 and doesn't really provide any
additional security, since you have to store the passphrase in a configuration
file on the same system as the encrypted documents for paperless to work.
Furthermore, the entire text content of the documents is stored plain in the
database, even if your documents are encrypted. Filenames are not encrypted as
well.
Also, the web server provides transparent access to your encrypted documents.
Consider running paperless on an encrypted filesystem instead, which will then
at least provide security against physical hardware theft.
Enabling encryption
-------------------
Enabling encryption is no longer supported.
Disabling encryption
--------------------
Basic usage to disable encryption of your document store:
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code::
decrypt_documents [--passphrase SECR3TP4SSPHRA$E]
.. _Pipenv: https://pipenv.pypa.io/en/latest/

342
docs/advanced_usage.rst Normal file
View File

@@ -0,0 +1,342 @@
***************
Advanced topics
***************
Paperless offers a couple features that automate certain tasks and make your life
easier.
Guesswork
#########
Any document you put into the consumption directory will be consumed, but if
you name the file right, it'll automatically set some values in the database
for you. This is is the logic the consumer follows:
1. Try to find the correspondent, title, and tags in the file name following
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
works as well.
2. If that doesn't work, we skip the date and try this pattern:
``Correspondent - Title - tag,tag,tag.pdf``.
3. If that doesn't work, we try to find the correspondent and title in the file
name following the pattern: ``Correspondent - Title.pdf``.
4. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect:
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Another Company - Letter of Reference.jpg``
* ``Dad's Recipe for Pancakes.png``
These however wouldn't work:
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Another Company- Letter of Reference.jpg``
Do I have to be so strict about naming?
=======================================
Rather than using the strict document naming rules, one can also set the option
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
to any date format that is found in the title, instead of a date pulled from
the document's text, without requiring the strict formatting of the document
filename as described above.
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
.. _advanced-transforming_filenames:
Transforming filenames for parsing
==================================
Some devices can't produce filenames that can be parsed by the default
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
``paperless.conf`` one can add transformations that are applied to the filename
before it's parsed.
The option contains a list of dictionaries of regular expressions (key:
``pattern``) and replacements (key: ``repl``) in JSON format, which are
applied in order by passing them to ``re.subn``. Transformation stops
after the first match, so at most one transformation is applied. The general
syntax is
.. code:: python
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
The example below is for a Brother ADS-2400N, a scanner that allows
different names to different hardware buttons (useful for handling
multiple entities in one instance), but insists on adding ``_<count>``
to the filename.
.. code:: python
# Brother profile configuration, support "Name_Date_Count" (the default
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
.. _advanced-matching:
Matching tags, correspondents and document types
################################################
After the consumer has tried to figure out what it could from the file name,
it starts looking at the content of the document itself. It will compare the
matching algorithms defined by every tag and correspondent already set in your
database to see if they apply to the text in that document. In other words,
if you defined a tag called ``Home Utility`` that had a ``match`` property of
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
automatically tag your newly-consumed document with your ``Home Utility`` tag
so long as the text ``bc hydro`` appears in the body of the document somewhere.
The matching logic is quite powerful, and supports searching the text of your
document with different algorithms, and as such, some experimentation may be
necessary to get things right.
In order to have a tag, correspondent or type assigned automatically to newly
consumed documents, assign a match and matching algorithm using the web
interface. These settings define when to assign correspondents, tags and types
to documents.
The following algorithms are available:
* **Any:** Looks for any occurrence of any word provided in match in the PDF.
If you define the match as ``Bank1 Bank2``, it will match documents containing
either of these terms.
* **All:** Requires that every word provided appears in the PDF, albeit not in the
order provided.
* **Literal:** Matches only if the match appears exactly as provided in the PDF.
* **Regular expression:** Parses the match as a regular expression and tries to
find a match within the document.
* **Fuzzy match:** I dont know. Look at the source.
* **Auto:** Tries to automatically match new documents. This does not require you
to set a match. See the notes below.
When using the "any" or "all" matching algorithms, you can search for terms
that consist of multiple words by enclosing them in double quotes. For example,
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
will match documents that contain either "Bank of America" or "BofA", but will
not match documents containing "Bank of South America".
Then just save your tag/correspondent and run another document through the
consumer. Once complete, you should see the newly-created document,
automatically tagged with the appropriate data.
.. _advanced-automatic_matching:
Automatic matching
==================
Paperless-ng comes with a new matching algorithm called *Auto*. This matching
algorithm tries to assign tags, correspondents and document types to your
documents based on how you have assigned these on existing documents. It
uses a neural network under the hood.
If, for example, all your bank statements of your account 123 at the Bank of
America are tagged with the tag "bofa_123" and the matching algorithm of this
tag is set to *Auto*, this neural network will examine your documents and
automatically learn when to assign this tag.
Paperless tries to hide much of the involved complexity with this approach.
However, there are a couple caveats you need to keep in mind when using this
feature:
* Changes to your documents are not immediately reflected by the matching
algorithm. The neural network needs to be *trained* on your documents after
changes. Paperless periodically (default: once each hour) checks for changes
and does this automatically for you.
* The Auto matching algorithm only takes documents into account which are NOT
placed in your inbox (i.e., have inbox tags assigned to them). This ensures
that the neural network only learns from documents which you have correctly
tagged before.
* The matching algorithm can only work if there is a correlation between the
tag, correspondent or document type and the document itself. Your bank
statements usually contain your bank account number and the name of the bank,
so this works reasonably well, However, tags such as "TODO" cannot be
automatically assigned.
* The matching algorithm needs a reasonable number of documents to identify when
to assign tags, correspondents, and types. If one out of a thousand documents
has the correspondent "Very obscure web shop I bought something five years
ago", it will probably not assign this correspondent automatically if you buy
something from them again. The more documents, the better.
* Paperless also needs a reasonable amount of negative examples to decide when
not to assign a certain tag, correspondent or type. This will usually be the
case as you start filling up paperless with documents. Example: If all your
documents are either from "Webshop" and "Bank", paperless will assign one of
these correspondents to ANY new document, if both are set to automatic matching.
Hooking into the consumption process
####################################
Sometimes you may want to do something arbitrary whenever a document is
consumed. Rather than try to predict what you may want to do, Paperless lets
you execute scripts of your own choosing just before or after a document is
consumed using a couple simple hooks.
Just write a script, put it somewhere that Paperless can read & execute, and
then put the path to that script in ``paperless.conf`` with the variable name
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
``PAPERLESS_POST_CONSUME_SCRIPT``.
.. important::
These scripts are executed in a **blocking** process, which means that if
a script takes a long time to run, it can significantly slow down your
document consumption flow. If you want things to run asynchronously,
you'll have to fork the process in your script and exit.
Pre-consumption script
======================
Executed after the consumer sees a new document in the consumption folder, but
before any processing of the document is performed. This script receives exactly
one argument:
* Document file name
A simple but common example for this would be creating a simple script like
this:
``/usr/local/bin/ocr-pdf``
.. code:: bash
#!/usr/bin/env bash
pdf2pdfocr.py -i ${1}
``/etc/paperless.conf``
.. code:: bash
...
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
...
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
overwrite the file with an OCR'd version of the file and exit. At which point,
the consumption process will begin with the newly modified file.
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
.. _advanced-post_consume_script:
Post-consumption script
=======================
Executed after the consumer has successfully processed a document and has moved it
into paperless. It receives the following arguments:
* Document id
* Generated file name
* Source path
* Thumbnail path
* Download URL
* Thumbnail URL
* Correspondent
* Tags
The script can be in any language you like, but for a simple shell script
example, you can take a look at ``post-consumption-example.sh`` in the
``scripts`` directory in this project.
The post consumption script cannot cancel the consumption process.
.. _advanced-file_name_handling:
File name handling
##################
By default, paperless stores your documents in the media directory and renames them
using the identifier which it has assigned to each document. You will end up getting
files like ``0000123.pdf`` in your media directory. This isn't necessarily a bad
thing, because you normally don't have to access these files manually. However, if
you wish to name your files differently, you can do that by adjusting the
``PAPERLESS_FILENAME_FORMAT`` configuration option.
This variable allows you to configure the filename (folders are allowed) using
placeholders. For example, configuring this to
.. code:: bash
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
will create a directory structure as follows:
.. code::
2019/
My bank/
Statement January.pdf
Statement February.pdf
2020/
My bank/
Statement January.pdf
Letter.pdf
Letter_01.pdf
Shoe store/
My new shoes.pdf
.. danger::
Do not manually move your files in the media folder. Paperless remembers the
last filename a document was stored as. If you do rename a file, paperless will
report your files as missing and won't be able to find them.
Paperless provides the following placeholders withing filenames:
* ``{correspondent}``: The name of the correspondent, or "none".
* ``{document_type}``: The name of the document type, or "none".
* ``{tag_list}``: A comma separated list of all tags assigned to the document.
* ``{title}``: The title of the document.
* ``{created}``: The full date and time the document was created.
* ``{created_year}``: Year created only.
* ``{created_month}``: Month created only (number 1-12).
* ``{created_day}``: Day created only (number 1-31).
* ``{added}``: The full date and time the document was added to paperless.
* ``{added_year}``: Year added only.
* ``{added_month}``: Month added only (number 1-12).
* ``{added_day}``: Day added only (number 1-31).
Paperless will try to conserve the information from your database as much as possible.
However, some characters that you can use in document titles and correspondent names (such
as ``: \ /`` and a couple more) are not allowed in filenames and will be replaced with dashes.
If paperless detects that two documents share the same filename, paperless will automatically
append ``_01``, ``_02``, etc to the filename. This happens if all the placeholders in a filename
evaluate to the same value.
.. hint::
Paperless checks the filename of a document whenever it is saved. Therefore,
you need to update the filenames of your documents and move them after altering
this setting by invoking the :ref:`document renamer <utilities-renamer>`.
.. warning::
Make absolutely sure you get the spelling of the placeholders right, or else
paperless will use the default naming scheme instead.
.. caution::
As of now, you could totally tell paperless to store your files anywhere outside
the media directory by setting
.. code::
PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}
However, keep in mind that inside docker, if files get stored outside of the
predefined volumes, they will be lost after a restart of paperless.

View File

@@ -1,23 +1,291 @@
.. _api:
************
The REST API
############
************
Paperless makes use of the `Django REST Framework`_ standard API interface
because of its inherent awesomeness. Conveniently, the system is also
self-documenting, so to learn more about the access points, schema, what's
accepted and what isn't, you need only visit ``/api`` on your local Paperless
installation.
Paperless makes use of the `Django REST Framework`_ standard API interface.
It provides a browsable API for most of its endpoints, which you can inspect
at ``http://<paperless-host>:<port>/api/``. This also documents most of the
available filters and ordering fields.
.. _Django REST Framework: http://django-rest-framework.org/
The API provides 5 main endpoints:
.. _api-uploading:
* ``/api/documents/``: Full CRUD support, except POSTing new documents. See below.
* ``/api/correspondents/``: Full CRUD support.
* ``/api/document_types/``: Full CRUD support.
* ``/api/logs/``: Read-Only.
* ``/api/tags/``: Full CRUD support.
Uploading
---------
All of these endpoints except for the logging endpoint
allow you to fetch, edit and delete individual objects
by appending their primary key to the path, for example ``/api/documents/454/``.
File uploads in an API are hard and so far as I've been able to tell, there's
no standard way of accepting them, so rather than crowbar file uploads into the
REST API and endure that headache, I've left that process to a simple HTTP
POST, documented on the :ref:`consumption page <consumption-http>`.
The objects served by the document endpoint contain the following fields:
* ``id``: ID of the document. Read-only.
* ``title``: Title of the document.
* ``content``: Plain text content of the document.
* ``tags``: List of IDs of tags assigned to this document, or empty list.
* ``document_type``: Document type of this document, or null.
* ``correspondent``: Correspondent of this document or null.
* ``created``: The date at which this document was created.
* ``modified``: The date at which this document was last edited in paperless. Read-only.
* ``added``: The date at which this document was added to paperless. Read-only.
* ``archive_serial_number``: The identifier of this document in a physical document archive.
* ``original_file_name``: Verbose filename of the original document. Read-only.
* ``archived_file_name``: Verbose filename of the archived document. Read-only. Null if no archived document is available.
Downloading documents
#####################
In addition to that, the document endpoint offers these additional actions on
individual documents:
* ``/api/documents/<pk>/download/``: Download the document.
* ``/api/documents/<pk>/preview/``: Display the document inline,
without downloading it.
* ``/api/documents/<pk>/thumb/``: Download the PNG thumbnail of a document.
Paperless generates archived PDF/A documents from consumed files and stores both
the original files as well as the archived files. By default, the endpoints
for previews and downloads serve the archived file, if it is available.
Otherwise, the original file is served.
Some document cannot be archived.
The endpoints correctly serve the response header fields ``Content-Disposition``
and ``Content-Type`` to indicate the filename for download and the type of content of
the document.
In order to download or preview the original document when an archied document is available,
supply the query parameter ``original=true``.
.. hint::
Paperless used to provide these functionality at ``/fetch/<pk>/preview``,
``/fetch/<pk>/thumb`` and ``/fetch/<pk>/doc``. Redirects to the new URLs
are in place. However, if you use these old URLs to access documents, you
should update your app or script to use the new URLs.
Getting document metadata
#########################
The api also has an endpoint to retrieve read-only metadata about specific documents. this
information is not served along with the document objects, since it requires reading
files and would therefore slow down document lists considerably.
Access the metadata of a document with an ID ``id`` at ``/api/documents/<id>/metadata/``.
The endpoint reports the following data:
* ``original_checksum``: MD5 checksum of the original document.
* ``original_size``: Size of the original document, in bytes.
* ``original_mime_type``: Mime type of the original document.
* ``media_filename``: Current filename of the document, under which it is stored inside the media directory.
* ``has_archive_version``: True, if this document is archived, false otherwise.
* ``original_metadata``: A list of metadata associated with the original document. See below.
* ``archive_checksum``: MD5 checksum of the archived document, or null.
* ``archive_size``: Size of the archived document in bytes, or null.
* ``archive_metadata``: Metadata associated with the archived document, or null. See below.
File metadata is reported as a list of objects in the following form:
.. code:: json
[
{
"namespace": "http://ns.adobe.com/pdf/1.3/",
"prefix": "pdf",
"key": "Producer",
"value": "SparklePDF, Fancy edition"
},
]
``namespace`` and ``prefix`` can be null. The actual metadata reported depends on the file type and the metadata
available in that specific document. Paperless only reports PDF metadata at this point.
Authorization
#############
The REST api provides three different forms of authentication.
1. Basic authentication
Authorize by providing a HTTP header in the form
.. code::
Authorization: Basic <credentials>
where ``credentials`` is a base64-encoded string of ``<username>:<password>``
2. Session authentication
When you're logged into paperless in your browser, you're automatically
logged into the API as well and don't need to provide any authorization
headers.
3. Token authentication
Paperless also offers an endpoint to acquire authentication tokens.
POST a username and password as a form or json string to ``/api/token/``
and paperless will respond with a token, if the login data is correct.
This token can be used to authenticate other requests with the
following HTTP header:
.. code::
Authorization: Token <token>
Tokens can be managed and revoked in the paperless admin.
Searching for documents
#######################
Paperless-ng offers API endpoints for full text search. These are as follows:
``/api/search/``
================
Get search results based on a query.
Query parameters:
* ``query``: The query string. See
`here <https://whoosh.readthedocs.io/en/latest/querylang.html>`_
for details on the syntax.
* ``page``: Specify the page you want to retrieve. Each page
contains 10 search results and the first page is ``page=1``, which
is the default if this is omitted.
Result list object returned by the endpoint:
.. code:: json
{
"count": 1,
"page": 1,
"page_count": 1,
"corrected_query": "",
"results": [
]
}
* ``count``: The approximate total number of results.
* ``page``: The page returned to you. This might be different from
the page you requested, if you requested a page that is behind
the last page. In that case, the last page is returned.
* ``page_count``: The total number of pages.
* ``corrected_query``: Corrected version of the query string. Can be null.
If not null, can be used verbatim to start a new query.
* ``results``: A list of result objects on the current page.
Result object:
.. code:: json
{
"id": 1,
"highlights": [
],
"score": 6.34234,
"rank": 23,
"document": {
}
}
* ``id``: the primary key of the found document
* ``highlights``: an object containing parsable highlights for the result.
See below.
* ``score``: The score assigned to the document. A higher score indicates a
better match with the query. Search results are sorted descending by score.
* ``rank``: the position of the document within the entire search results list.
* ``document``: The full json of the document, as returned by
``/api/documents/<id>/``.
Highlights object:
Highlights are provided as a list of fragments. A fragment is a longer section of
text from the original document.
Each fragment contains a list of strings, and some of them are marked as a highlight.
.. code:: json
[
[
{"text": "This is a sample text with a "},
{"text": "highlighted", "term": 0},
{"text": " word."}
],
[
{"text": "Another", "term": 1},
{"text": " fragment with a highlight."}
]
]
When ``term`` is present within a string, the word within ``text`` should be highlighted.
The term index groups multiple matches together and words with the same index
should get identical highlighting.
A client may use this example to produce the following output:
... This is a sample text with a **highlighted** word. ... **Another** fragment with a highlight. ...
``/api/search/autocomplete/``
=============================
Get auto completions for a partial search term.
Query parameters:
* ``term``: The incomplete term.
* ``limit``: Amount of results. Defaults to 10.
Results returned by the endpoint are ordered by importance of the term in the
document index. The first result is the term that has the highest Tf/Idf score
in the index.
.. code:: json
[
"term1",
"term3",
"term6",
"term4"
]
.. _api-file_uploads:
POSTing documents
#################
The API provides a special endpoint for file uploads:
``/api/documents/post_document/``
POST a multipart form to this endpoint, where the form field ``document`` contains
the document that you want to upload to paperless. The filename is sanitized and
then used to store the document in a temporary directory, and the consumer will
be instructed to consume the document from there.
The endpoint supports the following optional form fields:
* ``title``: Specify a title that the consumer should use for the document.
* ``correspondent``: Specify the ID of a correspondent that the consumer should use for the document.
* ``document_type``: Similar to correspondent.
* ``tags``: Similar to correspondent. Specify this multiple times to have multiple tags added
to the document.
The endpoint will immediately return "OK" if the document consumption process
was started successfully. No additional status information about the consumption
process itself is available, since that happens in a different process.

View File

@@ -1,4 +1,333 @@
.. _paperless_changelog:
*********
Changelog
*********
paperless-ng 0.9.8
##################
This release addresses two severe issues with the previous release.
* The delete buttons for document types, correspondents and tags were not working.
* The document section in the admin was causing internal server errors (500).
paperless-ng 0.9.7
##################
* Front end
* Thanks to the hard work of `Michael Shamoon`_, paperless now comes with a much more streamlined UI for
filtering documents.
* `Michael Shamoon`_ replaced the document preview with another component. This should fix compatibility with Safari browsers.
* Added buttons to the management pages to quickly show all documents with one specific tag, correspondent, or title.
* Paperless now stores your saved views on the server and associates them with your user account.
This means that you can access your views on multiple devices and have separate views for different users.
You will have to recreate your views.
* The GitHub and documentation links now open in new tabs/windows. Thanks to `rYR79435`_.
* Paperless now generates default saved view names when saving views with certain filter rules.
* Added a small version indicator to the front end.
* Other additions and changes
* The new filename format field ``{tag_list}`` inserts a list of tags into the filename, separated by comma.
* The ``document_retagger`` no longer removes inbox tags or tags without matching rules.
* The new configuration option ``PAPERLESS_COOKIE_PREFIX`` allows you to run multiple instances of paperless on different ports.
This option enables you to be logged in into multiple instances by specifying different cookie names for each instance.
* Fixes
* Sometimes paperless would assign dates in the future to newly consumed documents.
* The filename format fields ``{created_month}`` and ``{created_day}`` now use a leading zero for single digit values.
* The filename format field ``{tags}`` can no longer be used without arguments.
* Paperless was not able to consume many images (especially images from mobile scanners) due to missing DPI information.
Paperless now assumes A4 paper size for PDF generation if no DPI information is present.
* Documents with empty titles could not be opened from the table view due to the link being empty.
* Fixed an issue with filenames containing special characters such as ``:`` not being accepted for upload.
* Fixed issues with thumbnail generation for plain text files.
paperless-ng 0.9.6
##################
This release focusses primarily on many small issues with the UI.
* Front end
* Paperless now has proper window titles.
* Fixed an issue with the small cards when more than 7 tags were used.
* Navigation of the "Show all" links adjusted. They navigate to the saved view now, if available in the sidebar.
* Some indication on the document lists that a filter is active was added.
* There's a new filter to filter for documents that do *not* have a certain tag.
* The file upload box now shows upload progress.
* The document edit page was reorganized.
* The document edit page shows various information about a document.
* An issue with the height of the preview was fixed.
* Table issues with too long document titles fixed.
* API
* The API now serves file names with documents.
* The API now serves various metadata about documents.
* API documentation updated.
* Other
* Fixed an issue with the docker image when a non-standard PostgreSQL port was used.
* The docker image was trying check for installed languages before actually installing them.
* ``FILENAME_FORMAT`` placeholder for document types.
* The filename formatter is now less restrictive with file names and tries to
conserve the original correspondents, types and titles as much as possible.
* The filename formatter does not include the document ID in filenames anymore. It will
rather append ``_01``, ``_02``, etc when it detects duplicate filenames.
.. note::
The changes to the filename format will apply to newly added documents and changed documents.
If you want all files to reflect these changes, execute the ``document_renamer`` management
command.
paperless-ng 0.9.5
##################
This release concludes the big changes I wanted to get rolled into paperless. The next releases before 1.0 will
focus on fixing issues, primarily.
* OCR
* Paperless now uses `OCRmyPDF <https://github.com/jbarlow83/OCRmyPDF>`_ to perform OCR on documents.
It still uses tesseract under the hood, but the PDF parser of Paperless has changed considerably and
will behave different for some douments.
* OCRmyPDF creates archived PDF/A documents with embedded text that can be selected in the front end.
* Paperless stores archived versions of documents alongside with the originals. The originals can be
accessed on the document edit page. If available, a dropdown menu will appear next to the download button.
* Many of the configuration options regarding OCR have changed. See :ref:`configuration-ocr` for details.
* Paperless no longer guesses the language of your documents. It always uses the language that you
specified with ``PAPERLESS_OCR_LANGUAGE``. Be sure to set this to the language the majority of your
documents are in. Multiple languages can be specified, but that requires more CPU time.
* The management command :ref:`document_archiver <utilities-archiver>` can be used to create archived versions for already
existing documents.
* Tags from consumption folder.
* Thanks to `jayme-github`_, paperless now consumes files from sub folders in the consumption folder and is able to assign tags
based on the sub folders a document was found in. This can be configured with ``PAPERLESS_CONSUMER_RECURSIVE`` and
``PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS``.
* API
* The API now offers token authentication.
* The endpoint for uploading documents now supports specifying custom titles, correspondents, tags and types.
This can be used by clients to override the default behavior of paperless. See :ref:`api-file_uploads`.
* The document endpoint of API now serves documents in this form:
* correspondents, document types and tags are referenced by their ID in the fields ``correspondent``, ``document_type`` and ``tags``. The ``*_id`` versions are gone. These fields are read/write.
* paperless does not serve nested tags, correspondents or types anymore.
* Front end
* Paperless does some basic caching of correspondents, tags and types and will only request them from the server when necessary or when entirely reloading the page.
* Document list fetching is about 10%-30% faster now, especially when lots of tags/correspondents are present.
* Some minor improvements to the front end, such as document count in the document list, better highlighting of the current page, and improvements to the filter behavior.
* Fixes:
* A bug with the generation of filenames for files with unsupported types caused the exporter and
document saving to crash.
* Mail handling no longer exits entirely when encountering errors. It will skip the account/rule/message on which the error occured.
* Assigning correspondents from mail sender names failed for very long names. Paperless no longer assigns correspondents in these cases.
paperless-ng 0.9.4
##################
* Searching:
* Paperless now supports searching by tags, types and dates and correspondents. In order to have this applied to your
existing documents, you need to perform a ``document_index reindex`` management command
(see :ref:`administration-index`)
that adds the data to the search index. You only need to do this once, since the schema of the search index changed.
Paperless keeps the index updated after that whenever something changes.
* Paperless now has spelling corrections ("Did you mean") for miss-typed queries.
* The documentation contains :ref:`information about the query syntax <basic-searching>`.
* Front end:
* Clickable tags, correspondents and types allow quick filtering for related documents.
* Saved views are now editable.
* Preview documents directly in the browser.
* Navigation from the dashboard to saved views.
* Fixes:
* A severe error when trying to use post consume scripts.
* An error in the consumer that cause invalid messages of missing files to show up in the log.
* The documentation now contains information about bare metal installs and a section about
how to setup the development environment.
paperless-ng 0.9.3
##################
* Setting ``PAPERLESS_AUTO_LOGIN_USERNAME`` replaces ``PAPERLESS_DISABLE_LOGIN``.
You have to specify your username.
* Added a simple sanity checker that checks your documents for missing or orphaned files,
files with wrong checksums, inaccessible files, and documents with empty content.
* It is no longer possible to encrypt your documents. For the time being, paperless will
continue to operate with already encrypted documents.
* Fixes:
* Paperless now uses inotify again, since the watchdog was causing issues which I was not
aware of.
* Issue with the automatic classifier not working with only one tag.
* A couple issues with the search index being opened to eagerly.
* Added lots of tests for various parts of the application.
paperless-ng 0.9.2
##################
* Major changes to the front end (colors, logo, shadows, layout of the cards,
better mobile support)
* Paperless now uses mime types and libmagic detection to determine
if a file type is supported and which parser to use. Removes all
file type checks that where present in MANY different places in
paperless.
* Mail consumer now correctly consumes documents even when their
content type was not set correctly. (i.e. PDF documents with
content type ``application/octet-stream``)
* Basic sorting of mail rules added
* Much better admin for mail rule editing.
* Docker entrypoint script awaits the database server if it is
configured.
* Disabled editing of logs.
* New setting ``PAPERLESS_OCR_PAGES`` limits the tesseract parser
to the first n pages of scanned documents.
* Fixed a bug where tasks with too long task names would not show
up in the admin.
paperless-ng 0.9.1
##################
* Moved documentation of the settings to the actual documentation.
* Updated release script to force the user to choose between SQLite
and PostgreSQL. This avoids confusion when upgrading from paperless.
paperless-ng 0.9.0
##################
* **Deprecated:** GnuPG. :ref:`See this note on the state of GnuPG in paperless-ng. <utilities-encyption>`
This features will most likely be removed in future versions.
* **Added:** New frontend. Features:
* Single page application: It's much more responsive than the django admin pages.
* Dashboard. Shows recently scanned documents, or todo notes, or other documents
at wish. Allows uploading of documents. Shows basic statistics.
* Better document list with multiple display options.
* Full text search with result highlighting, auto completion and scoring based
on the query. It uses a document search index in the background.
* Saveable filters.
* Better log viewer.
* **Added:** Document types. Assign these to documents just as correspondents.
They may be used in the future to perform automatic operations on documents
depending on the type.
* **Added:** Inbox tags. Define an inbox tag and it will automatically be
assigned to any new document scanned into the system.
* **Added:** Automatic matching. A new matching algorithm that automatically
assigns tags, document types and correspondents to your documents. It uses
a neural network trained on your data.
* **Added:** Archive serial numbers. Assign these to quickly find documents stored in
physical binders.
* **Added:** Enabled the internal user management of django. This isn't really a
multi user solution, however, it allows more than one user to access the website
and set some basic permissions / renew passwords.
* **Modified [breaking]:** All new mail consumer with customizable filters, actions and
multiple account support. Replaces the old mail consumer. The new mail consumer
needs different configuration but can be configured to act exactly like the old
consumer.
* **Modified:** Changes to the consumer:
* Now uses the excellent watchdog library that should make sure files are
discovered no matter what the platform is.
* The consumer now uses a task scheduler to run consumption processes in parallel.
This means that consuming many documents should be much faster on systems with
many cores.
* Concurrency is controlled with the new settings ``PAPERLESS_TASK_WORKERS``
and ``PAPERLESS_THREADS_PER_WORKER``. See TODO for details on concurrency.
* The consumer no longer blocks the database for extended periods of time.
* An issue with tesseract running multiple threads per page and slowing down
the consumer was fixed.
* **Modified [breaking]:** REST Api changes:
* New filters added, other filters removed (case sensitive filters, slug filters)
* Endpoints for thumbnails, previews and downloads replace the old ``/fetch/`` urls. Redirects are in place.
* Endpoint for document uploads replaces the old ``/push`` url. Redirects are in place.
* Foreign key relationships are now served as IDs, not as urls.
* **Modified [breaking]:** PostgreSQL:
* If ``PAPERLESS_DBHOST`` is specified in the settings, paperless uses PostgreSQL instead of SQLite.
Username, database and password all default to ``paperless`` if not specified.
* **Modified [breaking]:** document_retagger management command rework. See
:ref:`utilities-retagger` for details. Replaces ``document_correspondents``
management command.
* **Removed [breaking]:** Reminders.
* **Removed:** All customizations made to the django admin pages.
* **Removed [breaking]:** The docker image no longer supports SSL. If you want to expose
paperless to the internet, hide paperless behind a proxy server that handles SSL
requests.
* **Internal changes:** Mostly code cleanup, including:
* Rework of the code of the tesseract parser. This is now a lot cleaner.
* Rework of the filename handling code. It was a mess.
* Fixed some issues with the document exporter not exporting all documents when encountering duplicate filenames.
* Added a task scheduler that takes care of checking mail, training the classifier, maintaining the document search index
and consuming documents.
* Updated dependencies. Now uses Pipenv all around.
* Updated Dockerfile and docker-compose. Now uses ``supervisord`` to run everything paperless-related in a single container.
* **Settings:**
* ``PAPERLESS_FORGIVING_OCR`` is now default and gone. Reason: Even if ``langdetect`` fails to detect
a language, tesseract still does a very good job at ocr'ing a document with the default language.
Certain language specifics such as umlauts may not get picked up properly.
* ``PAPERLESS_DEBUG`` defaults to ``false``.
* The presence of ``PAPERLESS_DBHOST`` now determines whether to use PostgreSQL or
SQLite.
* ``PAPERLESS_OCR_THREADS`` is gone and replaced with ``PAPERLESS_TASK_WORKERS`` and
``PAPERLESS_THREADS_PER_WORKER``. Refer to the config example for details.
* ``PAPERLESS_OPTIMIZE_THUMBNAILS`` allows you to disable or enable thumbnail
optimization. This is useful on less powerful devices.
* Many more small changes here and there. The usual stuff.
Paperless
#########
2.7.0
@@ -6,7 +335,7 @@ Changelog
* `syntonym`_ submitted a pull request to catch IMAP connection errors `#475`_.
* `Stéphane Brunner`_ added ``psycopg2`` to the Pipfile `#489`_. He also fixed
a syntax error in ``docker-compose.yml.example`` `#488`_ and added [DjangoQL](https://github.com/ivelum/djangoql),
a syntax error in ``docker-compose.yml.example`` `#488`_ and added `DjangoQL`_,
which allows a litany of handy search functionality `#492`_.
* `CkuT`_ and `JOKer`_ hacked out a simple, but super-helpful optimisation to
how the thumbnails are served up, improving performance considerably `#481`_.
@@ -194,7 +523,7 @@ that it was more an annoyance than anything else, so this feature is now turned
off unless you explicitly set a passphrase in your config file.
Migrating from 1.x
------------------
==================
Encryption isn't gone, it's just off for new users. So long as you have
``PAPERLESS_PASSPHRASE`` set in your config or your environment, Paperless
@@ -564,6 +893,9 @@ bulk of the work on this big change.
* Initial release
.. _rYR79435: https://github.com/rYR79435
.. _Michael Shamoon: https://github.com/shamoon
.. _jayme-github: http://github.com/jayme-github
.. _Brian Conn: https://github.com/TheConnMan
.. _Christopher Luu: https://github.com/nuudles
.. _Florian Jung: https://github.com/the01
@@ -739,6 +1071,6 @@ bulk of the work on this big change.
.. _#489: https://github.com/the-paperless-project/paperless/pull/489
.. _#492: https://github.com/the-paperless-project/paperless/pull/492
.. _pipenv: https://docs.pipenv.org/
.. _a new home on Docker Hub: https://hub.docker.com/r/danielquinn/paperless/
.. _optipng: http://optipng.sourceforge.net/
.. _DjangoQL: https://github.com/ivelum/djangoql

View File

@@ -1,51 +1,21 @@
# -*- coding: utf-8 -*-
#
# Paperless documentation build configuration file, created by
# sphinx-quickstart on Mon Oct 26 18:36:52 2015.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
import sphinx_rtd_theme
import sys
import os
__version__ = None
exec(open("../src/paperless/version.py").read())
# Believe it or not, this is the officially sanctioned way to add custom CSS.
def setup(app):
app.add_stylesheet("custom.css")
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#sys.path.insert(0, os.path.abspath('.'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.intersphinx',
'sphinx.ext.todo',
'sphinx.ext.imgmath',
'sphinx.ext.viewcode',
'sphinx_rtd_theme',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# templates_path = ['_templates']
# The suffix of source filenames.
source_suffix = '.rst'
@@ -57,7 +27,7 @@ source_suffix = '.rst'
master_doc = 'index'
# General information about the project.
project = u'Paperless'
project = u'Paperless-ng'
copyright = u'2015, Daniel Quinn'
# The version info for the project you're documenting, acts as replacement for
@@ -118,7 +88,7 @@ pygments_style = 'sphinx'
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'default'
html_theme = 'sphinx_rtd_theme'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
@@ -198,19 +168,6 @@ html_static_path = ['_static']
# Output file base name for HTML help builder.
htmlhelp_basename = 'paperless'
#
# Attempt to use the ReadTheDocs theme. If it's not installed, fallback to
# the default.
#
try:
import sphinx_rtd_theme
html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
except ImportError:
pass
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {

426
docs/configuration.rst Normal file
View File

@@ -0,0 +1,426 @@
.. _configuration:
*************
Configuration
*************
Paperless provides a wide range of customizations.
Depending on how you run paperless, these settings have to be defined in different
places.
* If you run paperless on docker, ``paperless.conf`` is not used. Rather, configure
paperless by copying necessary options to ``docker-compose.env``.
* If you are running paperless on anything else, paperless will search for the
configuration file in these locations and use the first one it finds:
.. code::
/path/to/paperless/paperless.conf
/etc/paperless.conf
/usr/local/etc/paperless.conf
Required services
#################
PAPERLESS_REDIS=<url>
This is required for processing scheduled tasks such as email fetching, index
optimization and for training the automatic document matcher.
Defaults to redis://localhost:6379.
PAPERLESS_DBHOST=<hostname>
By default, sqlite is used as the database backend. This can be changed here.
Set PAPERLESS_DBHOST and PostgreSQL will be used instead of mysql.
PAPERLESS_DBPORT=<port>
Adjust port if necessary.
Default is 5432.
PAPERLESS_DBNAME=<name>
Database name in PostgreSQL.
Defaults to "paperless".
PAPERLESS_DBUSER=<name>
Database user in PostgreSQL.
Defaults to "paperless".
PAPERLESS_DBPASS=<password>
Database password for PostgreSQL.
Defaults to "paperless".
Paths and folders
#################
PAPERLESS_CONSUMPTION_DIR=<path>
This where your documents should go to be consumed. Make sure that it exists
and that the user running the paperless service can read/write its contents
before you start Paperless.
Don't change this when using docker, as it only changes the path within the
container. Change the local consumption directory in the docker-compose.yml
file instead.
Defaults to "../consume", relative to the "src" directory.
PAPERLESS_DATA_DIR=<path>
This is where paperless stores all its data (search index, SQLite database,
classification model, etc).
Defaults to "../data", relative to the "src" directory.
PAPERLESS_MEDIA_ROOT=<path>
This is where your documents and thumbnails are stored.
You can set this and PAPERLESS_DATA_DIR to the same folder to have paperless
store all its data within the same volume.
Defaults to "../media", relative to the "src" directory.
PAPERLESS_STATICDIR=<path>
Override the default STATIC_ROOT here. This is where all static files
created using "collectstatic" manager command are stored.
Unless you're doing something fancy, there is no need to override this.
Defaults to "../static", relative to the "src" directory.
PAPERLESS_FILENAME_FORMAT=<format>
Changes the filenames paperless uses to store documents in the media directory.
See :ref:`advanced-file_name_handling` for details.
Default is none, which disables this feature.
Hosting & Security
##################
PAPERLESS_SECRET_KEY=<key>
Paperless uses this to make session tokens. If you expose paperless on the
internet, you need to change this, since the default secret is well known.
Use any sequence of characters. The more, the better. You don't need to
remember this. Just face-roll your keyboard.
Default is listed in the file ``src/paperless/settings.py``.
PAPERLESS_ALLOWED_HOSTS<comma-separated-list>
If you're planning on putting Paperless on the open internet, then you
really should set this value to the domain name you're using. Failing to do
so leaves you open to HTTP host header attacks:
https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation
Just remember that this is a comma-separated list, so "example.com" is fine,
as is "example.com,www.example.com", but NOT " example.com" or "example.com,"
Defaults to "*", which is all hosts.
PAPERLESS_CORS_ALLOWED_HOSTS<comma-separated-list>
You need to add your servers to the list of allowed hosts that can do CORS
calls. Set this to your public domain name.
Defaults to "http://localhost:8000".
PAPERLESS_FORCE_SCRIPT_NAME=<path>
To host paperless under a subpath url like example.com/paperless you set
this value to /paperless. No trailing slash!
.. note::
I don't know if this works in paperless-ng. Probably not.
Defaults to none, which hosts paperless at "/".
PAPERLESS_STATIC_URL=<path>
Override the STATIC_URL here. Unless you're hosting Paperless off a
subdomain like /paperless/, you probably don't need to change this.
Defaults to "/static/".
PAPERLESS_AUTO_LOGIN_USERNAME=<username>
Specify a username here so that paperless will automatically perform login
with the selected user.
.. danger::
Do not use this when exposing paperless on the internet. There are no
checks in place that would prevent you from doing this.
Defaults to none, which disables this feature.
PAPERLESS_COOKIE_PREFIX=<str>
Specify a prefix that is added to the cookies used by paperless to identify
the currently logged in user. This is useful for when you're running two
instances of paperless on the same host.
After changing this, you will have to login again.
Defaults to ``""``, which does not alter the cookie names.
.. _configuration-ocr:
OCR settings
############
Paperless uses `OCRmyPDF <https://ocrmypdf.readthedocs.io/en/latest/>`_ for
performing OCR on documents and images. Paperless uses sensible defaults for
most settings, but all of them can be configured to your needs.
PAPERLESS_OCR_LANGUAGE=<lang>
Customize the language that paperless will attempt to use when
parsing documents.
It should be a 3-letter language code consistent with ISO
639: https://www.loc.gov/standards/iso639-2/php/code_list.php
Set this to the language most of your documents are written in.
This can be a combination of multiple languages such as ``deu+eng``,
in which case tesseract will use whatever language matches best.
Keep in mind that tesseract uses much more cpu time with multiple
languages enabled.
Defaults to "eng".
PAPERLESS_OCR_MODE=<mode>
Tell paperless when and how to perform ocr on your documents. Four modes
are available:
* ``skip``: Paperless skips all pages and will perform ocr only on pages
where no text is present. This is the safest option.
* ``skip_noarchive``: In addition to skip, paperless won't create an
archived version of your documents when it finds any text in them.
This is useful if you don't want to have two almost-identical versions
of your digital documents in the media folder. This is the fastest option.
* ``redo``: Paperless will OCR all pages of your documents and attempt to
replace any existing text layers with new text. This will be useful for
documents from scanners that already performed OCR with insufficient
results. It will also perform OCR on purely digital documents.
This option may fail on some documents that have features that cannot
be removed, such as forms. In this case, the text from the document is
used instead.
* ``force``: Paperless rasterizes your documents, converting any text
into images and puts the OCRed text on top. This works for all documents,
however, the resulting document may be significantly larger and text
won't appear as sharp when zoomed in.
The default is ``skip``, which only performs OCR when necessary and always
creates archived documents.
PAPERLESS_OCR_OUTPUT_TYPE=<type>
Specify the the type of PDF documents that paperless should produce.
* ``pdf``: Modify the PDF document as little as possible.
* ``pdfa``: Convert PDF documents into PDF/A-2b documents, which is a
subset of the entire PDF specification and meant for storing
documents long term.
* ``pdfa-1``, ``pdfa-2``, ``pdfa-3`` to specify the exact version of
PDF/A you wish to use.
If not specified, ``pdfa`` is used. Remember that paperless also keeps
the original input file as well as the archived version.
PAPERLESS_OCR_PAGES=<num>
Tells paperless to use only the specified amount of pages for OCR. Documents
with less than the specified amount of pages get OCR'ed completely.
Specifying 1 here will only use the first page.
When combined with ``PAPERLESS_OCR_MODE=redo`` or ``PAPERLESS_OCR_MODE=force``,
paperless will not modify any text it finds on excluded pages and copy it
verbatim.
Defaults to 0, which disables this feature and always uses all pages.
PAPERLESS_OCR_IMAGE_DPI=<num>
Paperless will OCR any images you put into the system and convert them
into PDF documents. This is useful if your scanner produces images.
In order to do so, paperless needs to know the DPI of the image.
Most images from scanners will have this information embedded and
paperless will detect and use that information. In case this fails, it
uses this value as a fallback.
Set this to the DPI your scanner produces images at.
Default is none, which causes paperless to fail if no DPI information is
present in an image.
PAPERLESS_OCR_USER_ARG=<json>
OCRmyPDF offers many more options. Use this parameter to specify any
additional arguments you wish to pass to OCRmyPDF. Since Paperless uses
the API of OCRmyPDF, you have to specify these in a format that can be
passed to the API. See `the API reference of OCRmyPDF <https://ocrmypdf.readthedocs.io/en/latest/api.html#reference>`_
for valid parameters. All command line options are supported, but they
use underscores instead of dashed.
.. caution::
Paperless has been tested to work with the OCR options provided
above. There are many options that are incompatible with each other,
so specifying invalid options may prevent paperless from consuming
any documents.
Specify arguments as a JSON dictionary. Keep note of lower case booleans
and double quoted parameter names and strings. Examples:
.. code:: json
{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
Software tweaks
###############
PAPERLESS_TASK_WORKERS=<num>
Paperless does multiple things in the background: Maintain the search index,
maintain the automatic matching algorithm, check emails, consume documents,
etc. This variable specifies how many things it will do in parallel.
PAPERLESS_THREADS_PER_WORKER=<num>
Furthermore, paperless uses multiple threads when consuming documents to
speed up OCR. This variable specifies how many pages paperless will process
in parallel on a single document.
.. caution::
Ensure that the product
PAPERLESS_TASK_WORKERS * PAPERLESS_THREADS_PER_WORKER
does not exceed your CPU core count or else paperless will be extremely slow.
If you want paperless to process many documents in parallel, choose a high
worker count. If you want paperless to process very large documents faster,
use a higher thread per worker count.
The default is a balance between the two, according to your CPU core count,
with a slight favor towards threads per worker, and using as much cores as
possible.
If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust
PAPERLESS_THREADS_PER_WORKER automatically.
PAPERLESS_TIME_ZONE=<timezone>
Set the time zone here.
See https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE
for details on how to set it.
Defaults to UTC.
PAPERLESS_CONSUMER_POLLING=<num>
If paperless won't find documents added to your consume folder, it might
not be able to automatically detect filesystem changes. In that case,
specify a polling interval in seconds here, which will then cause paperless
to periodically check your consumption directory for changes.
Defaults to 0, which disables polling and uses filesystem notifications.
PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>
When the consumer detects a duplicate document, it will not touch the
original document. This default behavior can be changed here.
Defaults to false.
PAPERLESS_CONSUMER_RECURSIVE=<bool>
Enable recursive watching of the consumption directory. Paperless will
then pickup files from files in subdirectories within your consumption
directory as well.
Defaults to false.
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=<bool>
Set the names of subdirectories as tags for consumed files.
E.g. <CONSUMPTION_DIR>/foo/bar/file.pdf will add the tags "foo" and "bar" to
the consumed file. Paperless will create any tags that don't exist yet.
PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work.
Defaults to false.
PAPERLESS_CONVERT_MEMORY_LIMIT=<num>
On smaller systems, or even in the case of Very Large Documents, the consumer
may explode, complaining about how it's "unable to extend pixel cache". In
such cases, try setting this to a reasonably low value, like 32. The
default is to use whatever is necessary to do everything without writing to
disk, and units are in megabytes.
For more information on how to use this value, you should search
the web for "MAGICK_MEMORY_LIMIT".
Defaults to 0, which disables the limit.
PAPERLESS_CONVERT_TMPDIR=<path>
Similar to the memory limit, if you've got a small system and your OS mounts
/tmp as tmpfs, you should set this to a path that's on a physical disk, like
/home/your_user/tmp or something. ImageMagick will use this as scratch space
when crunching through very large documents.
For more information on how to use this value, you should search
the web for "MAGICK_TMPDIR".
Default is none, which disables the temporary directory.
PAPERLESS_OPTIMIZE_THUMBNAILS=<bool>
Use optipng to optimize thumbnails. This usually reduces the size of
thumbnails by about 20%, but uses considerable compute time during
consumption.
Defaults to true.
PAPERLESS_POST_CONSUME_SCRIPT=<filename>
After a document is consumed, Paperless can trigger an arbitrary script if
you like. This script will be passed a number of arguments for you to work
with. For more information, take a look at :ref:`advanced-post_consume_script`.
The default is blank, which means nothing will be executed.
PAPERLESS_FILENAME_DATE_ORDER=<format>
Paperless will check the document text for document date information.
Use this setting to enable checking the document filename for date
information. The date order can be set to any option as specified in
https://dateparser.readthedocs.io/en/latest/settings.html#date-order.
The filename will be checked first, and if nothing is found, the document
text will be checked as normal.
Defaults to none, which disables this feature.
PAPERLESS_FILENAME_PARSE_TRANSFORMS
Transforms filenames before they are processed by paperless. See
:ref:`advanced-transforming_filenames` for details.
Defaults to none, which disables this feature.
Binaries
########
There are a few external software packages that Paperless expects to find on
your system when it starts up. Unless you've done something creative with
their installation, you probably won't need to edit any of these. However,
if you've installed these programs somewhere where simply typing the name of
the program doesn't automatically execute it (ie. the program isn't in your
$PATH), then you'll need to specify the literal path for that program.
PAPERLESS_CONVERT_BINARY=<path>
Defaults to "/usr/bin/convert".
PAPERLESS_GS_BINARY=<path>
Defaults to "/usr/bin/gs".
PAPERLESS_OPTIPNG_BINARY=<path>
Defaults to "/usr/bin/optipng".

View File

@@ -1,255 +0,0 @@
.. _consumption:
Consumption
###########
Once you've got Paperless setup, you need to start feeding documents into it.
Currently, there are three options: the consumption directory, IMAP (email), and
HTTP POST.
.. _consumption-directory:
The Consumption Directory
=========================
The primary method of getting documents into your database is by putting them in
the consumption directory. The ``document_consumer`` script runs in an infinite
loop looking for new additions to this directory and when it finds them, it goes
about the process of parsing them with the OCR, indexing what it finds, and
encrypting the PDF (if ``PAPERLESS_PASSPHRASE`` is set), storing it in the
media directory.
Getting stuff into this directory is up to you. If you're running Paperless
on your local computer, you might just want to drag and drop files there, but if
you're running this on a server and want your scanner to automatically push
files to this directory, you'll need to setup some sort of service to accept the
files from the scanner. Typically, you're looking at an FTP server like
`Proftpd`_ or `Samba`_.
.. _Proftpd: http://www.proftpd.org/
.. _Samba: http://www.samba.org/
So where is this consumption directory? It's wherever you define it. Look for
the ``CONSUMPTION_DIR`` value in ``settings.py``. Set that to somewhere
appropriate for your use and put some documents in there. When you're ready,
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
.. _consumption-directory-hook:
Hooking into the Consumption Process
------------------------------------
Sometimes you may want to do something arbitrary whenever a document is
consumed. Rather than try to predict what you may want to do, Paperless lets
you execute scripts of your own choosing just before or after a document is
consumed using a couple simple hooks.
Just write a script, put it somewhere that Paperless can read & execute, and
then put the path to that script in ``paperless.conf`` with the variable name
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
``PAPERLESS_POST_CONSUME_SCRIPT``. The script will be executed before or
or after the document is consumed respectively.
.. important::
These scripts are executed in a **blocking** process, which means that if
a script takes a long time to run, it can significantly slow down your
document consumption flow. If you want things to run asynchronously,
you'll have to fork the process in your script and exit.
.. _consumption-directory-hook-variables:
What Can These Scripts Do?
..........................
It's your script, so you're only limited by your imagination and the laws of
physics. However, the following values are passed to the scripts in order:
.. _consumption-director-hook-variables-pre:
Pre-consumption script
::::::::::::::::::::::
* Document file name
A simple but common example for this would be creating a simple script like
this:
``/usr/local/bin/ocr-pdf``
.. code:: bash
#!/usr/bin/env bash
pdf2pdfocr.py -i ${1}
``/etc/paperless.conf``
.. code:: bash
...
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
...
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
overwrite the file with an OCR'd version of the file and exit. At which point,
the consumption process will begin with the newly modified file.
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
.. _consumption-director-hook-variables-post:
Post-consumption script
:::::::::::::::::::::::
* Document id
* Generated file name
* Source path
* Thumbnail path
* Download URL
* Thumbnail URL
* Correspondent
* Tags
The script can be in any language you like, but for a simple shell script
example, you can take a look at ``post-consumption-example.sh`` in the
``scripts`` directory in this project.
.. _consumption-imap:
IMAP (Email)
============
Another handy way to get documents into your database is to email them to
yourself. The typical use-case would be to be out for lunch and want to send a
copy of the receipt back to your system at home. Paperless can be taught to
pull emails down from an arbitrary account and dump them into the consumption
directory where the process :ref:`above <consumption-directory>` will follow the
usual pattern on consuming the document.
Some things you need to know about this feature:
* It's disabled by default. By setting the values below it will be enabled.
* It's been tested in a limited environment, so it may not work for you (please
submit a pull request if you can!)
* It's designed to **delete mail from the server once consumed**. So don't go
pointing this to your personal email account and wonder where all your stuff
went.
* Currently, only one photo (attachment) per email will work.
So, with all that in mind, here's what you do to get it running:
1. Setup a new email account somewhere, or if you're feeling daring, create a
folder in an existing email box and note the path to that folder.
2. In ``/etc/paperless.conf`` set all of the appropriate values in
``PATHS AND FOLDERS`` and ``SECURITY``.
If you decided to use a subfolder of an existing account, then make sure you
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
have to include that in every email you send.
3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check
the configured email account at startup and from then on every 10 minutes
for something new and pulls down whatever it finds.
4. Send yourself an email! Note that the subject is treated as the file name,
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
get what you expect. Also, you must include the aforementioned secret
string in every email so the fetcher knows that it's safe to import.
Note that Paperless only allows the email title to consist of safe characters
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
5. After a few minutes, the consumer will poll your mailbox, pull down the
message, and place the attachment in the consumption directory with the
appropriate name. A few minutes later, the consumer will import it like any
other file.
.. _consumption-http:
HTTP POST
=========
You can also submit a document via HTTP POST, so long as you do so after
authenticating. To push your document to Paperless, send an HTTP POST to the
server with the following name/value pairs:
* ``correspondent``: The name of the document's correspondent. Note that there
are restrictions on what characters you can use here. Specifically,
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
out. You also can't use the sequence ` - ` (space, dash, space).
* ``title``: The title of the document. The rules for characters is the same
here as the correspondent.
* ``document``: The file you're uploading
Specify ``enctype="multipart/form-data"``, and then POST your file with::
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
An example of this in HTML is a typical form:
.. code:: html
<form method="post" enctype="multipart/form-data">
<input type="text" name="correspondent" value="My Correspondent" />
<input type="text" name="title" value="My Title" />
<input type="file" name="document" />
<input type="submit" name="go" value="Do the thing" />
</form>
But a potentially more useful way to do this would be in Python. Here we use
the requests library to handle basic authentication and to send the POST data
to the URL.
.. code:: python
import os
from hashlib import sha256
import requests
from requests.auth import HTTPBasicAuth
# You authenticate via BasicAuth or with a session id.
# We use BasicAuth here
username = "my-username"
password = "my-super-secret-password"
# Where you have Paperless installed and listening
url = "http://localhost:8000/push"
# Document metadata
correspondent = "Test Correspondent"
title = "Test Title"
# The local file you want to push
path = "/path/to/some/directory/my-document.pdf"
with open(path, "rb") as f:
response = requests.post(
url=url,
data={"title": title, "correspondent": correspondent},
files={"document": (os.path.basename(path), f, "application/pdf")},
auth=HTTPBasicAuth(username, password),
allow_redirects=False
)
if response.status_code == 202:
# Everything worked out ok
print("Upload successful")
else:
# If you don't get a 202, it's probably because your credentials
# are wrong or something. This will give you a rough idea of what
# happened.
print("We got HTTP status code: {}".format(response.status_code))
for k, v in response.headers.items():
print("{}: {}".format(k, v))

View File

@@ -3,6 +3,10 @@
Contributing to Paperless
#########################
.. warning::
This section is not updated to paperless-ng yet.
Maybe you've been using Paperless for a while and want to add a feature or two,
or maybe you've come across a bug that you have some ideas how to solve. The
beauty of Free software is that you can see what's wrong and help to get it
@@ -81,7 +85,7 @@ quoted, or triple-quoted string will do:
problematic_string = 'This is a "string" with "quotes" in it'
In HTML templates, please use double-quotes for tag attributes, and single
quotes for arguments passed to Django tempalte tags:
quotes for arguments passed to Django template tags:
.. code:: html

View File

@@ -1,42 +0,0 @@
.. _customising:
Customising Paperless
#####################
Currently, the Paperless' interface is just the default Django admin, which
while powerful, is rather boring. If you'd like to give the site a bit of a
face-lift, or if you simply want to adjust the colours, contrast, or font size
to make things easier to read, you can do that by adding your own CSS or
Javascript quite easily.
.. _customising-overrides:
Overrides
=========
On every page load, Paperless looks for two files in your media root directory
(the directory defined by your ``PAPERLESS_MEDIADIR`` configuration variable or
the default, ``<project root>/media/``) for two files:
* ``overrides.css``
* ``overrides.js``
If it finds either or both of those files, they'll be loaded into the page: the
CSS in the ``<head>``, and the Javascript stuffed into the last line of the
``<body>``.
.. _customising-overrides-note:
An important note about customisation
-------------------------------------
Any changes you make to the site with your CSS or Javascript are likely to
depend on the structure of the current HTML and/or the existing CSS rules. For
the most part it's safe to assume that these bits won't change, but *sometimes
they do* as features are added or bugs are fixed.
If you make a change that you think others would appreciate though, submit it
as a pull request and maybe we can find a way to work it into the project by
default!

View File

@@ -1,158 +0,0 @@
#!/usr/bin/env bash
# Bash script to install paperless in lxc containter
# paperless.lan
#
# Will set-up paperless, apache2 and proftpd
#
# lxc launch ubuntu: paperless
# lxc exec paperless -- sh -c "sudo apt-get update && sudo apt-get install -y wget"
# lxc exec paperless -- sh -c "wget https://raw.githubusercontent.com/the-paperless-project/paperless/master/docs/examples/lxc/lxc-install.sh && /bin/bash lxc-install.sh --email "
#
#
set +e
PASSWORD=$(< /dev/urandom tr -dc _A-Z-a-z-0-9+@%^{} | head -c20;echo;)
EMAIL=
function displayHelp() {
echo "available parameters:
-e <email> | --email <email>
-p <password> | --password <password>
"
}
POSITIONAL=()
while [[ $# -gt 0 ]]
do
key="$1"
i=$key
case $i in
-e|--email)
EMAIL="${2}"
shift
shift
;;
-p|--password)
PASSWORD="${2}"
shift
shift
;;
--default|-h|--help)
shift
displayHelp
exit 0
;;
*)
echo "argument: $i not recognized"
exit 2
;;
esac
done
set -- "${POSITIONAL[@]}" # restore positional parameters
if [ -z $EMAIL ]; then
echo "missing email, try running with -h "
exit 3
fi
if [[ $(/usr/bin/id -u) -ne 0 ]]; then
echo "Not running as root"
exit
fi
if [ $(grep -c paperless /etc/passwd) -eq 0 ]; then
# Add paperless user with no password
adduser --disabled-password --gecos "" paperless
fi
if [ $(grep -c ftpupload /etc/passwd) -eq 0 ]; then
# Add ftpupload
adduser --disabled-password --gecos "" ftpupload
echo "Set ftpupload password: "
#passwd ftpupload
#TODO: generate some password and allow parameter
echo "ftpupload:ftpuploadpassword" | chpasswd
fi
if [ $(id -nG paperless | grep -Fcw ftpupload) -eq 0 ]; then
# Allow paperless group to access
adduser paperless ftpupload
chmod g+w /home/ftpupload
fi
# Get apt up to date
apt-get update
# Needed for plain Paperless
apt-get -y install unpaper gnupg libpoppler-cpp-dev python3-pyocr tesseract-ocr imagemagick optipng git
# Needed for Apache
apt-get -y install apache2 libapache2-mod-wsgi-py3
if [ ! -f /etc/proftpd/proftpd.conf ]; then
# Install ftp server and make sure all uplaoded files are owned by paperless
apt-get -y install proftpd
fi
if [ $(grep -c paperless /etc/proftpd/proftpd.conf) -eq 0 ]; then
cat <<EOF >> /etc/proftpd/proftpd.conf
<Directory /home/ftpupload/>
UserOwner paperless
GroupOwner paperless
</Directory>
EOF
systemctl restart proftpd
fi
#Get Paperless from git
su -c "cd /home/paperless ; git clone https://github.com/the-paperless-project/paperless" paperless
# Install Pip Requirements
apt-get -y install python3-pip python3-venv
cd /home/paperless/paperless
pip3 install -r requirements.txt
# Take paperless.conf.example and set consumuption dir (ftp dir)
sed -e '/PAPERLESS_CONSUMPTION_DIR=/s/=.*/=\"\/home\/ftpupload\/\"/' \
/home/paperless/paperless/paperless.conf.example >/etc/paperless.conf
# Update /etc/paperless.conf with PAPERLESS_SECRET_KEY
SECRET=$(strings /dev/urandom | grep -o '[[:alnum:]]' | head -n 30 | tr -d '\n'; echo)
sed -i "s/#PAPERLESS_SECRET_KEY.*/PAPERLESS_SECRET_KEY=$SECRET/" /etc/paperless.conf
#Initialise the SQLite database
su -c "cd /home/paperless/paperless/src/ ; ./manage.py migrate" paperless
echo "if superuser doesn't exists, create one with login: paperless and password: ${PASSWORD}"
#Create a user for your Paperless instance
su -c "cd /home/paperless/paperless/src/ ; echo ./manage.py create_superuser_with_password --username paperless --email ${EMAIL} --password ${PASSWORD} --preserve" paperless
su -c "cd /home/paperless/paperless/src/ ; ./manage.py create_superuser_with_password --username paperless --email ${EMAIL} --password ${PASSWORD} --preserve" paperless
if [ ! -d /home/paperless/paperless/static ]; then
# 167 static files copied to '/home/paperless/paperless/static'.
su -c "cd /home/paperless/paperless/src/ ; ./manage.py collectstatic" paperless
fi
if [ ! -f /etc/apache2/sites-available/paperless.conf ]; then
# Set-up apache
cp /home/paperless/paperless/docs/examples/lxc/paperless.conf /etc/apache2/sites-available/
a2dissite 000-default.conf
a2ensite paperless.conf
systemctl reload apache2
fi
sed -e "s:home/paperless/project/virtualenv/bin/python:usr/bin/python3:" \
/home/paperless/paperless/scripts/paperless-consumer.service \
>/etc/systemd/system/paperless-consumer.service
sed -i "s:/home/paperless/project/src/manage.py:/home/paperless/paperless/src/manage.py:" \
/etc/systemd/system/paperless-consumer.service
systemctl enable paperless-consumer
systemctl start paperless-consumer
# convert-im6.q16: not authorized
# Security risk ?
# https://stackoverflow.com/questions/42928765/convertnot-authorized-aaaa-error-constitute-c-readimage-453
if [ -f /etc/ImageMagick-6/policy.xml ]; then
mv /etc/ImageMagick-6/policy.xml /etc/ImageMagick-6/policy.xmlout
fi

View File

@@ -1,18 +0,0 @@
<VirtualHost *:80>
ServerName paperless.lan
Alias /static/ /home/paperless/paperless/static/
<Directory /home/paperless/paperless/static>
Require all granted
</Directory>
WSGIScriptAlias / /home/paperless/paperless/src/paperless/wsgi.py
WSGIDaemonProcess paperless.lan user=paperless group=paperless threads=5 python-path=/home/paperless/paperless/src
WSGIProcessGroup paperless.lan
<Directory /home/paperless/paperless/src/paperless>
<Files wsgi.py>
Require all granted
</Files>
</Directory>
</VirtualHost>

View File

@@ -1,112 +1,197 @@
.. _extending:
Paperless development
#####################
This section describes the steps you need to take to start development on paperless-ng.
1. Check out the source from github. The repository is organized in the following way:
* ``master`` always represents the latest release and will only see changes
when a new release is made.
* ``dev`` contains the code that will be in the next release.
* ``feature-X`` contain bigger changes that will be in some release, but not
necessarily the next one.
Apart from that, the folder structure is as follows:
* ``docs/`` - Documentation.
* ``src-ui/`` - Code of the front end.
* ``src/`` - Code of the back end.
* ``scripts/`` - Various scripts that help with different parts of development.
* ``docker/`` - Files required to build the docker image.
2. Install some dependencies.
* Python 3.6.
* All dependencies listed in the :ref:`Bare metal route <setup-bare_metal>`
* redis. You can either install redis or use the included scritps/start-redis.sh
to use docker to fire up a redis instance.
Back end development
====================
The backend is a django application. I use PyCharm for development, but you can use whatever
you want.
Install the python dependencies by performing ``pipenv install --dev`` in the src/ directory.
This will also create a virtual environment, which you can enter with ``pipenv shell`` or
execute one-shot commands in with ``pipenv run``.
In ``src/paperless.conf``, enable debug mode.
Configure the IDE to use the src/ folder as the base source folder. Configure the following
launch configurations in your IDE:
* python3 manage.py runserver
* python3 manage.py qcluster
* python3 manage.py consumer
Depending on which part of paperless you're developing for, you need to have some or all of
them running.
Testing and code style:
* Run ``pytest`` in the src/ directory to execute all tests. This also generates a HTML coverage
report. When runnings test, paperless.conf is loaded as well. However: the tests rely on the default
configuration. This is not ideal. But for now, make sure no settings except for DEBUG are overridden when testing.
* Run ``pycodestyle`` to test your code for issues with the configured code style settings.
.. note::
The line length rule E501 is generally useful for getting multiple source files
next to each other on the screen. However, in some cases, its just not possible
to make some lines fit, especially complicated IF cases. Append `` # NOQA: E501``
to disable this check for certain lines.
Front end development
=====================
The front end is build using angular. I use the ``Code - OSS`` IDE for development.
In order to get started, you need ``npm``. Install the Angular CLI interface with
.. code:: shell-session
$ npm install -g @angular/cli
and make sure that it's on your path. Next, in the src-ui/ directory, install the
required dependencies of the project.
.. code:: shell-session
$ npm install
You can launch a development server by running
.. code:: shell-session
$ ng serve
This will automatically update whenever you save. However, in-place compilation might fail
on syntax errors, in which case you need to restart it.
By default, the development server is available on ``http://localhost:4200/`` and is configured
to access the API at ``http://localhost:8000/api/``, which is the default of the backend.
If you enabled DEBUG on the back end, several security overrides for allowed hosts, CORS and
X-Frame-Options are in place so that the front end behaves exactly as in production. This also
relies on you being logged into the back end. Without a valid session, The front end will simply
not work.
In order to build the front end and serve it as part of django, execute
.. code:: shell-session
$ ng build --prod --output-path ../src/documents/static/frontend/
This will build the front end and put it in a location from which the Django server will serve
it as static content. This way, you can verify that authentication is working.
Making a release
================
Execute the ``make-release.sh <ver>`` script.
This will test and assemble everything and also build and tag a docker image.
Extending Paperless
===================
For the most part, Paperless is monolithic, so extending it is often best
managed by way of modifying the code directly and issuing a pull request on
`GitHub`_. However, over time the project has been evolving to be a little
more "pluggable" so that users can write their own stuff that talks to it.
Paperless does not have any fancy plugin systems and will probably never have. However,
some parts of the application have been designed to allow easy integration of additional
features without any modification to the base code.
.. _GitHub: https://github.com/the-paperless-project/paperless
Making custom parsers
---------------------
Paperless uses parsers to add documents to paperless. A parser is responsible for:
.. _extending-parsers:
* Retrieve the content from the original
* Create a thumbnail
* Optional: Retrieve a created date from the original
* Optional: Create an archived document from the original
Parsers
-------
Custom parsers can be added to paperless to support more file types. In order to do that,
you need to write the parser itself and announce its existence to paperless.
You can leverage Paperless' consumption model to have it consume files *other*
than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``. To do so,
you simply follow Django's convention of creating a new app, with a few key
requirements.
.. _extending-parsers-parserspy:
parsers.py
..........
In this file, you create a class that extends
``documents.parsers.DocumentParser`` and go about implementing the three
required methods:
* ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for
this document.
* ``get_text()``: Returns the text from the document and only the text.
* ``get_date()``: If possible, this returns the date of the document, otherwise
it should return ``None``.
.. _extending-parsers-signalspy:
signals.py
..........
At consumption time, Paperless emits a ``document_consumer_declaration``
signal which your module has to react to in order to let the consumer know
whether or not it's capable of handling a particular file. Think of it like
this:
1. Consumer finds a file in the consumption directory.
2. It asks all the available parsers: *"Hey, can you handle this file?"*
3. Each parser responds with either ``None`` meaning they can't handle the
file, or a dictionary in the following format:
The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the
methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to
``get_date`` if you don't want to rely on paperless' default date guessing mechanisms.
.. code:: python
{
"parser": <the class name>,
"weight": <an integer>
}
class MyCustomParser(DocumentParser):
The consumer compares the ``weight`` values from all respondents and uses the
class with the highest value to consume the document. The default parser,
``RasterisedDocumentParser`` has a weight of ``0``.
def parse(self, document_path, mime_type):
# This method does not return anything. Rather, you should assign
# whatever you got from the document to the following fields:
# The content of the document.
self.text = "content"
# Optional: path to a PDF document that you created from the original.
self.archive_path = os.path.join(self.tempdir, "archived.pdf")
.. _extending-parsers-appspy:
# Optional: "created" date of the document.
self.date = get_created_from_metadata(document_path)
apps.py
.......
def get_thumbnail(self, document_path, mime_type):
# This should return the path to a thumbnail you created for this
# document.
return os.path.join(self.tempdir, "thumb.png")
This is a standard Django file, but you'll need to add some code to it to
connect your parser to the ``document_consumer_declaration`` signal.
If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``.
The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty
and removed after consumption finished. You can use that directory to store any
intermediate files and also use it to store the thumbnail / archived document.
.. _extending-parsers-finally:
Finally
.......
The last step is to update ``settings.py`` to include your new module.
Eventually, this will be dynamic, but at the moment, you have to edit the
``INSTALLED_APPS`` section manually. Simply add the path to your AppConfig to
the list like this:
After that, you need to announce your parser to paperless. You need to connect a
handler to the ``document_consumer_declaration`` signal. Have a look in the file
``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method
that returns information about your parser:
.. code:: python
INSTALLED_APPS = [
...
"my_module.apps.MyModuleConfig",
...
]
def myparser_consumer_declaration(sender, **kwargs):
return {
"parser": MyCustomParser,
"weight": 0,
"mime_types": {
"application/pdf": ".pdf",
"image/jpeg": ".jpg",
}
}
Order doesn't matter, but generally it's a good idea to place your module lower
in the list so that you don't end up accidentally overriding project defaults
somewhere.
* ``parser`` is a reference to a class that extends ``DocumentParser``.
* ``weight`` is used whenever two or more parsers are able to parse a file: The parser with
the higher weight wins. This can be used to override the parsers provided by
paperless.
.. _extending-parsers-example:
An Example
..........
The core Paperless functionality is based on this design, so if you want to see
what a parser module should look like, have a look at `parsers.py`_,
`signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module.
.. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py
.. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py
.. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py
.. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/
* ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value
is the default file extension that paperless should use when storing files and serving them for
download. We could guess that from the file extensions, but some mime types have many extensions
associated with them and the python methods responsible for guessing the extension do not always
return the same value.

106
docs/faq.rst Normal file
View File

@@ -0,0 +1,106 @@
**************************
Frequently asked questions
**************************
**Q:** *What's the general plan for Paperless-ng?*
**A:** Paperless-ng is already almost feature-complete. This project will remain
as simple as it is right now. It will see improvements to features that are already there.
If you need advanced features such as document versions,
workflows or multi-user with customizable access to individual files, this is
not the tool for you.
Features that *are* planned are some more quality of life extensions for the searching
(i.e., search for similar documents, group results by correspondents with "more from this"
links, etc), bulk editing and hierarchical tags.
**Q:** *I'm using docker. Where are my documents?*
**A:** Your documents are stored inside the docker volume ``paperless_media``.
Docker manages this volume automatically for you. It is a persistent storage
and will persist as long as you don't explicitly delete it. The actual location
depends on your host operating system. On Linux, chances are high that this location
is
.. code::
/var/lib/docker/volumes/paperless_media/_data
.. caution::
Do not mess with this folder. Don't change permissions and don't move
files around manually. This folder is meant to be entirely managed by docker
and paperless.
**Q:** *Let's say you don't support this project anymore in a year. Can I easily move to other systems?*
**A:** Your documents are stored as plain files inside the media folder. You can always drag those files
out of that folder to use them elsewhere. Here are a couple notes about that.
* Paperless never modifies your original documents. It keeps checksums of all documents and uses a
scheduled sanity checker to check that they remain the same.
* By default, paperless uses the internal ID of each document as its filename. This might not be very
convenient for export. However, you can adjust the way files are stored in paperless by
:ref:`configuring the filename format <advanced-file_name_handling>`.
* :ref:`The exporter <utilities-exporter>` is another easy way to get your files out of paperless with reasonable file names.
**Q:** *What file types does paperless-ng support?*
**A:** Currently, the following files are supported:
* PDF documents, PNG images, JPEG images, TIFF images and GIF images are processed with OCR and converted into PDF documents.
* Plain text documents are supported as well and are added verbatim
to paperless.
Paperless determines the type of a file by inspecting its content. The
file extensions do not matter.
**Q:** *Will paperless-ng run on Raspberry Pi?*
**A:** The short answer is yes. I've tested it on a Raspberry Pi 3 B.
The long answer is that certain parts of
Paperless will run very slow, such as the tesseract OCR. On Raspberry Pi,
try to OCR documents before feeding them into paperless so that paperless can
reuse the text. The web interface should be a lot snappier, since it runs
in your browser and paperless has to do much less work to serve the data.
.. note::
You can adjust some of the settings so that paperless uses less processing
power. See :ref:`setup-less_powerful_devices` for details.
**Q:** *How do I install paperless-ng on Raspberry Pi?*
**A:** There is no docker image for ARM available. If you know how to build
that automatically, I'm all ears. For now, you have to grab the latest release
archive from the project page and build the image yourself. The release comes
with the front end already compiled, so you don't have to do this on the Pi.
**Q:** *How do I run this on unRaid?*
**A:** Head over to `<https://github.com/selfhosters/unRAID-CA-templates>`_,
`Uli Fahrer <https://github.com/Tooa>`_ created a container template for that.
I don't exactly know how to use that though, since I don't use unRaid.
**Q:** *How do I run this on my toaster?*
**A:** I honestly don't know! As for all other devices that might be able
to run paperless, you're a bit on your own. If you can't run the docker image,
the documentation has instructions for bare metal installs. I'm running
paperless on an i3 processor from 2015 or so. This is also what I use to test
new releases with. Apart from that, I also have a Raspberry Pi, which I
occasionally build the image on and see if it works.
**Q:** *How do I proxy this with NGINX?*
.. code::
location / {
proxy_pass http://localhost:8000/
}
And that's about it. Paperless serves everything, including static files by itself
when running the docker image. If you want to do anything fancy, you have to
install paperless bare metal.

View File

@@ -1,131 +0,0 @@
.. _guesswork:
Guesswork
#########
During the consumption process, Paperless tries to guess some of the attributes
of the document it's looking at. To do this it uses two approaches:
.. _guesswork-naming:
File Naming
===========
Any document you put into the consumption directory will be consumed, but if
you name the file right, it'll automatically set some values in the database
for you. This is is the logic the consumer follows:
1. Try to find the correspondent, title, and tags in the file name following
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
works as well.
2. If that doesn't work, we skip the date and try this pattern:
``Correspondent - Title - tag,tag,tag.pdf``.
3. If that doesn't work, we try to find the correspondent and title in the file
name following the pattern: ``Correspondent - Title.pdf``.
4. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect:
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Another Company - Letter of Reference.jpg``
* ``Dad's Recipe for Pancakes.png``
These however wouldn't work:
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Another Company- Letter of Reference.jpg``
Do I have to be so strict about naming?
---------------------------------------
Rather than using the strict document naming rules, one can also set the option
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
to any date format that is found in the title, instead of a date pulled from
the document's text, without requiring the strict formatting of the document
filename as described above.
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
Transforming filenames for parsing
----------------------------------
Some devices can't produce filenames that can be parsed by the default
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
``paperless.conf`` one can add transformations that are applied to the filename
before it's parsed.
The option contains a list of dictionaries of regular expressions (key:
``pattern``) and replacements (key: ``repl``) in JSON format, which are
applied in order by passing them to ``re.subn``. Transformation stops
after the first match, so at most one transformation is applied. The general
syntax is
.. code:: python
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
The example below is for a Brother ADS-2400N, a scanner that allows
different names to different hardware buttons (useful for handling
multiple entities in one instance), but insists on adding ``_<count>``
to the filename.
.. code:: python
# Brother profile configuration, support "Name_Date_Count" (the default
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
.. _guesswork-content:
Reading the Document Contents
=============================
After the consumer has tried to figure out what it could from the file name,
it starts looking at the content of the document itself. It will compare the
matching algorithms defined by every tag and correspondent already set in your
database to see if they apply to the text in that document. In other words,
if you defined a tag called ``Home Utility`` that had a ``match`` property of
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
automatically tag your newly-consumed document with your ``Home Utility`` tag
so long as the text ``bc hydro`` appears in the body of the document somewhere.
The matching logic is quite powerful, and supports searching the text of your
document with different algorithms, and as such, some experimentation may be
necessary to get things Just Right.
.. _guesswork-content-howto:
How Do I Set Up These Matching Algorithms?
------------------------------------------
Setting up of the algorithms is easily done through the admin interface. When
you create a new correspondent or tag, there are optional fields for matching
text and matching algorithm. From the help info there:
.. note::
Which algorithm you want to use when matching text to the OCR'd PDF. Here,
"any" looks for any occurrence of any word provided in the PDF, while "all"
requires that every word provided appear in the PDF, albeit not in the
order provided. A "literal" match means that the text you enter must
appear in the PDF exactly as you've entered it, and "regular expression"
uses a regex to match the PDF. If you don't know what a regex is, you
probably don't want this option.
When using the "any" or "all" matching algorithms, you can search for terms
that consist of multiple words by enclosing them in double quotes. For example,
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
will match documents that contain either "Bank of America" or "BofA", but will
not match documents containing "Bank of South America".
Then just save your tag/correspondent and run another document through the
consumer. Once complete, you should see the newly-created document,
automatically tagged with the appropriate data.

View File

@@ -1,17 +1,14 @@
.. _index:
*********
Paperless
=========
*********
Paperless is a simple Django application running in two parts:
a :ref:`consumer <utilities-consumer>` (the thing that does the indexing) and
the :ref:`webserver <utilities-webserver>` (the part that lets you search &
a *Consumer* (the thing that does the indexing) and
the *Web server* (the part that lets you search &
download already-indexed documents). If you want to learn more about its
functions keep on reading after the installation section.
.. _index-why-this-exists:
Why This Exists
===============
@@ -25,22 +22,54 @@ finding stuff again. I feed documents right from the post box into the scanner
and then shred them. Perhaps you might find it useful too.
Paperless-ng
============
Paperless-ng is a fork of the original paperless project. It changes many
things both on the surface and under the hood. Paperless-ng was created
because I feel that these changes are too big to be pushed into the main
repository right away.
NG stands for both Angular (the framework used for the
Frontend) and next-gen. Publishing this project under a different name also
avoids confusion between paperless and paperless-ng.
If you want to learn about what's different in paperless-ng, check out these
resources in the documentation:
* :ref:`Some screenshots <screenshots>` of the new UI are available.
* Read :ref:`this section <advanced-automatic_matching>` if you want to
learn about how paperless automates all tagging using machine learning.
* Paperless now comes with a :ref:`proper email consumer <usage-email>`
that's fully tested and production ready.
* Paperless creates searchable PDF/A documents from whatever you you put into
the consumption directory. This means that you can select text in
image-only documents coming from your scanner.
* See :ref:`this note <utilities-encyption>` about GnuPG encryption in
paperless-ng.
* Paperless is now integrated with a
:ref:`task processing queue <setup-task_processor>` that tells you
at a glance when and why something is not working.
* The :ref:`changelog <paperless_changelog>` contains a detailed list of all changes
in paperless-ng.
It would be great if this project could eventually merge back into the main
repository, but it needs a lot more work before that can happen.
Contents
========
.. toctree::
:maxdepth: 2
:maxdepth: 1
requirements
setup
consumption
usage_overview
advanced_usage
administration
configuration
api
utilities
guesswork
migrating
customising
faq
extending
troubleshooting
contributing

View File

@@ -1,109 +0,0 @@
.. _migrating:
Migrating, Updates, and Backups
===============================
As Paperless is still under active development, there's a lot that can change
as software updates roll out. You should backup often, so if anything goes
wrong during an update, you at least have a means of restoring to something
usable. Thankfully, there are automated ways of backing up, restoring, and
updating the software.
.. _migrating-backup:
Backing Up
----------
So you're bored of this whole project, or you want to make a remote backup of
your files for whatever reason. This is easy to do, simply use the
:ref:`exporter <utilities-exporter>` to dump your documents and database out
into an arbitrary directory.
.. _migrating-restoring:
Restoring
---------
Restoring your data is just as easy, since nearly all of your data exists either
in the file names, or in the contents of the files themselves. You just need to
create an empty database (just follow the
:ref:`installation instructions <setup-installation>` again) and then import the
``tags.json`` file you created as part of your backup. Lastly, copy your
exported documents into the consumption directory and start up the consumer.
.. code-block:: shell-session
$ cd /path/to/project
$ rm data/db.sqlite3 # Delete the database
$ cd src
$ ./manage.py migrate # Create the database
$ ./manage.py createsuperuser
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
$ ./manage.py document_consumer
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
is almost as simple:
.. code-block:: shell-session
# Stop and remove your current containers
$ docker-compose stop
$ docker-compose rm -f
# Recreate them, add the superuser
$ docker-compose up -d
$ docker-compose run --rm webserver createsuperuser
# Load the tags
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
# Load your exported documents into the consumption directory
# (How you do this highly depends on how you have set this up)
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
After loading the documents into the consumption directory the consumer will
immediately start consuming the documents.
.. _migrating-updates:
Updates
-------
For the most part, all you have to do to update Paperless is run ``git pull``
on the directory containing the project files, and then use Django's
``migrate`` command to execute any database schema updates that might have been
rolled in as part of the update:
.. code-block:: shell-session
$ cd /path/to/project
$ git pull
$ pip install -r requirements.txt
$ cd src
$ ./manage.py migrate
Note that it's possible (even likely) that while ``git pull`` may update some
files, the ``migrate`` step may not update anything. This is totally normal.
Additionally, as new features are added, the ability to control those features
is typically added by way of an environment variable set in ``paperless.conf``.
You may want to take a look at the ``paperless.conf.example`` file to see if
there's anything new in there compared to what you've got in ``/etc``.
If you are :ref:`using Docker <setup-installation-docker>` the update process
is similar:
.. code-block:: shell-session
$ cd /path/to/project
$ git pull
$ docker build -t paperless .
$ docker-compose run --rm consumer migrate
$ docker-compose up -d
If ``git pull`` doesn't report any changes, there is no need to continue with
the remaining steps.

View File

@@ -1,125 +0,0 @@
.. _requirements:
Requirements
============
You need a Linux machine or Unix-like setup (theoretically an Apple machine
should work) that has the following software installed:
* `Python3`_ (with development libraries, pip and virtualenv)
* `GNU Privacy Guard`_
* `Tesseract`_, plus its language files matching your document base.
* `Imagemagick`_ version 6.7.5 or higher
* `unpaper`_
* `libpoppler-cpp-dev`_ PDF rendering library
* `optipng`_
.. _Python3: https://python.org/
.. _GNU Privacy Guard: https://gnupg.org
.. _Tesseract: https://github.com/tesseract-ocr
.. _Imagemagick: http://imagemagick.org/
.. _unpaper: https://github.com/unpaper/unpaper
.. _libpoppler-cpp-dev: https://poppler.freedesktop.org/
.. _optipng: http://optipng.sourceforge.net/
Notably, you should confirm how you access your Python3 installation. Many
Linux distributions will install Python3 in parallel to Python2, using the
names ``python3`` and ``python`` respectively. The same goes for ``pip3`` and
``pip``. Running Paperless with Python2 will likely break things, so make sure
that you're using the right version.
For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to
refer to their Python3 versions.
In addition to the above, there are a number of Python requirements, all of
which are listed in a file called ``requirements.txt`` in the project root
directory.
If you're not working on a virtual environment (like Docker), you
should probably be using a virtualenv, but that's your call. The reasons why
you might choose a virtualenv or not aren't really within the scope of this
document. Needless to say if you don't know what a virtualenv is, you should
probably figure that out before continuing.
.. _requirements-apple:
Problems with Imagemagick & PDFs
--------------------------------
Some users have `run into problems`_ with getting ImageMagick to do its thing
with PDFs. Often this is the case with Apple systems using HomeBrew, but other
Linuxes have been a problem as well. The solution appears to be to install
ghostscript as well as ImageMagick:
.. _run into problems: https://github.com/the-paperless-project/paperless/issues/25
.. code:: bash
$ brew install ghostscript
$ brew install imagemagick
$ brew install libmagic
.. _requirements-baremetal:
Python-specific Requirements: No Virtualenv
-------------------------------------------
If you don't care to use a virtual env, then installation of the Python
dependencies is easy:
.. code:: bash
$ pip install --user --requirement /path/to/paperless/requirements.txt
This will download and install all of the requirements into
``${HOME}/.local``. Remember that your distribution may be using ``pip3`` as
mentioned above.
.. _requirements-virtualenv:
Python-specific Requirements: Virtualenv
----------------------------------------
Using a virtualenv for this is pretty straightforward: create a virtualenv,
enter it, and install the requirements using the ``requirements.txt`` file:
.. code:: bash
$ virtualenv --python=/path/to/python3 /path/to/arbitrary/directory
$ . /path/to/arbitrary/directory/bin/activate
$ pip install --requirement /path/to/paperless/requirements.txt
Now you're ready to go. Just remember to enter (activate) your virtualenv
whenever you want to use Paperless.
.. _requirements-documentation:
Documentation
-------------
As generation of the documentation is not required for the use of Paperless,
dependencies for this process are not included in ``requirements.txt``. If
you'd like to generate your own docs locally, you'll need to:
.. code:: bash
$ pip install sphinx
and then cd into the ``docs`` directory and type ``make html``.
If you are using Docker, you can use the following commands to build the
documentation and run a webserver serving it on `port 8001`_:
.. code:: bash
$ pwd
/path/to/paperless
$ docker build -t paperless:docs -f docs/Dockerfile .
$ docker run --rm -it -p "8001:8000" paperless:docs
.. _port 8001: http://127.0.0.1:8001

View File

View File

@@ -1,12 +1,14 @@
.. _scanners:
Scanner Recommendations
=======================
***********************
Scanner recommendations
***********************
As Paperless operates by watching a folder for new files, doesn't care what
scanner you use, but sometimes finding a scanner that will write to an FTP,
NFS, or SMB server can be difficult. This page is here to help you find one
that works right for you based on recommentations from other Paperless users.
that works right for you based on recommendations from other Paperless users.
+---------+----------------+-----+-----+-----+----------------+
| Brand | Model | Supports | Recommended By |
@@ -25,6 +27,8 @@ that works right for you based on recommentations from other Paperless users.
+---------+----------------+-----+-----+-----+----------------+
| Epson | `WF-7710DWF`_ | yes | | yes | `Skylinar`_ |
+---------+----------------+-----+-----+-----+----------------+
| Fujitsu | `S1300i`_ | yes | | yes | `jonaswinkler`_|
+---------+----------------+-----+-----+-----+----------------+
.. _ADS-1500W: https://www.brother.ca/en/p/ads1500w
.. _MFC-J6930DW: https://www.brother.ca/en/p/MFCJ6930DW
@@ -32,6 +36,7 @@ that works right for you based on recommentations from other Paperless users.
.. _MFC-9142CDN: https://www.brother.co.uk/printers/laser-printers/mfc9140cdn
.. _ix500: http://www.fujitsu.com/us/products/computing/peripheral/scanners/scansnap/ix500/
.. _WF-7710DWF: https://www.epson.de/en/products/printers/inkjet-printers/for-home/workforce-wf-7710dwf
.. _S1300i: https://www.fujitsu.com/global/products/computing/peripheral/scanners/soho/s1300i/
.. _danielquinn: https://github.com/danielquinn
.. _ayounggun: https://github.com/ayounggun
@@ -39,3 +44,4 @@ that works right for you based on recommentations from other Paperless users.
.. _eonist: https://github.com/eonist
.. _REOLDEV: https://github.com/REOLDEV
.. _Skylinar: https://github.com/Skylinar
.. _jonaswinkler: https://github.com/jonaswinkler

View File

@@ -1,16 +1,45 @@
.. _screenshots:
***********
Screenshots
===========
***********
Once everything is set-up login to paperless using the web front-end
This is what paperless-ng looks like. You shouldn't use paperless to index
research papers though, its a horrible tool for that job.
.. image:: ./_static/Screenshot_first_run_login.png
The dashboard shows customizable views on your document and allows document uploads:
Nice clean interface
.. image:: _static/screenshots/dashboard.png
.. image:: ./_static/Screenshot_first_logged.png
The document list provides three different styles to scroll through your documents:
Some documents loaded in via ftp or using the scanners ftp.
.. image:: _static/screenshots/documents-table.png
.. image:: _static/screenshots/documents-smallcards.png
.. image:: _static/screenshots/documents-largecards.png
Extensive filtering mechanisms:
.. image:: _static/screenshots/documents-filter.png
Side-by-side editing of documents. Optimized for 1080p.
.. image:: _static/screenshots/editing.png
Tag editing. This looks about the same for correspondents and document types.
.. image:: _static/screenshots/new-tag.png
Searching provides auto complete and highlights the results.
.. image:: _static/screenshots/search-preview.png
.. image:: _static/screenshots/search-results.png
Fancy mail filters!
.. image:: _static/screenshots/mail-rules-edited.png
Mobile support in the future? This kinda works, however some layouts are still
too wide.
.. image:: _static/screenshots/mobile.png
.. image:: ./_static/Screenshot_upload_and_scanned.png

File diff suppressed because it is too large Load Diff

View File

@@ -1,75 +1,51 @@
.. _troubleshooting:
***************
Troubleshooting
===============
***************
.. _troubleshooting-languagemissing:
No files are added by the consumer
##################################
Consumer warns ``OCR for XX failed``
------------------------------------
Check for the following issues:
If you find the OCR accuracy to be too low, and/or the document consumer warns
that ``OCR for XX failed, but we're going to stick with what we've got since
FORGIVING_OCR is enabled``, then you might need to install the
`Tesseract language files <http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_
marching your document's languages.
* Ensure that the directory you're putting your documents in is the folder
paperless is watching. With docker, this setting is performed in the
``docker-compose.yml`` file. Without docker, look at the ``CONSUMPTION_DIR``
setting. Don't adjust this setting if you're using docker.
* Ensure that redis is up and running. Paperless does its task processing
asynchronously, and for documents to arrive at the task processor, it needs
redis to run.
* Ensure that the task processor is running. Docker does this automatically.
Manually invoke the task processor by executing
As an example, if you are running Paperless from any Ubuntu or Debian
box, and your documents are written in Spanish you may need to run::
.. code:: shell-session
apt-get install -y tesseract-ocr-spa
$ python3 manage.py qcluster
* Look at the output of paperless and inspect it for any errors.
* Go to the admin interface, and check if there are failed tasks. If so, the
tasks will contain an error message.
.. _troubleshooting-convertpixelcache:
Consumer fails to pickup any new files
######################################
Consumer dies with ``convert: unable to extent pixel cache``
------------------------------------------------------------
If you notice that the consumer will only pickup files in the consumption
directory at startup, but won't find any other files added later, check out
the configuration file and enable filesystem polling with the setting
``PAPERLESS_CONSUMER_POLLING``.
During the consumption process, Paperless invokes ImageMagick's ``convert``
program to translate the source document into something that the OCR engine can
understand and this can burn a Very Large amount of memory if the original
document is rather long. Similarly, if your system doesn't have a lot of
memory to begin with (ie. a Raspberry Pi), then this can happen for even
medium-sized documents.
Operation not permitted
#######################
The solution is to tell ImageMagick *not* to Use All The RAM, as is its
default, and instead tell it to used a fixed amount. ``convert`` will then
break up the job into hundreds of individual files and use them to slowly
compile the finished image. Simply set ``PAPERLESS_CONVERT_MEMORY_LIMIT`` in
``/etc/paperless.conf`` to something like ``32000000`` and you'll limit
``convert`` to 32MB. Fiddle with this value as you like.
You might see errors such as:
**HOWEVER**: Simply setting this value may not be enough on system where
``/tmp`` is mounted as tmpfs, as this is where ``convert`` will write its
temporary files. In these cases (most Systemd machines), you need to tell
ImageMagick to use a different space for its scratch work. You do this by
setting ``PAPERLESS_CONVERT_TMPDIR`` in ``/etc/paperless.conf`` to somewhere
that's actually on a physical disk (and writable by the user running
Paperless), like ``/var/tmp/paperless`` or ``/home/my_user/tmp`` in a pinch.
.. code::
chown: changing ownership of '../export': Operation not permitted
.. _troubleshooting-decompressionbombwarning:
The container tries to set file ownership on the listed directories. This is
required so that the user running paperless inside docker has write permissions
to these folders. This happens when pointing these directories to NFS shares,
for example.
DecompressionBombWarning and/or no text in the OCR output
---------------------------------------------------------
Some users have had issues using Paperless to consume PDFs that were created
by merging Very Large Scanned Images into one PDF. If this happens to you,
it's likely because the PDF you've created contains some very large pages
(millions of pixels) and the process of converting the PDF to a OCR-friendly
image is exploding.
Typically, this happens because the scanned images are created with a high
DPI and then rolled into the PDF with an assumed DPI of 72 (the default).
The best solution then is to specify the DPI used in the scan in the
conversion-to-PDF step. So for example, if you scanned the original image
with a DPI of 300, then merging the images into the single PDF with
``convert`` should look like this:
.. code:: bash
$ convert -density 300 *.jpg finished.pdf
For more information on this and situations like it, you should take a look
at `Issue #118`_ as that's where this tip originated.
.. _Issue #118: https://github.com/the-paperless-project/paperless/issues/118
Ensure that `chown` is possible on these directories.

403
docs/usage_overview.rst Normal file
View File

@@ -0,0 +1,403 @@
**************
Usage Overview
**************
Paperless is an application that manages your personal documents. With
the help of a document scanner (see :ref:`scanners`), paperless transforms
your wieldy physical document binders into a searchable archive and
provides many utilities for finding and managing your documents.
Terms and definitions
#####################
Paperless essentially consists of two different parts for managing your
documents:
* The *consumer* watches a specified folder and adds all documents in that
folder to paperless.
* The *web server* provides a UI that you use to manage and search for your
scanned documents.
Each document has a couple of fields that you can assign to them:
* A *Document* is a piece of paper that sometimes contains valuable
information.
* The *correspondent* of a document is the person, institution or company that
a document either originates form, or is sent to.
* A *tag* is a label that you can assign to documents. Think of labels as more
powerful folders: Multiple documents can be grouped together with a single
tag, however, a single document can also have multiple tags. This is not
possible with folders. The reason folders are not implemented in paperless
is simply that tags are much more versatile than folders.
* A *document type* is used to demarcate the type of a document such as letter,
bank statement, invoice, contract, etc. It is used to identify what a document
is about.
* The *date added* of a document is the date the document was scanned into
paperless. You cannot and should not change this date.
* The *date created* of a document is the date the document was initially issued.
This can be the date you bought a product, the date you signed a contract, or
the date a letter was sent to you.
* The *archive serial number* (short: ASN) of a document is the identifier of
the document in your physical document binders. See
:ref:`usage-recommended_workflow` below.
* The *content* of a document is the text that was OCR'ed from the document.
This text is fed into the search engine and is used for matching tags,
correspondents and document types.
Frontend overview
#################
.. warning::
TBD. Add some fancy screenshots!
Adding documents to paperless
#############################
Once you've got Paperless setup, you need to start feeding documents into it.
When adding documents to paperless, it will perform the following operations on
your documents:
1. OCR the document, if it has no text. Digital documents usually have text,
and this step will be skipped for those documents.
2. Paperless will create an archiveable PDF/A document from your document.
If this document is coming from your scanner, it will have embedded selectable text.
3. Paperless performs automatic matching of tags, correspondents and types on the
document before storing it in the database.
.. hint::
This process can be configured to fit your needs. If you don't want paperless
to create archived versions for digital documents, you can configure that by
configuring ``PAPERLESS_OCR_MODE=skip_noarchive``. Please read the
:ref:`relevant section in the documentation <configuration-ocr>`.
.. note::
No matter which options you choose, Paperless will always store the original
document that it found in the consumption directory or in the mail and
will never overwrite that document. Archived versions are stored alongside the
original versions.
The consumption directory
=========================
The primary method of getting documents into your database is by putting them in
the consumption directory. The consumer runs in an infinite
loop looking for new additions to this directory and when it finds them, it goes
about the process of parsing them with the OCR, indexing what it finds, and storing
it in the media directory.
Getting stuff into this directory is up to you. If you're running Paperless
on your local computer, you might just want to drag and drop files there, but if
you're running this on a server and want your scanner to automatically push
files to this directory, you'll need to setup some sort of service to accept the
files from the scanner. Typically, you're looking at an FTP server like
`Proftpd`_ or a Windows folder share with `Samba`_.
.. _Proftpd: http://www.proftpd.org/
.. _Samba: http://www.samba.org/
.. TODO: hyperref to configuration of the location of this magic folder.
Dashboard upload
================
The dashboard has a file drop field to upload documents to paperless. Simply drag a file
onto this field or select a file with the file dialog. Multiple files are supported.
Mobile upload
=============
The mobile app over at `<https://github.com/qcasey/paperless_share>`_ allows Android users
to share any documents with paperless. This can be combined with any of the mobile
scanning apps out there, such as Office Lens.
Furthermore, there is the `Paperless App <https://github.com/bauerj/paperless_app>`_ as well,
which no only has document upload, but also document editing and browsing.
.. _usage-email:
IMAP (Email)
============
You can tell paperless-ng to consume documents from your email accounts.
This is a very flexible and powerful feature, if you regularly received documents
via mail that you need to archive. The mail consumer can be configured by using the
admin interface in the following manner:
1. Define e-mail accounts.
2. Define mail rules for your account.
These rules perform the following:
1. Connect to the mail server.
2. Fetch all matching mails (as defined by folder, maximum age and the filters)
3. Check if there are any consumable attachments.
4. If so, instruct paperless to consume the attachments and optionally
use the metadata provided in the rule for the new document.
5. If documents were consumed from a mail, the rule action is performed
on that mail.
Paperless will completely ignore mails that do not match your filters. It will also
only perform the action on mails that it has consumed documents from.
The actions all ensure that the same mail is not consumed twice by different means.
These are as follows:
* **Delete:** Immediately deletes mail that paperless has consumed documents from.
Use with caution.
* **Mark as read:** Mark consumed mail as read. Paperless will not consume documents
from already read mails. If you read a mail before paperless sees it, it will be
ignored.
* **Flag:** Sets the 'important' flag on mails with consumed documents. Paperless
will not consume flagged mails.
* **Move to folder:** Moves consumed mails out of the way so that paperless wont
consume them again.
.. caution::
The mail consumer will perform these actions on all mails it has consumed
documents from. Keep in mind that the actual consumption process may fail
for some reason, leaving you with missing documents in paperless.
.. note::
With the correct set of rules, you can completely automate your email documents.
Create rules for every correspondent you receive digital documents from and
paperless will read them automatically. The default action "mark as read" is
pretty tame and will not cause any damage or data loss whatsoever.
You can also setup a special folder in your mail account for paperless and use
your favorite mail client to move to be consumed mails into that folder
automatically or manually and tell paperless to move them to yet another folder
after consumption. It's up to you.
.. note::
Paperless will process the rules in the order defined in the admin page.
You can define catch-all rules and have them executed last to consume
any documents not matched by previous rules. Such a rule may assign an "Unknown
mail document" tag to consumed documents so you can inspect them further.
Paperless is set up to check your mails every 10 minutes. This can be configured on the
'Scheduled tasks' page in the admin.
REST API
========
You can also submit a document using the REST API, see :ref:`api-file_uploads` for details.
.. _basic-searching:
Best practices
##############
Paperless offers a couple tools that help you organize your document collection. However,
it is up to you to use them in a way that helps you organize documents and find specific
documents when you need them. This section offers a couple ideas for managing your collection.
Document types allow you to classify documents according to what they are. You can define
types such as "Receipt", "Invoice", or "Contract". If you used to collect all your receipts
in a single binder, you can recreate that system in paperless by defining a document type,
assigning documents to that type and then filtering by that type to only see all receipts.
Not all documents need document types. Sometimes its hard to determine what the type of a
document is or it is hard to justify creating a document type that you only need once or twice.
This is okay. As long as the types you define help you organize your collection in the way
you want, paperless is doing its job.
Tags can be used in many different ways. Think of tags are more versatile folders or binders.
If you have a binder for documents related to university / your car or health care, you can
create these binders in paperless by creating tags and assigning them to relevant documents.
Just as with documents, you can filter the document list by tags and only see documents of
a certain topic.
With physical documents, you'll often need to decide which folder the document belongs to.
The advantage of tags over folders and binders is that a single document can have multiple
tags. A physical document cannot magically appear in two different folders, but with tags,
this is entirely possible.
.. hint::
This can be used in many different ways. One example: Imagine you're working on a particular
task, such as signing up for university. Usually you'll need to collect a bunch of different
documents that are already sorted into various folders. With the tag system of paperless,
you can create a new group of documents that are relevant to this task without destroying
the already existing organization. When you're done with the task, you could delete the
tag again, which would be equal to sorting documents back into the folder they belong into.
Or keep the tag, up to you.
All of the logic above applies to correspondents as well. Attach them to documents if you
feel that they help you organize your collection.
When you've started organizing your documents, create a couple saved views for document collections
you regularly access. This is equal to having labeled physical binders on your desk, except
that these saved views are dynamic and simply update themselves as you add documents to the system.
Here are a couple examples of tags and types that you could use in your collection.
* An ``inbox`` tag for newly added documents that you haven't manually edited yet.
* A tag ``car`` for everything car related (repairs, registration, insurance, etc)
* A tag ``todo`` for documents that you still need to do something with, such as reply, or
perform some task online.
* A tag ``bank account x`` for all bank statement related to that account.
* A tag ``mail`` for anything that you added to paperless via its mail processing capabilities.
* A tag ``missing_metadata`` when you still need to add some metadata to a document, but can't
or don't want to do this right now.
Searching
#########
Paperless offers an extensive searching mechanism that is designed to allow you to quickly
find a document you're looking for (for example, that thing that just broke and you bought
a couple months ago, that contract you signed 8 years ago).
When you search paperless for a document, it tries to match this query against your documents.
Paperless will look for matching documents by inspecting their content, title, correspondent,
type and tags. Paperless returns a scored list of results, so that documents matching your query
better will appear further up in the search results.
By default, paperless returns only documents which contain all words typed in the search bar.
However, paperless also offers advanced search syntax if you want to drill down the results
further.
Matching documents with logical expressions:
.. code::
shopname AND (product1 OR product2)
Matching specific tags, correspondents or types:
.. code::
type:invoice tag:unpaid
correspondent:university certificate
Matching dates:
.. code::
created:[2005 to 2009]
added:yesterday
modified:today
Matching inexact words:
.. code::
produ*name
.. note::
Inexact terms are hard for search indexes. These queries might take a while to execute. That's why paperless offers
auto complete and query correction.
All of these constructs can be combined as you see fit.
If you want to learn more about the query language used by paperless, paperless uses Whoosh's default query language.
Head over to `Whoosh query language <https://whoosh.readthedocs.io/en/latest/querylang.html>`_.
For details on what date parsing utilities are available, see
`Date parsing <https://whoosh.readthedocs.io/en/latest/dates.html#parsing-date-queries>`_.
.. _usage-recommended_workflow:
The recommended workflow
########################
Once you have familiarized yourself with paperless and are ready to use it
for all your documents, the recommended workflow for managing your documents
is as follows. This workflow also takes into account that some documents
have to be kept in physical form, but still ensures that you get all the
advantages for these documents as well.
The following diagram shows how easy it is to manage your documents.
.. image:: _static/recommended_workflow.png
Preparations in paperless
=========================
* Create an inbox tag that gets assigned to all new documents.
* Create a TODO tag.
Processing of the physical documents
====================================
Keep a physical inbox. Whenever you receive a document that you need to
archive, put it into your inbox. Regularly, do the following for all documents
in your inbox:
1. For each document, decide if you need to keep the document in physical
form. This applies to certain important documents, such as contracts and
certificates.
2. If you need to keep the document, write a running number on the document
before scanning, starting at one and counting upwards. This is the archive
serial number, or ASN in short.
3. Scan the document.
4. If the document has an ASN assigned, store it in a *single* binder, sorted
by ASN. Don't order this binder in any other way.
5. If the document has no ASN, throw it away. Yay!
Over time, you will notice that your physical binder will fill up. If it is
full, label the binder with the range of ASNs in this binder (i.e., "Documents
1 to 343"), store the binder in your cellar or elsewhere, and start a new
binder.
The idea behind this process is that you will never have to use the physical
binders to find a document. If you need a specific physical document, you
may find this document by:
1. Searching in paperless for the document.
2. Identify the ASN of the document, since it appears on the scan.
3. Grab the relevant document binder and get the document. This is easy since
they are sorted by ASN.
Processing of documents in paperless
====================================
Once you have scanned in a document, proceed in paperless as follows.
1. If the document has an ASN, assign the ASN to the document.
2. Assign a correspondent to the document (i.e., your employer, bank, etc)
This isn't strictly necessary but helps in finding a document when you need
it.
3. Assign a document type (i.e., invoice, bank statement, etc) to the document
This isn't strictly necessary but helps in finding a document when you need
it.
4. Assign a proper title to the document (the name of an item you bought, the
subject of the letter, etc)
5. Check that the date of the document is correct. Paperless tries to read
the date from the content of the document, but this fails sometimes if the
OCR is bad or multiple dates appear on the document.
6. Remove inbox tags from the documents.
.. hint::
You can setup manual matching rules for your correspondents and tags and
paperless will assign them automatically. After consuming a couple documents,
you can even ask paperless to *learn* when to assign tags and correspondents
by itself. For details on this feature, see :ref:`advanced-matching`.
Task management
===============
Some documents require attention and require you to act on the document. You
may take two different approaches to handle these documents based on how
regularly you intent to use paperless and scan documents.
* If you scan and process your documents in paperless regularly, assign a
TODO tag to all scanned documents that you need to process. Create a saved
view on the dashboard that shows all documents with this tag.
* If you do not scan documents regularly and use paperless solely for archiving,
create a physical todo box next to your physical inbox and put documents you
need to process in the TODO box. When you performed the task associated with
the document, move it to the inbox.

View File

@@ -1,284 +0,0 @@
.. _utilities:
Utilities
=========
There's basically three utilities to Paperless: the webserver, consumer, and
if needed, the exporter. They're all detailed here.
.. _utilities-webserver:
The Webserver
-------------
At the heart of it, Paperless is a simple Django webservice, and the entire
interface is based on Django's standard admin interface. Once running, visiting
the URL for your service delivers the admin, through which you can get a
detailed listing of all available documents, search for specific files, and
download whatever it is you're looking for.
.. _utilities-webserver-howto:
How to Use It
.............
The webserver is started via the ``manage.py`` script:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py runserver
By default, the server runs on localhost, port 8000, but you can change this
with a few arguments, run ``manage.py --help`` for more information.
Add the option ``--noreload`` to reduce resource usage. Otherwise, the server
continuously polls all source files for changes to auto-reload them.
Note that when exiting this command your webserver will disappear.
If you want to run this full-time (which is kind of the point)
you'll need to have it start in the background -- something you'll need to
figure out for your own system. To get you started though, there are Systemd
service files in the ``scripts`` directory.
.. _utilities-consumer:
The Consumer
------------
The consumer script runs in an infinite loop, constantly looking at a directory
for documents to parse and index. The process is pretty straightforward:
1. Look in ``CONSUMPTION_DIR`` for a document. If one is found, go to #2.
If not, wait 10 seconds and try again. On Linux, new documents are detected
instantly via inotify, so there's no waiting involved.
2. Parse the document with Tesseract
3. Create a new record in the database with the OCR'd text
4. Attempt to automatically assign document attributes by doing some guesswork.
Read up on the :ref:`guesswork documentation<guesswork>` for more
information about this process.
5. Encrypt the document (if you have a passphrase set) and store it in the
``media`` directory under ``documents/originals``.
6. Go to #1.
.. _utilities-consumer-howto:
How to Use It
.............
The consumer is started via the ``manage.py`` script:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_consumer
This starts the service that will consume documents as they appear in
``CONSUMPTION_DIR``.
Note that this command runs continuously, so exiting it will mean your webserver
disappears. If you want to run this full-time (which is kind of the point)
you'll need to have it start in the background -- something you'll need to
figure out for your own system. To get you started though, there are Systemd
service files in the ``scripts`` directory.
Some command line arguments are available to customize the behavior of the
consumer. By default it will use ``/etc/paperless.conf`` values. Display the
help with:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_consumer --help
.. _utilities-exporter:
The Exporter
------------
Tired of fiddling with Paperless, or just want to do something stupid and are
afraid of accidentally damaging your files? You can export all of your
documents into neatly named, dated, and unencrypted files.
.. _utilities-exporter-howto:
How to Use It
.............
This too is done via the ``manage.py`` script:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/
This will dump all of your unencrypted documents into ``/path/to/somewhere``
for you to do with as you please. The files are accompanied with a special
file, ``manifest.json`` which can be used to :ref:`import the files
<utilities-importer>` at a later date if you wish.
.. _utilities-exporter-howto-docker:
Docker
______
If you are :ref:`using Docker <setup-installation-docker>`, running the
expoorter is almost as easy. To mount a volume for exports, follow the
instructions in the ``docker-compose.yml.example`` file for the ``/export``
volume (making the changes in your own ``docker-compose.yml`` file, of course).
Once you have the volume mounted, the command to run an export is:
.. code-block:: shell-session
$ docker-compose run --rm consumer document_exporter /export
If you prefer to use ``docker run`` directly, supplying the necessary commandline
options:
.. code-block:: shell-session
$ # Identify your containers
$ docker-compose ps
Name Command State Ports
-------------------------------------------------------------------------
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
$ # Make sure to replace your passphrase and remove or adapt the id mapping
$ docker run --rm \
--volumes-from paperless_data_1 \
--volume /path/to/arbitrary/place:/export \
-e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \
-e USERMAP_UID=1000 -e USERMAP_GID=1000 \
paperless document_exporter /export
.. _utilities-importer:
The Importer
------------
Looking to transfer Paperless data from one instance to another, or just want
to restore from a backup? This is your go-to toy.
.. _utilities-importer-howto:
How to Use It
.............
The importer works just like the exporter. You point it at a directory, and
the script does the rest of the work:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/
Docker
______
Assuming that you've already gone through the steps above in the
:ref:`export <utilities-exporter-howto-docker>` section, then the easiest thing
to do is just re-use the ``/export`` path you already setup:
.. code-block:: shell-session
$ docker-compose run --rm consumer document_importer /export
Similarly, if you're not using docker-compose, you can adjust the export
instructions above to do the import.
.. _utilities-retagger:
Re-running your tagging and correspondent matchers
--------------------------------------------------
Say you've imported a few hundred documents and now want to introduce
a tag or set up a new correspondent, and apply its matching to all of
the currently-imported docs. This problem is common enough that
there are tools for it.
.. _utilities-retagger-howto:
How to Do It
............
This too is done via the ``manage.py`` script:
.. code:: bash
$ /path/to/paperless/src/manage.py document_retagger
Run this after changing or adding tagging rules. It'll loop over all
of the documents in your database and attempt to match all of your
tags to them. If one matches, it'll be applied. And don't worry, you
can run this as often as you like, it won't double-tag a document.
.. code:: bash
$ /path/to/paperless/src/manage.py document_correspondents
This is the similar command to run after adding or changing a correspondent.
.. _utilities-encyption:
Enabling Encrpytion
-------------------
Let's say you've imported a few documents to play around with paperless and now
you are using it more seriously and want to enable encryption of your files.
.. utilities-encryption-howto:
Basic Syntax
.............
Again we'll use the ``manage.py`` script, passing ``change_storage_type``:
.. code:: console
$ /path/to/paperless/src/manage.py change_storage_type --help
usage: manage.py change_storage_type [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--passphrase PASSPHRASE]
{gpg,unencrypted} {gpg,unencrypted}
This is how you migrate your stored documents from an encrypted state to an
unencrypted one (or vice-versa)
positional arguments:
{gpg,unencrypted} The state you want to change your documents from
{gpg,unencrypted} The state you want to change your documents to
optional arguments:
--passphrase PASSPHRASE
If PAPERLESS_PASSPHRASE isn't set already, you need to
specify it here
Enabling Encryption
...................
Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**):
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code:: bash
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg
Disabling Encryption
....................
Basic usage to enable encryption of your document store:
(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
.. code:: bash
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted