Merge branch 'master' into master
BIN
docs/_static/Screenshot_first_logged.png
vendored
Before Width: | Height: | Size: 60 KiB |
BIN
docs/_static/Screenshot_first_run_login.png
vendored
Before Width: | Height: | Size: 26 KiB |
BIN
docs/_static/Screenshot_upload_and_scanned.png
vendored
Before Width: | Height: | Size: 113 KiB |
44
docs/_static/lxc-install.svg
vendored
Before Width: | Height: | Size: 1.9 MiB |
BIN
docs/_static/recommended_workflow.png
vendored
Normal file
After Width: | Height: | Size: 67 KiB |
BIN
docs/_static/screenshots/correspondents.png
vendored
Normal file
After Width: | Height: | Size: 106 KiB |
BIN
docs/_static/screenshots/dashboard.png
vendored
Normal file
After Width: | Height: | Size: 167 KiB |
BIN
docs/_static/screenshots/documents-filter.png
vendored
Normal file
After Width: | Height: | Size: 28 KiB |
BIN
docs/_static/screenshots/documents-largecards.png
vendored
Normal file
After Width: | Height: | Size: 306 KiB |
BIN
docs/_static/screenshots/documents-smallcards.png
vendored
Normal file
After Width: | Height: | Size: 410 KiB |
BIN
docs/_static/screenshots/documents-table.png
vendored
Normal file
After Width: | Height: | Size: 137 KiB |
BIN
docs/_static/screenshots/editing.png
vendored
Normal file
After Width: | Height: | Size: 293 KiB |
BIN
docs/_static/screenshots/logs.png
vendored
Normal file
After Width: | Height: | Size: 260 KiB |
BIN
docs/_static/screenshots/mail-rules-edited.png
vendored
Normal file
After Width: | Height: | Size: 96 KiB |
BIN
docs/_static/screenshots/mobile.png
vendored
Normal file
After Width: | Height: | Size: 158 KiB |
BIN
docs/_static/screenshots/new-tag.png
vendored
Normal file
After Width: | Height: | Size: 32 KiB |
BIN
docs/_static/screenshots/search-preview.png
vendored
Normal file
After Width: | Height: | Size: 61 KiB |
BIN
docs/_static/screenshots/search-results.png
vendored
Normal file
After Width: | Height: | Size: 261 KiB |
415
docs/administration.rst
Normal file
@@ -0,0 +1,415 @@
|
||||
|
||||
**************
|
||||
Administration
|
||||
**************
|
||||
|
||||
.. _administration-backup:
|
||||
|
||||
Making backups
|
||||
##############
|
||||
|
||||
Multiple options exist for making backups of your paperless instance,
|
||||
depending on how you installed paperless.
|
||||
|
||||
Before making backups, make sure that paperless is not running.
|
||||
|
||||
Options available to any installation of paperless:
|
||||
|
||||
* Use the :ref:`document exporter <utilities-exporter>`.
|
||||
The document exporter exports all your documents, thumbnails and
|
||||
metadata to a specific folder. You may import your documents into a
|
||||
fresh instance of paperless again or store your documents in another
|
||||
DMS with this export.
|
||||
|
||||
Options available to docker installations:
|
||||
|
||||
* Backup the docker volumes. These usually reside within
|
||||
``/var/lib/docker/volumes`` on the host and you need to be root in order
|
||||
to access them.
|
||||
|
||||
Paperless uses 3 volumes:
|
||||
|
||||
* ``paperless_media``: This is where your documents are stored.
|
||||
* ``paperless_data``: This is where auxillary data is stored. This
|
||||
folder also contains the SQLite database, if you use it.
|
||||
* ``paperless_pgdata``: Exists only if you use PostgreSQL and contains
|
||||
the database.
|
||||
|
||||
Options available to bare-metal and non-docker installations:
|
||||
|
||||
* Backup the entire paperless folder. This ensures that if your paperless instance
|
||||
crashes at some point or your disk fails, you can simply copy the folder back
|
||||
into place and it works.
|
||||
|
||||
When using PostgreSQL, you'll also have to backup the database.
|
||||
|
||||
.. _migrating-restoring:
|
||||
|
||||
Restoring
|
||||
=========
|
||||
|
||||
|
||||
|
||||
|
||||
.. _administration-updating:
|
||||
|
||||
Updating paperless
|
||||
##################
|
||||
|
||||
If a new release of paperless-ng is available, upgrading depends on how you
|
||||
installed paperless-ng in the first place. The releases are available at
|
||||
`release page <https://github.com/jonaswinkler/paperless-ng/releases>`_.
|
||||
|
||||
First of all, ensure that paperless is stopped.
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd /path/to/paperless
|
||||
$ docker-compose down
|
||||
|
||||
After that, :ref:`make a backup <administration-backup>`.
|
||||
|
||||
A. If you used the dockerfiles archive, simply download the files of the new release,
|
||||
adjust the settings in the files (i.e., the path to your consumption directory),
|
||||
and replace your existing docker-compose files. Then start paperless as usual,
|
||||
which will pull the new image, and update your database, if necessary:
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd /path/to/paperless
|
||||
$ docker-compose up
|
||||
|
||||
If you see everything working, you can start paperless-ng with "-d" to have it
|
||||
run in the background.
|
||||
|
||||
.. hint::
|
||||
|
||||
The released docker-compose files specify exact versions to be pulled from the hub.
|
||||
This is to ensure that if the docker-compose files should change at some point
|
||||
(i.e., services updates/configured differently), you wont run into trouble due to
|
||||
docker pulling the ``latest`` image and running it in an older environment.
|
||||
|
||||
B. If you built the image yourself, grab the new archive and replace your current
|
||||
paperless folder with the new contents.
|
||||
|
||||
After that, make the necessary adjustments to the docker-compose.yml (i.e.,
|
||||
adjust your consumption directory).
|
||||
|
||||
Build and start the new image with:
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd /path/to/paperless
|
||||
$ docker-compose build
|
||||
$ docker-compose up
|
||||
|
||||
If you see everything working, you can start paperless-ng with "-d" to have it
|
||||
run in the background.
|
||||
|
||||
.. hint::
|
||||
|
||||
You can usually keep your ``docker-compose.env`` file, since this file will
|
||||
never include mandatory configuration options. However, it is worth checking
|
||||
out the new version of this file, since it might have new recommendations
|
||||
on what to configure.
|
||||
|
||||
|
||||
Updating paperless without docker
|
||||
=================================
|
||||
|
||||
After grabbing the new release and unpacking the contents, do the following:
|
||||
|
||||
1. Update dependencies. New paperless version may require additional
|
||||
dependencies. The dependencies required are listed in the section about
|
||||
:ref:`bare metal installations <setup-bare_metal>`.
|
||||
|
||||
2. Update python requirements. If you use Pipenv, this is done with the following steps.
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ pip install --upgrade pipenv
|
||||
$ cd /path/to/paperless
|
||||
$ pipenv clean
|
||||
$ pipenv install
|
||||
|
||||
This creates a new virtual environment (or uses your existing environment)
|
||||
and installs all dependencies into it.
|
||||
|
||||
3. Collect static files.
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd src
|
||||
$ pipenv run python3 manage.py collectstatic --clear
|
||||
|
||||
4. Migrate the database.
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd src
|
||||
$ pipenv run python3 manage.py migrate
|
||||
|
||||
|
||||
Management utilities
|
||||
####################
|
||||
|
||||
Paperless comes with some management commands that perform various maintenance
|
||||
tasks on your paperless instance. You can invoke these commands either by
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd /path/to/paperless
|
||||
$ docker-compose run --rm webserver <command> <arguments>
|
||||
|
||||
or
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ cd /path/to/paperless/src
|
||||
$ pipenv run python manage.py <command> <arguments>
|
||||
|
||||
depending on whether you use docker or not.
|
||||
|
||||
All commands have built-in help, which can be accessed by executing them with
|
||||
the argument ``--help``.
|
||||
|
||||
.. _utilities-exporter:
|
||||
|
||||
Document exporter
|
||||
=================
|
||||
|
||||
The document exporter exports all your data from paperless into a folder for
|
||||
backup or migration to another DMS.
|
||||
|
||||
.. code::
|
||||
|
||||
document_exporter target
|
||||
|
||||
``target`` is a folder to which the data gets written. This includes documents,
|
||||
thumbnails and a ``manifest.json`` file. The manifest contains all metadata from
|
||||
the database (correspondents, tags, etc).
|
||||
|
||||
When you use the provided docker compose script, specify ``../export`` as the
|
||||
target. This path inside the container is automatically mounted on your host on
|
||||
the folder ``export``.
|
||||
|
||||
|
||||
.. _utilities-importer:
|
||||
|
||||
Document importer
|
||||
=================
|
||||
|
||||
The document importer takes the export produced by the `Document exporter`_ and
|
||||
imports it into paperless.
|
||||
|
||||
The importer works just like the exporter. You point it at a directory, and
|
||||
the script does the rest of the work:
|
||||
|
||||
.. code::
|
||||
|
||||
document_importer source
|
||||
|
||||
When you use the provided docker compose script, put the export inside the
|
||||
``export`` folder in your paperless source directory. Specify ``../export``
|
||||
as the ``source``.
|
||||
|
||||
|
||||
.. _utilities-retagger:
|
||||
|
||||
Document retagger
|
||||
=================
|
||||
|
||||
Say you've imported a few hundred documents and now want to introduce
|
||||
a tag or set up a new correspondent, and apply its matching to all of
|
||||
the currently-imported docs. This problem is common enough that
|
||||
there are tools for it.
|
||||
|
||||
.. code::
|
||||
|
||||
document_retagger [-h] [-c] [-T] [-t] [-i] [--use-first] [-f]
|
||||
|
||||
optional arguments:
|
||||
-c, --correspondent
|
||||
-T, --tags
|
||||
-t, --document_type
|
||||
-i, --inbox-only
|
||||
--use-first
|
||||
-f, --overwrite
|
||||
|
||||
Run this after changing or adding matching rules. It'll loop over all
|
||||
of the documents in your database and attempt to match documents
|
||||
according to the new rules.
|
||||
|
||||
Specify any combination of ``-c``, ``-T`` and ``-t`` to have the
|
||||
retagger perform matching of the specified metadata type. If you don't
|
||||
specify any of these options, the document retagger won't do anything.
|
||||
|
||||
Specify ``-i`` to have the document retagger work on documents tagged
|
||||
with inbox tags only. This is useful when you don't want to mess with
|
||||
your already processed documents.
|
||||
|
||||
When multiple document types or correspondents match a single document,
|
||||
the retagger won't assign these to the document. Specify ``--use-first``
|
||||
to override this behavior and just use the first correspondent or type
|
||||
it finds. This option does not apply to tags, since any amount of tags
|
||||
can be applied to a document.
|
||||
|
||||
Finally, ``-f`` specifies that you wish to overwrite already assigned
|
||||
correspondents, types and/or tags. The default behavior is to not
|
||||
assign correspondents and types to documents that have this data already
|
||||
assigned. ``-f`` works differently for tags: By default, only additional tags get
|
||||
added to documents, no tags will be removed. With ``-f``, tags that don't
|
||||
match a document anymore get removed as well.
|
||||
|
||||
|
||||
Managing the Automatic matching algorithm
|
||||
=========================================
|
||||
|
||||
The *Auto* matching algorithm requires a trained neural network to work.
|
||||
This network needs to be updated whenever somethings in your data
|
||||
changes. The docker image takes care of that automatically with the task
|
||||
scheduler. You can manually renew the classifier by invoking the following
|
||||
management command:
|
||||
|
||||
.. code::
|
||||
|
||||
document_create_classifier
|
||||
|
||||
This command takes no arguments.
|
||||
|
||||
.. _`administration-index`:
|
||||
|
||||
Managing the document search index
|
||||
==================================
|
||||
|
||||
The document search index is responsible for delivering search results for the
|
||||
website. The document index is automatically updated whenever documents get
|
||||
added to, changed, or removed from paperless. However, if the search yields
|
||||
non-existing documents or won't find anything, you may need to recreate the
|
||||
index manually.
|
||||
|
||||
.. code::
|
||||
|
||||
document_index {reindex,optimize}
|
||||
|
||||
Specify ``reindex`` to have the index created from scratch. This may take some
|
||||
time.
|
||||
|
||||
Specify ``optimize`` to optimize the index. This updates certain aspects of
|
||||
the index and usually makes queries faster and also ensures that the
|
||||
autocompletion works properly. This command is regularly invoked by the task
|
||||
scheduler.
|
||||
|
||||
.. _utilities-renamer:
|
||||
|
||||
Managing filenames
|
||||
==================
|
||||
|
||||
If you use paperless' feature to
|
||||
:ref:`assign custom filenames to your documents <advanced-file_name_handling>`,
|
||||
you can use this command to move all your files after changing
|
||||
the naming scheme.
|
||||
|
||||
.. warning::
|
||||
|
||||
Since this command moves you documents around alot, it is advised to to
|
||||
a backup before. The renaming logic is robust and will never overwrite
|
||||
or delete a file, but you can't ever be careful enough.
|
||||
|
||||
.. code::
|
||||
|
||||
document_renamer
|
||||
|
||||
The command takes no arguments and processes all your documents at once.
|
||||
|
||||
|
||||
Fetching e-mail
|
||||
===============
|
||||
|
||||
Paperless automatically fetches your e-mail every 10 minutes by default. If
|
||||
you want to invoke the email consumer manually, call the following management
|
||||
command:
|
||||
|
||||
.. code::
|
||||
|
||||
mail_fetcher
|
||||
|
||||
The command takes no arguments and processes all your mail accounts and rules.
|
||||
|
||||
.. _utilities-archiver:
|
||||
|
||||
Creating archived documents
|
||||
===========================
|
||||
|
||||
Paperless stores archived PDF/A documents alongside your original documents.
|
||||
These archived documents will also contain selectable text for image-only
|
||||
originals.
|
||||
These documents are derived from the originals, which are always stored
|
||||
unmodified. If coming from an earlier version of paperless, your documents
|
||||
won't have archived versions.
|
||||
|
||||
This command creates PDF/A documents for your documents.
|
||||
|
||||
.. code::
|
||||
|
||||
document_archiver --overwrite --document <id>
|
||||
|
||||
This command will only attempt to create archived documents when no archived
|
||||
document exists yet, unless ``--overwrite`` is specified. If ``--document <id>``
|
||||
is specified, the archiver will only process that document.
|
||||
|
||||
.. note::
|
||||
|
||||
This command essentially performs OCR on all your documents again,
|
||||
according to your settings. If you run this with ``PAPERLESS_OCR_MODE=redo``,
|
||||
it will potentially run for a very long time. You can cancel the command
|
||||
at any time, since this command will skip already archived versions the next time
|
||||
it is run.
|
||||
|
||||
.. note::
|
||||
|
||||
Some documents will cause errors and cannot be converted into PDF/A documents,
|
||||
such as encrypted PDF documents. The archiver will skip over these documents
|
||||
each time it sees them.
|
||||
|
||||
.. _utilities-encyption:
|
||||
|
||||
Managing encryption
|
||||
===================
|
||||
|
||||
Documents can be stored in Paperless using GnuPG encryption.
|
||||
|
||||
.. danger::
|
||||
|
||||
Encryption is deprecated since paperless-ng 0.9 and doesn't really provide any
|
||||
additional security, since you have to store the passphrase in a configuration
|
||||
file on the same system as the encrypted documents for paperless to work.
|
||||
Furthermore, the entire text content of the documents is stored plain in the
|
||||
database, even if your documents are encrypted. Filenames are not encrypted as
|
||||
well.
|
||||
|
||||
Also, the web server provides transparent access to your encrypted documents.
|
||||
|
||||
Consider running paperless on an encrypted filesystem instead, which will then
|
||||
at least provide security against physical hardware theft.
|
||||
|
||||
|
||||
Enabling encryption
|
||||
-------------------
|
||||
|
||||
Enabling encryption is no longer supported.
|
||||
|
||||
|
||||
Disabling encryption
|
||||
--------------------
|
||||
|
||||
Basic usage to disable encryption of your document store:
|
||||
|
||||
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code::
|
||||
|
||||
decrypt_documents [--passphrase SECR3TP4SSPHRA$E]
|
||||
|
||||
|
||||
.. _Pipenv: https://pipenv.pypa.io/en/latest/
|
342
docs/advanced_usage.rst
Normal file
@@ -0,0 +1,342 @@
|
||||
***************
|
||||
Advanced topics
|
||||
***************
|
||||
|
||||
Paperless offers a couple features that automate certain tasks and make your life
|
||||
easier.
|
||||
|
||||
Guesswork
|
||||
#########
|
||||
|
||||
|
||||
Any document you put into the consumption directory will be consumed, but if
|
||||
you name the file right, it'll automatically set some values in the database
|
||||
for you. This is is the logic the consumer follows:
|
||||
|
||||
1. Try to find the correspondent, title, and tags in the file name following
|
||||
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
||||
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
|
||||
works as well.
|
||||
2. If that doesn't work, we skip the date and try this pattern:
|
||||
``Correspondent - Title - tag,tag,tag.pdf``.
|
||||
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||
name following the pattern: ``Correspondent - Title.pdf``.
|
||||
4. If that doesn't work, just assume that the name of the file is the title.
|
||||
|
||||
So given the above, the following examples would work as you'd expect:
|
||||
|
||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Another Company - Letter of Reference.jpg``
|
||||
* ``Dad's Recipe for Pancakes.png``
|
||||
|
||||
These however wouldn't work:
|
||||
|
||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Another Company- Letter of Reference.jpg``
|
||||
|
||||
Do I have to be so strict about naming?
|
||||
=======================================
|
||||
|
||||
Rather than using the strict document naming rules, one can also set the option
|
||||
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
|
||||
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
|
||||
to any date format that is found in the title, instead of a date pulled from
|
||||
the document's text, without requiring the strict formatting of the document
|
||||
filename as described above.
|
||||
|
||||
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
|
||||
|
||||
.. _advanced-transforming_filenames:
|
||||
|
||||
Transforming filenames for parsing
|
||||
==================================
|
||||
|
||||
Some devices can't produce filenames that can be parsed by the default
|
||||
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
|
||||
``paperless.conf`` one can add transformations that are applied to the filename
|
||||
before it's parsed.
|
||||
|
||||
The option contains a list of dictionaries of regular expressions (key:
|
||||
``pattern``) and replacements (key: ``repl``) in JSON format, which are
|
||||
applied in order by passing them to ``re.subn``. Transformation stops
|
||||
after the first match, so at most one transformation is applied. The general
|
||||
syntax is
|
||||
|
||||
.. code:: python
|
||||
|
||||
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
|
||||
|
||||
The example below is for a Brother ADS-2400N, a scanner that allows
|
||||
different names to different hardware buttons (useful for handling
|
||||
multiple entities in one instance), but insists on adding ``_<count>``
|
||||
to the filename.
|
||||
|
||||
.. code:: python
|
||||
|
||||
# Brother profile configuration, support "Name_Date_Count" (the default
|
||||
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
|
||||
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
|
||||
|
||||
|
||||
.. _advanced-matching:
|
||||
|
||||
Matching tags, correspondents and document types
|
||||
################################################
|
||||
|
||||
After the consumer has tried to figure out what it could from the file name,
|
||||
it starts looking at the content of the document itself. It will compare the
|
||||
matching algorithms defined by every tag and correspondent already set in your
|
||||
database to see if they apply to the text in that document. In other words,
|
||||
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
||||
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
||||
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
||||
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
||||
|
||||
The matching logic is quite powerful, and supports searching the text of your
|
||||
document with different algorithms, and as such, some experimentation may be
|
||||
necessary to get things right.
|
||||
|
||||
In order to have a tag, correspondent or type assigned automatically to newly
|
||||
consumed documents, assign a match and matching algorithm using the web
|
||||
interface. These settings define when to assign correspondents, tags and types
|
||||
to documents.
|
||||
|
||||
The following algorithms are available:
|
||||
|
||||
* **Any:** Looks for any occurrence of any word provided in match in the PDF.
|
||||
If you define the match as ``Bank1 Bank2``, it will match documents containing
|
||||
either of these terms.
|
||||
* **All:** Requires that every word provided appears in the PDF, albeit not in the
|
||||
order provided.
|
||||
* **Literal:** Matches only if the match appears exactly as provided in the PDF.
|
||||
* **Regular expression:** Parses the match as a regular expression and tries to
|
||||
find a match within the document.
|
||||
* **Fuzzy match:** I dont know. Look at the source.
|
||||
* **Auto:** Tries to automatically match new documents. This does not require you
|
||||
to set a match. See the notes below.
|
||||
|
||||
When using the "any" or "all" matching algorithms, you can search for terms
|
||||
that consist of multiple words by enclosing them in double quotes. For example,
|
||||
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
|
||||
will match documents that contain either "Bank of America" or "BofA", but will
|
||||
not match documents containing "Bank of South America".
|
||||
|
||||
Then just save your tag/correspondent and run another document through the
|
||||
consumer. Once complete, you should see the newly-created document,
|
||||
automatically tagged with the appropriate data.
|
||||
|
||||
|
||||
.. _advanced-automatic_matching:
|
||||
|
||||
Automatic matching
|
||||
==================
|
||||
|
||||
Paperless-ng comes with a new matching algorithm called *Auto*. This matching
|
||||
algorithm tries to assign tags, correspondents and document types to your
|
||||
documents based on how you have assigned these on existing documents. It
|
||||
uses a neural network under the hood.
|
||||
|
||||
If, for example, all your bank statements of your account 123 at the Bank of
|
||||
America are tagged with the tag "bofa_123" and the matching algorithm of this
|
||||
tag is set to *Auto*, this neural network will examine your documents and
|
||||
automatically learn when to assign this tag.
|
||||
|
||||
Paperless tries to hide much of the involved complexity with this approach.
|
||||
However, there are a couple caveats you need to keep in mind when using this
|
||||
feature:
|
||||
|
||||
* Changes to your documents are not immediately reflected by the matching
|
||||
algorithm. The neural network needs to be *trained* on your documents after
|
||||
changes. Paperless periodically (default: once each hour) checks for changes
|
||||
and does this automatically for you.
|
||||
* The Auto matching algorithm only takes documents into account which are NOT
|
||||
placed in your inbox (i.e., have inbox tags assigned to them). This ensures
|
||||
that the neural network only learns from documents which you have correctly
|
||||
tagged before.
|
||||
* The matching algorithm can only work if there is a correlation between the
|
||||
tag, correspondent or document type and the document itself. Your bank
|
||||
statements usually contain your bank account number and the name of the bank,
|
||||
so this works reasonably well, However, tags such as "TODO" cannot be
|
||||
automatically assigned.
|
||||
* The matching algorithm needs a reasonable number of documents to identify when
|
||||
to assign tags, correspondents, and types. If one out of a thousand documents
|
||||
has the correspondent "Very obscure web shop I bought something five years
|
||||
ago", it will probably not assign this correspondent automatically if you buy
|
||||
something from them again. The more documents, the better.
|
||||
* Paperless also needs a reasonable amount of negative examples to decide when
|
||||
not to assign a certain tag, correspondent or type. This will usually be the
|
||||
case as you start filling up paperless with documents. Example: If all your
|
||||
documents are either from "Webshop" and "Bank", paperless will assign one of
|
||||
these correspondents to ANY new document, if both are set to automatic matching.
|
||||
|
||||
Hooking into the consumption process
|
||||
####################################
|
||||
|
||||
Sometimes you may want to do something arbitrary whenever a document is
|
||||
consumed. Rather than try to predict what you may want to do, Paperless lets
|
||||
you execute scripts of your own choosing just before or after a document is
|
||||
consumed using a couple simple hooks.
|
||||
|
||||
Just write a script, put it somewhere that Paperless can read & execute, and
|
||||
then put the path to that script in ``paperless.conf`` with the variable name
|
||||
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
||||
``PAPERLESS_POST_CONSUME_SCRIPT``.
|
||||
|
||||
.. important::
|
||||
|
||||
These scripts are executed in a **blocking** process, which means that if
|
||||
a script takes a long time to run, it can significantly slow down your
|
||||
document consumption flow. If you want things to run asynchronously,
|
||||
you'll have to fork the process in your script and exit.
|
||||
|
||||
|
||||
Pre-consumption script
|
||||
======================
|
||||
|
||||
Executed after the consumer sees a new document in the consumption folder, but
|
||||
before any processing of the document is performed. This script receives exactly
|
||||
one argument:
|
||||
|
||||
* Document file name
|
||||
|
||||
A simple but common example for this would be creating a simple script like
|
||||
this:
|
||||
|
||||
``/usr/local/bin/ocr-pdf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
#!/usr/bin/env bash
|
||||
pdf2pdfocr.py -i ${1}
|
||||
|
||||
``/etc/paperless.conf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
...
|
||||
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
|
||||
...
|
||||
|
||||
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
|
||||
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
|
||||
overwrite the file with an OCR'd version of the file and exit. At which point,
|
||||
the consumption process will begin with the newly modified file.
|
||||
|
||||
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
|
||||
|
||||
.. _advanced-post_consume_script:
|
||||
|
||||
Post-consumption script
|
||||
=======================
|
||||
|
||||
Executed after the consumer has successfully processed a document and has moved it
|
||||
into paperless. It receives the following arguments:
|
||||
|
||||
* Document id
|
||||
* Generated file name
|
||||
* Source path
|
||||
* Thumbnail path
|
||||
* Download URL
|
||||
* Thumbnail URL
|
||||
* Correspondent
|
||||
* Tags
|
||||
|
||||
The script can be in any language you like, but for a simple shell script
|
||||
example, you can take a look at ``post-consumption-example.sh`` in the
|
||||
``scripts`` directory in this project.
|
||||
|
||||
The post consumption script cannot cancel the consumption process.
|
||||
|
||||
.. _advanced-file_name_handling:
|
||||
|
||||
File name handling
|
||||
##################
|
||||
|
||||
By default, paperless stores your documents in the media directory and renames them
|
||||
using the identifier which it has assigned to each document. You will end up getting
|
||||
files like ``0000123.pdf`` in your media directory. This isn't necessarily a bad
|
||||
thing, because you normally don't have to access these files manually. However, if
|
||||
you wish to name your files differently, you can do that by adjusting the
|
||||
``PAPERLESS_FILENAME_FORMAT`` configuration option.
|
||||
|
||||
This variable allows you to configure the filename (folders are allowed) using
|
||||
placeholders. For example, configuring this to
|
||||
|
||||
.. code:: bash
|
||||
|
||||
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
|
||||
|
||||
will create a directory structure as follows:
|
||||
|
||||
.. code::
|
||||
|
||||
2019/
|
||||
My bank/
|
||||
Statement January.pdf
|
||||
Statement February.pdf
|
||||
2020/
|
||||
My bank/
|
||||
Statement January.pdf
|
||||
Letter.pdf
|
||||
Letter_01.pdf
|
||||
Shoe store/
|
||||
My new shoes.pdf
|
||||
|
||||
.. danger::
|
||||
|
||||
Do not manually move your files in the media folder. Paperless remembers the
|
||||
last filename a document was stored as. If you do rename a file, paperless will
|
||||
report your files as missing and won't be able to find them.
|
||||
|
||||
Paperless provides the following placeholders withing filenames:
|
||||
|
||||
* ``{correspondent}``: The name of the correspondent, or "none".
|
||||
* ``{document_type}``: The name of the document type, or "none".
|
||||
* ``{tag_list}``: A comma separated list of all tags assigned to the document.
|
||||
* ``{title}``: The title of the document.
|
||||
* ``{created}``: The full date and time the document was created.
|
||||
* ``{created_year}``: Year created only.
|
||||
* ``{created_month}``: Month created only (number 1-12).
|
||||
* ``{created_day}``: Day created only (number 1-31).
|
||||
* ``{added}``: The full date and time the document was added to paperless.
|
||||
* ``{added_year}``: Year added only.
|
||||
* ``{added_month}``: Month added only (number 1-12).
|
||||
* ``{added_day}``: Day added only (number 1-31).
|
||||
|
||||
|
||||
Paperless will try to conserve the information from your database as much as possible.
|
||||
However, some characters that you can use in document titles and correspondent names (such
|
||||
as ``: \ /`` and a couple more) are not allowed in filenames and will be replaced with dashes.
|
||||
|
||||
If paperless detects that two documents share the same filename, paperless will automatically
|
||||
append ``_01``, ``_02``, etc to the filename. This happens if all the placeholders in a filename
|
||||
evaluate to the same value.
|
||||
|
||||
.. hint::
|
||||
|
||||
Paperless checks the filename of a document whenever it is saved. Therefore,
|
||||
you need to update the filenames of your documents and move them after altering
|
||||
this setting by invoking the :ref:`document renamer <utilities-renamer>`.
|
||||
|
||||
.. warning::
|
||||
|
||||
Make absolutely sure you get the spelling of the placeholders right, or else
|
||||
paperless will use the default naming scheme instead.
|
||||
|
||||
.. caution::
|
||||
|
||||
As of now, you could totally tell paperless to store your files anywhere outside
|
||||
the media directory by setting
|
||||
|
||||
.. code::
|
||||
|
||||
PAPERLESS_FILENAME_FORMAT=../../my/custom/location/{title}
|
||||
|
||||
However, keep in mind that inside docker, if files get stored outside of the
|
||||
predefined volumes, they will be lost after a restart of paperless.
|
296
docs/api.rst
@@ -1,23 +1,291 @@
|
||||
.. _api:
|
||||
|
||||
************
|
||||
The REST API
|
||||
############
|
||||
************
|
||||
|
||||
Paperless makes use of the `Django REST Framework`_ standard API interface
|
||||
because of its inherent awesomeness. Conveniently, the system is also
|
||||
self-documenting, so to learn more about the access points, schema, what's
|
||||
accepted and what isn't, you need only visit ``/api`` on your local Paperless
|
||||
installation.
|
||||
|
||||
Paperless makes use of the `Django REST Framework`_ standard API interface.
|
||||
It provides a browsable API for most of its endpoints, which you can inspect
|
||||
at ``http://<paperless-host>:<port>/api/``. This also documents most of the
|
||||
available filters and ordering fields.
|
||||
|
||||
.. _Django REST Framework: http://django-rest-framework.org/
|
||||
|
||||
The API provides 5 main endpoints:
|
||||
|
||||
.. _api-uploading:
|
||||
* ``/api/documents/``: Full CRUD support, except POSTing new documents. See below.
|
||||
* ``/api/correspondents/``: Full CRUD support.
|
||||
* ``/api/document_types/``: Full CRUD support.
|
||||
* ``/api/logs/``: Read-Only.
|
||||
* ``/api/tags/``: Full CRUD support.
|
||||
|
||||
Uploading
|
||||
---------
|
||||
All of these endpoints except for the logging endpoint
|
||||
allow you to fetch, edit and delete individual objects
|
||||
by appending their primary key to the path, for example ``/api/documents/454/``.
|
||||
|
||||
File uploads in an API are hard and so far as I've been able to tell, there's
|
||||
no standard way of accepting them, so rather than crowbar file uploads into the
|
||||
REST API and endure that headache, I've left that process to a simple HTTP
|
||||
POST, documented on the :ref:`consumption page <consumption-http>`.
|
||||
The objects served by the document endpoint contain the following fields:
|
||||
|
||||
* ``id``: ID of the document. Read-only.
|
||||
* ``title``: Title of the document.
|
||||
* ``content``: Plain text content of the document.
|
||||
* ``tags``: List of IDs of tags assigned to this document, or empty list.
|
||||
* ``document_type``: Document type of this document, or null.
|
||||
* ``correspondent``: Correspondent of this document or null.
|
||||
* ``created``: The date at which this document was created.
|
||||
* ``modified``: The date at which this document was last edited in paperless. Read-only.
|
||||
* ``added``: The date at which this document was added to paperless. Read-only.
|
||||
* ``archive_serial_number``: The identifier of this document in a physical document archive.
|
||||
* ``original_file_name``: Verbose filename of the original document. Read-only.
|
||||
* ``archived_file_name``: Verbose filename of the archived document. Read-only. Null if no archived document is available.
|
||||
|
||||
|
||||
Downloading documents
|
||||
#####################
|
||||
|
||||
In addition to that, the document endpoint offers these additional actions on
|
||||
individual documents:
|
||||
|
||||
* ``/api/documents/<pk>/download/``: Download the document.
|
||||
* ``/api/documents/<pk>/preview/``: Display the document inline,
|
||||
without downloading it.
|
||||
* ``/api/documents/<pk>/thumb/``: Download the PNG thumbnail of a document.
|
||||
|
||||
Paperless generates archived PDF/A documents from consumed files and stores both
|
||||
the original files as well as the archived files. By default, the endpoints
|
||||
for previews and downloads serve the archived file, if it is available.
|
||||
Otherwise, the original file is served.
|
||||
Some document cannot be archived.
|
||||
|
||||
The endpoints correctly serve the response header fields ``Content-Disposition``
|
||||
and ``Content-Type`` to indicate the filename for download and the type of content of
|
||||
the document.
|
||||
|
||||
In order to download or preview the original document when an archied document is available,
|
||||
supply the query parameter ``original=true``.
|
||||
|
||||
.. hint::
|
||||
|
||||
Paperless used to provide these functionality at ``/fetch/<pk>/preview``,
|
||||
``/fetch/<pk>/thumb`` and ``/fetch/<pk>/doc``. Redirects to the new URLs
|
||||
are in place. However, if you use these old URLs to access documents, you
|
||||
should update your app or script to use the new URLs.
|
||||
|
||||
|
||||
Getting document metadata
|
||||
#########################
|
||||
|
||||
The api also has an endpoint to retrieve read-only metadata about specific documents. this
|
||||
information is not served along with the document objects, since it requires reading
|
||||
files and would therefore slow down document lists considerably.
|
||||
|
||||
Access the metadata of a document with an ID ``id`` at ``/api/documents/<id>/metadata/``.
|
||||
|
||||
The endpoint reports the following data:
|
||||
|
||||
* ``original_checksum``: MD5 checksum of the original document.
|
||||
* ``original_size``: Size of the original document, in bytes.
|
||||
* ``original_mime_type``: Mime type of the original document.
|
||||
* ``media_filename``: Current filename of the document, under which it is stored inside the media directory.
|
||||
* ``has_archive_version``: True, if this document is archived, false otherwise.
|
||||
* ``original_metadata``: A list of metadata associated with the original document. See below.
|
||||
* ``archive_checksum``: MD5 checksum of the archived document, or null.
|
||||
* ``archive_size``: Size of the archived document in bytes, or null.
|
||||
* ``archive_metadata``: Metadata associated with the archived document, or null. See below.
|
||||
|
||||
File metadata is reported as a list of objects in the following form:
|
||||
|
||||
.. code:: json
|
||||
|
||||
[
|
||||
{
|
||||
"namespace": "http://ns.adobe.com/pdf/1.3/",
|
||||
"prefix": "pdf",
|
||||
"key": "Producer",
|
||||
"value": "SparklePDF, Fancy edition"
|
||||
},
|
||||
]
|
||||
|
||||
``namespace`` and ``prefix`` can be null. The actual metadata reported depends on the file type and the metadata
|
||||
available in that specific document. Paperless only reports PDF metadata at this point.
|
||||
|
||||
Authorization
|
||||
#############
|
||||
|
||||
The REST api provides three different forms of authentication.
|
||||
|
||||
1. Basic authentication
|
||||
|
||||
Authorize by providing a HTTP header in the form
|
||||
|
||||
.. code::
|
||||
|
||||
Authorization: Basic <credentials>
|
||||
|
||||
where ``credentials`` is a base64-encoded string of ``<username>:<password>``
|
||||
|
||||
2. Session authentication
|
||||
|
||||
When you're logged into paperless in your browser, you're automatically
|
||||
logged into the API as well and don't need to provide any authorization
|
||||
headers.
|
||||
|
||||
3. Token authentication
|
||||
|
||||
Paperless also offers an endpoint to acquire authentication tokens.
|
||||
|
||||
POST a username and password as a form or json string to ``/api/token/``
|
||||
and paperless will respond with a token, if the login data is correct.
|
||||
This token can be used to authenticate other requests with the
|
||||
following HTTP header:
|
||||
|
||||
.. code::
|
||||
|
||||
Authorization: Token <token>
|
||||
|
||||
Tokens can be managed and revoked in the paperless admin.
|
||||
|
||||
Searching for documents
|
||||
#######################
|
||||
|
||||
Paperless-ng offers API endpoints for full text search. These are as follows:
|
||||
|
||||
``/api/search/``
|
||||
================
|
||||
|
||||
Get search results based on a query.
|
||||
|
||||
Query parameters:
|
||||
|
||||
* ``query``: The query string. See
|
||||
`here <https://whoosh.readthedocs.io/en/latest/querylang.html>`_
|
||||
for details on the syntax.
|
||||
* ``page``: Specify the page you want to retrieve. Each page
|
||||
contains 10 search results and the first page is ``page=1``, which
|
||||
is the default if this is omitted.
|
||||
|
||||
Result list object returned by the endpoint:
|
||||
|
||||
.. code:: json
|
||||
|
||||
{
|
||||
"count": 1,
|
||||
"page": 1,
|
||||
"page_count": 1,
|
||||
"corrected_query": "",
|
||||
"results": [
|
||||
|
||||
]
|
||||
}
|
||||
|
||||
* ``count``: The approximate total number of results.
|
||||
* ``page``: The page returned to you. This might be different from
|
||||
the page you requested, if you requested a page that is behind
|
||||
the last page. In that case, the last page is returned.
|
||||
* ``page_count``: The total number of pages.
|
||||
* ``corrected_query``: Corrected version of the query string. Can be null.
|
||||
If not null, can be used verbatim to start a new query.
|
||||
* ``results``: A list of result objects on the current page.
|
||||
|
||||
Result object:
|
||||
|
||||
.. code:: json
|
||||
|
||||
{
|
||||
"id": 1,
|
||||
"highlights": [
|
||||
|
||||
],
|
||||
"score": 6.34234,
|
||||
"rank": 23,
|
||||
"document": {
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
* ``id``: the primary key of the found document
|
||||
* ``highlights``: an object containing parsable highlights for the result.
|
||||
See below.
|
||||
* ``score``: The score assigned to the document. A higher score indicates a
|
||||
better match with the query. Search results are sorted descending by score.
|
||||
* ``rank``: the position of the document within the entire search results list.
|
||||
* ``document``: The full json of the document, as returned by
|
||||
``/api/documents/<id>/``.
|
||||
|
||||
Highlights object:
|
||||
|
||||
Highlights are provided as a list of fragments. A fragment is a longer section of
|
||||
text from the original document.
|
||||
Each fragment contains a list of strings, and some of them are marked as a highlight.
|
||||
|
||||
.. code:: json
|
||||
|
||||
[
|
||||
[
|
||||
{"text": "This is a sample text with a "},
|
||||
{"text": "highlighted", "term": 0},
|
||||
{"text": " word."}
|
||||
],
|
||||
[
|
||||
{"text": "Another", "term": 1},
|
||||
{"text": " fragment with a highlight."}
|
||||
]
|
||||
]
|
||||
|
||||
|
||||
|
||||
When ``term`` is present within a string, the word within ``text`` should be highlighted.
|
||||
The term index groups multiple matches together and words with the same index
|
||||
should get identical highlighting.
|
||||
A client may use this example to produce the following output:
|
||||
|
||||
... This is a sample text with a **highlighted** word. ... **Another** fragment with a highlight. ...
|
||||
|
||||
``/api/search/autocomplete/``
|
||||
=============================
|
||||
|
||||
Get auto completions for a partial search term.
|
||||
|
||||
Query parameters:
|
||||
|
||||
* ``term``: The incomplete term.
|
||||
* ``limit``: Amount of results. Defaults to 10.
|
||||
|
||||
Results returned by the endpoint are ordered by importance of the term in the
|
||||
document index. The first result is the term that has the highest Tf/Idf score
|
||||
in the index.
|
||||
|
||||
.. code:: json
|
||||
|
||||
[
|
||||
"term1",
|
||||
"term3",
|
||||
"term6",
|
||||
"term4"
|
||||
]
|
||||
|
||||
|
||||
.. _api-file_uploads:
|
||||
|
||||
POSTing documents
|
||||
#################
|
||||
|
||||
The API provides a special endpoint for file uploads:
|
||||
|
||||
``/api/documents/post_document/``
|
||||
|
||||
POST a multipart form to this endpoint, where the form field ``document`` contains
|
||||
the document that you want to upload to paperless. The filename is sanitized and
|
||||
then used to store the document in a temporary directory, and the consumer will
|
||||
be instructed to consume the document from there.
|
||||
|
||||
The endpoint supports the following optional form fields:
|
||||
|
||||
* ``title``: Specify a title that the consumer should use for the document.
|
||||
* ``correspondent``: Specify the ID of a correspondent that the consumer should use for the document.
|
||||
* ``document_type``: Similar to correspondent.
|
||||
* ``tags``: Similar to correspondent. Specify this multiple times to have multiple tags added
|
||||
to the document.
|
||||
|
||||
The endpoint will immediately return "OK" if the document consumption process
|
||||
was started successfully. No additional status information about the consumption
|
||||
process itself is available, since that happens in a different process.
|
||||
|
@@ -1,4 +1,333 @@
|
||||
|
||||
.. _paperless_changelog:
|
||||
|
||||
*********
|
||||
Changelog
|
||||
*********
|
||||
|
||||
|
||||
paperless-ng 0.9.8
|
||||
##################
|
||||
|
||||
This release addresses two severe issues with the previous release.
|
||||
|
||||
* The delete buttons for document types, correspondents and tags were not working.
|
||||
* The document section in the admin was causing internal server errors (500).
|
||||
|
||||
|
||||
paperless-ng 0.9.7
|
||||
##################
|
||||
|
||||
|
||||
* Front end
|
||||
|
||||
* Thanks to the hard work of `Michael Shamoon`_, paperless now comes with a much more streamlined UI for
|
||||
filtering documents.
|
||||
|
||||
* `Michael Shamoon`_ replaced the document preview with another component. This should fix compatibility with Safari browsers.
|
||||
|
||||
* Added buttons to the management pages to quickly show all documents with one specific tag, correspondent, or title.
|
||||
|
||||
* Paperless now stores your saved views on the server and associates them with your user account.
|
||||
This means that you can access your views on multiple devices and have separate views for different users.
|
||||
You will have to recreate your views.
|
||||
|
||||
* The GitHub and documentation links now open in new tabs/windows. Thanks to `rYR79435`_.
|
||||
|
||||
* Paperless now generates default saved view names when saving views with certain filter rules.
|
||||
|
||||
* Added a small version indicator to the front end.
|
||||
|
||||
* Other additions and changes
|
||||
|
||||
* The new filename format field ``{tag_list}`` inserts a list of tags into the filename, separated by comma.
|
||||
* The ``document_retagger`` no longer removes inbox tags or tags without matching rules.
|
||||
* The new configuration option ``PAPERLESS_COOKIE_PREFIX`` allows you to run multiple instances of paperless on different ports.
|
||||
This option enables you to be logged in into multiple instances by specifying different cookie names for each instance.
|
||||
|
||||
* Fixes
|
||||
|
||||
* Sometimes paperless would assign dates in the future to newly consumed documents.
|
||||
* The filename format fields ``{created_month}`` and ``{created_day}`` now use a leading zero for single digit values.
|
||||
* The filename format field ``{tags}`` can no longer be used without arguments.
|
||||
* Paperless was not able to consume many images (especially images from mobile scanners) due to missing DPI information.
|
||||
Paperless now assumes A4 paper size for PDF generation if no DPI information is present.
|
||||
* Documents with empty titles could not be opened from the table view due to the link being empty.
|
||||
* Fixed an issue with filenames containing special characters such as ``:`` not being accepted for upload.
|
||||
* Fixed issues with thumbnail generation for plain text files.
|
||||
|
||||
|
||||
paperless-ng 0.9.6
|
||||
##################
|
||||
|
||||
This release focusses primarily on many small issues with the UI.
|
||||
|
||||
* Front end
|
||||
|
||||
* Paperless now has proper window titles.
|
||||
* Fixed an issue with the small cards when more than 7 tags were used.
|
||||
* Navigation of the "Show all" links adjusted. They navigate to the saved view now, if available in the sidebar.
|
||||
* Some indication on the document lists that a filter is active was added.
|
||||
* There's a new filter to filter for documents that do *not* have a certain tag.
|
||||
* The file upload box now shows upload progress.
|
||||
* The document edit page was reorganized.
|
||||
* The document edit page shows various information about a document.
|
||||
* An issue with the height of the preview was fixed.
|
||||
* Table issues with too long document titles fixed.
|
||||
|
||||
* API
|
||||
|
||||
* The API now serves file names with documents.
|
||||
* The API now serves various metadata about documents.
|
||||
* API documentation updated.
|
||||
|
||||
* Other
|
||||
|
||||
* Fixed an issue with the docker image when a non-standard PostgreSQL port was used.
|
||||
* The docker image was trying check for installed languages before actually installing them.
|
||||
* ``FILENAME_FORMAT`` placeholder for document types.
|
||||
* The filename formatter is now less restrictive with file names and tries to
|
||||
conserve the original correspondents, types and titles as much as possible.
|
||||
* The filename formatter does not include the document ID in filenames anymore. It will
|
||||
rather append ``_01``, ``_02``, etc when it detects duplicate filenames.
|
||||
|
||||
.. note::
|
||||
|
||||
The changes to the filename format will apply to newly added documents and changed documents.
|
||||
If you want all files to reflect these changes, execute the ``document_renamer`` management
|
||||
command.
|
||||
|
||||
|
||||
paperless-ng 0.9.5
|
||||
##################
|
||||
|
||||
This release concludes the big changes I wanted to get rolled into paperless. The next releases before 1.0 will
|
||||
focus on fixing issues, primarily.
|
||||
|
||||
* OCR
|
||||
|
||||
* Paperless now uses `OCRmyPDF <https://github.com/jbarlow83/OCRmyPDF>`_ to perform OCR on documents.
|
||||
It still uses tesseract under the hood, but the PDF parser of Paperless has changed considerably and
|
||||
will behave different for some douments.
|
||||
* OCRmyPDF creates archived PDF/A documents with embedded text that can be selected in the front end.
|
||||
* Paperless stores archived versions of documents alongside with the originals. The originals can be
|
||||
accessed on the document edit page. If available, a dropdown menu will appear next to the download button.
|
||||
* Many of the configuration options regarding OCR have changed. See :ref:`configuration-ocr` for details.
|
||||
* Paperless no longer guesses the language of your documents. It always uses the language that you
|
||||
specified with ``PAPERLESS_OCR_LANGUAGE``. Be sure to set this to the language the majority of your
|
||||
documents are in. Multiple languages can be specified, but that requires more CPU time.
|
||||
* The management command :ref:`document_archiver <utilities-archiver>` can be used to create archived versions for already
|
||||
existing documents.
|
||||
|
||||
* Tags from consumption folder.
|
||||
|
||||
* Thanks to `jayme-github`_, paperless now consumes files from sub folders in the consumption folder and is able to assign tags
|
||||
based on the sub folders a document was found in. This can be configured with ``PAPERLESS_CONSUMER_RECURSIVE`` and
|
||||
``PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS``.
|
||||
|
||||
* API
|
||||
|
||||
* The API now offers token authentication.
|
||||
* The endpoint for uploading documents now supports specifying custom titles, correspondents, tags and types.
|
||||
This can be used by clients to override the default behavior of paperless. See :ref:`api-file_uploads`.
|
||||
* The document endpoint of API now serves documents in this form:
|
||||
|
||||
* correspondents, document types and tags are referenced by their ID in the fields ``correspondent``, ``document_type`` and ``tags``. The ``*_id`` versions are gone. These fields are read/write.
|
||||
* paperless does not serve nested tags, correspondents or types anymore.
|
||||
|
||||
* Front end
|
||||
|
||||
* Paperless does some basic caching of correspondents, tags and types and will only request them from the server when necessary or when entirely reloading the page.
|
||||
* Document list fetching is about 10%-30% faster now, especially when lots of tags/correspondents are present.
|
||||
* Some minor improvements to the front end, such as document count in the document list, better highlighting of the current page, and improvements to the filter behavior.
|
||||
|
||||
* Fixes:
|
||||
|
||||
* A bug with the generation of filenames for files with unsupported types caused the exporter and
|
||||
document saving to crash.
|
||||
* Mail handling no longer exits entirely when encountering errors. It will skip the account/rule/message on which the error occured.
|
||||
* Assigning correspondents from mail sender names failed for very long names. Paperless no longer assigns correspondents in these cases.
|
||||
|
||||
paperless-ng 0.9.4
|
||||
##################
|
||||
|
||||
* Searching:
|
||||
|
||||
* Paperless now supports searching by tags, types and dates and correspondents. In order to have this applied to your
|
||||
existing documents, you need to perform a ``document_index reindex`` management command
|
||||
(see :ref:`administration-index`)
|
||||
that adds the data to the search index. You only need to do this once, since the schema of the search index changed.
|
||||
Paperless keeps the index updated after that whenever something changes.
|
||||
* Paperless now has spelling corrections ("Did you mean") for miss-typed queries.
|
||||
* The documentation contains :ref:`information about the query syntax <basic-searching>`.
|
||||
|
||||
* Front end:
|
||||
|
||||
* Clickable tags, correspondents and types allow quick filtering for related documents.
|
||||
* Saved views are now editable.
|
||||
* Preview documents directly in the browser.
|
||||
* Navigation from the dashboard to saved views.
|
||||
|
||||
* Fixes:
|
||||
|
||||
* A severe error when trying to use post consume scripts.
|
||||
* An error in the consumer that cause invalid messages of missing files to show up in the log.
|
||||
|
||||
* The documentation now contains information about bare metal installs and a section about
|
||||
how to setup the development environment.
|
||||
|
||||
paperless-ng 0.9.3
|
||||
##################
|
||||
|
||||
* Setting ``PAPERLESS_AUTO_LOGIN_USERNAME`` replaces ``PAPERLESS_DISABLE_LOGIN``.
|
||||
You have to specify your username.
|
||||
* Added a simple sanity checker that checks your documents for missing or orphaned files,
|
||||
files with wrong checksums, inaccessible files, and documents with empty content.
|
||||
* It is no longer possible to encrypt your documents. For the time being, paperless will
|
||||
continue to operate with already encrypted documents.
|
||||
* Fixes:
|
||||
|
||||
* Paperless now uses inotify again, since the watchdog was causing issues which I was not
|
||||
aware of.
|
||||
* Issue with the automatic classifier not working with only one tag.
|
||||
* A couple issues with the search index being opened to eagerly.
|
||||
|
||||
* Added lots of tests for various parts of the application.
|
||||
|
||||
paperless-ng 0.9.2
|
||||
##################
|
||||
|
||||
* Major changes to the front end (colors, logo, shadows, layout of the cards,
|
||||
better mobile support)
|
||||
|
||||
* Paperless now uses mime types and libmagic detection to determine
|
||||
if a file type is supported and which parser to use. Removes all
|
||||
file type checks that where present in MANY different places in
|
||||
paperless.
|
||||
|
||||
* Mail consumer now correctly consumes documents even when their
|
||||
content type was not set correctly. (i.e. PDF documents with
|
||||
content type ``application/octet-stream``)
|
||||
|
||||
* Basic sorting of mail rules added
|
||||
|
||||
* Much better admin for mail rule editing.
|
||||
|
||||
* Docker entrypoint script awaits the database server if it is
|
||||
configured.
|
||||
|
||||
* Disabled editing of logs.
|
||||
|
||||
* New setting ``PAPERLESS_OCR_PAGES`` limits the tesseract parser
|
||||
to the first n pages of scanned documents.
|
||||
|
||||
* Fixed a bug where tasks with too long task names would not show
|
||||
up in the admin.
|
||||
|
||||
paperless-ng 0.9.1
|
||||
##################
|
||||
|
||||
* Moved documentation of the settings to the actual documentation.
|
||||
* Updated release script to force the user to choose between SQLite
|
||||
and PostgreSQL. This avoids confusion when upgrading from paperless.
|
||||
|
||||
|
||||
paperless-ng 0.9.0
|
||||
##################
|
||||
|
||||
* **Deprecated:** GnuPG. :ref:`See this note on the state of GnuPG in paperless-ng. <utilities-encyption>`
|
||||
This features will most likely be removed in future versions.
|
||||
|
||||
* **Added:** New frontend. Features:
|
||||
|
||||
* Single page application: It's much more responsive than the django admin pages.
|
||||
* Dashboard. Shows recently scanned documents, or todo notes, or other documents
|
||||
at wish. Allows uploading of documents. Shows basic statistics.
|
||||
* Better document list with multiple display options.
|
||||
* Full text search with result highlighting, auto completion and scoring based
|
||||
on the query. It uses a document search index in the background.
|
||||
* Saveable filters.
|
||||
* Better log viewer.
|
||||
|
||||
* **Added:** Document types. Assign these to documents just as correspondents.
|
||||
They may be used in the future to perform automatic operations on documents
|
||||
depending on the type.
|
||||
* **Added:** Inbox tags. Define an inbox tag and it will automatically be
|
||||
assigned to any new document scanned into the system.
|
||||
* **Added:** Automatic matching. A new matching algorithm that automatically
|
||||
assigns tags, document types and correspondents to your documents. It uses
|
||||
a neural network trained on your data.
|
||||
* **Added:** Archive serial numbers. Assign these to quickly find documents stored in
|
||||
physical binders.
|
||||
* **Added:** Enabled the internal user management of django. This isn't really a
|
||||
multi user solution, however, it allows more than one user to access the website
|
||||
and set some basic permissions / renew passwords.
|
||||
|
||||
* **Modified [breaking]:** All new mail consumer with customizable filters, actions and
|
||||
multiple account support. Replaces the old mail consumer. The new mail consumer
|
||||
needs different configuration but can be configured to act exactly like the old
|
||||
consumer.
|
||||
|
||||
|
||||
* **Modified:** Changes to the consumer:
|
||||
|
||||
* Now uses the excellent watchdog library that should make sure files are
|
||||
discovered no matter what the platform is.
|
||||
* The consumer now uses a task scheduler to run consumption processes in parallel.
|
||||
This means that consuming many documents should be much faster on systems with
|
||||
many cores.
|
||||
* Concurrency is controlled with the new settings ``PAPERLESS_TASK_WORKERS``
|
||||
and ``PAPERLESS_THREADS_PER_WORKER``. See TODO for details on concurrency.
|
||||
* The consumer no longer blocks the database for extended periods of time.
|
||||
* An issue with tesseract running multiple threads per page and slowing down
|
||||
the consumer was fixed.
|
||||
|
||||
* **Modified [breaking]:** REST Api changes:
|
||||
|
||||
* New filters added, other filters removed (case sensitive filters, slug filters)
|
||||
* Endpoints for thumbnails, previews and downloads replace the old ``/fetch/`` urls. Redirects are in place.
|
||||
* Endpoint for document uploads replaces the old ``/push`` url. Redirects are in place.
|
||||
* Foreign key relationships are now served as IDs, not as urls.
|
||||
|
||||
* **Modified [breaking]:** PostgreSQL:
|
||||
|
||||
* If ``PAPERLESS_DBHOST`` is specified in the settings, paperless uses PostgreSQL instead of SQLite.
|
||||
Username, database and password all default to ``paperless`` if not specified.
|
||||
|
||||
* **Modified [breaking]:** document_retagger management command rework. See
|
||||
:ref:`utilities-retagger` for details. Replaces ``document_correspondents``
|
||||
management command.
|
||||
* **Removed [breaking]:** Reminders.
|
||||
* **Removed:** All customizations made to the django admin pages.
|
||||
* **Removed [breaking]:** The docker image no longer supports SSL. If you want to expose
|
||||
paperless to the internet, hide paperless behind a proxy server that handles SSL
|
||||
requests.
|
||||
* **Internal changes:** Mostly code cleanup, including:
|
||||
|
||||
* Rework of the code of the tesseract parser. This is now a lot cleaner.
|
||||
* Rework of the filename handling code. It was a mess.
|
||||
* Fixed some issues with the document exporter not exporting all documents when encountering duplicate filenames.
|
||||
* Added a task scheduler that takes care of checking mail, training the classifier, maintaining the document search index
|
||||
and consuming documents.
|
||||
* Updated dependencies. Now uses Pipenv all around.
|
||||
* Updated Dockerfile and docker-compose. Now uses ``supervisord`` to run everything paperless-related in a single container.
|
||||
|
||||
* **Settings:**
|
||||
|
||||
* ``PAPERLESS_FORGIVING_OCR`` is now default and gone. Reason: Even if ``langdetect`` fails to detect
|
||||
a language, tesseract still does a very good job at ocr'ing a document with the default language.
|
||||
Certain language specifics such as umlauts may not get picked up properly.
|
||||
* ``PAPERLESS_DEBUG`` defaults to ``false``.
|
||||
* The presence of ``PAPERLESS_DBHOST`` now determines whether to use PostgreSQL or
|
||||
SQLite.
|
||||
* ``PAPERLESS_OCR_THREADS`` is gone and replaced with ``PAPERLESS_TASK_WORKERS`` and
|
||||
``PAPERLESS_THREADS_PER_WORKER``. Refer to the config example for details.
|
||||
* ``PAPERLESS_OPTIMIZE_THUMBNAILS`` allows you to disable or enable thumbnail
|
||||
optimization. This is useful on less powerful devices.
|
||||
|
||||
* Many more small changes here and there. The usual stuff.
|
||||
|
||||
Paperless
|
||||
#########
|
||||
|
||||
2.7.0
|
||||
@@ -6,7 +335,7 @@ Changelog
|
||||
|
||||
* `syntonym`_ submitted a pull request to catch IMAP connection errors `#475`_.
|
||||
* `Stéphane Brunner`_ added ``psycopg2`` to the Pipfile `#489`_. He also fixed
|
||||
a syntax error in ``docker-compose.yml.example`` `#488`_ and added [DjangoQL](https://github.com/ivelum/djangoql),
|
||||
a syntax error in ``docker-compose.yml.example`` `#488`_ and added `DjangoQL`_,
|
||||
which allows a litany of handy search functionality `#492`_.
|
||||
* `CkuT`_ and `JOKer`_ hacked out a simple, but super-helpful optimisation to
|
||||
how the thumbnails are served up, improving performance considerably `#481`_.
|
||||
@@ -194,7 +523,7 @@ that it was more an annoyance than anything else, so this feature is now turned
|
||||
off unless you explicitly set a passphrase in your config file.
|
||||
|
||||
Migrating from 1.x
|
||||
------------------
|
||||
==================
|
||||
|
||||
Encryption isn't gone, it's just off for new users. So long as you have
|
||||
``PAPERLESS_PASSPHRASE`` set in your config or your environment, Paperless
|
||||
@@ -564,6 +893,9 @@ bulk of the work on this big change.
|
||||
|
||||
* Initial release
|
||||
|
||||
.. _rYR79435: https://github.com/rYR79435
|
||||
.. _Michael Shamoon: https://github.com/shamoon
|
||||
.. _jayme-github: http://github.com/jayme-github
|
||||
.. _Brian Conn: https://github.com/TheConnMan
|
||||
.. _Christopher Luu: https://github.com/nuudles
|
||||
.. _Florian Jung: https://github.com/the01
|
||||
@@ -739,6 +1071,6 @@ bulk of the work on this big change.
|
||||
.. _#489: https://github.com/the-paperless-project/paperless/pull/489
|
||||
.. _#492: https://github.com/the-paperless-project/paperless/pull/492
|
||||
|
||||
.. _pipenv: https://docs.pipenv.org/
|
||||
.. _a new home on Docker Hub: https://hub.docker.com/r/danielquinn/paperless/
|
||||
.. _optipng: http://optipng.sourceforge.net/
|
||||
.. _DjangoQL: https://github.com/ivelum/djangoql
|
||||
|
53
docs/conf.py
@@ -1,51 +1,21 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
#
|
||||
# Paperless documentation build configuration file, created by
|
||||
# sphinx-quickstart on Mon Oct 26 18:36:52 2015.
|
||||
#
|
||||
# This file is execfile()d with the current directory set to its
|
||||
# containing dir.
|
||||
#
|
||||
# Note that not all possible configuration values are present in this
|
||||
# autogenerated file.
|
||||
#
|
||||
# All configuration values have a default; values that are commented out
|
||||
# serve to show the default.
|
||||
import sphinx_rtd_theme
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
__version__ = None
|
||||
exec(open("../src/paperless/version.py").read())
|
||||
|
||||
|
||||
# Believe it or not, this is the officially sanctioned way to add custom CSS.
|
||||
def setup(app):
|
||||
app.add_stylesheet("custom.css")
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
#sys.path.insert(0, os.path.abspath('.'))
|
||||
|
||||
# -- General configuration ------------------------------------------------
|
||||
|
||||
# If your documentation needs a minimal Sphinx version, state it here.
|
||||
#needs_sphinx = '1.0'
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.intersphinx',
|
||||
'sphinx.ext.todo',
|
||||
'sphinx.ext.imgmath',
|
||||
'sphinx.ext.viewcode',
|
||||
'sphinx_rtd_theme',
|
||||
]
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
# templates_path = ['_templates']
|
||||
|
||||
# The suffix of source filenames.
|
||||
source_suffix = '.rst'
|
||||
@@ -57,7 +27,7 @@ source_suffix = '.rst'
|
||||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = u'Paperless'
|
||||
project = u'Paperless-ng'
|
||||
copyright = u'2015, Daniel Quinn'
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
@@ -118,7 +88,7 @@ pygments_style = 'sphinx'
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
html_theme = 'default'
|
||||
html_theme = 'sphinx_rtd_theme'
|
||||
|
||||
# Theme options are theme-specific and customize the look and feel of a theme
|
||||
# further. For a list of options available for each theme, see the
|
||||
@@ -198,19 +168,6 @@ html_static_path = ['_static']
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = 'paperless'
|
||||
|
||||
|
||||
#
|
||||
# Attempt to use the ReadTheDocs theme. If it's not installed, fallback to
|
||||
# the default.
|
||||
#
|
||||
|
||||
try:
|
||||
import sphinx_rtd_theme
|
||||
html_theme = "sphinx_rtd_theme"
|
||||
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# -- Options for LaTeX output ---------------------------------------------
|
||||
|
||||
latex_elements = {
|
||||
|
426
docs/configuration.rst
Normal file
@@ -0,0 +1,426 @@
|
||||
.. _configuration:
|
||||
|
||||
*************
|
||||
Configuration
|
||||
*************
|
||||
|
||||
Paperless provides a wide range of customizations.
|
||||
Depending on how you run paperless, these settings have to be defined in different
|
||||
places.
|
||||
|
||||
* If you run paperless on docker, ``paperless.conf`` is not used. Rather, configure
|
||||
paperless by copying necessary options to ``docker-compose.env``.
|
||||
* If you are running paperless on anything else, paperless will search for the
|
||||
configuration file in these locations and use the first one it finds:
|
||||
|
||||
.. code::
|
||||
|
||||
/path/to/paperless/paperless.conf
|
||||
/etc/paperless.conf
|
||||
/usr/local/etc/paperless.conf
|
||||
|
||||
|
||||
Required services
|
||||
#################
|
||||
|
||||
PAPERLESS_REDIS=<url>
|
||||
This is required for processing scheduled tasks such as email fetching, index
|
||||
optimization and for training the automatic document matcher.
|
||||
|
||||
Defaults to redis://localhost:6379.
|
||||
|
||||
PAPERLESS_DBHOST=<hostname>
|
||||
By default, sqlite is used as the database backend. This can be changed here.
|
||||
Set PAPERLESS_DBHOST and PostgreSQL will be used instead of mysql.
|
||||
|
||||
PAPERLESS_DBPORT=<port>
|
||||
Adjust port if necessary.
|
||||
|
||||
Default is 5432.
|
||||
|
||||
PAPERLESS_DBNAME=<name>
|
||||
Database name in PostgreSQL.
|
||||
|
||||
Defaults to "paperless".
|
||||
|
||||
PAPERLESS_DBUSER=<name>
|
||||
Database user in PostgreSQL.
|
||||
|
||||
Defaults to "paperless".
|
||||
|
||||
PAPERLESS_DBPASS=<password>
|
||||
Database password for PostgreSQL.
|
||||
|
||||
Defaults to "paperless".
|
||||
|
||||
|
||||
Paths and folders
|
||||
#################
|
||||
|
||||
PAPERLESS_CONSUMPTION_DIR=<path>
|
||||
This where your documents should go to be consumed. Make sure that it exists
|
||||
and that the user running the paperless service can read/write its contents
|
||||
before you start Paperless.
|
||||
|
||||
Don't change this when using docker, as it only changes the path within the
|
||||
container. Change the local consumption directory in the docker-compose.yml
|
||||
file instead.
|
||||
|
||||
Defaults to "../consume", relative to the "src" directory.
|
||||
|
||||
PAPERLESS_DATA_DIR=<path>
|
||||
This is where paperless stores all its data (search index, SQLite database,
|
||||
classification model, etc).
|
||||
|
||||
Defaults to "../data", relative to the "src" directory.
|
||||
|
||||
PAPERLESS_MEDIA_ROOT=<path>
|
||||
This is where your documents and thumbnails are stored.
|
||||
|
||||
You can set this and PAPERLESS_DATA_DIR to the same folder to have paperless
|
||||
store all its data within the same volume.
|
||||
|
||||
Defaults to "../media", relative to the "src" directory.
|
||||
|
||||
PAPERLESS_STATICDIR=<path>
|
||||
Override the default STATIC_ROOT here. This is where all static files
|
||||
created using "collectstatic" manager command are stored.
|
||||
|
||||
Unless you're doing something fancy, there is no need to override this.
|
||||
|
||||
Defaults to "../static", relative to the "src" directory.
|
||||
|
||||
PAPERLESS_FILENAME_FORMAT=<format>
|
||||
Changes the filenames paperless uses to store documents in the media directory.
|
||||
See :ref:`advanced-file_name_handling` for details.
|
||||
|
||||
Default is none, which disables this feature.
|
||||
|
||||
Hosting & Security
|
||||
##################
|
||||
|
||||
PAPERLESS_SECRET_KEY=<key>
|
||||
Paperless uses this to make session tokens. If you expose paperless on the
|
||||
internet, you need to change this, since the default secret is well known.
|
||||
|
||||
Use any sequence of characters. The more, the better. You don't need to
|
||||
remember this. Just face-roll your keyboard.
|
||||
|
||||
Default is listed in the file ``src/paperless/settings.py``.
|
||||
|
||||
PAPERLESS_ALLOWED_HOSTS<comma-separated-list>
|
||||
If you're planning on putting Paperless on the open internet, then you
|
||||
really should set this value to the domain name you're using. Failing to do
|
||||
so leaves you open to HTTP host header attacks:
|
||||
https://docs.djangoproject.com/en/3.1/topics/security/#host-header-validation
|
||||
|
||||
Just remember that this is a comma-separated list, so "example.com" is fine,
|
||||
as is "example.com,www.example.com", but NOT " example.com" or "example.com,"
|
||||
|
||||
Defaults to "*", which is all hosts.
|
||||
|
||||
PAPERLESS_CORS_ALLOWED_HOSTS<comma-separated-list>
|
||||
You need to add your servers to the list of allowed hosts that can do CORS
|
||||
calls. Set this to your public domain name.
|
||||
|
||||
Defaults to "http://localhost:8000".
|
||||
|
||||
PAPERLESS_FORCE_SCRIPT_NAME=<path>
|
||||
To host paperless under a subpath url like example.com/paperless you set
|
||||
this value to /paperless. No trailing slash!
|
||||
|
||||
.. note::
|
||||
|
||||
I don't know if this works in paperless-ng. Probably not.
|
||||
|
||||
Defaults to none, which hosts paperless at "/".
|
||||
|
||||
PAPERLESS_STATIC_URL=<path>
|
||||
Override the STATIC_URL here. Unless you're hosting Paperless off a
|
||||
subdomain like /paperless/, you probably don't need to change this.
|
||||
|
||||
Defaults to "/static/".
|
||||
|
||||
PAPERLESS_AUTO_LOGIN_USERNAME=<username>
|
||||
Specify a username here so that paperless will automatically perform login
|
||||
with the selected user.
|
||||
|
||||
.. danger::
|
||||
|
||||
Do not use this when exposing paperless on the internet. There are no
|
||||
checks in place that would prevent you from doing this.
|
||||
|
||||
Defaults to none, which disables this feature.
|
||||
|
||||
|
||||
PAPERLESS_COOKIE_PREFIX=<str>
|
||||
Specify a prefix that is added to the cookies used by paperless to identify
|
||||
the currently logged in user. This is useful for when you're running two
|
||||
instances of paperless on the same host.
|
||||
|
||||
After changing this, you will have to login again.
|
||||
|
||||
Defaults to ``""``, which does not alter the cookie names.
|
||||
|
||||
.. _configuration-ocr:
|
||||
|
||||
OCR settings
|
||||
############
|
||||
|
||||
Paperless uses `OCRmyPDF <https://ocrmypdf.readthedocs.io/en/latest/>`_ for
|
||||
performing OCR on documents and images. Paperless uses sensible defaults for
|
||||
most settings, but all of them can be configured to your needs.
|
||||
|
||||
|
||||
PAPERLESS_OCR_LANGUAGE=<lang>
|
||||
Customize the language that paperless will attempt to use when
|
||||
parsing documents.
|
||||
|
||||
It should be a 3-letter language code consistent with ISO
|
||||
639: https://www.loc.gov/standards/iso639-2/php/code_list.php
|
||||
|
||||
Set this to the language most of your documents are written in.
|
||||
|
||||
This can be a combination of multiple languages such as ``deu+eng``,
|
||||
in which case tesseract will use whatever language matches best.
|
||||
Keep in mind that tesseract uses much more cpu time with multiple
|
||||
languages enabled.
|
||||
|
||||
Defaults to "eng".
|
||||
|
||||
PAPERLESS_OCR_MODE=<mode>
|
||||
Tell paperless when and how to perform ocr on your documents. Four modes
|
||||
are available:
|
||||
|
||||
* ``skip``: Paperless skips all pages and will perform ocr only on pages
|
||||
where no text is present. This is the safest option.
|
||||
* ``skip_noarchive``: In addition to skip, paperless won't create an
|
||||
archived version of your documents when it finds any text in them.
|
||||
This is useful if you don't want to have two almost-identical versions
|
||||
of your digital documents in the media folder. This is the fastest option.
|
||||
* ``redo``: Paperless will OCR all pages of your documents and attempt to
|
||||
replace any existing text layers with new text. This will be useful for
|
||||
documents from scanners that already performed OCR with insufficient
|
||||
results. It will also perform OCR on purely digital documents.
|
||||
|
||||
This option may fail on some documents that have features that cannot
|
||||
be removed, such as forms. In this case, the text from the document is
|
||||
used instead.
|
||||
* ``force``: Paperless rasterizes your documents, converting any text
|
||||
into images and puts the OCRed text on top. This works for all documents,
|
||||
however, the resulting document may be significantly larger and text
|
||||
won't appear as sharp when zoomed in.
|
||||
|
||||
The default is ``skip``, which only performs OCR when necessary and always
|
||||
creates archived documents.
|
||||
|
||||
PAPERLESS_OCR_OUTPUT_TYPE=<type>
|
||||
Specify the the type of PDF documents that paperless should produce.
|
||||
|
||||
* ``pdf``: Modify the PDF document as little as possible.
|
||||
* ``pdfa``: Convert PDF documents into PDF/A-2b documents, which is a
|
||||
subset of the entire PDF specification and meant for storing
|
||||
documents long term.
|
||||
* ``pdfa-1``, ``pdfa-2``, ``pdfa-3`` to specify the exact version of
|
||||
PDF/A you wish to use.
|
||||
|
||||
If not specified, ``pdfa`` is used. Remember that paperless also keeps
|
||||
the original input file as well as the archived version.
|
||||
|
||||
|
||||
PAPERLESS_OCR_PAGES=<num>
|
||||
Tells paperless to use only the specified amount of pages for OCR. Documents
|
||||
with less than the specified amount of pages get OCR'ed completely.
|
||||
|
||||
Specifying 1 here will only use the first page.
|
||||
|
||||
When combined with ``PAPERLESS_OCR_MODE=redo`` or ``PAPERLESS_OCR_MODE=force``,
|
||||
paperless will not modify any text it finds on excluded pages and copy it
|
||||
verbatim.
|
||||
|
||||
Defaults to 0, which disables this feature and always uses all pages.
|
||||
|
||||
|
||||
PAPERLESS_OCR_IMAGE_DPI=<num>
|
||||
Paperless will OCR any images you put into the system and convert them
|
||||
into PDF documents. This is useful if your scanner produces images.
|
||||
In order to do so, paperless needs to know the DPI of the image.
|
||||
Most images from scanners will have this information embedded and
|
||||
paperless will detect and use that information. In case this fails, it
|
||||
uses this value as a fallback.
|
||||
|
||||
Set this to the DPI your scanner produces images at.
|
||||
|
||||
Default is none, which causes paperless to fail if no DPI information is
|
||||
present in an image.
|
||||
|
||||
|
||||
PAPERLESS_OCR_USER_ARG=<json>
|
||||
OCRmyPDF offers many more options. Use this parameter to specify any
|
||||
additional arguments you wish to pass to OCRmyPDF. Since Paperless uses
|
||||
the API of OCRmyPDF, you have to specify these in a format that can be
|
||||
passed to the API. See `the API reference of OCRmyPDF <https://ocrmypdf.readthedocs.io/en/latest/api.html#reference>`_
|
||||
for valid parameters. All command line options are supported, but they
|
||||
use underscores instead of dashed.
|
||||
|
||||
.. caution::
|
||||
|
||||
Paperless has been tested to work with the OCR options provided
|
||||
above. There are many options that are incompatible with each other,
|
||||
so specifying invalid options may prevent paperless from consuming
|
||||
any documents.
|
||||
|
||||
Specify arguments as a JSON dictionary. Keep note of lower case booleans
|
||||
and double quoted parameter names and strings. Examples:
|
||||
|
||||
.. code:: json
|
||||
|
||||
{"deskew": true, "optimize": 3, "unpaper_args": "--pre-rotate 90"}
|
||||
|
||||
|
||||
Software tweaks
|
||||
###############
|
||||
|
||||
PAPERLESS_TASK_WORKERS=<num>
|
||||
Paperless does multiple things in the background: Maintain the search index,
|
||||
maintain the automatic matching algorithm, check emails, consume documents,
|
||||
etc. This variable specifies how many things it will do in parallel.
|
||||
|
||||
|
||||
PAPERLESS_THREADS_PER_WORKER=<num>
|
||||
Furthermore, paperless uses multiple threads when consuming documents to
|
||||
speed up OCR. This variable specifies how many pages paperless will process
|
||||
in parallel on a single document.
|
||||
|
||||
.. caution::
|
||||
|
||||
Ensure that the product
|
||||
|
||||
PAPERLESS_TASK_WORKERS * PAPERLESS_THREADS_PER_WORKER
|
||||
|
||||
does not exceed your CPU core count or else paperless will be extremely slow.
|
||||
If you want paperless to process many documents in parallel, choose a high
|
||||
worker count. If you want paperless to process very large documents faster,
|
||||
use a higher thread per worker count.
|
||||
|
||||
The default is a balance between the two, according to your CPU core count,
|
||||
with a slight favor towards threads per worker, and using as much cores as
|
||||
possible.
|
||||
|
||||
If you only specify PAPERLESS_TASK_WORKERS, paperless will adjust
|
||||
PAPERLESS_THREADS_PER_WORKER automatically.
|
||||
|
||||
|
||||
PAPERLESS_TIME_ZONE=<timezone>
|
||||
Set the time zone here.
|
||||
See https://docs.djangoproject.com/en/3.1/ref/settings/#std:setting-TIME_ZONE
|
||||
for details on how to set it.
|
||||
|
||||
Defaults to UTC.
|
||||
|
||||
|
||||
PAPERLESS_CONSUMER_POLLING=<num>
|
||||
If paperless won't find documents added to your consume folder, it might
|
||||
not be able to automatically detect filesystem changes. In that case,
|
||||
specify a polling interval in seconds here, which will then cause paperless
|
||||
to periodically check your consumption directory for changes.
|
||||
|
||||
Defaults to 0, which disables polling and uses filesystem notifications.
|
||||
|
||||
|
||||
PAPERLESS_CONSUMER_DELETE_DUPLICATES=<bool>
|
||||
When the consumer detects a duplicate document, it will not touch the
|
||||
original document. This default behavior can be changed here.
|
||||
|
||||
Defaults to false.
|
||||
|
||||
|
||||
PAPERLESS_CONSUMER_RECURSIVE=<bool>
|
||||
Enable recursive watching of the consumption directory. Paperless will
|
||||
then pickup files from files in subdirectories within your consumption
|
||||
directory as well.
|
||||
|
||||
Defaults to false.
|
||||
|
||||
|
||||
PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=<bool>
|
||||
Set the names of subdirectories as tags for consumed files.
|
||||
E.g. <CONSUMPTION_DIR>/foo/bar/file.pdf will add the tags "foo" and "bar" to
|
||||
the consumed file. Paperless will create any tags that don't exist yet.
|
||||
|
||||
PAPERLESS_CONSUMER_RECURSIVE must be enabled for this to work.
|
||||
|
||||
Defaults to false.
|
||||
|
||||
|
||||
PAPERLESS_CONVERT_MEMORY_LIMIT=<num>
|
||||
On smaller systems, or even in the case of Very Large Documents, the consumer
|
||||
may explode, complaining about how it's "unable to extend pixel cache". In
|
||||
such cases, try setting this to a reasonably low value, like 32. The
|
||||
default is to use whatever is necessary to do everything without writing to
|
||||
disk, and units are in megabytes.
|
||||
|
||||
For more information on how to use this value, you should search
|
||||
the web for "MAGICK_MEMORY_LIMIT".
|
||||
|
||||
Defaults to 0, which disables the limit.
|
||||
|
||||
PAPERLESS_CONVERT_TMPDIR=<path>
|
||||
Similar to the memory limit, if you've got a small system and your OS mounts
|
||||
/tmp as tmpfs, you should set this to a path that's on a physical disk, like
|
||||
/home/your_user/tmp or something. ImageMagick will use this as scratch space
|
||||
when crunching through very large documents.
|
||||
|
||||
For more information on how to use this value, you should search
|
||||
the web for "MAGICK_TMPDIR".
|
||||
|
||||
Default is none, which disables the temporary directory.
|
||||
|
||||
PAPERLESS_OPTIMIZE_THUMBNAILS=<bool>
|
||||
Use optipng to optimize thumbnails. This usually reduces the size of
|
||||
thumbnails by about 20%, but uses considerable compute time during
|
||||
consumption.
|
||||
|
||||
Defaults to true.
|
||||
|
||||
PAPERLESS_POST_CONSUME_SCRIPT=<filename>
|
||||
After a document is consumed, Paperless can trigger an arbitrary script if
|
||||
you like. This script will be passed a number of arguments for you to work
|
||||
with. For more information, take a look at :ref:`advanced-post_consume_script`.
|
||||
|
||||
The default is blank, which means nothing will be executed.
|
||||
|
||||
PAPERLESS_FILENAME_DATE_ORDER=<format>
|
||||
Paperless will check the document text for document date information.
|
||||
Use this setting to enable checking the document filename for date
|
||||
information. The date order can be set to any option as specified in
|
||||
https://dateparser.readthedocs.io/en/latest/settings.html#date-order.
|
||||
The filename will be checked first, and if nothing is found, the document
|
||||
text will be checked as normal.
|
||||
|
||||
Defaults to none, which disables this feature.
|
||||
|
||||
PAPERLESS_FILENAME_PARSE_TRANSFORMS
|
||||
Transforms filenames before they are processed by paperless. See
|
||||
:ref:`advanced-transforming_filenames` for details.
|
||||
|
||||
Defaults to none, which disables this feature.
|
||||
|
||||
Binaries
|
||||
########
|
||||
|
||||
There are a few external software packages that Paperless expects to find on
|
||||
your system when it starts up. Unless you've done something creative with
|
||||
their installation, you probably won't need to edit any of these. However,
|
||||
if you've installed these programs somewhere where simply typing the name of
|
||||
the program doesn't automatically execute it (ie. the program isn't in your
|
||||
$PATH), then you'll need to specify the literal path for that program.
|
||||
|
||||
PAPERLESS_CONVERT_BINARY=<path>
|
||||
Defaults to "/usr/bin/convert".
|
||||
|
||||
PAPERLESS_GS_BINARY=<path>
|
||||
Defaults to "/usr/bin/gs".
|
||||
|
||||
PAPERLESS_OPTIPNG_BINARY=<path>
|
||||
Defaults to "/usr/bin/optipng".
|
@@ -1,255 +0,0 @@
|
||||
.. _consumption:
|
||||
|
||||
Consumption
|
||||
###########
|
||||
|
||||
Once you've got Paperless setup, you need to start feeding documents into it.
|
||||
Currently, there are three options: the consumption directory, IMAP (email), and
|
||||
HTTP POST.
|
||||
|
||||
|
||||
.. _consumption-directory:
|
||||
|
||||
The Consumption Directory
|
||||
=========================
|
||||
|
||||
The primary method of getting documents into your database is by putting them in
|
||||
the consumption directory. The ``document_consumer`` script runs in an infinite
|
||||
loop looking for new additions to this directory and when it finds them, it goes
|
||||
about the process of parsing them with the OCR, indexing what it finds, and
|
||||
encrypting the PDF (if ``PAPERLESS_PASSPHRASE`` is set), storing it in the
|
||||
media directory.
|
||||
|
||||
Getting stuff into this directory is up to you. If you're running Paperless
|
||||
on your local computer, you might just want to drag and drop files there, but if
|
||||
you're running this on a server and want your scanner to automatically push
|
||||
files to this directory, you'll need to setup some sort of service to accept the
|
||||
files from the scanner. Typically, you're looking at an FTP server like
|
||||
`Proftpd`_ or `Samba`_.
|
||||
|
||||
.. _Proftpd: http://www.proftpd.org/
|
||||
.. _Samba: http://www.samba.org/
|
||||
|
||||
So where is this consumption directory? It's wherever you define it. Look for
|
||||
the ``CONSUMPTION_DIR`` value in ``settings.py``. Set that to somewhere
|
||||
appropriate for your use and put some documents in there. When you're ready,
|
||||
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
|
||||
|
||||
|
||||
.. _consumption-directory-hook:
|
||||
|
||||
Hooking into the Consumption Process
|
||||
------------------------------------
|
||||
|
||||
Sometimes you may want to do something arbitrary whenever a document is
|
||||
consumed. Rather than try to predict what you may want to do, Paperless lets
|
||||
you execute scripts of your own choosing just before or after a document is
|
||||
consumed using a couple simple hooks.
|
||||
|
||||
Just write a script, put it somewhere that Paperless can read & execute, and
|
||||
then put the path to that script in ``paperless.conf`` with the variable name
|
||||
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
||||
``PAPERLESS_POST_CONSUME_SCRIPT``. The script will be executed before or
|
||||
or after the document is consumed respectively.
|
||||
|
||||
.. important::
|
||||
|
||||
These scripts are executed in a **blocking** process, which means that if
|
||||
a script takes a long time to run, it can significantly slow down your
|
||||
document consumption flow. If you want things to run asynchronously,
|
||||
you'll have to fork the process in your script and exit.
|
||||
|
||||
|
||||
.. _consumption-directory-hook-variables:
|
||||
|
||||
What Can These Scripts Do?
|
||||
..........................
|
||||
|
||||
It's your script, so you're only limited by your imagination and the laws of
|
||||
physics. However, the following values are passed to the scripts in order:
|
||||
|
||||
|
||||
.. _consumption-director-hook-variables-pre:
|
||||
|
||||
Pre-consumption script
|
||||
::::::::::::::::::::::
|
||||
|
||||
* Document file name
|
||||
|
||||
A simple but common example for this would be creating a simple script like
|
||||
this:
|
||||
|
||||
``/usr/local/bin/ocr-pdf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
#!/usr/bin/env bash
|
||||
pdf2pdfocr.py -i ${1}
|
||||
|
||||
``/etc/paperless.conf``
|
||||
|
||||
.. code:: bash
|
||||
|
||||
...
|
||||
PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf"
|
||||
...
|
||||
|
||||
This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``,
|
||||
which will in turn call `pdf2pdfocr.py`_ on your document, which will then
|
||||
overwrite the file with an OCR'd version of the file and exit. At which point,
|
||||
the consumption process will begin with the newly modified file.
|
||||
|
||||
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
|
||||
|
||||
|
||||
.. _consumption-director-hook-variables-post:
|
||||
|
||||
Post-consumption script
|
||||
:::::::::::::::::::::::
|
||||
|
||||
* Document id
|
||||
* Generated file name
|
||||
* Source path
|
||||
* Thumbnail path
|
||||
* Download URL
|
||||
* Thumbnail URL
|
||||
* Correspondent
|
||||
* Tags
|
||||
|
||||
The script can be in any language you like, but for a simple shell script
|
||||
example, you can take a look at ``post-consumption-example.sh`` in the
|
||||
``scripts`` directory in this project.
|
||||
|
||||
|
||||
.. _consumption-imap:
|
||||
|
||||
IMAP (Email)
|
||||
============
|
||||
|
||||
Another handy way to get documents into your database is to email them to
|
||||
yourself. The typical use-case would be to be out for lunch and want to send a
|
||||
copy of the receipt back to your system at home. Paperless can be taught to
|
||||
pull emails down from an arbitrary account and dump them into the consumption
|
||||
directory where the process :ref:`above <consumption-directory>` will follow the
|
||||
usual pattern on consuming the document.
|
||||
|
||||
Some things you need to know about this feature:
|
||||
|
||||
* It's disabled by default. By setting the values below it will be enabled.
|
||||
* It's been tested in a limited environment, so it may not work for you (please
|
||||
submit a pull request if you can!)
|
||||
* It's designed to **delete mail from the server once consumed**. So don't go
|
||||
pointing this to your personal email account and wonder where all your stuff
|
||||
went.
|
||||
* Currently, only one photo (attachment) per email will work.
|
||||
|
||||
So, with all that in mind, here's what you do to get it running:
|
||||
|
||||
1. Setup a new email account somewhere, or if you're feeling daring, create a
|
||||
folder in an existing email box and note the path to that folder.
|
||||
2. In ``/etc/paperless.conf`` set all of the appropriate values in
|
||||
``PATHS AND FOLDERS`` and ``SECURITY``.
|
||||
If you decided to use a subfolder of an existing account, then make sure you
|
||||
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
|
||||
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
|
||||
have to include that in every email you send.
|
||||
3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check
|
||||
the configured email account at startup and from then on every 10 minutes
|
||||
for something new and pulls down whatever it finds.
|
||||
4. Send yourself an email! Note that the subject is treated as the file name,
|
||||
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
|
||||
get what you expect. Also, you must include the aforementioned secret
|
||||
string in every email so the fetcher knows that it's safe to import.
|
||||
Note that Paperless only allows the email title to consist of safe characters
|
||||
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
|
||||
5. After a few minutes, the consumer will poll your mailbox, pull down the
|
||||
message, and place the attachment in the consumption directory with the
|
||||
appropriate name. A few minutes later, the consumer will import it like any
|
||||
other file.
|
||||
|
||||
|
||||
.. _consumption-http:
|
||||
|
||||
HTTP POST
|
||||
=========
|
||||
|
||||
You can also submit a document via HTTP POST, so long as you do so after
|
||||
authenticating. To push your document to Paperless, send an HTTP POST to the
|
||||
server with the following name/value pairs:
|
||||
|
||||
* ``correspondent``: The name of the document's correspondent. Note that there
|
||||
are restrictions on what characters you can use here. Specifically,
|
||||
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
|
||||
out. You also can't use the sequence ` - ` (space, dash, space).
|
||||
* ``title``: The title of the document. The rules for characters is the same
|
||||
here as the correspondent.
|
||||
* ``document``: The file you're uploading
|
||||
|
||||
Specify ``enctype="multipart/form-data"``, and then POST your file with::
|
||||
|
||||
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
|
||||
|
||||
An example of this in HTML is a typical form:
|
||||
|
||||
.. code:: html
|
||||
|
||||
<form method="post" enctype="multipart/form-data">
|
||||
<input type="text" name="correspondent" value="My Correspondent" />
|
||||
<input type="text" name="title" value="My Title" />
|
||||
<input type="file" name="document" />
|
||||
<input type="submit" name="go" value="Do the thing" />
|
||||
</form>
|
||||
|
||||
But a potentially more useful way to do this would be in Python. Here we use
|
||||
the requests library to handle basic authentication and to send the POST data
|
||||
to the URL.
|
||||
|
||||
.. code:: python
|
||||
|
||||
import os
|
||||
|
||||
from hashlib import sha256
|
||||
|
||||
import requests
|
||||
from requests.auth import HTTPBasicAuth
|
||||
|
||||
# You authenticate via BasicAuth or with a session id.
|
||||
# We use BasicAuth here
|
||||
username = "my-username"
|
||||
password = "my-super-secret-password"
|
||||
|
||||
# Where you have Paperless installed and listening
|
||||
url = "http://localhost:8000/push"
|
||||
|
||||
# Document metadata
|
||||
correspondent = "Test Correspondent"
|
||||
title = "Test Title"
|
||||
|
||||
# The local file you want to push
|
||||
path = "/path/to/some/directory/my-document.pdf"
|
||||
|
||||
|
||||
with open(path, "rb") as f:
|
||||
|
||||
response = requests.post(
|
||||
url=url,
|
||||
data={"title": title, "correspondent": correspondent},
|
||||
files={"document": (os.path.basename(path), f, "application/pdf")},
|
||||
auth=HTTPBasicAuth(username, password),
|
||||
allow_redirects=False
|
||||
)
|
||||
|
||||
if response.status_code == 202:
|
||||
|
||||
# Everything worked out ok
|
||||
print("Upload successful")
|
||||
|
||||
else:
|
||||
|
||||
# If you don't get a 202, it's probably because your credentials
|
||||
# are wrong or something. This will give you a rough idea of what
|
||||
# happened.
|
||||
|
||||
print("We got HTTP status code: {}".format(response.status_code))
|
||||
for k, v in response.headers.items():
|
||||
print("{}: {}".format(k, v))
|
@@ -3,6 +3,10 @@
|
||||
Contributing to Paperless
|
||||
#########################
|
||||
|
||||
.. warning::
|
||||
|
||||
This section is not updated to paperless-ng yet.
|
||||
|
||||
Maybe you've been using Paperless for a while and want to add a feature or two,
|
||||
or maybe you've come across a bug that you have some ideas how to solve. The
|
||||
beauty of Free software is that you can see what's wrong and help to get it
|
||||
@@ -81,7 +85,7 @@ quoted, or triple-quoted string will do:
|
||||
problematic_string = 'This is a "string" with "quotes" in it'
|
||||
|
||||
In HTML templates, please use double-quotes for tag attributes, and single
|
||||
quotes for arguments passed to Django tempalte tags:
|
||||
quotes for arguments passed to Django template tags:
|
||||
|
||||
.. code:: html
|
||||
|
||||
|
@@ -1,42 +0,0 @@
|
||||
.. _customising:
|
||||
|
||||
Customising Paperless
|
||||
#####################
|
||||
|
||||
Currently, the Paperless' interface is just the default Django admin, which
|
||||
while powerful, is rather boring. If you'd like to give the site a bit of a
|
||||
face-lift, or if you simply want to adjust the colours, contrast, or font size
|
||||
to make things easier to read, you can do that by adding your own CSS or
|
||||
Javascript quite easily.
|
||||
|
||||
|
||||
.. _customising-overrides:
|
||||
|
||||
Overrides
|
||||
=========
|
||||
|
||||
On every page load, Paperless looks for two files in your media root directory
|
||||
(the directory defined by your ``PAPERLESS_MEDIADIR`` configuration variable or
|
||||
the default, ``<project root>/media/``) for two files:
|
||||
|
||||
* ``overrides.css``
|
||||
* ``overrides.js``
|
||||
|
||||
If it finds either or both of those files, they'll be loaded into the page: the
|
||||
CSS in the ``<head>``, and the Javascript stuffed into the last line of the
|
||||
``<body>``.
|
||||
|
||||
|
||||
.. _customising-overrides-note:
|
||||
|
||||
An important note about customisation
|
||||
-------------------------------------
|
||||
|
||||
Any changes you make to the site with your CSS or Javascript are likely to
|
||||
depend on the structure of the current HTML and/or the existing CSS rules. For
|
||||
the most part it's safe to assume that these bits won't change, but *sometimes
|
||||
they do* as features are added or bugs are fixed.
|
||||
|
||||
If you make a change that you think others would appreciate though, submit it
|
||||
as a pull request and maybe we can find a way to work it into the project by
|
||||
default!
|
@@ -1,158 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
# Bash script to install paperless in lxc containter
|
||||
# paperless.lan
|
||||
#
|
||||
# Will set-up paperless, apache2 and proftpd
|
||||
#
|
||||
# lxc launch ubuntu: paperless
|
||||
# lxc exec paperless -- sh -c "sudo apt-get update && sudo apt-get install -y wget"
|
||||
# lxc exec paperless -- sh -c "wget https://raw.githubusercontent.com/the-paperless-project/paperless/master/docs/examples/lxc/lxc-install.sh && /bin/bash lxc-install.sh --email "
|
||||
#
|
||||
#
|
||||
set +e
|
||||
PASSWORD=$(< /dev/urandom tr -dc _A-Z-a-z-0-9+@%^{} | head -c20;echo;)
|
||||
EMAIL=
|
||||
|
||||
function displayHelp() {
|
||||
echo "available parameters:
|
||||
-e <email> | --email <email>
|
||||
-p <password> | --password <password>
|
||||
"
|
||||
}
|
||||
|
||||
POSITIONAL=()
|
||||
while [[ $# -gt 0 ]]
|
||||
do
|
||||
key="$1"
|
||||
i=$key
|
||||
|
||||
case $i in
|
||||
-e|--email)
|
||||
EMAIL="${2}"
|
||||
shift
|
||||
shift
|
||||
;;
|
||||
-p|--password)
|
||||
PASSWORD="${2}"
|
||||
shift
|
||||
shift
|
||||
;;
|
||||
--default|-h|--help)
|
||||
shift
|
||||
displayHelp
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "argument: $i not recognized"
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
set -- "${POSITIONAL[@]}" # restore positional parameters
|
||||
|
||||
if [ -z $EMAIL ]; then
|
||||
echo "missing email, try running with -h "
|
||||
exit 3
|
||||
fi
|
||||
if [[ $(/usr/bin/id -u) -ne 0 ]]; then
|
||||
echo "Not running as root"
|
||||
exit
|
||||
fi
|
||||
|
||||
if [ $(grep -c paperless /etc/passwd) -eq 0 ]; then
|
||||
# Add paperless user with no password
|
||||
adduser --disabled-password --gecos "" paperless
|
||||
fi
|
||||
|
||||
if [ $(grep -c ftpupload /etc/passwd) -eq 0 ]; then
|
||||
# Add ftpupload
|
||||
adduser --disabled-password --gecos "" ftpupload
|
||||
echo "Set ftpupload password: "
|
||||
#passwd ftpupload
|
||||
#TODO: generate some password and allow parameter
|
||||
echo "ftpupload:ftpuploadpassword" | chpasswd
|
||||
fi
|
||||
|
||||
if [ $(id -nG paperless | grep -Fcw ftpupload) -eq 0 ]; then
|
||||
# Allow paperless group to access
|
||||
adduser paperless ftpupload
|
||||
chmod g+w /home/ftpupload
|
||||
fi
|
||||
|
||||
# Get apt up to date
|
||||
apt-get update
|
||||
|
||||
# Needed for plain Paperless
|
||||
apt-get -y install unpaper gnupg libpoppler-cpp-dev python3-pyocr tesseract-ocr imagemagick optipng git
|
||||
|
||||
# Needed for Apache
|
||||
apt-get -y install apache2 libapache2-mod-wsgi-py3
|
||||
|
||||
if [ ! -f /etc/proftpd/proftpd.conf ]; then
|
||||
# Install ftp server and make sure all uplaoded files are owned by paperless
|
||||
apt-get -y install proftpd
|
||||
fi
|
||||
if [ $(grep -c paperless /etc/proftpd/proftpd.conf) -eq 0 ]; then
|
||||
cat <<EOF >> /etc/proftpd/proftpd.conf
|
||||
<Directory /home/ftpupload/>
|
||||
UserOwner paperless
|
||||
GroupOwner paperless
|
||||
</Directory>
|
||||
EOF
|
||||
systemctl restart proftpd
|
||||
fi
|
||||
|
||||
#Get Paperless from git
|
||||
su -c "cd /home/paperless ; git clone https://github.com/the-paperless-project/paperless" paperless
|
||||
|
||||
# Install Pip Requirements
|
||||
apt-get -y install python3-pip python3-venv
|
||||
cd /home/paperless/paperless
|
||||
pip3 install -r requirements.txt
|
||||
|
||||
# Take paperless.conf.example and set consumuption dir (ftp dir)
|
||||
sed -e '/PAPERLESS_CONSUMPTION_DIR=/s/=.*/=\"\/home\/ftpupload\/\"/' \
|
||||
/home/paperless/paperless/paperless.conf.example >/etc/paperless.conf
|
||||
|
||||
# Update /etc/paperless.conf with PAPERLESS_SECRET_KEY
|
||||
SECRET=$(strings /dev/urandom | grep -o '[[:alnum:]]' | head -n 30 | tr -d '\n'; echo)
|
||||
sed -i "s/#PAPERLESS_SECRET_KEY.*/PAPERLESS_SECRET_KEY=$SECRET/" /etc/paperless.conf
|
||||
|
||||
#Initialise the SQLite database
|
||||
su -c "cd /home/paperless/paperless/src/ ; ./manage.py migrate" paperless
|
||||
echo "if superuser doesn't exists, create one with login: paperless and password: ${PASSWORD}"
|
||||
#Create a user for your Paperless instance
|
||||
su -c "cd /home/paperless/paperless/src/ ; echo ./manage.py create_superuser_with_password --username paperless --email ${EMAIL} --password ${PASSWORD} --preserve" paperless
|
||||
su -c "cd /home/paperless/paperless/src/ ; ./manage.py create_superuser_with_password --username paperless --email ${EMAIL} --password ${PASSWORD} --preserve" paperless
|
||||
|
||||
if [ ! -d /home/paperless/paperless/static ]; then
|
||||
# 167 static files copied to '/home/paperless/paperless/static'.
|
||||
su -c "cd /home/paperless/paperless/src/ ; ./manage.py collectstatic" paperless
|
||||
fi
|
||||
|
||||
if [ ! -f /etc/apache2/sites-available/paperless.conf ]; then
|
||||
# Set-up apache
|
||||
cp /home/paperless/paperless/docs/examples/lxc/paperless.conf /etc/apache2/sites-available/
|
||||
a2dissite 000-default.conf
|
||||
a2ensite paperless.conf
|
||||
systemctl reload apache2
|
||||
fi
|
||||
|
||||
sed -e "s:home/paperless/project/virtualenv/bin/python:usr/bin/python3:" \
|
||||
/home/paperless/paperless/scripts/paperless-consumer.service \
|
||||
>/etc/systemd/system/paperless-consumer.service
|
||||
|
||||
sed -i "s:/home/paperless/project/src/manage.py:/home/paperless/paperless/src/manage.py:" \
|
||||
/etc/systemd/system/paperless-consumer.service
|
||||
|
||||
|
||||
systemctl enable paperless-consumer
|
||||
systemctl start paperless-consumer
|
||||
|
||||
# convert-im6.q16: not authorized
|
||||
# Security risk ?
|
||||
# https://stackoverflow.com/questions/42928765/convertnot-authorized-aaaa-error-constitute-c-readimage-453
|
||||
if [ -f /etc/ImageMagick-6/policy.xml ]; then
|
||||
mv /etc/ImageMagick-6/policy.xml /etc/ImageMagick-6/policy.xmlout
|
||||
fi
|
@@ -1,18 +0,0 @@
|
||||
<VirtualHost *:80>
|
||||
ServerName paperless.lan
|
||||
|
||||
Alias /static/ /home/paperless/paperless/static/
|
||||
<Directory /home/paperless/paperless/static>
|
||||
Require all granted
|
||||
</Directory>
|
||||
|
||||
WSGIScriptAlias / /home/paperless/paperless/src/paperless/wsgi.py
|
||||
WSGIDaemonProcess paperless.lan user=paperless group=paperless threads=5 python-path=/home/paperless/paperless/src
|
||||
WSGIProcessGroup paperless.lan
|
||||
|
||||
<Directory /home/paperless/paperless/src/paperless>
|
||||
<Files wsgi.py>
|
||||
Require all granted
|
||||
</Files>
|
||||
</Directory>
|
||||
</VirtualHost>
|
@@ -1,112 +1,197 @@
|
||||
.. _extending:
|
||||
|
||||
Paperless development
|
||||
#####################
|
||||
|
||||
This section describes the steps you need to take to start development on paperless-ng.
|
||||
|
||||
1. Check out the source from github. The repository is organized in the following way:
|
||||
|
||||
* ``master`` always represents the latest release and will only see changes
|
||||
when a new release is made.
|
||||
* ``dev`` contains the code that will be in the next release.
|
||||
* ``feature-X`` contain bigger changes that will be in some release, but not
|
||||
necessarily the next one.
|
||||
|
||||
Apart from that, the folder structure is as follows:
|
||||
|
||||
* ``docs/`` - Documentation.
|
||||
* ``src-ui/`` - Code of the front end.
|
||||
* ``src/`` - Code of the back end.
|
||||
* ``scripts/`` - Various scripts that help with different parts of development.
|
||||
* ``docker/`` - Files required to build the docker image.
|
||||
|
||||
2. Install some dependencies.
|
||||
|
||||
* Python 3.6.
|
||||
* All dependencies listed in the :ref:`Bare metal route <setup-bare_metal>`
|
||||
* redis. You can either install redis or use the included scritps/start-redis.sh
|
||||
to use docker to fire up a redis instance.
|
||||
|
||||
Back end development
|
||||
====================
|
||||
|
||||
The backend is a django application. I use PyCharm for development, but you can use whatever
|
||||
you want.
|
||||
|
||||
Install the python dependencies by performing ``pipenv install --dev`` in the src/ directory.
|
||||
This will also create a virtual environment, which you can enter with ``pipenv shell`` or
|
||||
execute one-shot commands in with ``pipenv run``.
|
||||
|
||||
In ``src/paperless.conf``, enable debug mode.
|
||||
|
||||
Configure the IDE to use the src/ folder as the base source folder. Configure the following
|
||||
launch configurations in your IDE:
|
||||
|
||||
* python3 manage.py runserver
|
||||
* python3 manage.py qcluster
|
||||
* python3 manage.py consumer
|
||||
|
||||
Depending on which part of paperless you're developing for, you need to have some or all of
|
||||
them running.
|
||||
|
||||
Testing and code style:
|
||||
|
||||
* Run ``pytest`` in the src/ directory to execute all tests. This also generates a HTML coverage
|
||||
report. When runnings test, paperless.conf is loaded as well. However: the tests rely on the default
|
||||
configuration. This is not ideal. But for now, make sure no settings except for DEBUG are overridden when testing.
|
||||
* Run ``pycodestyle`` to test your code for issues with the configured code style settings.
|
||||
|
||||
.. note::
|
||||
|
||||
The line length rule E501 is generally useful for getting multiple source files
|
||||
next to each other on the screen. However, in some cases, its just not possible
|
||||
to make some lines fit, especially complicated IF cases. Append `` # NOQA: E501``
|
||||
to disable this check for certain lines.
|
||||
|
||||
Front end development
|
||||
=====================
|
||||
|
||||
The front end is build using angular. I use the ``Code - OSS`` IDE for development.
|
||||
|
||||
In order to get started, you need ``npm``. Install the Angular CLI interface with
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ npm install -g @angular/cli
|
||||
|
||||
and make sure that it's on your path. Next, in the src-ui/ directory, install the
|
||||
required dependencies of the project.
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ npm install
|
||||
|
||||
You can launch a development server by running
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ ng serve
|
||||
|
||||
This will automatically update whenever you save. However, in-place compilation might fail
|
||||
on syntax errors, in which case you need to restart it.
|
||||
|
||||
By default, the development server is available on ``http://localhost:4200/`` and is configured
|
||||
to access the API at ``http://localhost:8000/api/``, which is the default of the backend.
|
||||
If you enabled DEBUG on the back end, several security overrides for allowed hosts, CORS and
|
||||
X-Frame-Options are in place so that the front end behaves exactly as in production. This also
|
||||
relies on you being logged into the back end. Without a valid session, The front end will simply
|
||||
not work.
|
||||
|
||||
In order to build the front end and serve it as part of django, execute
|
||||
|
||||
.. code:: shell-session
|
||||
|
||||
$ ng build --prod --output-path ../src/documents/static/frontend/
|
||||
|
||||
This will build the front end and put it in a location from which the Django server will serve
|
||||
it as static content. This way, you can verify that authentication is working.
|
||||
|
||||
Making a release
|
||||
================
|
||||
|
||||
Execute the ``make-release.sh <ver>`` script.
|
||||
|
||||
This will test and assemble everything and also build and tag a docker image.
|
||||
|
||||
|
||||
Extending Paperless
|
||||
===================
|
||||
|
||||
For the most part, Paperless is monolithic, so extending it is often best
|
||||
managed by way of modifying the code directly and issuing a pull request on
|
||||
`GitHub`_. However, over time the project has been evolving to be a little
|
||||
more "pluggable" so that users can write their own stuff that talks to it.
|
||||
Paperless does not have any fancy plugin systems and will probably never have. However,
|
||||
some parts of the application have been designed to allow easy integration of additional
|
||||
features without any modification to the base code.
|
||||
|
||||
.. _GitHub: https://github.com/the-paperless-project/paperless
|
||||
Making custom parsers
|
||||
---------------------
|
||||
|
||||
Paperless uses parsers to add documents to paperless. A parser is responsible for:
|
||||
|
||||
.. _extending-parsers:
|
||||
* Retrieve the content from the original
|
||||
* Create a thumbnail
|
||||
* Optional: Retrieve a created date from the original
|
||||
* Optional: Create an archived document from the original
|
||||
|
||||
Parsers
|
||||
-------
|
||||
Custom parsers can be added to paperless to support more file types. In order to do that,
|
||||
you need to write the parser itself and announce its existence to paperless.
|
||||
|
||||
You can leverage Paperless' consumption model to have it consume files *other*
|
||||
than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``. To do so,
|
||||
you simply follow Django's convention of creating a new app, with a few key
|
||||
requirements.
|
||||
|
||||
|
||||
.. _extending-parsers-parserspy:
|
||||
|
||||
parsers.py
|
||||
..........
|
||||
|
||||
In this file, you create a class that extends
|
||||
``documents.parsers.DocumentParser`` and go about implementing the three
|
||||
required methods:
|
||||
|
||||
* ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for
|
||||
this document.
|
||||
* ``get_text()``: Returns the text from the document and only the text.
|
||||
* ``get_date()``: If possible, this returns the date of the document, otherwise
|
||||
it should return ``None``.
|
||||
|
||||
|
||||
.. _extending-parsers-signalspy:
|
||||
|
||||
signals.py
|
||||
..........
|
||||
|
||||
At consumption time, Paperless emits a ``document_consumer_declaration``
|
||||
signal which your module has to react to in order to let the consumer know
|
||||
whether or not it's capable of handling a particular file. Think of it like
|
||||
this:
|
||||
|
||||
1. Consumer finds a file in the consumption directory.
|
||||
2. It asks all the available parsers: *"Hey, can you handle this file?"*
|
||||
3. Each parser responds with either ``None`` meaning they can't handle the
|
||||
file, or a dictionary in the following format:
|
||||
The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the
|
||||
methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to
|
||||
``get_date`` if you don't want to rely on paperless' default date guessing mechanisms.
|
||||
|
||||
.. code:: python
|
||||
|
||||
{
|
||||
"parser": <the class name>,
|
||||
"weight": <an integer>
|
||||
}
|
||||
class MyCustomParser(DocumentParser):
|
||||
|
||||
The consumer compares the ``weight`` values from all respondents and uses the
|
||||
class with the highest value to consume the document. The default parser,
|
||||
``RasterisedDocumentParser`` has a weight of ``0``.
|
||||
def parse(self, document_path, mime_type):
|
||||
# This method does not return anything. Rather, you should assign
|
||||
# whatever you got from the document to the following fields:
|
||||
|
||||
# The content of the document.
|
||||
self.text = "content"
|
||||
|
||||
# Optional: path to a PDF document that you created from the original.
|
||||
self.archive_path = os.path.join(self.tempdir, "archived.pdf")
|
||||
|
||||
.. _extending-parsers-appspy:
|
||||
# Optional: "created" date of the document.
|
||||
self.date = get_created_from_metadata(document_path)
|
||||
|
||||
apps.py
|
||||
.......
|
||||
def get_thumbnail(self, document_path, mime_type):
|
||||
# This should return the path to a thumbnail you created for this
|
||||
# document.
|
||||
return os.path.join(self.tempdir, "thumb.png")
|
||||
|
||||
This is a standard Django file, but you'll need to add some code to it to
|
||||
connect your parser to the ``document_consumer_declaration`` signal.
|
||||
If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``.
|
||||
|
||||
The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty
|
||||
and removed after consumption finished. You can use that directory to store any
|
||||
intermediate files and also use it to store the thumbnail / archived document.
|
||||
|
||||
.. _extending-parsers-finally:
|
||||
|
||||
Finally
|
||||
.......
|
||||
|
||||
The last step is to update ``settings.py`` to include your new module.
|
||||
Eventually, this will be dynamic, but at the moment, you have to edit the
|
||||
``INSTALLED_APPS`` section manually. Simply add the path to your AppConfig to
|
||||
the list like this:
|
||||
After that, you need to announce your parser to paperless. You need to connect a
|
||||
handler to the ``document_consumer_declaration`` signal. Have a look in the file
|
||||
``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method
|
||||
that returns information about your parser:
|
||||
|
||||
.. code:: python
|
||||
|
||||
INSTALLED_APPS = [
|
||||
...
|
||||
"my_module.apps.MyModuleConfig",
|
||||
...
|
||||
]
|
||||
def myparser_consumer_declaration(sender, **kwargs):
|
||||
return {
|
||||
"parser": MyCustomParser,
|
||||
"weight": 0,
|
||||
"mime_types": {
|
||||
"application/pdf": ".pdf",
|
||||
"image/jpeg": ".jpg",
|
||||
}
|
||||
}
|
||||
|
||||
Order doesn't matter, but generally it's a good idea to place your module lower
|
||||
in the list so that you don't end up accidentally overriding project defaults
|
||||
somewhere.
|
||||
* ``parser`` is a reference to a class that extends ``DocumentParser``.
|
||||
|
||||
* ``weight`` is used whenever two or more parsers are able to parse a file: The parser with
|
||||
the higher weight wins. This can be used to override the parsers provided by
|
||||
paperless.
|
||||
|
||||
.. _extending-parsers-example:
|
||||
|
||||
An Example
|
||||
..........
|
||||
|
||||
The core Paperless functionality is based on this design, so if you want to see
|
||||
what a parser module should look like, have a look at `parsers.py`_,
|
||||
`signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module.
|
||||
|
||||
.. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py
|
||||
.. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py
|
||||
.. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py
|
||||
.. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/
|
||||
* ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value
|
||||
is the default file extension that paperless should use when storing files and serving them for
|
||||
download. We could guess that from the file extensions, but some mime types have many extensions
|
||||
associated with them and the python methods responsible for guessing the extension do not always
|
||||
return the same value.
|
||||
|
106
docs/faq.rst
Normal file
@@ -0,0 +1,106 @@
|
||||
|
||||
**************************
|
||||
Frequently asked questions
|
||||
**************************
|
||||
|
||||
**Q:** *What's the general plan for Paperless-ng?*
|
||||
|
||||
**A:** Paperless-ng is already almost feature-complete. This project will remain
|
||||
as simple as it is right now. It will see improvements to features that are already there.
|
||||
If you need advanced features such as document versions,
|
||||
workflows or multi-user with customizable access to individual files, this is
|
||||
not the tool for you.
|
||||
|
||||
Features that *are* planned are some more quality of life extensions for the searching
|
||||
(i.e., search for similar documents, group results by correspondents with "more from this"
|
||||
links, etc), bulk editing and hierarchical tags.
|
||||
|
||||
**Q:** *I'm using docker. Where are my documents?*
|
||||
|
||||
**A:** Your documents are stored inside the docker volume ``paperless_media``.
|
||||
Docker manages this volume automatically for you. It is a persistent storage
|
||||
and will persist as long as you don't explicitly delete it. The actual location
|
||||
depends on your host operating system. On Linux, chances are high that this location
|
||||
is
|
||||
|
||||
.. code::
|
||||
|
||||
/var/lib/docker/volumes/paperless_media/_data
|
||||
|
||||
.. caution::
|
||||
|
||||
Do not mess with this folder. Don't change permissions and don't move
|
||||
files around manually. This folder is meant to be entirely managed by docker
|
||||
and paperless.
|
||||
|
||||
**Q:** *Let's say you don't support this project anymore in a year. Can I easily move to other systems?*
|
||||
|
||||
**A:** Your documents are stored as plain files inside the media folder. You can always drag those files
|
||||
out of that folder to use them elsewhere. Here are a couple notes about that.
|
||||
|
||||
* Paperless never modifies your original documents. It keeps checksums of all documents and uses a
|
||||
scheduled sanity checker to check that they remain the same.
|
||||
* By default, paperless uses the internal ID of each document as its filename. This might not be very
|
||||
convenient for export. However, you can adjust the way files are stored in paperless by
|
||||
:ref:`configuring the filename format <advanced-file_name_handling>`.
|
||||
* :ref:`The exporter <utilities-exporter>` is another easy way to get your files out of paperless with reasonable file names.
|
||||
|
||||
**Q:** *What file types does paperless-ng support?*
|
||||
|
||||
**A:** Currently, the following files are supported:
|
||||
|
||||
* PDF documents, PNG images, JPEG images, TIFF images and GIF images are processed with OCR and converted into PDF documents.
|
||||
* Plain text documents are supported as well and are added verbatim
|
||||
to paperless.
|
||||
|
||||
Paperless determines the type of a file by inspecting its content. The
|
||||
file extensions do not matter.
|
||||
|
||||
**Q:** *Will paperless-ng run on Raspberry Pi?*
|
||||
|
||||
**A:** The short answer is yes. I've tested it on a Raspberry Pi 3 B.
|
||||
The long answer is that certain parts of
|
||||
Paperless will run very slow, such as the tesseract OCR. On Raspberry Pi,
|
||||
try to OCR documents before feeding them into paperless so that paperless can
|
||||
reuse the text. The web interface should be a lot snappier, since it runs
|
||||
in your browser and paperless has to do much less work to serve the data.
|
||||
|
||||
.. note::
|
||||
|
||||
You can adjust some of the settings so that paperless uses less processing
|
||||
power. See :ref:`setup-less_powerful_devices` for details.
|
||||
|
||||
|
||||
**Q:** *How do I install paperless-ng on Raspberry Pi?*
|
||||
|
||||
**A:** There is no docker image for ARM available. If you know how to build
|
||||
that automatically, I'm all ears. For now, you have to grab the latest release
|
||||
archive from the project page and build the image yourself. The release comes
|
||||
with the front end already compiled, so you don't have to do this on the Pi.
|
||||
|
||||
**Q:** *How do I run this on unRaid?*
|
||||
|
||||
**A:** Head over to `<https://github.com/selfhosters/unRAID-CA-templates>`_,
|
||||
`Uli Fahrer <https://github.com/Tooa>`_ created a container template for that.
|
||||
I don't exactly know how to use that though, since I don't use unRaid.
|
||||
|
||||
**Q:** *How do I run this on my toaster?*
|
||||
|
||||
**A:** I honestly don't know! As for all other devices that might be able
|
||||
to run paperless, you're a bit on your own. If you can't run the docker image,
|
||||
the documentation has instructions for bare metal installs. I'm running
|
||||
paperless on an i3 processor from 2015 or so. This is also what I use to test
|
||||
new releases with. Apart from that, I also have a Raspberry Pi, which I
|
||||
occasionally build the image on and see if it works.
|
||||
|
||||
**Q:** *How do I proxy this with NGINX?*
|
||||
|
||||
.. code::
|
||||
|
||||
location / {
|
||||
proxy_pass http://localhost:8000/
|
||||
}
|
||||
|
||||
And that's about it. Paperless serves everything, including static files by itself
|
||||
when running the docker image. If you want to do anything fancy, you have to
|
||||
install paperless bare metal.
|
@@ -1,131 +0,0 @@
|
||||
.. _guesswork:
|
||||
|
||||
Guesswork
|
||||
#########
|
||||
|
||||
During the consumption process, Paperless tries to guess some of the attributes
|
||||
of the document it's looking at. To do this it uses two approaches:
|
||||
|
||||
|
||||
.. _guesswork-naming:
|
||||
|
||||
File Naming
|
||||
===========
|
||||
|
||||
Any document you put into the consumption directory will be consumed, but if
|
||||
you name the file right, it'll automatically set some values in the database
|
||||
for you. This is is the logic the consumer follows:
|
||||
|
||||
1. Try to find the correspondent, title, and tags in the file name following
|
||||
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||
``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC".
|
||||
The tags are optional, so the format ``Date - Correspondent - Title.pdf``
|
||||
works as well.
|
||||
2. If that doesn't work, we skip the date and try this pattern:
|
||||
``Correspondent - Title - tag,tag,tag.pdf``.
|
||||
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||
name following the pattern: ``Correspondent - Title.pdf``.
|
||||
4. If that doesn't work, just assume that the name of the file is the title.
|
||||
|
||||
So given the above, the following examples would work as you'd expect:
|
||||
|
||||
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||
* ``Another Company - Letter of Reference.jpg``
|
||||
* ``Dad's Recipe for Pancakes.png``
|
||||
|
||||
These however wouldn't work:
|
||||
|
||||
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||
* ``Another Company- Letter of Reference.jpg``
|
||||
|
||||
Do I have to be so strict about naming?
|
||||
---------------------------------------
|
||||
Rather than using the strict document naming rules, one can also set the option
|
||||
``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order
|
||||
that is accepted by dateparser_. Doing so will cause ``paperless`` to default
|
||||
to any date format that is found in the title, instead of a date pulled from
|
||||
the document's text, without requiring the strict formatting of the document
|
||||
filename as described above.
|
||||
|
||||
.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings
|
||||
|
||||
Transforming filenames for parsing
|
||||
----------------------------------
|
||||
Some devices can't produce filenames that can be parsed by the default
|
||||
parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in
|
||||
``paperless.conf`` one can add transformations that are applied to the filename
|
||||
before it's parsed.
|
||||
|
||||
The option contains a list of dictionaries of regular expressions (key:
|
||||
``pattern``) and replacements (key: ``repl``) in JSON format, which are
|
||||
applied in order by passing them to ``re.subn``. Transformation stops
|
||||
after the first match, so at most one transformation is applied. The general
|
||||
syntax is
|
||||
|
||||
.. code:: python
|
||||
|
||||
[{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}]
|
||||
|
||||
The example below is for a Brother ADS-2400N, a scanner that allows
|
||||
different names to different hardware buttons (useful for handling
|
||||
multiple entities in one instance), but insists on adding ``_<count>``
|
||||
to the filename.
|
||||
|
||||
.. code:: python
|
||||
|
||||
# Brother profile configuration, support "Name_Date_Count" (the default
|
||||
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
|
||||
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]
|
||||
|
||||
.. _guesswork-content:
|
||||
|
||||
Reading the Document Contents
|
||||
=============================
|
||||
|
||||
After the consumer has tried to figure out what it could from the file name,
|
||||
it starts looking at the content of the document itself. It will compare the
|
||||
matching algorithms defined by every tag and correspondent already set in your
|
||||
database to see if they apply to the text in that document. In other words,
|
||||
if you defined a tag called ``Home Utility`` that had a ``match`` property of
|
||||
``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will
|
||||
automatically tag your newly-consumed document with your ``Home Utility`` tag
|
||||
so long as the text ``bc hydro`` appears in the body of the document somewhere.
|
||||
|
||||
The matching logic is quite powerful, and supports searching the text of your
|
||||
document with different algorithms, and as such, some experimentation may be
|
||||
necessary to get things Just Right.
|
||||
|
||||
|
||||
.. _guesswork-content-howto:
|
||||
|
||||
How Do I Set Up These Matching Algorithms?
|
||||
------------------------------------------
|
||||
|
||||
Setting up of the algorithms is easily done through the admin interface. When
|
||||
you create a new correspondent or tag, there are optional fields for matching
|
||||
text and matching algorithm. From the help info there:
|
||||
|
||||
.. note::
|
||||
|
||||
Which algorithm you want to use when matching text to the OCR'd PDF. Here,
|
||||
"any" looks for any occurrence of any word provided in the PDF, while "all"
|
||||
requires that every word provided appear in the PDF, albeit not in the
|
||||
order provided. A "literal" match means that the text you enter must
|
||||
appear in the PDF exactly as you've entered it, and "regular expression"
|
||||
uses a regex to match the PDF. If you don't know what a regex is, you
|
||||
probably don't want this option.
|
||||
|
||||
When using the "any" or "all" matching algorithms, you can search for terms
|
||||
that consist of multiple words by enclosing them in double quotes. For example,
|
||||
defining a match text of ``"Bank of America" BofA`` using the "any" algorithm,
|
||||
will match documents that contain either "Bank of America" or "BofA", but will
|
||||
not match documents containing "Bank of South America".
|
||||
|
||||
Then just save your tag/correspondent and run another document through the
|
||||
consumer. Once complete, you should see the newly-created document,
|
||||
automatically tagged with the appropriate data.
|
@@ -1,17 +1,14 @@
|
||||
.. _index:
|
||||
|
||||
*********
|
||||
Paperless
|
||||
=========
|
||||
*********
|
||||
|
||||
Paperless is a simple Django application running in two parts:
|
||||
a :ref:`consumer <utilities-consumer>` (the thing that does the indexing) and
|
||||
the :ref:`webserver <utilities-webserver>` (the part that lets you search &
|
||||
a *Consumer* (the thing that does the indexing) and
|
||||
the *Web server* (the part that lets you search &
|
||||
download already-indexed documents). If you want to learn more about its
|
||||
functions keep on reading after the installation section.
|
||||
|
||||
|
||||
.. _index-why-this-exists:
|
||||
|
||||
Why This Exists
|
||||
===============
|
||||
|
||||
@@ -25,22 +22,54 @@ finding stuff again. I feed documents right from the post box into the scanner
|
||||
and then shred them. Perhaps you might find it useful too.
|
||||
|
||||
|
||||
Paperless-ng
|
||||
============
|
||||
|
||||
Paperless-ng is a fork of the original paperless project. It changes many
|
||||
things both on the surface and under the hood. Paperless-ng was created
|
||||
because I feel that these changes are too big to be pushed into the main
|
||||
repository right away.
|
||||
|
||||
NG stands for both Angular (the framework used for the
|
||||
Frontend) and next-gen. Publishing this project under a different name also
|
||||
avoids confusion between paperless and paperless-ng.
|
||||
|
||||
If you want to learn about what's different in paperless-ng, check out these
|
||||
resources in the documentation:
|
||||
|
||||
* :ref:`Some screenshots <screenshots>` of the new UI are available.
|
||||
* Read :ref:`this section <advanced-automatic_matching>` if you want to
|
||||
learn about how paperless automates all tagging using machine learning.
|
||||
* Paperless now comes with a :ref:`proper email consumer <usage-email>`
|
||||
that's fully tested and production ready.
|
||||
* Paperless creates searchable PDF/A documents from whatever you you put into
|
||||
the consumption directory. This means that you can select text in
|
||||
image-only documents coming from your scanner.
|
||||
* See :ref:`this note <utilities-encyption>` about GnuPG encryption in
|
||||
paperless-ng.
|
||||
* Paperless is now integrated with a
|
||||
:ref:`task processing queue <setup-task_processor>` that tells you
|
||||
at a glance when and why something is not working.
|
||||
* The :ref:`changelog <paperless_changelog>` contains a detailed list of all changes
|
||||
in paperless-ng.
|
||||
|
||||
It would be great if this project could eventually merge back into the main
|
||||
repository, but it needs a lot more work before that can happen.
|
||||
|
||||
|
||||
Contents
|
||||
========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:maxdepth: 1
|
||||
|
||||
requirements
|
||||
setup
|
||||
consumption
|
||||
usage_overview
|
||||
advanced_usage
|
||||
administration
|
||||
configuration
|
||||
api
|
||||
utilities
|
||||
guesswork
|
||||
migrating
|
||||
customising
|
||||
faq
|
||||
extending
|
||||
troubleshooting
|
||||
contributing
|
||||
|
@@ -1,109 +0,0 @@
|
||||
.. _migrating:
|
||||
|
||||
Migrating, Updates, and Backups
|
||||
===============================
|
||||
|
||||
As Paperless is still under active development, there's a lot that can change
|
||||
as software updates roll out. You should backup often, so if anything goes
|
||||
wrong during an update, you at least have a means of restoring to something
|
||||
usable. Thankfully, there are automated ways of backing up, restoring, and
|
||||
updating the software.
|
||||
|
||||
|
||||
.. _migrating-backup:
|
||||
|
||||
Backing Up
|
||||
----------
|
||||
|
||||
So you're bored of this whole project, or you want to make a remote backup of
|
||||
your files for whatever reason. This is easy to do, simply use the
|
||||
:ref:`exporter <utilities-exporter>` to dump your documents and database out
|
||||
into an arbitrary directory.
|
||||
|
||||
|
||||
.. _migrating-restoring:
|
||||
|
||||
Restoring
|
||||
---------
|
||||
|
||||
Restoring your data is just as easy, since nearly all of your data exists either
|
||||
in the file names, or in the contents of the files themselves. You just need to
|
||||
create an empty database (just follow the
|
||||
:ref:`installation instructions <setup-installation>` again) and then import the
|
||||
``tags.json`` file you created as part of your backup. Lastly, copy your
|
||||
exported documents into the consumption directory and start up the consumer.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ rm data/db.sqlite3 # Delete the database
|
||||
$ cd src
|
||||
$ ./manage.py migrate # Create the database
|
||||
$ ./manage.py createsuperuser
|
||||
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
|
||||
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
|
||||
$ ./manage.py document_consumer
|
||||
|
||||
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
|
||||
is almost as simple:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
# Stop and remove your current containers
|
||||
$ docker-compose stop
|
||||
$ docker-compose rm -f
|
||||
|
||||
# Recreate them, add the superuser
|
||||
$ docker-compose up -d
|
||||
$ docker-compose run --rm webserver createsuperuser
|
||||
|
||||
# Load the tags
|
||||
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
|
||||
|
||||
# Load your exported documents into the consumption directory
|
||||
# (How you do this highly depends on how you have set this up)
|
||||
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
|
||||
|
||||
After loading the documents into the consumption directory the consumer will
|
||||
immediately start consuming the documents.
|
||||
|
||||
|
||||
.. _migrating-updates:
|
||||
|
||||
Updates
|
||||
-------
|
||||
|
||||
For the most part, all you have to do to update Paperless is run ``git pull``
|
||||
on the directory containing the project files, and then use Django's
|
||||
``migrate`` command to execute any database schema updates that might have been
|
||||
rolled in as part of the update:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ git pull
|
||||
$ pip install -r requirements.txt
|
||||
$ cd src
|
||||
$ ./manage.py migrate
|
||||
|
||||
Note that it's possible (even likely) that while ``git pull`` may update some
|
||||
files, the ``migrate`` step may not update anything. This is totally normal.
|
||||
|
||||
Additionally, as new features are added, the ability to control those features
|
||||
is typically added by way of an environment variable set in ``paperless.conf``.
|
||||
You may want to take a look at the ``paperless.conf.example`` file to see if
|
||||
there's anything new in there compared to what you've got in ``/etc``.
|
||||
|
||||
If you are :ref:`using Docker <setup-installation-docker>` the update process
|
||||
is similar:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ cd /path/to/project
|
||||
$ git pull
|
||||
$ docker build -t paperless .
|
||||
$ docker-compose run --rm consumer migrate
|
||||
$ docker-compose up -d
|
||||
|
||||
If ``git pull`` doesn't report any changes, there is no need to continue with
|
||||
the remaining steps.
|
@@ -1,125 +0,0 @@
|
||||
.. _requirements:
|
||||
|
||||
Requirements
|
||||
============
|
||||
|
||||
You need a Linux machine or Unix-like setup (theoretically an Apple machine
|
||||
should work) that has the following software installed:
|
||||
|
||||
* `Python3`_ (with development libraries, pip and virtualenv)
|
||||
* `GNU Privacy Guard`_
|
||||
* `Tesseract`_, plus its language files matching your document base.
|
||||
* `Imagemagick`_ version 6.7.5 or higher
|
||||
* `unpaper`_
|
||||
* `libpoppler-cpp-dev`_ PDF rendering library
|
||||
* `optipng`_
|
||||
|
||||
.. _Python3: https://python.org/
|
||||
.. _GNU Privacy Guard: https://gnupg.org
|
||||
.. _Tesseract: https://github.com/tesseract-ocr
|
||||
.. _Imagemagick: http://imagemagick.org/
|
||||
.. _unpaper: https://github.com/unpaper/unpaper
|
||||
.. _libpoppler-cpp-dev: https://poppler.freedesktop.org/
|
||||
.. _optipng: http://optipng.sourceforge.net/
|
||||
|
||||
Notably, you should confirm how you access your Python3 installation. Many
|
||||
Linux distributions will install Python3 in parallel to Python2, using the
|
||||
names ``python3`` and ``python`` respectively. The same goes for ``pip3`` and
|
||||
``pip``. Running Paperless with Python2 will likely break things, so make sure
|
||||
that you're using the right version.
|
||||
|
||||
For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to
|
||||
refer to their Python3 versions.
|
||||
|
||||
In addition to the above, there are a number of Python requirements, all of
|
||||
which are listed in a file called ``requirements.txt`` in the project root
|
||||
directory.
|
||||
|
||||
If you're not working on a virtual environment (like Docker), you
|
||||
should probably be using a virtualenv, but that's your call. The reasons why
|
||||
you might choose a virtualenv or not aren't really within the scope of this
|
||||
document. Needless to say if you don't know what a virtualenv is, you should
|
||||
probably figure that out before continuing.
|
||||
|
||||
|
||||
.. _requirements-apple:
|
||||
|
||||
Problems with Imagemagick & PDFs
|
||||
--------------------------------
|
||||
|
||||
Some users have `run into problems`_ with getting ImageMagick to do its thing
|
||||
with PDFs. Often this is the case with Apple systems using HomeBrew, but other
|
||||
Linuxes have been a problem as well. The solution appears to be to install
|
||||
ghostscript as well as ImageMagick:
|
||||
|
||||
.. _run into problems: https://github.com/the-paperless-project/paperless/issues/25
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ brew install ghostscript
|
||||
$ brew install imagemagick
|
||||
$ brew install libmagic
|
||||
|
||||
|
||||
.. _requirements-baremetal:
|
||||
|
||||
Python-specific Requirements: No Virtualenv
|
||||
-------------------------------------------
|
||||
|
||||
If you don't care to use a virtual env, then installation of the Python
|
||||
dependencies is easy:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pip install --user --requirement /path/to/paperless/requirements.txt
|
||||
|
||||
This will download and install all of the requirements into
|
||||
``${HOME}/.local``. Remember that your distribution may be using ``pip3`` as
|
||||
mentioned above.
|
||||
|
||||
|
||||
.. _requirements-virtualenv:
|
||||
|
||||
Python-specific Requirements: Virtualenv
|
||||
----------------------------------------
|
||||
|
||||
Using a virtualenv for this is pretty straightforward: create a virtualenv,
|
||||
enter it, and install the requirements using the ``requirements.txt`` file:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ virtualenv --python=/path/to/python3 /path/to/arbitrary/directory
|
||||
$ . /path/to/arbitrary/directory/bin/activate
|
||||
$ pip install --requirement /path/to/paperless/requirements.txt
|
||||
|
||||
Now you're ready to go. Just remember to enter (activate) your virtualenv
|
||||
whenever you want to use Paperless.
|
||||
|
||||
|
||||
.. _requirements-documentation:
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
As generation of the documentation is not required for the use of Paperless,
|
||||
dependencies for this process are not included in ``requirements.txt``. If
|
||||
you'd like to generate your own docs locally, you'll need to:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pip install sphinx
|
||||
|
||||
and then cd into the ``docs`` directory and type ``make html``.
|
||||
|
||||
If you are using Docker, you can use the following commands to build the
|
||||
documentation and run a webserver serving it on `port 8001`_:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ pwd
|
||||
/path/to/paperless
|
||||
|
||||
$ docker build -t paperless:docs -f docs/Dockerfile .
|
||||
$ docker run --rm -it -p "8001:8000" paperless:docs
|
||||
|
||||
.. _port 8001: http://127.0.0.1:8001
|
@@ -1,12 +1,14 @@
|
||||
|
||||
.. _scanners:
|
||||
|
||||
Scanner Recommendations
|
||||
=======================
|
||||
***********************
|
||||
Scanner recommendations
|
||||
***********************
|
||||
|
||||
As Paperless operates by watching a folder for new files, doesn't care what
|
||||
scanner you use, but sometimes finding a scanner that will write to an FTP,
|
||||
NFS, or SMB server can be difficult. This page is here to help you find one
|
||||
that works right for you based on recommentations from other Paperless users.
|
||||
that works right for you based on recommendations from other Paperless users.
|
||||
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
| Brand | Model | Supports | Recommended By |
|
||||
@@ -25,6 +27,8 @@ that works right for you based on recommentations from other Paperless users.
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
| Epson | `WF-7710DWF`_ | yes | | yes | `Skylinar`_ |
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
| Fujitsu | `S1300i`_ | yes | | yes | `jonaswinkler`_|
|
||||
+---------+----------------+-----+-----+-----+----------------+
|
||||
|
||||
.. _ADS-1500W: https://www.brother.ca/en/p/ads1500w
|
||||
.. _MFC-J6930DW: https://www.brother.ca/en/p/MFCJ6930DW
|
||||
@@ -32,6 +36,7 @@ that works right for you based on recommentations from other Paperless users.
|
||||
.. _MFC-9142CDN: https://www.brother.co.uk/printers/laser-printers/mfc9140cdn
|
||||
.. _ix500: http://www.fujitsu.com/us/products/computing/peripheral/scanners/scansnap/ix500/
|
||||
.. _WF-7710DWF: https://www.epson.de/en/products/printers/inkjet-printers/for-home/workforce-wf-7710dwf
|
||||
.. _S1300i: https://www.fujitsu.com/global/products/computing/peripheral/scanners/soho/s1300i/
|
||||
|
||||
.. _danielquinn: https://github.com/danielquinn
|
||||
.. _ayounggun: https://github.com/ayounggun
|
||||
@@ -39,3 +44,4 @@ that works right for you based on recommentations from other Paperless users.
|
||||
.. _eonist: https://github.com/eonist
|
||||
.. _REOLDEV: https://github.com/REOLDEV
|
||||
.. _Skylinar: https://github.com/Skylinar
|
||||
.. _jonaswinkler: https://github.com/jonaswinkler
|
||||
|
@@ -1,16 +1,45 @@
|
||||
.. _screenshots:
|
||||
|
||||
***********
|
||||
Screenshots
|
||||
===========
|
||||
***********
|
||||
|
||||
Once everything is set-up login to paperless using the web front-end
|
||||
This is what paperless-ng looks like. You shouldn't use paperless to index
|
||||
research papers though, its a horrible tool for that job.
|
||||
|
||||
.. image:: ./_static/Screenshot_first_run_login.png
|
||||
The dashboard shows customizable views on your document and allows document uploads:
|
||||
|
||||
Nice clean interface
|
||||
.. image:: _static/screenshots/dashboard.png
|
||||
|
||||
.. image:: ./_static/Screenshot_first_logged.png
|
||||
The document list provides three different styles to scroll through your documents:
|
||||
|
||||
Some documents loaded in via ftp or using the scanners ftp.
|
||||
.. image:: _static/screenshots/documents-table.png
|
||||
.. image:: _static/screenshots/documents-smallcards.png
|
||||
.. image:: _static/screenshots/documents-largecards.png
|
||||
|
||||
Extensive filtering mechanisms:
|
||||
|
||||
.. image:: _static/screenshots/documents-filter.png
|
||||
|
||||
Side-by-side editing of documents. Optimized for 1080p.
|
||||
|
||||
.. image:: _static/screenshots/editing.png
|
||||
|
||||
Tag editing. This looks about the same for correspondents and document types.
|
||||
|
||||
.. image:: _static/screenshots/new-tag.png
|
||||
|
||||
Searching provides auto complete and highlights the results.
|
||||
|
||||
.. image:: _static/screenshots/search-preview.png
|
||||
.. image:: _static/screenshots/search-results.png
|
||||
|
||||
Fancy mail filters!
|
||||
|
||||
.. image:: _static/screenshots/mail-rules-edited.png
|
||||
|
||||
Mobile support in the future? This kinda works, however some layouts are still
|
||||
too wide.
|
||||
|
||||
.. image:: _static/screenshots/mobile.png
|
||||
|
||||
.. image:: ./_static/Screenshot_upload_and_scanned.png
|
||||
|
894
docs/setup.rst
@@ -1,75 +1,51 @@
|
||||
.. _troubleshooting:
|
||||
|
||||
***************
|
||||
Troubleshooting
|
||||
===============
|
||||
***************
|
||||
|
||||
.. _troubleshooting-languagemissing:
|
||||
No files are added by the consumer
|
||||
##################################
|
||||
|
||||
Consumer warns ``OCR for XX failed``
|
||||
------------------------------------
|
||||
Check for the following issues:
|
||||
|
||||
If you find the OCR accuracy to be too low, and/or the document consumer warns
|
||||
that ``OCR for XX failed, but we're going to stick with what we've got since
|
||||
FORGIVING_OCR is enabled``, then you might need to install the
|
||||
`Tesseract language files <http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_
|
||||
marching your document's languages.
|
||||
* Ensure that the directory you're putting your documents in is the folder
|
||||
paperless is watching. With docker, this setting is performed in the
|
||||
``docker-compose.yml`` file. Without docker, look at the ``CONSUMPTION_DIR``
|
||||
setting. Don't adjust this setting if you're using docker.
|
||||
* Ensure that redis is up and running. Paperless does its task processing
|
||||
asynchronously, and for documents to arrive at the task processor, it needs
|
||||
redis to run.
|
||||
* Ensure that the task processor is running. Docker does this automatically.
|
||||
Manually invoke the task processor by executing
|
||||
|
||||
As an example, if you are running Paperless from any Ubuntu or Debian
|
||||
box, and your documents are written in Spanish you may need to run::
|
||||
.. code:: shell-session
|
||||
|
||||
apt-get install -y tesseract-ocr-spa
|
||||
$ python3 manage.py qcluster
|
||||
|
||||
* Look at the output of paperless and inspect it for any errors.
|
||||
* Go to the admin interface, and check if there are failed tasks. If so, the
|
||||
tasks will contain an error message.
|
||||
|
||||
|
||||
.. _troubleshooting-convertpixelcache:
|
||||
Consumer fails to pickup any new files
|
||||
######################################
|
||||
|
||||
Consumer dies with ``convert: unable to extent pixel cache``
|
||||
------------------------------------------------------------
|
||||
If you notice that the consumer will only pickup files in the consumption
|
||||
directory at startup, but won't find any other files added later, check out
|
||||
the configuration file and enable filesystem polling with the setting
|
||||
``PAPERLESS_CONSUMER_POLLING``.
|
||||
|
||||
During the consumption process, Paperless invokes ImageMagick's ``convert``
|
||||
program to translate the source document into something that the OCR engine can
|
||||
understand and this can burn a Very Large amount of memory if the original
|
||||
document is rather long. Similarly, if your system doesn't have a lot of
|
||||
memory to begin with (ie. a Raspberry Pi), then this can happen for even
|
||||
medium-sized documents.
|
||||
Operation not permitted
|
||||
#######################
|
||||
|
||||
The solution is to tell ImageMagick *not* to Use All The RAM, as is its
|
||||
default, and instead tell it to used a fixed amount. ``convert`` will then
|
||||
break up the job into hundreds of individual files and use them to slowly
|
||||
compile the finished image. Simply set ``PAPERLESS_CONVERT_MEMORY_LIMIT`` in
|
||||
``/etc/paperless.conf`` to something like ``32000000`` and you'll limit
|
||||
``convert`` to 32MB. Fiddle with this value as you like.
|
||||
You might see errors such as:
|
||||
|
||||
**HOWEVER**: Simply setting this value may not be enough on system where
|
||||
``/tmp`` is mounted as tmpfs, as this is where ``convert`` will write its
|
||||
temporary files. In these cases (most Systemd machines), you need to tell
|
||||
ImageMagick to use a different space for its scratch work. You do this by
|
||||
setting ``PAPERLESS_CONVERT_TMPDIR`` in ``/etc/paperless.conf`` to somewhere
|
||||
that's actually on a physical disk (and writable by the user running
|
||||
Paperless), like ``/var/tmp/paperless`` or ``/home/my_user/tmp`` in a pinch.
|
||||
.. code::
|
||||
|
||||
chown: changing ownership of '../export': Operation not permitted
|
||||
|
||||
.. _troubleshooting-decompressionbombwarning:
|
||||
The container tries to set file ownership on the listed directories. This is
|
||||
required so that the user running paperless inside docker has write permissions
|
||||
to these folders. This happens when pointing these directories to NFS shares,
|
||||
for example.
|
||||
|
||||
DecompressionBombWarning and/or no text in the OCR output
|
||||
---------------------------------------------------------
|
||||
Some users have had issues using Paperless to consume PDFs that were created
|
||||
by merging Very Large Scanned Images into one PDF. If this happens to you,
|
||||
it's likely because the PDF you've created contains some very large pages
|
||||
(millions of pixels) and the process of converting the PDF to a OCR-friendly
|
||||
image is exploding.
|
||||
|
||||
Typically, this happens because the scanned images are created with a high
|
||||
DPI and then rolled into the PDF with an assumed DPI of 72 (the default).
|
||||
The best solution then is to specify the DPI used in the scan in the
|
||||
conversion-to-PDF step. So for example, if you scanned the original image
|
||||
with a DPI of 300, then merging the images into the single PDF with
|
||||
``convert`` should look like this:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ convert -density 300 *.jpg finished.pdf
|
||||
|
||||
For more information on this and situations like it, you should take a look
|
||||
at `Issue #118`_ as that's where this tip originated.
|
||||
|
||||
.. _Issue #118: https://github.com/the-paperless-project/paperless/issues/118
|
||||
Ensure that `chown` is possible on these directories.
|
||||
|
403
docs/usage_overview.rst
Normal file
@@ -0,0 +1,403 @@
|
||||
**************
|
||||
Usage Overview
|
||||
**************
|
||||
|
||||
Paperless is an application that manages your personal documents. With
|
||||
the help of a document scanner (see :ref:`scanners`), paperless transforms
|
||||
your wieldy physical document binders into a searchable archive and
|
||||
provides many utilities for finding and managing your documents.
|
||||
|
||||
|
||||
Terms and definitions
|
||||
#####################
|
||||
|
||||
Paperless essentially consists of two different parts for managing your
|
||||
documents:
|
||||
|
||||
* The *consumer* watches a specified folder and adds all documents in that
|
||||
folder to paperless.
|
||||
* The *web server* provides a UI that you use to manage and search for your
|
||||
scanned documents.
|
||||
|
||||
Each document has a couple of fields that you can assign to them:
|
||||
|
||||
* A *Document* is a piece of paper that sometimes contains valuable
|
||||
information.
|
||||
* The *correspondent* of a document is the person, institution or company that
|
||||
a document either originates form, or is sent to.
|
||||
* A *tag* is a label that you can assign to documents. Think of labels as more
|
||||
powerful folders: Multiple documents can be grouped together with a single
|
||||
tag, however, a single document can also have multiple tags. This is not
|
||||
possible with folders. The reason folders are not implemented in paperless
|
||||
is simply that tags are much more versatile than folders.
|
||||
* A *document type* is used to demarcate the type of a document such as letter,
|
||||
bank statement, invoice, contract, etc. It is used to identify what a document
|
||||
is about.
|
||||
* The *date added* of a document is the date the document was scanned into
|
||||
paperless. You cannot and should not change this date.
|
||||
* The *date created* of a document is the date the document was initially issued.
|
||||
This can be the date you bought a product, the date you signed a contract, or
|
||||
the date a letter was sent to you.
|
||||
* The *archive serial number* (short: ASN) of a document is the identifier of
|
||||
the document in your physical document binders. See
|
||||
:ref:`usage-recommended_workflow` below.
|
||||
* The *content* of a document is the text that was OCR'ed from the document.
|
||||
This text is fed into the search engine and is used for matching tags,
|
||||
correspondents and document types.
|
||||
|
||||
|
||||
Frontend overview
|
||||
#################
|
||||
|
||||
.. warning::
|
||||
|
||||
TBD. Add some fancy screenshots!
|
||||
|
||||
Adding documents to paperless
|
||||
#############################
|
||||
|
||||
Once you've got Paperless setup, you need to start feeding documents into it.
|
||||
When adding documents to paperless, it will perform the following operations on
|
||||
your documents:
|
||||
|
||||
1. OCR the document, if it has no text. Digital documents usually have text,
|
||||
and this step will be skipped for those documents.
|
||||
2. Paperless will create an archiveable PDF/A document from your document.
|
||||
If this document is coming from your scanner, it will have embedded selectable text.
|
||||
3. Paperless performs automatic matching of tags, correspondents and types on the
|
||||
document before storing it in the database.
|
||||
|
||||
.. hint::
|
||||
|
||||
This process can be configured to fit your needs. If you don't want paperless
|
||||
to create archived versions for digital documents, you can configure that by
|
||||
configuring ``PAPERLESS_OCR_MODE=skip_noarchive``. Please read the
|
||||
:ref:`relevant section in the documentation <configuration-ocr>`.
|
||||
|
||||
.. note::
|
||||
|
||||
No matter which options you choose, Paperless will always store the original
|
||||
document that it found in the consumption directory or in the mail and
|
||||
will never overwrite that document. Archived versions are stored alongside the
|
||||
original versions.
|
||||
|
||||
|
||||
The consumption directory
|
||||
=========================
|
||||
|
||||
The primary method of getting documents into your database is by putting them in
|
||||
the consumption directory. The consumer runs in an infinite
|
||||
loop looking for new additions to this directory and when it finds them, it goes
|
||||
about the process of parsing them with the OCR, indexing what it finds, and storing
|
||||
it in the media directory.
|
||||
|
||||
Getting stuff into this directory is up to you. If you're running Paperless
|
||||
on your local computer, you might just want to drag and drop files there, but if
|
||||
you're running this on a server and want your scanner to automatically push
|
||||
files to this directory, you'll need to setup some sort of service to accept the
|
||||
files from the scanner. Typically, you're looking at an FTP server like
|
||||
`Proftpd`_ or a Windows folder share with `Samba`_.
|
||||
|
||||
.. _Proftpd: http://www.proftpd.org/
|
||||
.. _Samba: http://www.samba.org/
|
||||
|
||||
.. TODO: hyperref to configuration of the location of this magic folder.
|
||||
|
||||
Dashboard upload
|
||||
================
|
||||
|
||||
The dashboard has a file drop field to upload documents to paperless. Simply drag a file
|
||||
onto this field or select a file with the file dialog. Multiple files are supported.
|
||||
|
||||
|
||||
Mobile upload
|
||||
=============
|
||||
|
||||
The mobile app over at `<https://github.com/qcasey/paperless_share>`_ allows Android users
|
||||
to share any documents with paperless. This can be combined with any of the mobile
|
||||
scanning apps out there, such as Office Lens.
|
||||
|
||||
Furthermore, there is the `Paperless App <https://github.com/bauerj/paperless_app>`_ as well,
|
||||
which no only has document upload, but also document editing and browsing.
|
||||
|
||||
.. _usage-email:
|
||||
|
||||
IMAP (Email)
|
||||
============
|
||||
|
||||
You can tell paperless-ng to consume documents from your email accounts.
|
||||
This is a very flexible and powerful feature, if you regularly received documents
|
||||
via mail that you need to archive. The mail consumer can be configured by using the
|
||||
admin interface in the following manner:
|
||||
|
||||
1. Define e-mail accounts.
|
||||
2. Define mail rules for your account.
|
||||
|
||||
These rules perform the following:
|
||||
|
||||
1. Connect to the mail server.
|
||||
2. Fetch all matching mails (as defined by folder, maximum age and the filters)
|
||||
3. Check if there are any consumable attachments.
|
||||
4. If so, instruct paperless to consume the attachments and optionally
|
||||
use the metadata provided in the rule for the new document.
|
||||
5. If documents were consumed from a mail, the rule action is performed
|
||||
on that mail.
|
||||
|
||||
Paperless will completely ignore mails that do not match your filters. It will also
|
||||
only perform the action on mails that it has consumed documents from.
|
||||
|
||||
The actions all ensure that the same mail is not consumed twice by different means.
|
||||
These are as follows:
|
||||
|
||||
* **Delete:** Immediately deletes mail that paperless has consumed documents from.
|
||||
Use with caution.
|
||||
* **Mark as read:** Mark consumed mail as read. Paperless will not consume documents
|
||||
from already read mails. If you read a mail before paperless sees it, it will be
|
||||
ignored.
|
||||
* **Flag:** Sets the 'important' flag on mails with consumed documents. Paperless
|
||||
will not consume flagged mails.
|
||||
* **Move to folder:** Moves consumed mails out of the way so that paperless wont
|
||||
consume them again.
|
||||
|
||||
.. caution::
|
||||
|
||||
The mail consumer will perform these actions on all mails it has consumed
|
||||
documents from. Keep in mind that the actual consumption process may fail
|
||||
for some reason, leaving you with missing documents in paperless.
|
||||
|
||||
.. note::
|
||||
|
||||
With the correct set of rules, you can completely automate your email documents.
|
||||
Create rules for every correspondent you receive digital documents from and
|
||||
paperless will read them automatically. The default action "mark as read" is
|
||||
pretty tame and will not cause any damage or data loss whatsoever.
|
||||
|
||||
You can also setup a special folder in your mail account for paperless and use
|
||||
your favorite mail client to move to be consumed mails into that folder
|
||||
automatically or manually and tell paperless to move them to yet another folder
|
||||
after consumption. It's up to you.
|
||||
|
||||
.. note::
|
||||
|
||||
Paperless will process the rules in the order defined in the admin page.
|
||||
|
||||
You can define catch-all rules and have them executed last to consume
|
||||
any documents not matched by previous rules. Such a rule may assign an "Unknown
|
||||
mail document" tag to consumed documents so you can inspect them further.
|
||||
|
||||
Paperless is set up to check your mails every 10 minutes. This can be configured on the
|
||||
'Scheduled tasks' page in the admin.
|
||||
|
||||
|
||||
REST API
|
||||
========
|
||||
|
||||
You can also submit a document using the REST API, see :ref:`api-file_uploads` for details.
|
||||
|
||||
.. _basic-searching:
|
||||
|
||||
|
||||
Best practices
|
||||
##############
|
||||
|
||||
Paperless offers a couple tools that help you organize your document collection. However,
|
||||
it is up to you to use them in a way that helps you organize documents and find specific
|
||||
documents when you need them. This section offers a couple ideas for managing your collection.
|
||||
|
||||
Document types allow you to classify documents according to what they are. You can define
|
||||
types such as "Receipt", "Invoice", or "Contract". If you used to collect all your receipts
|
||||
in a single binder, you can recreate that system in paperless by defining a document type,
|
||||
assigning documents to that type and then filtering by that type to only see all receipts.
|
||||
|
||||
Not all documents need document types. Sometimes its hard to determine what the type of a
|
||||
document is or it is hard to justify creating a document type that you only need once or twice.
|
||||
This is okay. As long as the types you define help you organize your collection in the way
|
||||
you want, paperless is doing its job.
|
||||
|
||||
Tags can be used in many different ways. Think of tags are more versatile folders or binders.
|
||||
If you have a binder for documents related to university / your car or health care, you can
|
||||
create these binders in paperless by creating tags and assigning them to relevant documents.
|
||||
Just as with documents, you can filter the document list by tags and only see documents of
|
||||
a certain topic.
|
||||
|
||||
With physical documents, you'll often need to decide which folder the document belongs to.
|
||||
The advantage of tags over folders and binders is that a single document can have multiple
|
||||
tags. A physical document cannot magically appear in two different folders, but with tags,
|
||||
this is entirely possible.
|
||||
|
||||
.. hint::
|
||||
|
||||
This can be used in many different ways. One example: Imagine you're working on a particular
|
||||
task, such as signing up for university. Usually you'll need to collect a bunch of different
|
||||
documents that are already sorted into various folders. With the tag system of paperless,
|
||||
you can create a new group of documents that are relevant to this task without destroying
|
||||
the already existing organization. When you're done with the task, you could delete the
|
||||
tag again, which would be equal to sorting documents back into the folder they belong into.
|
||||
Or keep the tag, up to you.
|
||||
|
||||
All of the logic above applies to correspondents as well. Attach them to documents if you
|
||||
feel that they help you organize your collection.
|
||||
|
||||
When you've started organizing your documents, create a couple saved views for document collections
|
||||
you regularly access. This is equal to having labeled physical binders on your desk, except
|
||||
that these saved views are dynamic and simply update themselves as you add documents to the system.
|
||||
|
||||
Here are a couple examples of tags and types that you could use in your collection.
|
||||
|
||||
* An ``inbox`` tag for newly added documents that you haven't manually edited yet.
|
||||
* A tag ``car`` for everything car related (repairs, registration, insurance, etc)
|
||||
* A tag ``todo`` for documents that you still need to do something with, such as reply, or
|
||||
perform some task online.
|
||||
* A tag ``bank account x`` for all bank statement related to that account.
|
||||
* A tag ``mail`` for anything that you added to paperless via its mail processing capabilities.
|
||||
* A tag ``missing_metadata`` when you still need to add some metadata to a document, but can't
|
||||
or don't want to do this right now.
|
||||
|
||||
Searching
|
||||
#########
|
||||
|
||||
Paperless offers an extensive searching mechanism that is designed to allow you to quickly
|
||||
find a document you're looking for (for example, that thing that just broke and you bought
|
||||
a couple months ago, that contract you signed 8 years ago).
|
||||
|
||||
When you search paperless for a document, it tries to match this query against your documents.
|
||||
Paperless will look for matching documents by inspecting their content, title, correspondent,
|
||||
type and tags. Paperless returns a scored list of results, so that documents matching your query
|
||||
better will appear further up in the search results.
|
||||
|
||||
By default, paperless returns only documents which contain all words typed in the search bar.
|
||||
However, paperless also offers advanced search syntax if you want to drill down the results
|
||||
further.
|
||||
|
||||
Matching documents with logical expressions:
|
||||
|
||||
.. code::
|
||||
|
||||
shopname AND (product1 OR product2)
|
||||
|
||||
Matching specific tags, correspondents or types:
|
||||
|
||||
.. code::
|
||||
|
||||
type:invoice tag:unpaid
|
||||
correspondent:university certificate
|
||||
|
||||
Matching dates:
|
||||
|
||||
.. code::
|
||||
|
||||
created:[2005 to 2009]
|
||||
added:yesterday
|
||||
modified:today
|
||||
|
||||
Matching inexact words:
|
||||
|
||||
.. code::
|
||||
|
||||
produ*name
|
||||
|
||||
.. note::
|
||||
|
||||
Inexact terms are hard for search indexes. These queries might take a while to execute. That's why paperless offers
|
||||
auto complete and query correction.
|
||||
|
||||
All of these constructs can be combined as you see fit.
|
||||
If you want to learn more about the query language used by paperless, paperless uses Whoosh's default query language.
|
||||
Head over to `Whoosh query language <https://whoosh.readthedocs.io/en/latest/querylang.html>`_.
|
||||
For details on what date parsing utilities are available, see
|
||||
`Date parsing <https://whoosh.readthedocs.io/en/latest/dates.html#parsing-date-queries>`_.
|
||||
|
||||
|
||||
.. _usage-recommended_workflow:
|
||||
|
||||
The recommended workflow
|
||||
########################
|
||||
|
||||
Once you have familiarized yourself with paperless and are ready to use it
|
||||
for all your documents, the recommended workflow for managing your documents
|
||||
is as follows. This workflow also takes into account that some documents
|
||||
have to be kept in physical form, but still ensures that you get all the
|
||||
advantages for these documents as well.
|
||||
|
||||
The following diagram shows how easy it is to manage your documents.
|
||||
|
||||
.. image:: _static/recommended_workflow.png
|
||||
|
||||
Preparations in paperless
|
||||
=========================
|
||||
|
||||
* Create an inbox tag that gets assigned to all new documents.
|
||||
* Create a TODO tag.
|
||||
|
||||
Processing of the physical documents
|
||||
====================================
|
||||
|
||||
Keep a physical inbox. Whenever you receive a document that you need to
|
||||
archive, put it into your inbox. Regularly, do the following for all documents
|
||||
in your inbox:
|
||||
|
||||
1. For each document, decide if you need to keep the document in physical
|
||||
form. This applies to certain important documents, such as contracts and
|
||||
certificates.
|
||||
2. If you need to keep the document, write a running number on the document
|
||||
before scanning, starting at one and counting upwards. This is the archive
|
||||
serial number, or ASN in short.
|
||||
3. Scan the document.
|
||||
4. If the document has an ASN assigned, store it in a *single* binder, sorted
|
||||
by ASN. Don't order this binder in any other way.
|
||||
5. If the document has no ASN, throw it away. Yay!
|
||||
|
||||
Over time, you will notice that your physical binder will fill up. If it is
|
||||
full, label the binder with the range of ASNs in this binder (i.e., "Documents
|
||||
1 to 343"), store the binder in your cellar or elsewhere, and start a new
|
||||
binder.
|
||||
|
||||
The idea behind this process is that you will never have to use the physical
|
||||
binders to find a document. If you need a specific physical document, you
|
||||
may find this document by:
|
||||
|
||||
1. Searching in paperless for the document.
|
||||
2. Identify the ASN of the document, since it appears on the scan.
|
||||
3. Grab the relevant document binder and get the document. This is easy since
|
||||
they are sorted by ASN.
|
||||
|
||||
Processing of documents in paperless
|
||||
====================================
|
||||
|
||||
Once you have scanned in a document, proceed in paperless as follows.
|
||||
|
||||
1. If the document has an ASN, assign the ASN to the document.
|
||||
2. Assign a correspondent to the document (i.e., your employer, bank, etc)
|
||||
This isn't strictly necessary but helps in finding a document when you need
|
||||
it.
|
||||
3. Assign a document type (i.e., invoice, bank statement, etc) to the document
|
||||
This isn't strictly necessary but helps in finding a document when you need
|
||||
it.
|
||||
4. Assign a proper title to the document (the name of an item you bought, the
|
||||
subject of the letter, etc)
|
||||
5. Check that the date of the document is correct. Paperless tries to read
|
||||
the date from the content of the document, but this fails sometimes if the
|
||||
OCR is bad or multiple dates appear on the document.
|
||||
6. Remove inbox tags from the documents.
|
||||
|
||||
.. hint::
|
||||
|
||||
You can setup manual matching rules for your correspondents and tags and
|
||||
paperless will assign them automatically. After consuming a couple documents,
|
||||
you can even ask paperless to *learn* when to assign tags and correspondents
|
||||
by itself. For details on this feature, see :ref:`advanced-matching`.
|
||||
|
||||
Task management
|
||||
===============
|
||||
|
||||
Some documents require attention and require you to act on the document. You
|
||||
may take two different approaches to handle these documents based on how
|
||||
regularly you intent to use paperless and scan documents.
|
||||
|
||||
* If you scan and process your documents in paperless regularly, assign a
|
||||
TODO tag to all scanned documents that you need to process. Create a saved
|
||||
view on the dashboard that shows all documents with this tag.
|
||||
* If you do not scan documents regularly and use paperless solely for archiving,
|
||||
create a physical todo box next to your physical inbox and put documents you
|
||||
need to process in the TODO box. When you performed the task associated with
|
||||
the document, move it to the inbox.
|
@@ -1,284 +0,0 @@
|
||||
.. _utilities:
|
||||
|
||||
Utilities
|
||||
=========
|
||||
|
||||
There's basically three utilities to Paperless: the webserver, consumer, and
|
||||
if needed, the exporter. They're all detailed here.
|
||||
|
||||
|
||||
.. _utilities-webserver:
|
||||
|
||||
The Webserver
|
||||
-------------
|
||||
|
||||
At the heart of it, Paperless is a simple Django webservice, and the entire
|
||||
interface is based on Django's standard admin interface. Once running, visiting
|
||||
the URL for your service delivers the admin, through which you can get a
|
||||
detailed listing of all available documents, search for specific files, and
|
||||
download whatever it is you're looking for.
|
||||
|
||||
|
||||
.. _utilities-webserver-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
The webserver is started via the ``manage.py`` script:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py runserver
|
||||
|
||||
By default, the server runs on localhost, port 8000, but you can change this
|
||||
with a few arguments, run ``manage.py --help`` for more information.
|
||||
|
||||
Add the option ``--noreload`` to reduce resource usage. Otherwise, the server
|
||||
continuously polls all source files for changes to auto-reload them.
|
||||
|
||||
Note that when exiting this command your webserver will disappear.
|
||||
If you want to run this full-time (which is kind of the point)
|
||||
you'll need to have it start in the background -- something you'll need to
|
||||
figure out for your own system. To get you started though, there are Systemd
|
||||
service files in the ``scripts`` directory.
|
||||
|
||||
|
||||
.. _utilities-consumer:
|
||||
|
||||
The Consumer
|
||||
------------
|
||||
|
||||
The consumer script runs in an infinite loop, constantly looking at a directory
|
||||
for documents to parse and index. The process is pretty straightforward:
|
||||
|
||||
1. Look in ``CONSUMPTION_DIR`` for a document. If one is found, go to #2.
|
||||
If not, wait 10 seconds and try again. On Linux, new documents are detected
|
||||
instantly via inotify, so there's no waiting involved.
|
||||
2. Parse the document with Tesseract
|
||||
3. Create a new record in the database with the OCR'd text
|
||||
4. Attempt to automatically assign document attributes by doing some guesswork.
|
||||
Read up on the :ref:`guesswork documentation<guesswork>` for more
|
||||
information about this process.
|
||||
5. Encrypt the document (if you have a passphrase set) and store it in the
|
||||
``media`` directory under ``documents/originals``.
|
||||
6. Go to #1.
|
||||
|
||||
|
||||
.. _utilities-consumer-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
The consumer is started via the ``manage.py`` script:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_consumer
|
||||
|
||||
This starts the service that will consume documents as they appear in
|
||||
``CONSUMPTION_DIR``.
|
||||
|
||||
Note that this command runs continuously, so exiting it will mean your webserver
|
||||
disappears. If you want to run this full-time (which is kind of the point)
|
||||
you'll need to have it start in the background -- something you'll need to
|
||||
figure out for your own system. To get you started though, there are Systemd
|
||||
service files in the ``scripts`` directory.
|
||||
|
||||
Some command line arguments are available to customize the behavior of the
|
||||
consumer. By default it will use ``/etc/paperless.conf`` values. Display the
|
||||
help with:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_consumer --help
|
||||
|
||||
.. _utilities-exporter:
|
||||
|
||||
The Exporter
|
||||
------------
|
||||
|
||||
Tired of fiddling with Paperless, or just want to do something stupid and are
|
||||
afraid of accidentally damaging your files? You can export all of your
|
||||
documents into neatly named, dated, and unencrypted files.
|
||||
|
||||
|
||||
.. _utilities-exporter-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
This too is done via the ``manage.py`` script:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/
|
||||
|
||||
This will dump all of your unencrypted documents into ``/path/to/somewhere``
|
||||
for you to do with as you please. The files are accompanied with a special
|
||||
file, ``manifest.json`` which can be used to :ref:`import the files
|
||||
<utilities-importer>` at a later date if you wish.
|
||||
|
||||
|
||||
.. _utilities-exporter-howto-docker:
|
||||
|
||||
Docker
|
||||
______
|
||||
|
||||
If you are :ref:`using Docker <setup-installation-docker>`, running the
|
||||
expoorter is almost as easy. To mount a volume for exports, follow the
|
||||
instructions in the ``docker-compose.yml.example`` file for the ``/export``
|
||||
volume (making the changes in your own ``docker-compose.yml`` file, of course).
|
||||
Once you have the volume mounted, the command to run an export is:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ docker-compose run --rm consumer document_exporter /export
|
||||
|
||||
If you prefer to use ``docker run`` directly, supplying the necessary commandline
|
||||
options:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ # Identify your containers
|
||||
$ docker-compose ps
|
||||
Name Command State Ports
|
||||
-------------------------------------------------------------------------
|
||||
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
|
||||
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
|
||||
|
||||
$ # Make sure to replace your passphrase and remove or adapt the id mapping
|
||||
$ docker run --rm \
|
||||
--volumes-from paperless_data_1 \
|
||||
--volume /path/to/arbitrary/place:/export \
|
||||
-e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \
|
||||
-e USERMAP_UID=1000 -e USERMAP_GID=1000 \
|
||||
paperless document_exporter /export
|
||||
|
||||
|
||||
.. _utilities-importer:
|
||||
|
||||
The Importer
|
||||
------------
|
||||
|
||||
Looking to transfer Paperless data from one instance to another, or just want
|
||||
to restore from a backup? This is your go-to toy.
|
||||
|
||||
|
||||
.. _utilities-importer-howto:
|
||||
|
||||
How to Use It
|
||||
.............
|
||||
|
||||
The importer works just like the exporter. You point it at a directory, and
|
||||
the script does the rest of the work:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/
|
||||
|
||||
Docker
|
||||
______
|
||||
|
||||
Assuming that you've already gone through the steps above in the
|
||||
:ref:`export <utilities-exporter-howto-docker>` section, then the easiest thing
|
||||
to do is just re-use the ``/export`` path you already setup:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ docker-compose run --rm consumer document_importer /export
|
||||
|
||||
Similarly, if you're not using docker-compose, you can adjust the export
|
||||
instructions above to do the import.
|
||||
|
||||
|
||||
.. _utilities-retagger:
|
||||
|
||||
Re-running your tagging and correspondent matchers
|
||||
--------------------------------------------------
|
||||
|
||||
Say you've imported a few hundred documents and now want to introduce
|
||||
a tag or set up a new correspondent, and apply its matching to all of
|
||||
the currently-imported docs. This problem is common enough that
|
||||
there are tools for it.
|
||||
|
||||
|
||||
.. _utilities-retagger-howto:
|
||||
|
||||
How to Do It
|
||||
............
|
||||
|
||||
This too is done via the ``manage.py`` script:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_retagger
|
||||
|
||||
Run this after changing or adding tagging rules. It'll loop over all
|
||||
of the documents in your database and attempt to match all of your
|
||||
tags to them. If one matches, it'll be applied. And don't worry, you
|
||||
can run this as often as you like, it won't double-tag a document.
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py document_correspondents
|
||||
|
||||
This is the similar command to run after adding or changing a correspondent.
|
||||
|
||||
.. _utilities-encyption:
|
||||
|
||||
Enabling Encrpytion
|
||||
-------------------
|
||||
|
||||
Let's say you've imported a few documents to play around with paperless and now
|
||||
you are using it more seriously and want to enable encryption of your files.
|
||||
|
||||
.. utilities-encryption-howto:
|
||||
|
||||
Basic Syntax
|
||||
.............
|
||||
|
||||
Again we'll use the ``manage.py`` script, passing ``change_storage_type``:
|
||||
|
||||
.. code:: console
|
||||
|
||||
$ /path/to/paperless/src/manage.py change_storage_type --help
|
||||
usage: manage.py change_storage_type [-h] [--version] [-v {0,1,2,3}]
|
||||
[--settings SETTINGS]
|
||||
[--pythonpath PYTHONPATH] [--traceback]
|
||||
[--no-color] [--passphrase PASSPHRASE]
|
||||
{gpg,unencrypted} {gpg,unencrypted}
|
||||
|
||||
This is how you migrate your stored documents from an encrypted state to an
|
||||
unencrypted one (or vice-versa)
|
||||
|
||||
positional arguments:
|
||||
{gpg,unencrypted} The state you want to change your documents from
|
||||
{gpg,unencrypted} The state you want to change your documents to
|
||||
|
||||
optional arguments:
|
||||
--passphrase PASSPHRASE
|
||||
If PAPERLESS_PASSPHRASE isn't set already, you need to
|
||||
specify it here
|
||||
|
||||
Enabling Encryption
|
||||
...................
|
||||
|
||||
Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**):
|
||||
|
||||
(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg
|
||||
|
||||
|
||||
Disabling Encryption
|
||||
....................
|
||||
|
||||
Basic usage to enable encryption of your document store:
|
||||
|
||||
(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here)
|
||||
|
||||
.. code:: bash
|
||||
|
||||
$ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted
|