mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-02 13:45:10 -05:00

Originally we used SHARED secret both for email and for the API. That was a bad idea, and now that we're only using this value for one case, I've renamed it to reflect its actual use.
230 lines
8.3 KiB
ReStructuredText
230 lines
8.3 KiB
ReStructuredText
.. _consumption:
|
|
|
|
Consumption
|
|
###########
|
|
|
|
Once you've got Paperless setup, you need to start feeding documents into it.
|
|
Currently, there are three options: the consumption directory, IMAP (email), and
|
|
HTTP POST.
|
|
|
|
|
|
.. _consumption-directory:
|
|
|
|
The Consumption Directory
|
|
=========================
|
|
|
|
The primary method of getting documents into your database is by putting them in
|
|
the consumption directory. The ``document_consumer`` script runs in an infinite
|
|
loop looking for new additions to this directory and when it finds them, it goes
|
|
about the process of parsing them with the OCR, indexing what it finds, and
|
|
encrypting the PDF, storing it in the media directory.
|
|
|
|
Getting stuff into this directory is up to you. If you're running Paperless
|
|
on your local computer, you might just want to drag and drop files there, but if
|
|
you're running this on a server and want your scanner to automatically push
|
|
files to this directory, you'll need to setup some sort of service to accept the
|
|
files from the scanner. Typically, you're looking at an FTP server like
|
|
`Proftpd`_ or `Samba`_.
|
|
|
|
.. _Proftpd: http://www.proftpd.org/
|
|
.. _Samba: http://www.samba.org/
|
|
|
|
So where is this consumption directory? It's wherever you define it. Look for
|
|
the ``CONSUMPTION_DIR`` value in ``settings.py``. Set that to somewhere
|
|
appropriate for your use and put some documents in there. When you're ready,
|
|
follow the :ref:`consumer <utilities-consumer>` instructions to get it running.
|
|
|
|
|
|
.. _consumption-directory-hook:
|
|
|
|
Hooking into the Consumption Process
|
|
------------------------------------
|
|
|
|
Sometimes you may want to do something arbitrary whenever a document is
|
|
consumed. Rather than try to predict what you may want to do, Paperless lets
|
|
you execute scripts of your own choosing just before or after a document is
|
|
consumed using a couple simple hooks.
|
|
|
|
Just write a script, put it somewhere that Paperless can read & execute, and
|
|
then put the path to that script in ``paperless.conf`` with the variable name
|
|
of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or
|
|
``PAPERLESS_POST_CONSUME_SCRIPT``. The script will be executed before or
|
|
or after the document is consumed respectively.
|
|
|
|
.. important::
|
|
|
|
These scripts are executed in a **blocking** process, which means that if
|
|
a script takes a long time to run, it can significantly slow down your
|
|
document consumption flow. If you want things to run asynchronously,
|
|
you'll have to fork the process in your script and exit.
|
|
|
|
|
|
.. _consumption-directory-hook-variables:
|
|
|
|
What Can These Scripts Do?
|
|
..........................
|
|
|
|
It's your script, so you're only limited by your imagination and the laws of
|
|
physics. However, the following values are passed to the scripts in order:
|
|
|
|
|
|
.. _consumption-director-hook-variables-pre:
|
|
|
|
Pre-consumption script
|
|
::::::::::::::::::::::
|
|
|
|
* Document file name
|
|
|
|
|
|
.. _consumption-director-hook-variables-post:
|
|
|
|
Post-consumption script
|
|
:::::::::::::::::::::::
|
|
|
|
* Document id
|
|
* Generated file name
|
|
* Source path
|
|
* Thumbnail path
|
|
* Download URL
|
|
* Thumbnail URL
|
|
* Correspondent
|
|
* Tags
|
|
|
|
The script can be in any language you like, but for a simple shell script
|
|
example, you can take a look at ``post-consumption-example.sh`` in the
|
|
``scripts`` directory in this project.
|
|
|
|
|
|
.. _consumption-imap:
|
|
|
|
IMAP (Email)
|
|
============
|
|
|
|
Another handy way to get documents into your database is to email them to
|
|
yourself. The typical use-case would be to be out for lunch and want to send a
|
|
copy of the receipt back to your system at home. Paperless can be taught to
|
|
pull emails down from an arbitrary account and dump them into the consumption
|
|
directory where the process :ref:`above <consumption-directory>` will follow the
|
|
usual pattern on consuming the document.
|
|
|
|
Some things you need to know about this feature:
|
|
|
|
* It's disabled by default. By setting the values below it will be enabled.
|
|
* It's been tested in a limited environment, so it may not work for you (please
|
|
submit a pull request if you can!)
|
|
* It's designed to **delete mail from the server once consumed**. So don't go
|
|
pointing this to your personal email account and wonder where all your stuff
|
|
went.
|
|
* Currently, only one photo (attachment) per email will work.
|
|
|
|
So, with all that in mind, here's what you do to get it running:
|
|
|
|
1. Setup a new email account somewhere, or if you're feeling daring, create a
|
|
folder in an existing email box and note the path to that folder.
|
|
2. In ``/etc/paperless.conf`` set all of the appropriate values in
|
|
``PATHS AND FOLDERS`` and ``SECURITY``.
|
|
If you decided to use a subfolder of an existing account, then make sure you
|
|
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
|
|
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
|
|
have to include that in every email you send.
|
|
3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check
|
|
the configured email account at startup and from then on every 10 minutes
|
|
for something new and pulls down whatever it finds.
|
|
4. Send yourself an email! Note that the subject is treated as the file name,
|
|
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
|
|
get what you expect. Also, you must include the aforementioned secret
|
|
string in every email so the fetcher knows that it's safe to import.
|
|
Note that Paperless only allows the email title to consist of safe characters
|
|
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
|
|
5. After a few minutes, the consumer will poll your mailbox, pull down the
|
|
message, and place the attachment in the consumption directory with the
|
|
appropriate name. A few minutes later, the consumer will import it like any
|
|
other file.
|
|
|
|
|
|
.. _consumption-http:
|
|
|
|
HTTP POST
|
|
=========
|
|
|
|
You can also submit a document via HTTP POST, so long as you do so after
|
|
authenticating. To push your document to Paperless, send an HTTP POST to the
|
|
server with the following name/value pairs:
|
|
|
|
* ``correspondent``: The name of the document's correspondent. Note that there
|
|
are restrictions on what characters you can use here. Specifically,
|
|
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
|
|
out. You also can't use the sequence ` - ` (space, dash, space).
|
|
* ``title``: The title of the document. The rules for characters is the same
|
|
here as the correspondent.
|
|
* ``document``: The file you're uploading
|
|
|
|
Specify ``enctype="multipart/form-data"``, and then POST your file with::
|
|
|
|
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
|
|
|
|
An example of this in HTML is a typical form:
|
|
|
|
.. code:: html
|
|
|
|
<form method="post" enctype="multipart/form-data">
|
|
<input type="text" name="correspondent" value="My Correspondent" />
|
|
<input type="text" name="title" value="My Title" />
|
|
<input type="file" name="document" />
|
|
<input type="submit" name="go" value="Do the thing" />
|
|
</form>
|
|
|
|
But a potentially more useful way to do this would be in Python. Here we use
|
|
the requests library to handle basic authentication and to send the POST data
|
|
to the URL.
|
|
|
|
.. code:: python
|
|
|
|
import os
|
|
|
|
from hashlib import sha256
|
|
|
|
import requests
|
|
from requests.auth import HTTPBasicAuth
|
|
|
|
# You authenticate via BasicAuth or with a session id.
|
|
# We use BasicAuth here
|
|
username = "my-username"
|
|
password = "my-super-secret-password"
|
|
|
|
# Where you have Paperless installed and listening
|
|
url = "http://localhost:8000/push"
|
|
|
|
# Document metadata
|
|
correspondent = "Test Correspondent"
|
|
title = "Test Title"
|
|
|
|
# The local file you want to push
|
|
path = "/path/to/some/directory/my-document.pdf"
|
|
|
|
|
|
with open(path, "rb") as f:
|
|
|
|
response = requests.post(
|
|
url=url,
|
|
data={"title": title, "correspondent": correspondent},
|
|
files={"document": (os.path.basename(path), f, "application/pdf")},
|
|
auth=HTTPBasicAuth(username, password),
|
|
allow_redirects=False
|
|
)
|
|
|
|
if response.status_code == 202:
|
|
|
|
# Everything worked out ok
|
|
print("Upload successful")
|
|
|
|
else:
|
|
|
|
# If you don't get a 202, it's probably because your credentials
|
|
# are wrong or something. This will give you a rough idea of what
|
|
# happened.
|
|
|
|
print("We got HTTP status code: {}".format(response.status_code))
|
|
for k, v in response.headers.items():
|
|
print("{}: {}".format(k, v))
|