Merge branch 'master' into issue/81

This commit is contained in:
Daniel Quinn 2016-03-25 20:56:30 +00:00
commit 49b56425e8
16 changed files with 598 additions and 167 deletions

View File

@ -24,8 +24,11 @@ How it Works
1. Buy a document scanner like `this one`_. 1. Buy a document scanner like `this one`_.
2. Set it up to "scan to FTP" or something similar. It should be able to push 2. Set it up to "scan to FTP" or something similar. It should be able to push
scanned images to a server without you having to do anything. scanned images to a server without you having to do anything. If your
3. Have the target server run the *Paperless* consumption script to OCR the PDF scanner doesn't know how to automatically upload the file somewhere, you can
always do that manually. Paperless doesn't care how the documents get into
its local consumption directory.
3. Have the target server run the Paperless consumption script to OCR the PDF
and index it into a local database. and index it into a local database.
4. Use the web frontend to sift through the database and find what you want. 4. Use the web frontend to sift through the database and find what you want.
5. Download the PDF you need/want via the web interface and do whatever you 5. Download the PDF you need/want via the web interface and do whatever you
@ -56,7 +59,7 @@ powerful tools.
* `ImageMagick`_ converts the images between colour and greyscale. * `ImageMagick`_ converts the images between colour and greyscale.
* `Tesseract`_ does the character recognition. * `Tesseract`_ does the character recognition.
* `Unpaper`_ despeckles and and deskews the scanned image. * `Unpaper`_ despeckles and deskews the scanned image.
* `GNU Privacy Guard`_ is used as the encryption backend. * `GNU Privacy Guard`_ is used as the encryption backend.
* `Python 3`_ is the language of the project. * `Python 3`_ is the language of the project.

View File

@ -11,6 +11,10 @@ services:
- data:/usr/src/paperless/data - data:/usr/src/paperless/data
- media:/usr/src/paperless/media - media:/usr/src/paperless/media
env_file: docker-compose.env env_file: docker-compose.env
# The reason the line is here is so that the webserver that doesn't do
# any text recognition and doesn't have to install unnecessary
# languages the user might have set in the env-file by overwriting the
# value with nothing.
environment: environment:
- PAPERLESS_OCR_LANGUAGES= - PAPERLESS_OCR_LANGUAGES=
command: ["runserver", "0.0.0.0:8000"] command: ["runserver", "0.0.0.0:8000"]

View File

@ -1,6 +1,15 @@
Changelog Changelog
######### #########
* 0.2.0
* Added support for guessing the date from the file name along with the
correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull
request that I took forever to merge and to `Pit`_ for his efforts on the
regex front.
* `#94`_: Restored support for changing the created date in the UI. Thanks
to `Martin Honermeyer`_ and `Tim White`_ for working with me on this.
* 0.1.1 * 0.1.1
* Potentially **Breaking Change**: All references to "sender" in the code * Potentially **Breaking Change**: All references to "sender" in the code
@ -86,6 +95,8 @@ Changelog
.. _Wayne Werner: https://github.com/waynew .. _Wayne Werner: https://github.com/waynew
.. _darkmatter: https://github.com/darkmatter .. _darkmatter: https://github.com/darkmatter
.. _zedster: https://github.com/zedster .. _zedster: https://github.com/zedster
.. _Martin Honermeyer: https://github.com/djmaze
.. _Tim White: https://github.com/timwhite
.. _#20: https://github.com/danielquinn/paperless/issues/20 .. _#20: https://github.com/danielquinn/paperless/issues/20
.. _#44: https://github.com/danielquinn/paperless/issues/44 .. _#44: https://github.com/danielquinn/paperless/issues/44
@ -99,3 +110,4 @@ Changelog
.. _#67: https://github.com/danielquinn/paperless/issues/67 .. _#67: https://github.com/danielquinn/paperless/issues/67
.. _#68: https://github.com/danielquinn/paperless/issues/68 .. _#68: https://github.com/danielquinn/paperless/issues/68
.. _#71: https://github.com/danielquinn/paperless/issues/71 .. _#71: https://github.com/danielquinn/paperless/issues/71
.. _#94: https://github.com/danielquinn/paperless/issues/71

View File

@ -45,19 +45,27 @@ you name the file right, it'll automatically set some values in the database
for you. This is is the logic the consumer follows: for you. This is is the logic the consumer follows:
1. Try to find the correspondent, title, and tags in the file name following 1. Try to find the correspondent, title, and tags in the file name following
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
``YYYYMMDDZ``. The ``Z`` is for "Zulu time" AKA "UTC".
2. If that doesn't work, we skip the date and try this pattern:
the pattern: ``Correspondent - Title - tag,tag,tag.pdf``. the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
2. If that doesn't work, try to find the correspondent and title in the file 3. If that doesn't work, we try to find the correspondent and title in the file
name following the pattern: ``Correspondent - Title.pdf``. name following the pattern: ``Correspondent - Title.pdf``.
3. If that doesn't work, just assume that the name of the file is the title. 4. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect: So given the above, the following examples would work as you'd expect:
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Another Company - Letter of Reference.jpg`` * ``Another Company - Letter of Reference.jpg``
* ``Dad's Recipe for Pancakes.png`` * ``Dad's Recipe for Pancakes.png``
These however wouldn't work: These however wouldn't work:
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Another Company- Letter of Reference.jpg`` * ``Another Company- Letter of Reference.jpg``
@ -128,7 +136,7 @@ following name/value pairs:
don't start uploading stuff to your server. The means of generating this don't start uploading stuff to your server. The means of generating this
signature is defined below. signature is defined below.
Specify ``enctype="multipart/form-data"``, and then POST your file with::: Specify ``enctype="multipart/form-data"``, and then POST your file with::
Content-Disposition: form-data; name="document"; filename="whatever.pdf" Content-Disposition: form-data; name="document"; filename="whatever.pdf"

View File

@ -33,4 +33,5 @@ Contents
api api
utilities utilities
migrating migrating
troubleshooting
changelog changelog

View File

@ -8,7 +8,7 @@ should work) that has the following software installed on it:
* `Python3`_ (with development libraries, pip and virtualenv) * `Python3`_ (with development libraries, pip and virtualenv)
* `GNU Privacy Guard`_ * `GNU Privacy Guard`_
* `Tesseract`_ * `Tesseract`_, plus its language files matching your document base.
* `Imagemagick`_ * `Imagemagick`_
* `unpaper`_ * `unpaper`_
@ -52,6 +52,7 @@ well as ImageMagick:
$ brew install ghostscript $ brew install ghostscript
$ brew install imagemagick $ brew install imagemagick
$ brew install libmagic
.. _requirements-baremetal: .. _requirements-baremetal:

View File

@ -5,7 +5,8 @@ Setup
Paperless isn't a very complicated app, but there are a few components, so some Paperless isn't a very complicated app, but there are a few components, so some
basic documentation is in order. If you go follow along in this document and basic documentation is in order. If you go follow along in this document and
still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps. still have trouble, please open an `issue on GitHub`_ so I can fill in the
gaps.
.. _issue on GitHub: https://github.com/danielquinn/paperless/issues .. _issue on GitHub: https://github.com/danielquinn/paperless/issues
@ -15,8 +16,8 @@ still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
Download Download
-------- --------
The source is currently only available via GitHub, so grab it from there, either The source is currently only available via GitHub, so grab it from there,
by using ``git``: either by using ``git``:
.. code:: bash .. code:: bash
@ -42,15 +43,16 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
consumption etc. We also `support Docker`_, which you can use natively under consumption etc. We also `support Docker`_, which you can use natively under
Linux and in a VM with `Docker Machine`_ (this guide was written for native Linux and in a VM with `Docker Machine`_ (this guide was written for native
Docker usage under Linux, you might have to adapt it for Docker Machine.) Docker usage under Linux, you might have to adapt it for Docker Machine.)
Alternatively the standard, `bare metal`_ approach is a little more complicated, Alternatively the standard, `bare metal`_ approach is a little more
but worth it because it makes it easier to should you want to contribute some complicated, but worth it because it makes it easier to should you want to
code back. contribute some code back.
.. _Vagrant route: setup-installation-vagrant_ .. _Vagrant route: setup-installation-vagrant_
.. _support Docker: setup-installation-docker_ .. _support Docker: setup-installation-docker_
.. _bare metal: setup-installation-standard_ .. _bare metal: setup-installation-standard_
.. _Docker Machine: https://docs.docker.com/machine/ .. _Docker Machine: https://docs.docker.com/machine/
.. _setup-installation-standard: .. _setup-installation-standard:
Standard (Bare Metal) Standard (Bare Metal)
@ -58,19 +60,16 @@ Standard (Bare Metal)
1. Install the requirements as per the :ref:`requirements <requirements>` page. 1. Install the requirements as per the :ref:`requirements <requirements>` page.
2. Change to the ``src`` directory in this repo. 2. Change to the ``src`` directory in this repo.
3. Edit ``paperless/settings.py`` and be sure to set the values for: 3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
* ``CONSUMPTION_DIR``: this is where your documents will be dumped to be your favourite editor. Set the values for:
consumed by Paperless.
* ``PASSPHRASE``: this is the passphrase Paperless uses to encrypt/decrypt * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
the original document. The default value attempts to source the dumped to be consumed by Paperless.
passphrase from the environment, so if you don't set it to a static value * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
here, you must set ``PAPERLESS_PASSPHRASE=some-secret-string`` on the encrypt/decrypt the original document.
command line whenever invoking the consumer or webserver. * ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
* ``OCR_THREADS``: this is the number of threads the OCR process will spawn will spawn to process document pages in parallel.
to process document pages in parallel. The default value gets sourced from
the environment-variable ``PAPERLESS_OCR_THREADS`` and expects it to be an
integer. If the variable is not set, Python determines the core-count of
your CPU and uses that value.
4. Initialise the database with ``./manage.py migrate``. 4. Initialise the database with ``./manage.py migrate``.
5. Create a user for your Paperless instance with 5. Create a user for your Paperless instance with
``./manage.py createsuperuser``. Follow the prompts to create your user. ``./manage.py createsuperuser``. Follow the prompts to create your user.
@ -79,8 +78,8 @@ Standard (Bare Metal)
You should now be able to visit your (empty) `Paperless webserver`_ at You should now be able to visit your (empty) `Paperless webserver`_ at
``127.0.0.1:8000`` (or whatever you chose). You can login with the ``127.0.0.1:8000`` (or whatever you chose). You can login with the
user/pass you created in #5. user/pass you created in #5.
7. In a separate window, change to the ``src`` directory in this repo again, but 7. In a separate window, change to the ``src`` directory in this repo again,
this time, you should start the consumer script with but this time, you should start the consumer script with
``./manage.py document_consumer``. ``./manage.py document_consumer``.
8. Scan something. Put it in the ``CONSUMPTION_DIR``. 8. Scan something. Put it in the ``CONSUMPTION_DIR``.
9. Wait a few minutes 9. Wait a few minutes
@ -100,6 +99,7 @@ Vagrant Method
provisioned... provisioned...
3. Run ``vagrant ssh`` and once inside your new vagrant box, edit 3. Run ``vagrant ssh`` and once inside your new vagrant box, edit
``/etc/paperless.conf`` and set the values for: ``/etc/paperless.conf`` and set the values for:
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
dumped to be consumed by Paperless. dumped to be consumed by Paperless.
* ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
@ -107,6 +107,7 @@ Vagrant Method
* ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming * ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming
documents from mail or via the API. If you don't use either, leaving it documents from mail or via the API. If you don't use either, leaving it
blank is just fine. blank is just fine.
4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again. This 4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again. This
updates the environment to make use of the changes you made to the config updates the environment to make use of the changes you made to the config
file. file.
@ -140,9 +141,9 @@ Docker Method
.. caution:: .. caution::
As mentioned earlier, this guide assumes that you use Docker natively As mentioned earlier, this guide assumes that you use Docker natively
under Linux. If you are using `Docker Machine`_ under Mac OS X or Windows, under Linux. If you are using `Docker Machine`_ under Mac OS X or
you will have to adapt IP addresses, volume-mounting, command execution Windows, you will have to adapt IP addresses, volume-mounting, command
and maybe more. execution and maybe more.
2. Install `docker-compose`_. [#compose]_ 2. Install `docker-compose`_. [#compose]_
@ -161,14 +162,14 @@ Docker Method
.. _Docker installation guide: https://docs.docker.com/engine/installation/ .. _Docker installation guide: https://docs.docker.com/engine/installation/
.. _docker-compose installation guide: https://docs.docker.com/compose/install/ .. _docker-compose installation guide: https://docs.docker.com/compose/install/
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` and 3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
a copy of ``docker-compose.env.example`` as ``docker-compose.env``. You'll be and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
editing both these files: taking a copy ensures that you can ``git pull`` to You'll be editing both these files: taking a copy ensures that you can
receive updates without risking merge conflicts with your modified versions ``git pull`` to receive updates without risking merge conflicts with your
of the configuration files. modified versions of the configuration files.
4. Modify ``docker-compose.yml`` to your preferences, following the instructions 4. Modify ``docker-compose.yml`` to your preferences, following the
in comments in the file. The only change that is a hard requirement is to instructions in comments in the file. The only change that is a hard
specify where the consumption directory should mount. requirement is to specify where the consumption directory should mount.
5. Modify ``docker-compose.env`` and adapt the following environment variables: 5. Modify ``docker-compose.env`` and adapt the following environment variables:
``PAPERLESS_PASSPHRASE`` ``PAPERLESS_PASSPHRASE``
@ -181,10 +182,11 @@ Docker Method
the core-count of your CPU and uses that value. the core-count of your CPU and uses that value.
``PAPERLESS_OCR_LANGUAGES`` ``PAPERLESS_OCR_LANGUAGES``
If you want the OCR to recognize other languages in addition to the default If you want the OCR to recognize other languages in addition to the
English, set this parameter to a space separated list of three-letter default English, set this parameter to a space separated list of
language-codes after `ISO 639-2/T`_. For a list of available languages -- three-letter language-codes after `ISO 639-2/T`_. For a list of available
including their three letter codes -- see the `Debian packagelist`_. languages -- including their three letter codes -- see the
`Debian packagelist`_.
``USERMAP_UID`` and ``USERMAP_GID`` ``USERMAP_UID`` and ``USERMAP_GID``
If you want to mount the consumption volume (directory ``/consume`` within If you want to mount the consumption volume (directory ``/consume`` within
@ -192,11 +194,11 @@ Docker Method
access rights might be an issue. The default user and group ``paperless`` access rights might be an issue. The default user and group ``paperless``
in the containers have an id of 1000. The containers will enforce that the in the containers have an id of 1000. The containers will enforce that the
owning group of the consumption directory will be ``paperless`` to be able owning group of the consumption directory will be ``paperless`` to be able
to delete consumed documents. If your host-system has a group with an id of to delete consumed documents. If your host-system has a group with an ID
1000 and you don't want this group to have access rights to the consumption of 1000 and you don't want this group to have access rights to the
directory, you can use ``USERMAP_GID`` to change the id in the container consumption directory, you can use ``USERMAP_GID`` to change the id in the
and thus the one of the consumption directory. Furthermore, you can change container and thus the one of the consumption directory. Furthermore, you
the id of the default user as well using ``USERMAP_UID``. can change the id of the default user as well using ``USERMAP_UID``.
6. Run ``docker-compose up -d``. This will create and start the necessary 6. Run ``docker-compose up -d``. This will create and start the necessary
containers. containers.
@ -234,14 +236,14 @@ Docker Method
.. danger:: .. danger::
While the consumption container will ensure at startup that it can While the consumption container will ensure at startup that it can
**delete** a consumed file from a host-mounted directory, it might not **delete** a consumed file from a host-mounted directory, it might
be able to **read** the document in the first place if the access not be able to **read** the document in the first place if the access
rights to the file are incorrect. rights to the file are incorrect.
Make sure that the documents you put into the consumption directory Make sure that the documents you put into the consumption directory
will either be readable by everyone (``chmod o+r file.pdf``) or will either be readable by everyone (``chmod o+r file.pdf``) or
readable by the default user or group id 1000 (or the one you have set readable by the default user or group id 1000 (or the one you have
with ``USERMAP_UID`` or ``USERMAP_GID`` respectively). set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
2. Use ``docker cp`` to copy your files directly into the container: 2. Use ``docker cp`` to copy your files directly into the container:
@ -258,8 +260,8 @@ Docker Method
``docker cp`` is a one-shot-command, just like ``cp``. This means that ``docker cp`` is a one-shot-command, just like ``cp``. This means that
every time you want to consume a new document, you will have to execute every time you want to consume a new document, you will have to execute
``docker cp`` again. You can of course automate this process, but option 1 ``docker cp`` again. You can of course automate this process, but option
is generally the preferred one. 1 is generally the preferred one.
.. danger:: .. danger::
@ -267,8 +269,8 @@ Docker Method
to the acting user at the destination, which will be ``root``. to the acting user at the destination, which will be ``root``.
You therefore need to ensure that the documents you want to copy into You therefore need to ensure that the documents you want to copy into
the container are readable by everyone (``chmod o+r file.pdf``) before the container are readable by everyone (``chmod o+r file.pdf``)
copying them. before copying them.
.. _Docker: https://www.docker.com/ .. _Docker: https://www.docker.com/
@ -281,17 +283,108 @@ Docker Method
free to tinker around without using compose! free to tinker around without using compose!
.. _making-things-a-little-more-permanent: .. _setup-permanent:
Making Things a Little more Permanent Making Things a Little more Permanent
------------------------------------- -------------------------------------
Once you've tested things and are happy with the work flow, you can automate the Once you've tested things and are happy with the work flow, you can automate
process of starting the webserver and consumer automatically. If you're running the process of starting the webserver and consumer automatically.
on a bare metal system that's using Systemd, you can use the service unit files
in the ``scripts`` directory to set this up. If you're on another startup
system or are using a Vagrant box, then you're currently on your own. If you are .. _setup-permanent-standard-systemd:
using Docker, you can set a restart-policy_ in the ``docker-compose.yml`` to
have the containers automatically start with the Docker daemon. Standard (Bare Metal, Systemd)
..............................
If you're running on a bare metal system that's using Systemd, you can use the
service unit files in the ``scripts`` directory to set this up. You'll need to
create a user called ``paperless`` and setup Paperless to be in a place that
this new user can read and write to. Then, you can just tell Systemd to enable
the two ``.service`` files::
# systemctl enable /path/to/paperless/scripts/paperless-consumer.service
# systemctl enable /path/to/paperless/scripts/paperless-webserver.service
# systemctl start /path/to/paperless/scripts/paperless-consumer.service
# systemctl start /path/to/paperless/scripts/paperless-webserver.service
.. _setup-permanent-standard-ubuntu14:
Ubuntu 14.04 (Bare Metal, Upstart)
..................................
Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
during the boot process. To configure Upstart to run Paperless automatically
after restarting your system:
1. Change to the directory where Upstart's configuration files are kept:
``cd /etc/init``
2. Create a new file: ``sudo nano paperless-server.conf``
3. In the newly-created file enter::
start on (local-filesystems and net-device-up IFACE=eth0)
stop on shutdown
respawn
respawn limit 10 5
script
exec /srv/paperless/src/manage.py runserver 0.0.0.0:80
end script
Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
path to the ``manage.py`` script in your installation directory.
If you are using a network interface other than ``eth0``, you will have to
change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
run ``ifconfig``.
Save the file.
4. Create a new file: ``sudo nano paperless-consumer.conf``
5. In the newly-created file enter::
start on (local-filesystems and net-device-up IFACE=eth0)
stop on shutdown
respawn
respawn limit 10 5
script
exec /srv/paperless/src/manage.py document_consumer
end script
Replace ``/srv/paperless/src/manage.py`` with the same values as in step 3
above and replace ``eth0`` with the appropriate value, if necessary. Save the
file.
These two configuration files together will start both the Paperless webserver
and document consumer processes when the file system and network interface
specified is available after boot. Furthermore, if either process ever exits
unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
second period.
.. _Upstart: http://upstart.ubuntu.com/
.. _setup-permanent-vagrant:
Vagrant
.......
You're currently on your own, but the Ubuntu explanation above may be enough.
.. _setup-permanent-docker:
Docker
......
If you're using Docker, you can set a restart-policy_ in the
``docker-compose.yml`` to have the containers automatically start with the
Docker daemon.
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart .. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart

19
docs/troubleshooting.rst Normal file
View File

@ -0,0 +1,19 @@
.. _troubleshooting:
Troubleshooting
===============
.. _troubleshooting_ocr_language_files_missing:
Consumer warns ``OCR for XX failed``
------------------------------------
If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
might need to install the `Tesseract language files
<http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian
box), and your documents are written in Spanish you may need to run::
apt-get install -y tesseract-ocr-spa

View File

@ -20,7 +20,7 @@ PAPERLESS_CONSUME_MAIL_PASS=""
# #
# The passphrase you use here will be used when storing your documents in # The passphrase you use here will be used when storing your documents in
# Paperless, but you can always export them in an unencrypted format by using # Paperless, but you can always export them in an unencrypted format by using
# document exporter. See the documentaiton for more information. # document exporter. See the documentation for more information.
# #
# One final note about the passphrase. Once you've consumed a document with # One final note about the passphrase. Once you've consumed a document with
# one passphrase, DON'T CHANGE IT. Paperless assumes this to be a constant and # one passphrase, DON'T CHANGE IT. Paperless assumes this to be a constant and
@ -31,3 +31,8 @@ PAPERLESS_PASSPHRASE="secret"
# If you intend to consume documents either via HTTP POST or by email, you must # If you intend to consume documents either via HTTP POST or by email, you must
# have a shared secret here. # have a shared secret here.
PAPERLESS_SHARED_SECRET="" PAPERLESS_SHARED_SECRET=""
# By default, Paperless will attempt to use all available CPU cores to process
# a document, but if you would like to limit that, you can set this value to
# an integer:
#PAPERLESS_OCR_THREADS=1

BIN
presentation/img/kitten.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

View File

@ -148,12 +148,12 @@
<section data-background="img/pony.png"> <section data-background="img/pony.png">
<h2>Demo!</h2> <h2>Demo!</h2>
<p>(Time to sacrifice a kitten)</p> <img src="img/kitten.jpg" style="width: 50%;" />
</section> </section>
<section> <section>
<h2>TODO</h2> <h2>TODO</h2>
<p>It works, but it could use polish</p> <p>It works, but it needs polish</p>
<ul> <ul>
<li>The UI is the Django admin</li> <li>The UI is the Django admin</li>
<li>Mail consumption is really raw</li> <li>Mail consumption is really raw</li>
@ -163,11 +163,11 @@
<aside class="notes"> <aside class="notes">
<ul> <ul>
<li> <li>
<strong>Plugin architecture</strong>: there've been requests for <strong>Plugin architecture</strong>: there've been requests
some overly custom stuff to happen before and after consumption, for some overly custom stuff to happen before and after
but in the UNIX spirit of "do one job well", I think this sort consumption, but in the UNIX spirit of "do one job well", I
of thing is better written as a plugin -- which means I need to think this sort of thing is better written as a plugin -- which
figure out a best practise for that. means I need to figure out a best practise for that.
</li> </li>
</ul> </ul>
</aside> </aside>

View File

@ -1,4 +1,4 @@
Django==1.9.2 Django==1.9.4
Pillow==3.1.1 Pillow==3.1.1
django-crispy-forms==1.6.0 django-crispy-forms==1.6.0
django-extensions==1.6.1 django-extensions==1.6.1

View File

@ -19,12 +19,11 @@ from PIL import Image
from django.conf import settings from django.conf import settings
from django.utils import timezone from django.utils import timezone
from django.template.defaultfilters import slugify
from pyocr.tesseract import TesseractError from pyocr.tesseract import TesseractError
from paperless.db import GnuPG from paperless.db import GnuPG
from .models import Correspondent, Tag, Document, Log from .models import Tag, Document, Log, FileInfo
from .languages import ISO639 from .languages import ISO639
from .signals import ( from .signals import (
document_consumption_started, document_consumption_finished) document_consumption_started, document_consumption_finished)
@ -56,19 +55,6 @@ class Consumer(object):
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
REGEX_TITLE = re.compile(
r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)
REGEX_CORRESPONDENT_TITLE = re.compile(
r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)
REGEX_CORRESPONDENT_TITLE_TAGS = re.compile(
r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)
def __init__(self): def __init__(self):
self.logger = logging.getLogger(__name__) self.logger = logging.getLogger(__name__)
@ -107,7 +93,7 @@ class Consumer(object):
if not os.path.isfile(doc): if not os.path.isfile(doc):
continue continue
if not re.match(self.REGEX_TITLE, doc): if not re.match(FileInfo.REGEXES["title"], doc):
continue continue
if doc in self._ignore: if doc in self._ignore:
@ -282,72 +268,20 @@ class Consumer(object):
# Strip out excess white space to allow matching to go smoother # Strip out excess white space to allow matching to go smoother
return re.sub(r"\s+", " ", r) return re.sub(r"\s+", " ", r)
def _guess_attributes_from_name(self, parseable):
"""
We use a crude naming convention to make handling the correspondent,
title, and tags easier:
"<correspondent> - <title> - <tags>.<suffix>"
"<correspondent> - <title>.<suffix>"
"<title>.<suffix>"
"""
def get_correspondent(correspondent_name):
return Correspondent.objects.get_or_create(
name=correspondent_name,
defaults={"slug": slugify(correspondent_name)}
)[0]
def get_tags(tags):
r = []
for t in tags.split(","):
r.append(
Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
return tuple(r)
def get_suffix(suffix):
suffix = suffix.lower()
if suffix == "jpeg":
return "jpg"
return suffix
# First attempt: "<correspondent> - <title> - <tags>.<suffix>"
m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable)
if m:
return (
get_correspondent(m.group(1)),
m.group(2),
get_tags(m.group(3)),
get_suffix(m.group(4))
)
# Second attempt: "<correspondent> - <title>.<suffix>"
m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable)
if m:
return (
get_correspondent(m.group(1)),
m.group(2),
(),
get_suffix(m.group(3))
)
# That didn't work, so we assume correspondent and tags are None
m = re.match(self.REGEX_TITLE, parseable)
return None, m.group(1), (), get_suffix(m.group(2))
def _store(self, text, doc, thumbnail): def _store(self, text, doc, thumbnail):
sender, title, tags, file_type = self._guess_attributes_from_name(doc) file_info = FileInfo.from_path(doc)
relevant_tags = set(list(Tag.match_all(text)) + list(tags)) relevant_tags = set(list(Tag.match_all(text)) + list(file_info.tags))
stats = os.stat(doc) stats = os.stat(doc)
self.log("debug", "Saving record to database") self.log("debug", "Saving record to database")
document = Document.objects.create( document = Document.objects.create(
correspondent=sender, correspondent=file_info.correspondent,
title=title, title=file_info.title,
content=text, content=text,
file_type=file_type, file_type=file_info.extension,
created=timezone.make_aware( created=timezone.make_aware(
datetime.datetime.fromtimestamp(stats.st_mtime)), datetime.datetime.fromtimestamp(stats.st_mtime)),
modified=timezone.make_aware( modified=timezone.make_aware(

View File

@ -96,11 +96,16 @@ class Command(Renderable, BaseCommand):
@staticmethod @staticmethod
def _get_legacy_file_name(doc): def _get_legacy_file_name(doc):
if doc.correspondent and doc.title:
tags = ",".join([t.slug for t in doc.tags.all()]) if not doc.correspondent and not doc.title:
if tags: return os.path.basename(doc.source_path)
return "{} - {} - {}.{}".format(
doc.correspondent, doc.title, tags, doc.file_type) created = doc.created.strftime("%Y%m%d%H%M%SZ")
return "{} - {}.{}".format( tags = ",".join([t.slug for t in doc.tags.all()])
doc.correspondent, doc.title, doc.file_type)
return os.path.basename(doc.source_path) if tags:
return "{} - {} - {} - {}.{}".format(
created, doc.correspondent, doc.title, tags, doc.file_type)
return "{} - {} - {}.{}".format(
created, doc.correspondent, doc.title, doc.file_type)

View File

@ -1,8 +1,11 @@
import dateutil.parser
import logging import logging
import os import os
import re import re
import uuid import uuid
from collections import OrderedDict
from django.conf import settings from django.conf import settings
from django.core.urlresolvers import reverse from django.core.urlresolvers import reverse
from django.db import models from django.db import models
@ -152,7 +155,7 @@ class Document(models.Model):
) )
tags = models.ManyToManyField( tags = models.ManyToManyField(
Tag, related_name="documents", blank=True) Tag, related_name="documents", blank=True)
created = models.DateTimeField(default=timezone.now, editable=False) created = models.DateTimeField(default=timezone.now)
modified = models.DateTimeField(auto_now=True, editable=False) modified = models.DateTimeField(auto_now=True, editable=False)
class Meta(object): class Meta(object):
@ -250,3 +253,136 @@ class Log(models.Model):
self.group = uuid.uuid4() self.group = uuid.uuid4()
models.Model.save(self, *args, **kwargs) models.Model.save(self, *args, **kwargs)
class FileInfo(object):
# This epic regex *almost* worked for our needs, so I'm keeping it here for
# posterity, in the hopes that we might find a way to make it work one day.
ALMOST_REGEX = re.compile(
r"^((?P<date>\d\d\d\d\d\d\d\d\d\d\d\d\d\dZ){separator})?"
r"((?P<correspondent>{non_separated_word}+){separator})??"
r"(?P<title>{non_separated_word}+)"
r"({separator}(?P<tags>[a-z,0-9-]+))?"
r"\.(?P<extension>[a-zA-Z.-]+)$".format(
separator=r"\s+-\s+",
non_separated_word=r"([\w,. ]|([^\s]-))"
)
)
REGEXES = OrderedDict([
("created-correspondent-title-tags", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<correspondent>.*) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("created-title-tags", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("created-correspondent-title", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<correspondent>.*) - "
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("created-title", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("correspondent-title-tags", re.compile(
r"(?P<correspondent>.*) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("correspondent-title", re.compile(
r"(?P<correspondent>.*) - "
r"(?P<title>.*)?"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("title", re.compile(
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
))
])
def __init__(self, created=None, correspondent=None, title=None, tags=(),
extension=None):
self.created = created
self.title = title
self.extension = extension
self.correspondent = correspondent
self.tags = tags
@classmethod
def _get_created(cls, created):
return dateutil.parser.parse("{:0<14}Z".format(created[:-1]))
@classmethod
def _get_correspondent(cls, name):
if not name:
return None
return Correspondent.objects.get_or_create(name=name, defaults={
"slug": slugify(name)
})[0]
@classmethod
def _get_title(cls, title):
return title
@classmethod
def _get_tags(cls, tags):
r = []
for t in tags.split(","):
r.append(
Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
return tuple(r)
@classmethod
def _get_extension(cls, extension):
r = extension.lower()
if r == "jpeg":
return "jpg"
return r
@classmethod
def _mangle_property(cls, properties, name):
if name in properties:
properties[name] = getattr(cls, "_get_{}".format(name))(
properties[name]
)
@classmethod
def from_path(cls, path):
"""
We use a crude naming convention to make handling the correspondent,
title, and tags easier:
"<correspondent> - <title> - <tags>.<suffix>"
"<correspondent> - <title>.<suffix>"
"<title>.<suffix>"
"""
for regex in cls.REGEXES.values():
m = regex.match(os.path.basename(path))
if m:
properties = m.groupdict()
cls._mangle_property(properties, "created")
cls._mangle_property(properties, "correspondent")
cls._mangle_property(properties, "title")
cls._mangle_property(properties, "tags")
cls._mangle_property(properties, "extension")
return cls(**properties)

View File

@ -1,29 +1,36 @@
from django.test import TestCase from django.test import TestCase
from ..consumer import Consumer from ..models import Document, FileInfo
class TestAttachment(TestCase): class TestAttachment(TestCase):
TAGS = ("tag1", "tag2", "tag3") TAGS = ("tag1", "tag2", "tag3")
CONSUMER = Consumer() EXTENSIONS = (
SUFFIXES = (
"pdf", "png", "jpg", "jpeg", "gif", "pdf", "png", "jpg", "jpeg", "gif",
"PDF", "PNG", "JPG", "JPEG", "GIF", "PDF", "PNG", "JPG", "JPEG", "GIF",
"PdF", "PnG", "JpG", "JPeG", "GiF", "PdF", "PnG", "JpG", "JPeG", "GiF",
) )
def _test_guess_attributes_from_name(self, path, sender, title, tags): def _test_guess_attributes_from_name(self, path, sender, title, tags):
for suffix in self.SUFFIXES:
f = path.format(suffix) for extension in self.EXTENSIONS:
results = self.CONSUMER._guess_attributes_from_name(f)
self.assertEqual(results[0].name, sender, f) f = path.format(extension)
self.assertEqual(results[1], title, f) file_info = FileInfo.from_path(f)
self.assertEqual(tuple([t.slug for t in results[2]]), tags, f)
if suffix.lower() == "jpeg": if sender:
self.assertEqual(results[3], "jpg", f) self.assertEqual(file_info.correspondent.name, sender, f)
else: else:
self.assertEqual(results[3], suffix.lower(), f) self.assertIsNone(file_info.correspondent, f)
self.assertEqual(file_info.title, title, f)
self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f)
if extension.lower() == "jpeg":
self.assertEqual(file_info.extension, "jpg", f)
else:
self.assertEqual(file_info.extension, extension.lower(), f)
def test_guess_attributes_from_name0(self): def test_guess_attributes_from_name0(self):
self._test_guess_attributes_from_name( self._test_guess_attributes_from_name(
@ -92,3 +99,206 @@ class TestAttachment(TestCase):
"Τιτλε", "Τιτλε",
self.TAGS self.TAGS
) )
def test_guess_attributes_from_name_when_correspondent_empty(self):
self._test_guess_attributes_from_name(
'/path/to/ - weird empty correspondent but should not break.{}',
None,
'weird empty correspondent but should not break',
()
)
def test_guess_attributes_from_name_when_title_starts_with_dash(self):
self._test_guess_attributes_from_name(
'/path/to/- weird but should not break.{}',
None,
'- weird but should not break',
()
)
def test_guess_attributes_from_name_when_title_ends_with_dash(self):
self._test_guess_attributes_from_name(
'/path/to/weird but should not break -.{}',
None,
'weird but should not break -',
()
)
def test_guess_attributes_from_name_when_title_is_empty(self):
self._test_guess_attributes_from_name(
'/path/to/weird correspondent but should not break - .{}',
'weird correspondent but should not break',
'',
()
)
class Permutations(TestCase):
valid_dates = (
"20150102030405Z",
"20150102Z",
)
valid_correspondents = [
"timmy",
"Dr. McWheelie",
"Dash Gor-don",
"ο Θερμαστής",
""
]
valid_titles = ["title", "Title w Spaces", "Title a-dash", "Τίτλος", ""]
valid_tags = ["tag", "tig,tag", "tag1,tag2,tag-3"]
valid_extensions = ["pdf", "png", "jpg", "jpeg", "gif"]
def _test_guessed_attributes(self, filename, created=None,
correspondent=None, title=None,
extension=None, tags=None):
# print(filename)
info = FileInfo.from_path(filename)
# Created
if created is None:
self.assertIsNone(info.created, filename)
else:
self.assertEqual(info.created.year, int(created[:4]), filename)
self.assertEqual(info.created.month, int(created[4:6]), filename)
self.assertEqual(info.created.day, int(created[6:8]), filename)
# Correspondent
if correspondent:
self.assertEqual(info.correspondent.name, correspondent, filename)
else:
self.assertEqual(info.correspondent, None, filename)
# Title
self.assertEqual(info.title, title, filename)
# Tags
if tags is None:
self.assertEqual(info.tags, (), filename)
else:
self.assertEqual(
[t.slug for t in info.tags], tags.split(','),
filename
)
# Extension
if extension == 'jpeg':
extension = 'jpg'
self.assertEqual(info.extension, extension, filename)
def test_just_title(self):
template = '/path/to/{title}.{extension}'
for title in self.valid_titles:
for extension in self.valid_extensions:
spec = dict(title=title, extension=extension)
filename = template.format(**spec)
self._test_guessed_attributes(filename, **spec)
def test_title_and_correspondent(self):
template = '/path/to/{correspondent} - {title}.{extension}'
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
for extension in self.valid_extensions:
spec = dict(correspondent=correspondent, title=title,
extension=extension)
filename = template.format(**spec)
self._test_guessed_attributes(filename, **spec)
def test_title_and_correspondent_and_tags(self):
template = '/path/to/{correspondent} - {title} - {tags}.{extension}'
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
for tags in self.valid_tags:
for extension in self.valid_extensions:
spec = dict(correspondent=correspondent, title=title,
tags=tags, extension=extension)
filename = template.format(**spec)
self._test_guessed_attributes(filename, **spec)
def test_created_and_correspondent_and_title_and_tags(self):
template = ("/path/to/{created} - "
"{correspondent} - "
"{title} - "
"{tags}"
".{extension}")
for created in self.valid_dates:
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
for tags in self.valid_tags:
for extension in self.valid_extensions:
spec = {
"created": created,
"correspondent": correspondent,
"title": title,
"tags": tags,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)
def test_created_and_correspondent_and_title(self):
template = ("/path/to/{created} - "
"{correspondent} - "
"{title}"
".{extension}")
for created in self.valid_dates:
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
# Skip cases where title looks like a tag as we can't
# accommodate such cases.
if title.lower() == title:
continue
for extension in self.valid_extensions:
spec = {
"created": created,
"correspondent": correspondent,
"title": title,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)
def test_created_and_title(self):
template = ("/path/to/{created} - "
"{title}"
".{extension}")
for created in self.valid_dates:
for title in self.valid_titles:
for extension in self.valid_extensions:
spec = {
"created": created,
"title": title,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)
def test_created_and_title_and_tags(self):
template = ("/path/to/{created} - "
"{title} - "
"{tags}"
".{extension}")
for created in self.valid_dates:
for title in self.valid_titles:
for tags in self.valid_tags:
for extension in self.valid_extensions:
spec = {
"created": created,
"title": title,
"tags": tags,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)