Merge branch 'master' into issue/81

This commit is contained in:
Daniel Quinn 2016-03-25 20:56:30 +00:00
commit 49b56425e8
16 changed files with 598 additions and 167 deletions

View File

@ -24,8 +24,11 @@ How it Works
1. Buy a document scanner like `this one`_.
2. Set it up to "scan to FTP" or something similar. It should be able to push
scanned images to a server without you having to do anything.
3. Have the target server run the *Paperless* consumption script to OCR the PDF
scanned images to a server without you having to do anything. If your
scanner doesn't know how to automatically upload the file somewhere, you can
always do that manually. Paperless doesn't care how the documents get into
its local consumption directory.
3. Have the target server run the Paperless consumption script to OCR the PDF
and index it into a local database.
4. Use the web frontend to sift through the database and find what you want.
5. Download the PDF you need/want via the web interface and do whatever you
@ -56,7 +59,7 @@ powerful tools.
* `ImageMagick`_ converts the images between colour and greyscale.
* `Tesseract`_ does the character recognition.
* `Unpaper`_ despeckles and and deskews the scanned image.
* `Unpaper`_ despeckles and deskews the scanned image.
* `GNU Privacy Guard`_ is used as the encryption backend.
* `Python 3`_ is the language of the project.

View File

@ -11,6 +11,10 @@ services:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
env_file: docker-compose.env
# The reason the line is here is so that the webserver that doesn't do
# any text recognition and doesn't have to install unnecessary
# languages the user might have set in the env-file by overwriting the
# value with nothing.
environment:
- PAPERLESS_OCR_LANGUAGES=
command: ["runserver", "0.0.0.0:8000"]

View File

@ -1,6 +1,15 @@
Changelog
#########
* 0.2.0
* Added support for guessing the date from the file name along with the
correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull
request that I took forever to merge and to `Pit`_ for his efforts on the
regex front.
* `#94`_: Restored support for changing the created date in the UI. Thanks
to `Martin Honermeyer`_ and `Tim White`_ for working with me on this.
* 0.1.1
* Potentially **Breaking Change**: All references to "sender" in the code
@ -86,6 +95,8 @@ Changelog
.. _Wayne Werner: https://github.com/waynew
.. _darkmatter: https://github.com/darkmatter
.. _zedster: https://github.com/zedster
.. _Martin Honermeyer: https://github.com/djmaze
.. _Tim White: https://github.com/timwhite
.. _#20: https://github.com/danielquinn/paperless/issues/20
.. _#44: https://github.com/danielquinn/paperless/issues/44
@ -99,3 +110,4 @@ Changelog
.. _#67: https://github.com/danielquinn/paperless/issues/67
.. _#68: https://github.com/danielquinn/paperless/issues/68
.. _#71: https://github.com/danielquinn/paperless/issues/71
.. _#94: https://github.com/danielquinn/paperless/issues/71

View File

@ -45,19 +45,27 @@ you name the file right, it'll automatically set some values in the database
for you. This is is the logic the consumer follows:
1. Try to find the correspondent, title, and tags in the file name following
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
``YYYYMMDDZ``. The ``Z`` is for "Zulu time" AKA "UTC".
2. If that doesn't work, we skip the date and try this pattern:
the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
2. If that doesn't work, try to find the correspondent and title in the file
3. If that doesn't work, we try to find the correspondent and title in the file
name following the pattern: ``Correspondent - Title.pdf``.
3. If that doesn't work, just assume that the name of the file is the title.
4. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect:
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
* ``Another Company - Letter of Reference.jpg``
* ``Dad's Recipe for Pancakes.png``
These however wouldn't work:
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
* ``Another Company- Letter of Reference.jpg``
@ -128,7 +136,7 @@ following name/value pairs:
don't start uploading stuff to your server. The means of generating this
signature is defined below.
Specify ``enctype="multipart/form-data"``, and then POST your file with:::
Specify ``enctype="multipart/form-data"``, and then POST your file with::
Content-Disposition: form-data; name="document"; filename="whatever.pdf"

View File

@ -33,4 +33,5 @@ Contents
api
utilities
migrating
troubleshooting
changelog

View File

@ -8,7 +8,7 @@ should work) that has the following software installed on it:
* `Python3`_ (with development libraries, pip and virtualenv)
* `GNU Privacy Guard`_
* `Tesseract`_
* `Tesseract`_, plus its language files matching your document base.
* `Imagemagick`_
* `unpaper`_
@ -52,6 +52,7 @@ well as ImageMagick:
$ brew install ghostscript
$ brew install imagemagick
$ brew install libmagic
.. _requirements-baremetal:

View File

@ -5,7 +5,8 @@ Setup
Paperless isn't a very complicated app, but there are a few components, so some
basic documentation is in order. If you go follow along in this document and
still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
still have trouble, please open an `issue on GitHub`_ so I can fill in the
gaps.
.. _issue on GitHub: https://github.com/danielquinn/paperless/issues
@ -15,8 +16,8 @@ still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
Download
--------
The source is currently only available via GitHub, so grab it from there, either
by using ``git``:
The source is currently only available via GitHub, so grab it from there,
either by using ``git``:
.. code:: bash
@ -42,15 +43,16 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
consumption etc. We also `support Docker`_, which you can use natively under
Linux and in a VM with `Docker Machine`_ (this guide was written for native
Docker usage under Linux, you might have to adapt it for Docker Machine.)
Alternatively the standard, `bare metal`_ approach is a little more complicated,
but worth it because it makes it easier to should you want to contribute some
code back.
Alternatively the standard, `bare metal`_ approach is a little more
complicated, but worth it because it makes it easier to should you want to
contribute some code back.
.. _Vagrant route: setup-installation-vagrant_
.. _support Docker: setup-installation-docker_
.. _bare metal: setup-installation-standard_
.. _Docker Machine: https://docs.docker.com/machine/
.. _setup-installation-standard:
Standard (Bare Metal)
@ -58,19 +60,16 @@ Standard (Bare Metal)
1. Install the requirements as per the :ref:`requirements <requirements>` page.
2. Change to the ``src`` directory in this repo.
3. Edit ``paperless/settings.py`` and be sure to set the values for:
* ``CONSUMPTION_DIR``: this is where your documents will be dumped to be
consumed by Paperless.
* ``PASSPHRASE``: this is the passphrase Paperless uses to encrypt/decrypt
the original document. The default value attempts to source the
passphrase from the environment, so if you don't set it to a static value
here, you must set ``PAPERLESS_PASSPHRASE=some-secret-string`` on the
command line whenever invoking the consumer or webserver.
* ``OCR_THREADS``: this is the number of threads the OCR process will spawn
to process document pages in parallel. The default value gets sourced from
the environment-variable ``PAPERLESS_OCR_THREADS`` and expects it to be an
integer. If the variable is not set, Python determines the core-count of
your CPU and uses that value.
3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
your favourite editor. Set the values for:
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
dumped to be consumed by Paperless.
* ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
encrypt/decrypt the original document.
* ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
will spawn to process document pages in parallel.
4. Initialise the database with ``./manage.py migrate``.
5. Create a user for your Paperless instance with
``./manage.py createsuperuser``. Follow the prompts to create your user.
@ -79,8 +78,8 @@ Standard (Bare Metal)
You should now be able to visit your (empty) `Paperless webserver`_ at
``127.0.0.1:8000`` (or whatever you chose). You can login with the
user/pass you created in #5.
7. In a separate window, change to the ``src`` directory in this repo again, but
this time, you should start the consumer script with
7. In a separate window, change to the ``src`` directory in this repo again,
but this time, you should start the consumer script with
``./manage.py document_consumer``.
8. Scan something. Put it in the ``CONSUMPTION_DIR``.
9. Wait a few minutes
@ -100,6 +99,7 @@ Vagrant Method
provisioned...
3. Run ``vagrant ssh`` and once inside your new vagrant box, edit
``/etc/paperless.conf`` and set the values for:
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
dumped to be consumed by Paperless.
* ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
@ -107,6 +107,7 @@ Vagrant Method
* ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming
documents from mail or via the API. If you don't use either, leaving it
blank is just fine.
4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again. This
updates the environment to make use of the changes you made to the config
file.
@ -140,9 +141,9 @@ Docker Method
.. caution::
As mentioned earlier, this guide assumes that you use Docker natively
under Linux. If you are using `Docker Machine`_ under Mac OS X or Windows,
you will have to adapt IP addresses, volume-mounting, command execution
and maybe more.
under Linux. If you are using `Docker Machine`_ under Mac OS X or
Windows, you will have to adapt IP addresses, volume-mounting, command
execution and maybe more.
2. Install `docker-compose`_. [#compose]_
@ -161,14 +162,14 @@ Docker Method
.. _Docker installation guide: https://docs.docker.com/engine/installation/
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` and
a copy of ``docker-compose.env.example`` as ``docker-compose.env``. You'll be
editing both these files: taking a copy ensures that you can ``git pull`` to
receive updates without risking merge conflicts with your modified versions
of the configuration files.
4. Modify ``docker-compose.yml`` to your preferences, following the instructions
in comments in the file. The only change that is a hard requirement is to
specify where the consumption directory should mount.
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
You'll be editing both these files: taking a copy ensures that you can
``git pull`` to receive updates without risking merge conflicts with your
modified versions of the configuration files.
4. Modify ``docker-compose.yml`` to your preferences, following the
instructions in comments in the file. The only change that is a hard
requirement is to specify where the consumption directory should mount.
5. Modify ``docker-compose.env`` and adapt the following environment variables:
``PAPERLESS_PASSPHRASE``
@ -181,10 +182,11 @@ Docker Method
the core-count of your CPU and uses that value.
``PAPERLESS_OCR_LANGUAGES``
If you want the OCR to recognize other languages in addition to the default
English, set this parameter to a space separated list of three-letter
language-codes after `ISO 639-2/T`_. For a list of available languages --
including their three letter codes -- see the `Debian packagelist`_.
If you want the OCR to recognize other languages in addition to the
default English, set this parameter to a space separated list of
three-letter language-codes after `ISO 639-2/T`_. For a list of available
languages -- including their three letter codes -- see the
`Debian packagelist`_.
``USERMAP_UID`` and ``USERMAP_GID``
If you want to mount the consumption volume (directory ``/consume`` within
@ -192,11 +194,11 @@ Docker Method
access rights might be an issue. The default user and group ``paperless``
in the containers have an id of 1000. The containers will enforce that the
owning group of the consumption directory will be ``paperless`` to be able
to delete consumed documents. If your host-system has a group with an id of
1000 and you don't want this group to have access rights to the consumption
directory, you can use ``USERMAP_GID`` to change the id in the container
and thus the one of the consumption directory. Furthermore, you can change
the id of the default user as well using ``USERMAP_UID``.
to delete consumed documents. If your host-system has a group with an ID
of 1000 and you don't want this group to have access rights to the
consumption directory, you can use ``USERMAP_GID`` to change the id in the
container and thus the one of the consumption directory. Furthermore, you
can change the id of the default user as well using ``USERMAP_UID``.
6. Run ``docker-compose up -d``. This will create and start the necessary
containers.
@ -234,14 +236,14 @@ Docker Method
.. danger::
While the consumption container will ensure at startup that it can
**delete** a consumed file from a host-mounted directory, it might not
be able to **read** the document in the first place if the access
**delete** a consumed file from a host-mounted directory, it might
not be able to **read** the document in the first place if the access
rights to the file are incorrect.
Make sure that the documents you put into the consumption directory
will either be readable by everyone (``chmod o+r file.pdf``) or
readable by the default user or group id 1000 (or the one you have set
with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
readable by the default user or group id 1000 (or the one you have
set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
2. Use ``docker cp`` to copy your files directly into the container:
@ -258,8 +260,8 @@ Docker Method
``docker cp`` is a one-shot-command, just like ``cp``. This means that
every time you want to consume a new document, you will have to execute
``docker cp`` again. You can of course automate this process, but option 1
is generally the preferred one.
``docker cp`` again. You can of course automate this process, but option
1 is generally the preferred one.
.. danger::
@ -267,8 +269,8 @@ Docker Method
to the acting user at the destination, which will be ``root``.
You therefore need to ensure that the documents you want to copy into
the container are readable by everyone (``chmod o+r file.pdf``) before
copying them.
the container are readable by everyone (``chmod o+r file.pdf``)
before copying them.
.. _Docker: https://www.docker.com/
@ -281,17 +283,108 @@ Docker Method
free to tinker around without using compose!
.. _making-things-a-little-more-permanent:
.. _setup-permanent:
Making Things a Little more Permanent
-------------------------------------
Once you've tested things and are happy with the work flow, you can automate the
process of starting the webserver and consumer automatically. If you're running
on a bare metal system that's using Systemd, you can use the service unit files
in the ``scripts`` directory to set this up. If you're on another startup
system or are using a Vagrant box, then you're currently on your own. If you are
using Docker, you can set a restart-policy_ in the ``docker-compose.yml`` to
have the containers automatically start with the Docker daemon.
Once you've tested things and are happy with the work flow, you can automate
the process of starting the webserver and consumer automatically.
.. _setup-permanent-standard-systemd:
Standard (Bare Metal, Systemd)
..............................
If you're running on a bare metal system that's using Systemd, you can use the
service unit files in the ``scripts`` directory to set this up. You'll need to
create a user called ``paperless`` and setup Paperless to be in a place that
this new user can read and write to. Then, you can just tell Systemd to enable
the two ``.service`` files::
# systemctl enable /path/to/paperless/scripts/paperless-consumer.service
# systemctl enable /path/to/paperless/scripts/paperless-webserver.service
# systemctl start /path/to/paperless/scripts/paperless-consumer.service
# systemctl start /path/to/paperless/scripts/paperless-webserver.service
.. _setup-permanent-standard-ubuntu14:
Ubuntu 14.04 (Bare Metal, Upstart)
..................................
Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
during the boot process. To configure Upstart to run Paperless automatically
after restarting your system:
1. Change to the directory where Upstart's configuration files are kept:
``cd /etc/init``
2. Create a new file: ``sudo nano paperless-server.conf``
3. In the newly-created file enter::
start on (local-filesystems and net-device-up IFACE=eth0)
stop on shutdown
respawn
respawn limit 10 5
script
exec /srv/paperless/src/manage.py runserver 0.0.0.0:80
end script
Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
path to the ``manage.py`` script in your installation directory.
If you are using a network interface other than ``eth0``, you will have to
change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
run ``ifconfig``.
Save the file.
4. Create a new file: ``sudo nano paperless-consumer.conf``
5. In the newly-created file enter::
start on (local-filesystems and net-device-up IFACE=eth0)
stop on shutdown
respawn
respawn limit 10 5
script
exec /srv/paperless/src/manage.py document_consumer
end script
Replace ``/srv/paperless/src/manage.py`` with the same values as in step 3
above and replace ``eth0`` with the appropriate value, if necessary. Save the
file.
These two configuration files together will start both the Paperless webserver
and document consumer processes when the file system and network interface
specified is available after boot. Furthermore, if either process ever exits
unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
second period.
.. _Upstart: http://upstart.ubuntu.com/
.. _setup-permanent-vagrant:
Vagrant
.......
You're currently on your own, but the Ubuntu explanation above may be enough.
.. _setup-permanent-docker:
Docker
......
If you're using Docker, you can set a restart-policy_ in the
``docker-compose.yml`` to have the containers automatically start with the
Docker daemon.
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart

19
docs/troubleshooting.rst Normal file
View File

@ -0,0 +1,19 @@
.. _troubleshooting:
Troubleshooting
===============
.. _troubleshooting_ocr_language_files_missing:
Consumer warns ``OCR for XX failed``
------------------------------------
If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
might need to install the `Tesseract language files
<http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian
box), and your documents are written in Spanish you may need to run::
apt-get install -y tesseract-ocr-spa

View File

@ -20,7 +20,7 @@ PAPERLESS_CONSUME_MAIL_PASS=""
#
# The passphrase you use here will be used when storing your documents in
# Paperless, but you can always export them in an unencrypted format by using
# document exporter. See the documentaiton for more information.
# document exporter. See the documentation for more information.
#
# One final note about the passphrase. Once you've consumed a document with
# one passphrase, DON'T CHANGE IT. Paperless assumes this to be a constant and
@ -31,3 +31,8 @@ PAPERLESS_PASSPHRASE="secret"
# If you intend to consume documents either via HTTP POST or by email, you must
# have a shared secret here.
PAPERLESS_SHARED_SECRET=""
# By default, Paperless will attempt to use all available CPU cores to process
# a document, but if you would like to limit that, you can set this value to
# an integer:
#PAPERLESS_OCR_THREADS=1

BIN
presentation/img/kitten.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

View File

@ -148,12 +148,12 @@
<section data-background="img/pony.png">
<h2>Demo!</h2>
<p>(Time to sacrifice a kitten)</p>
<img src="img/kitten.jpg" style="width: 50%;" />
</section>
<section>
<h2>TODO</h2>
<p>It works, but it could use polish</p>
<p>It works, but it needs polish</p>
<ul>
<li>The UI is the Django admin</li>
<li>Mail consumption is really raw</li>
@ -163,11 +163,11 @@
<aside class="notes">
<ul>
<li>
<strong>Plugin architecture</strong>: there've been requests for
some overly custom stuff to happen before and after consumption,
but in the UNIX spirit of "do one job well", I think this sort
of thing is better written as a plugin -- which means I need to
figure out a best practise for that.
<strong>Plugin architecture</strong>: there've been requests
for some overly custom stuff to happen before and after
consumption, but in the UNIX spirit of "do one job well", I
think this sort of thing is better written as a plugin -- which
means I need to figure out a best practise for that.
</li>
</ul>
</aside>

View File

@ -1,4 +1,4 @@
Django==1.9.2
Django==1.9.4
Pillow==3.1.1
django-crispy-forms==1.6.0
django-extensions==1.6.1

View File

@ -19,12 +19,11 @@ from PIL import Image
from django.conf import settings
from django.utils import timezone
from django.template.defaultfilters import slugify
from pyocr.tesseract import TesseractError
from paperless.db import GnuPG
from .models import Correspondent, Tag, Document, Log
from .models import Tag, Document, Log, FileInfo
from .languages import ISO639
from .signals import (
document_consumption_started, document_consumption_finished)
@ -56,19 +55,6 @@ class Consumer(object):
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
REGEX_TITLE = re.compile(
r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)
REGEX_CORRESPONDENT_TITLE = re.compile(
r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)
REGEX_CORRESPONDENT_TITLE_TAGS = re.compile(
r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)
def __init__(self):
self.logger = logging.getLogger(__name__)
@ -107,7 +93,7 @@ class Consumer(object):
if not os.path.isfile(doc):
continue
if not re.match(self.REGEX_TITLE, doc):
if not re.match(FileInfo.REGEXES["title"], doc):
continue
if doc in self._ignore:
@ -282,72 +268,20 @@ class Consumer(object):
# Strip out excess white space to allow matching to go smoother
return re.sub(r"\s+", " ", r)
def _guess_attributes_from_name(self, parseable):
"""
We use a crude naming convention to make handling the correspondent,
title, and tags easier:
"<correspondent> - <title> - <tags>.<suffix>"
"<correspondent> - <title>.<suffix>"
"<title>.<suffix>"
"""
def get_correspondent(correspondent_name):
return Correspondent.objects.get_or_create(
name=correspondent_name,
defaults={"slug": slugify(correspondent_name)}
)[0]
def get_tags(tags):
r = []
for t in tags.split(","):
r.append(
Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
return tuple(r)
def get_suffix(suffix):
suffix = suffix.lower()
if suffix == "jpeg":
return "jpg"
return suffix
# First attempt: "<correspondent> - <title> - <tags>.<suffix>"
m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable)
if m:
return (
get_correspondent(m.group(1)),
m.group(2),
get_tags(m.group(3)),
get_suffix(m.group(4))
)
# Second attempt: "<correspondent> - <title>.<suffix>"
m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable)
if m:
return (
get_correspondent(m.group(1)),
m.group(2),
(),
get_suffix(m.group(3))
)
# That didn't work, so we assume correspondent and tags are None
m = re.match(self.REGEX_TITLE, parseable)
return None, m.group(1), (), get_suffix(m.group(2))
def _store(self, text, doc, thumbnail):
sender, title, tags, file_type = self._guess_attributes_from_name(doc)
relevant_tags = set(list(Tag.match_all(text)) + list(tags))
file_info = FileInfo.from_path(doc)
relevant_tags = set(list(Tag.match_all(text)) + list(file_info.tags))
stats = os.stat(doc)
self.log("debug", "Saving record to database")
document = Document.objects.create(
correspondent=sender,
title=title,
correspondent=file_info.correspondent,
title=file_info.title,
content=text,
file_type=file_type,
file_type=file_info.extension,
created=timezone.make_aware(
datetime.datetime.fromtimestamp(stats.st_mtime)),
modified=timezone.make_aware(

View File

@ -96,11 +96,16 @@ class Command(Renderable, BaseCommand):
@staticmethod
def _get_legacy_file_name(doc):
if doc.correspondent and doc.title:
tags = ",".join([t.slug for t in doc.tags.all()])
if tags:
return "{} - {} - {}.{}".format(
doc.correspondent, doc.title, tags, doc.file_type)
return "{} - {}.{}".format(
doc.correspondent, doc.title, doc.file_type)
return os.path.basename(doc.source_path)
if not doc.correspondent and not doc.title:
return os.path.basename(doc.source_path)
created = doc.created.strftime("%Y%m%d%H%M%SZ")
tags = ",".join([t.slug for t in doc.tags.all()])
if tags:
return "{} - {} - {} - {}.{}".format(
created, doc.correspondent, doc.title, tags, doc.file_type)
return "{} - {} - {}.{}".format(
created, doc.correspondent, doc.title, doc.file_type)

View File

@ -1,8 +1,11 @@
import dateutil.parser
import logging
import os
import re
import uuid
from collections import OrderedDict
from django.conf import settings
from django.core.urlresolvers import reverse
from django.db import models
@ -152,7 +155,7 @@ class Document(models.Model):
)
tags = models.ManyToManyField(
Tag, related_name="documents", blank=True)
created = models.DateTimeField(default=timezone.now, editable=False)
created = models.DateTimeField(default=timezone.now)
modified = models.DateTimeField(auto_now=True, editable=False)
class Meta(object):
@ -250,3 +253,136 @@ class Log(models.Model):
self.group = uuid.uuid4()
models.Model.save(self, *args, **kwargs)
class FileInfo(object):
# This epic regex *almost* worked for our needs, so I'm keeping it here for
# posterity, in the hopes that we might find a way to make it work one day.
ALMOST_REGEX = re.compile(
r"^((?P<date>\d\d\d\d\d\d\d\d\d\d\d\d\d\dZ){separator})?"
r"((?P<correspondent>{non_separated_word}+){separator})??"
r"(?P<title>{non_separated_word}+)"
r"({separator}(?P<tags>[a-z,0-9-]+))?"
r"\.(?P<extension>[a-zA-Z.-]+)$".format(
separator=r"\s+-\s+",
non_separated_word=r"([\w,. ]|([^\s]-))"
)
)
REGEXES = OrderedDict([
("created-correspondent-title-tags", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<correspondent>.*) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("created-title-tags", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("created-correspondent-title", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<correspondent>.*) - "
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("created-title", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("correspondent-title-tags", re.compile(
r"(?P<correspondent>.*) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("correspondent-title", re.compile(
r"(?P<correspondent>.*) - "
r"(?P<title>.*)?"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
)),
("title", re.compile(
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE
))
])
def __init__(self, created=None, correspondent=None, title=None, tags=(),
extension=None):
self.created = created
self.title = title
self.extension = extension
self.correspondent = correspondent
self.tags = tags
@classmethod
def _get_created(cls, created):
return dateutil.parser.parse("{:0<14}Z".format(created[:-1]))
@classmethod
def _get_correspondent(cls, name):
if not name:
return None
return Correspondent.objects.get_or_create(name=name, defaults={
"slug": slugify(name)
})[0]
@classmethod
def _get_title(cls, title):
return title
@classmethod
def _get_tags(cls, tags):
r = []
for t in tags.split(","):
r.append(
Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
return tuple(r)
@classmethod
def _get_extension(cls, extension):
r = extension.lower()
if r == "jpeg":
return "jpg"
return r
@classmethod
def _mangle_property(cls, properties, name):
if name in properties:
properties[name] = getattr(cls, "_get_{}".format(name))(
properties[name]
)
@classmethod
def from_path(cls, path):
"""
We use a crude naming convention to make handling the correspondent,
title, and tags easier:
"<correspondent> - <title> - <tags>.<suffix>"
"<correspondent> - <title>.<suffix>"
"<title>.<suffix>"
"""
for regex in cls.REGEXES.values():
m = regex.match(os.path.basename(path))
if m:
properties = m.groupdict()
cls._mangle_property(properties, "created")
cls._mangle_property(properties, "correspondent")
cls._mangle_property(properties, "title")
cls._mangle_property(properties, "tags")
cls._mangle_property(properties, "extension")
return cls(**properties)

View File

@ -1,29 +1,36 @@
from django.test import TestCase
from ..consumer import Consumer
from ..models import Document, FileInfo
class TestAttachment(TestCase):
TAGS = ("tag1", "tag2", "tag3")
CONSUMER = Consumer()
SUFFIXES = (
EXTENSIONS = (
"pdf", "png", "jpg", "jpeg", "gif",
"PDF", "PNG", "JPG", "JPEG", "GIF",
"PdF", "PnG", "JpG", "JPeG", "GiF",
)
def _test_guess_attributes_from_name(self, path, sender, title, tags):
for suffix in self.SUFFIXES:
f = path.format(suffix)
results = self.CONSUMER._guess_attributes_from_name(f)
self.assertEqual(results[0].name, sender, f)
self.assertEqual(results[1], title, f)
self.assertEqual(tuple([t.slug for t in results[2]]), tags, f)
if suffix.lower() == "jpeg":
self.assertEqual(results[3], "jpg", f)
for extension in self.EXTENSIONS:
f = path.format(extension)
file_info = FileInfo.from_path(f)
if sender:
self.assertEqual(file_info.correspondent.name, sender, f)
else:
self.assertEqual(results[3], suffix.lower(), f)
self.assertIsNone(file_info.correspondent, f)
self.assertEqual(file_info.title, title, f)
self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f)
if extension.lower() == "jpeg":
self.assertEqual(file_info.extension, "jpg", f)
else:
self.assertEqual(file_info.extension, extension.lower(), f)
def test_guess_attributes_from_name0(self):
self._test_guess_attributes_from_name(
@ -92,3 +99,206 @@ class TestAttachment(TestCase):
"Τιτλε",
self.TAGS
)
def test_guess_attributes_from_name_when_correspondent_empty(self):
self._test_guess_attributes_from_name(
'/path/to/ - weird empty correspondent but should not break.{}',
None,
'weird empty correspondent but should not break',
()
)
def test_guess_attributes_from_name_when_title_starts_with_dash(self):
self._test_guess_attributes_from_name(
'/path/to/- weird but should not break.{}',
None,
'- weird but should not break',
()
)
def test_guess_attributes_from_name_when_title_ends_with_dash(self):
self._test_guess_attributes_from_name(
'/path/to/weird but should not break -.{}',
None,
'weird but should not break -',
()
)
def test_guess_attributes_from_name_when_title_is_empty(self):
self._test_guess_attributes_from_name(
'/path/to/weird correspondent but should not break - .{}',
'weird correspondent but should not break',
'',
()
)
class Permutations(TestCase):
valid_dates = (
"20150102030405Z",
"20150102Z",
)
valid_correspondents = [
"timmy",
"Dr. McWheelie",
"Dash Gor-don",
"ο Θερμαστής",
""
]
valid_titles = ["title", "Title w Spaces", "Title a-dash", "Τίτλος", ""]
valid_tags = ["tag", "tig,tag", "tag1,tag2,tag-3"]
valid_extensions = ["pdf", "png", "jpg", "jpeg", "gif"]
def _test_guessed_attributes(self, filename, created=None,
correspondent=None, title=None,
extension=None, tags=None):
# print(filename)
info = FileInfo.from_path(filename)
# Created
if created is None:
self.assertIsNone(info.created, filename)
else:
self.assertEqual(info.created.year, int(created[:4]), filename)
self.assertEqual(info.created.month, int(created[4:6]), filename)
self.assertEqual(info.created.day, int(created[6:8]), filename)
# Correspondent
if correspondent:
self.assertEqual(info.correspondent.name, correspondent, filename)
else:
self.assertEqual(info.correspondent, None, filename)
# Title
self.assertEqual(info.title, title, filename)
# Tags
if tags is None:
self.assertEqual(info.tags, (), filename)
else:
self.assertEqual(
[t.slug for t in info.tags], tags.split(','),
filename
)
# Extension
if extension == 'jpeg':
extension = 'jpg'
self.assertEqual(info.extension, extension, filename)
def test_just_title(self):
template = '/path/to/{title}.{extension}'
for title in self.valid_titles:
for extension in self.valid_extensions:
spec = dict(title=title, extension=extension)
filename = template.format(**spec)
self._test_guessed_attributes(filename, **spec)
def test_title_and_correspondent(self):
template = '/path/to/{correspondent} - {title}.{extension}'
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
for extension in self.valid_extensions:
spec = dict(correspondent=correspondent, title=title,
extension=extension)
filename = template.format(**spec)
self._test_guessed_attributes(filename, **spec)
def test_title_and_correspondent_and_tags(self):
template = '/path/to/{correspondent} - {title} - {tags}.{extension}'
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
for tags in self.valid_tags:
for extension in self.valid_extensions:
spec = dict(correspondent=correspondent, title=title,
tags=tags, extension=extension)
filename = template.format(**spec)
self._test_guessed_attributes(filename, **spec)
def test_created_and_correspondent_and_title_and_tags(self):
template = ("/path/to/{created} - "
"{correspondent} - "
"{title} - "
"{tags}"
".{extension}")
for created in self.valid_dates:
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
for tags in self.valid_tags:
for extension in self.valid_extensions:
spec = {
"created": created,
"correspondent": correspondent,
"title": title,
"tags": tags,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)
def test_created_and_correspondent_and_title(self):
template = ("/path/to/{created} - "
"{correspondent} - "
"{title}"
".{extension}")
for created in self.valid_dates:
for correspondent in self.valid_correspondents:
for title in self.valid_titles:
# Skip cases where title looks like a tag as we can't
# accommodate such cases.
if title.lower() == title:
continue
for extension in self.valid_extensions:
spec = {
"created": created,
"correspondent": correspondent,
"title": title,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)
def test_created_and_title(self):
template = ("/path/to/{created} - "
"{title}"
".{extension}")
for created in self.valid_dates:
for title in self.valid_titles:
for extension in self.valid_extensions:
spec = {
"created": created,
"title": title,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)
def test_created_and_title_and_tags(self):
template = ("/path/to/{created} - "
"{title} - "
"{tags}"
".{extension}")
for created in self.valid_dates:
for title in self.valid_titles:
for tags in self.valid_tags:
for extension in self.valid_extensions:
spec = {
"created": created,
"title": title,
"tags": tags,
"extension": extension
}
self._test_guessed_attributes(
template.format(**spec), **spec)