Merge branch 'master' into issue/81

2025-11-09 03:46:12 -06:00 · 2016-03-25 20:56:30 +00:00
parent 1170139127 396ff98b41
commit 49b56425e8
16 changed files with 598 additions and 167 deletions
--- a/README.rst
+++ b/README.rst
@@ -24,8 +24,11 @@ How it Works
 1. Buy a document scanner like `this one`_.
 2. Set it up to "scan to FTP" or something similar. It should be able to push
-   scanned images to a server without you having to do anything.
+   scanned images to a server without you having to do anything.  If your
-3. Have the target server run the *Paperless* consumption script to OCR the PDF
+   scanner doesn't know how to automatically upload the file somewhere, you can
   always do that manually.  Paperless doesn't care how the documents get into
   its local consumption directory.
 3. Have the target server run the Paperless consumption script to OCR the PDF
   and index it into a local database.
 4. Use the web frontend to sift through the database and find what you want.
 5. Download the PDF you need/want via the web interface and do whatever you
@@ -56,7 +59,7 @@ powerful tools.
 * `ImageMagick`_ converts the images between colour and greyscale.
 * `Tesseract`_ does the character recognition.
-* `Unpaper`_ despeckles and and deskews the scanned image.
+* `Unpaper`_ despeckles and deskews the scanned image.
 * `GNU Privacy Guard`_ is used as the encryption backend.
 * `Python 3`_ is the language of the project.
--- a/docker-compose.yml.example
+++ b/docker-compose.yml.example
@@ -11,6 +11,10 @@ services:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
        env_file: docker-compose.env
        # The reason the line is here is so that the webserver that doesn't do
        # any text recognition and doesn't have to install unnecessary
        # languages the user might have set in the env-file by overwriting the
        # value with nothing.
        environment:
            - PAPERLESS_OCR_LANGUAGES=
        command: ["runserver", "0.0.0.0:8000"]
--- a/docs/changelog.rst
+++ b/docs/changelog.rst
@@ -1,6 +1,15 @@
 Changelog
 #########
 * 0.2.0
  * Added support for guessing the date from the file name along with the
    correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull
    request that I took forever to merge and to `Pit`_ for his efforts on the
    regex front.
  * `#94`_: Restored support for changing the created date in the UI.  Thanks
    to `Martin Honermeyer`_ and `Tim White`_ for working with me on this.
 * 0.1.1
  * Potentially **Breaking Change**: All references to "sender" in the code
@@ -86,6 +95,8 @@ Changelog
 .. _Wayne Werner: https://github.com/waynew
 .. _darkmatter: https://github.com/darkmatter
 .. _zedster: https://github.com/zedster
 .. _Martin Honermeyer: https://github.com/djmaze
 .. _Tim White: https://github.com/timwhite
 .. _#20: https://github.com/danielquinn/paperless/issues/20
 .. _#44: https://github.com/danielquinn/paperless/issues/44
@@ -99,3 +110,4 @@ Changelog
 .. _#67: https://github.com/danielquinn/paperless/issues/67
 .. _#68: https://github.com/danielquinn/paperless/issues/68
 .. _#71: https://github.com/danielquinn/paperless/issues/71
 .. _#94: https://github.com/danielquinn/paperless/issues/71
--- a/docs/consumption.rst
+++ b/docs/consumption.rst
@@ -45,19 +45,27 @@ you name the file right, it'll automatically set some values in the database
 for you.  This is is the logic the consumer follows:
 1. Try to find the correspondent, title, and tags in the file name following
   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
   ``YYYYMMDDZ``.  The ``Z`` is for "Zulu time" AKA "UTC".
 2. If that doesn't work, we skip the date and try this pattern:
   the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
-2. If that doesn't work, try to find the correspondent and title in the file
+3. If that doesn't work, we try to find the correspondent and title in the file
   name following the pattern:  ``Correspondent - Title.pdf``.
-3. If that doesn't work, just assume that the name of the file is the title.
+4. If that doesn't work, just assume that the name of the file is the title.
 So given the above, the following examples would work as you'd expect:
 * ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Another Company - Letter of Reference.jpg``
 * ``Dad's Recipe for Pancakes.png``
 These however wouldn't work:
 * ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Another Company- Letter of Reference.jpg``
@@ -128,7 +136,7 @@ following name/value pairs:
  don't start uploading stuff to your server.  The means of generating this
  signature is defined below.
-Specify ``enctype="multipart/form-data"``, and then POST your file with:::
+Specify ``enctype="multipart/form-data"``, and then POST your file with::
    Content-Disposition: form-data; name="document"; filename="whatever.pdf"
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -33,4 +33,5 @@ Contents
   api
   utilities
   migrating
   troubleshooting 
   changelog
--- a/docs/requirements.rst
+++ b/docs/requirements.rst
@@ -8,7 +8,7 @@ should work) that has the following software installed on it:
 * `Python3`_ (with development libraries, pip and virtualenv)
 * `GNU Privacy Guard`_
-* `Tesseract`_
+* `Tesseract`_, plus its language files matching your document base.
 * `Imagemagick`_
 * `unpaper`_
@@ -52,6 +52,7 @@ well as ImageMagick:
    $ brew install ghostscript
    $ brew install imagemagick
    $ brew install libmagic
 .. _requirements-baremetal:
--- a/docs/setup.rst
+++ b/docs/setup.rst
@@ -5,7 +5,8 @@ Setup
 Paperless isn't a very complicated app, but there are a few components, so some
 basic documentation is in order.  If you go follow along in this document and
-still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
+still have trouble, please open an `issue on GitHub`_ so I can fill in the
 gaps.
 .. _issue on GitHub: https://github.com/danielquinn/paperless/issues
@@ -15,8 +16,8 @@ still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
 Download
 --------
-The source is currently only available via GitHub, so grab it from there, either
+The source is currently only available via GitHub, so grab it from there,
-by using ``git``:
+either by using ``git``:
 .. code:: bash
@@ -42,15 +43,16 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
 consumption etc. We also `support Docker`_, which you can use natively under
 Linux and in a VM with `Docker Machine`_ (this guide was written for native
 Docker usage under Linux, you might have to adapt it for Docker Machine.)
-Alternatively the standard, `bare metal`_ approach is a little more complicated,
+Alternatively the standard, `bare metal`_ approach is a little more
-but worth it because it makes it easier to should you want to contribute some
+complicated, but worth it because it makes it easier to should you want to
-code back.
+contribute some code back.
 .. _Vagrant route: setup-installation-vagrant_
 .. _support Docker: setup-installation-docker_
 .. _bare metal: setup-installation-standard_
 .. _Docker Machine: https://docs.docker.com/machine/
 .. _setup-installation-standard:
 Standard (Bare Metal)
@@ -58,19 +60,16 @@ Standard (Bare Metal)
 1. Install the requirements as per the :ref:`requirements <requirements>` page.
 2. Change to the ``src`` directory in this repo.
-3. Edit ``paperless/settings.py`` and be sure to set the values for:
+3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
-    * ``CONSUMPTION_DIR``: this is where your documents will be dumped to be
+   your favourite editor.  Set the values for:
-      consumed by Paperless.
+
-    * ``PASSPHRASE``: this is the passphrase Paperless uses to encrypt/decrypt
+    * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
-      the original document.  The default value attempts to source the
+      dumped to be consumed by Paperless.
-      passphrase from the environment, so if you don't set it to a static value
+    * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
-      here, you must set ``PAPERLESS_PASSPHRASE=some-secret-string`` on the
+      encrypt/decrypt the original document.
-      command line whenever invoking the consumer or webserver.
+    * ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
-    * ``OCR_THREADS``: this is the number of threads the OCR process will spawn
+      will spawn to process document pages in parallel.
-      to process document pages in parallel. The default value gets sourced from
+
      the environment-variable ``PAPERLESS_OCR_THREADS`` and expects it to be an
      integer. If the variable is not set, Python determines the core-count of
      your CPU and uses that value.
 4. Initialise the database with ``./manage.py migrate``.
 5. Create a user for your Paperless instance with
   ``./manage.py createsuperuser``. Follow the prompts to create your user.
@@ -79,8 +78,8 @@ Standard (Bare Metal)
   You should now be able to visit your (empty) `Paperless webserver`_ at
   ``127.0.0.1:8000`` (or whatever you chose).  You can login with the
   user/pass you created in #5.
-7. In a separate window, change to the ``src`` directory in this repo again, but
+7. In a separate window, change to the ``src`` directory in this repo again,
-   this time, you should start the consumer script with
+   but this time, you should start the consumer script with
   ``./manage.py document_consumer``.
 8. Scan something.  Put it in the ``CONSUMPTION_DIR``.
 9. Wait a few minutes
@@ -100,6 +99,7 @@ Vagrant Method
   provisioned...
 3. Run ``vagrant ssh`` and once inside your new vagrant box, edit
   ``/etc/paperless.conf`` and set the values for:
    * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
      dumped to be consumed by Paperless.
    * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
@@ -107,6 +107,7 @@ Vagrant Method
    * ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming
      documents from mail or via the API.  If you don't use either, leaving it
      blank is just fine.
 4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again.  This
   updates the environment to make use of the changes you made to the config
   file.
@@ -140,9 +141,9 @@ Docker Method
   .. caution::
      As mentioned earlier, this guide assumes that you use Docker natively
-      under Linux. If you are using `Docker Machine`_ under Mac OS X or Windows,
+      under Linux. If you are using `Docker Machine`_ under Mac OS X or
-      you will have to adapt IP addresses, volume-mounting, command execution
+      Windows, you will have to adapt IP addresses, volume-mounting, command
-      and maybe more.
+      execution and maybe more.
 2. Install `docker-compose`_. [#compose]_
@@ -161,14 +162,14 @@ Docker Method
       .. _Docker installation guide: https://docs.docker.com/engine/installation/
       .. _docker-compose installation guide: https://docs.docker.com/compose/install/
-3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` and
+3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
-   a copy of ``docker-compose.env.example`` as ``docker-compose.env``. You'll be
+   and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
-   editing both these files: taking a copy ensures that you can ``git pull`` to
+   You'll be editing both these files: taking a copy ensures that you can
-   receive updates without risking merge conflicts with your modified versions
+   ``git pull`` to receive updates without risking merge conflicts with your
-   of the configuration files.
+   modified versions of the configuration files.
-4. Modify ``docker-compose.yml`` to your preferences, following the instructions
+4. Modify ``docker-compose.yml`` to your preferences, following the
-   in comments in the file. The only change that is a hard requirement is to
+   instructions in comments in the file. The only change that is a hard
-   specify where the consumption directory should mount.
+   requirement is to specify where the consumption directory should mount.
 5. Modify ``docker-compose.env`` and adapt the following environment variables:
   ``PAPERLESS_PASSPHRASE``
@@ -181,10 +182,11 @@ Docker Method
     the core-count of your CPU and uses that value.
   ``PAPERLESS_OCR_LANGUAGES``
-     If you want the OCR to recognize other languages in addition to the default
+     If you want the OCR to recognize other languages in addition to the
-     English, set this parameter to a space separated list of three-letter
+     default English, set this parameter to a space separated list of
-     language-codes after `ISO 639-2/T`_. For a list of available languages --
+     three-letter language-codes after `ISO 639-2/T`_. For a list of available
-     including their three letter codes -- see the `Debian packagelist`_.
+     languages -- including their three letter codes -- see the
     `Debian packagelist`_.
   ``USERMAP_UID`` and ``USERMAP_GID``
     If you want to mount the consumption volume (directory ``/consume`` within
@@ -192,11 +194,11 @@ Docker Method
     access rights might be an issue. The default user and group ``paperless``
     in the containers have an id of 1000. The containers will enforce that the
     owning group of the consumption directory will be ``paperless`` to be able
-     to delete consumed documents. If your host-system has a group with an id of
+     to delete consumed documents. If your host-system has a group with an ID
-     1000 and you don't want this group to have access rights to the consumption
+     of 1000 and you don't want this group to have access rights to the
-     directory, you can use ``USERMAP_GID`` to change the id in the container
+     consumption directory, you can use ``USERMAP_GID`` to change the id in the
-     and thus the one of the consumption directory. Furthermore, you can change
+     container and thus the one of the consumption directory. Furthermore, you
-     the id of the default user as well using ``USERMAP_UID``.
+     can change the id of the default user as well using ``USERMAP_UID``.
 6. Run ``docker-compose up -d``. This will create and start the necessary
   containers.
@@ -234,14 +236,14 @@ Docker Method
      .. danger::
          While the consumption container will ensure at startup that it can
-          **delete** a consumed file from a host-mounted directory, it might not
+          **delete** a consumed file from a host-mounted directory, it might
-          be able to **read** the document in the first place if the access
+          not be able to **read** the document in the first place if the access
          rights to the file are incorrect.
          Make sure that the documents you put into the consumption directory
          will either be readable by everyone (``chmod o+r file.pdf``) or
-          readable by the default user or group id 1000 (or the one you have set
+          readable by the default user or group id 1000 (or the one you have
-          with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
+          set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
   2. Use ``docker cp`` to copy your files directly into the container:
@@ -258,8 +260,8 @@ Docker Method
      ``docker cp`` is a one-shot-command, just like ``cp``. This means that
      every time you want to consume a new document, you will have to execute
-      ``docker cp`` again. You can of course automate this process, but option 1
+      ``docker cp`` again. You can of course automate this process, but option
-      is generally the preferred one.
+      1 is generally the preferred one.
      .. danger::
@@ -267,8 +269,8 @@ Docker Method
          to the acting user at the destination, which will be ``root``.
          You therefore need to ensure that the documents you want to copy into
-          the container are readable by everyone (``chmod o+r file.pdf``) before
+          the container are readable by everyone (``chmod o+r file.pdf``)
-          copying them.
+          before copying them.
 .. _Docker: https://www.docker.com/
@@ -281,17 +283,108 @@ Docker Method
   free to tinker around without using compose!
-.. _making-things-a-little-more-permanent:
+.. _setup-permanent:
 Making Things a Little more Permanent
 -------------------------------------
-Once you've tested things and are happy with the work flow, you can automate the
+Once you've tested things and are happy with the work flow, you can automate
-process of starting the webserver and consumer automatically.  If you're running
+the process of starting the webserver and consumer automatically.
-on a bare metal system that's using Systemd, you can use the service unit files
+
-in the ``scripts`` directory to set this up.  If you're on another startup
+
-system or are using a Vagrant box, then you're currently on your own. If you are
+.. _setup-permanent-standard-systemd:
-using Docker, you can set a restart-policy_ in the ``docker-compose.yml`` to
+
-have the containers automatically start with the Docker daemon.
+Standard (Bare Metal, Systemd)
 ..............................
 If you're running on a bare metal system that's using Systemd, you can use the
 service unit files in the ``scripts`` directory to set this up.  You'll need to
 create a user called ``paperless`` and setup Paperless to be in a place that
 this new user can read and write to.  Then, you can just tell Systemd to enable
 the two ``.service`` files::
    # systemctl enable /path/to/paperless/scripts/paperless-consumer.service
    # systemctl enable /path/to/paperless/scripts/paperless-webserver.service
    # systemctl start /path/to/paperless/scripts/paperless-consumer.service
    # systemctl start /path/to/paperless/scripts/paperless-webserver.service
 .. _setup-permanent-standard-ubuntu14:
 Ubuntu 14.04 (Bare Metal, Upstart)
 ..................................
 Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
 during the boot process. To configure Upstart to run Paperless automatically
 after restarting your system:
 1. Change to the directory where Upstart's configuration files are kept:
   ``cd /etc/init``
 2. Create a new file: ``sudo nano paperless-server.conf``
 3. In the newly-created file enter::
    start on (local-filesystems and net-device-up IFACE=eth0)
    stop on shutdown
    respawn
    respawn limit 10 5
    script
      exec /srv/paperless/src/manage.py runserver 0.0.0.0:80
    end script
   Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
   path to the ``manage.py`` script in your installation directory.
  If you are using a network interface other than ``eth0``, you will have to
  change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
  likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
  run ``ifconfig``.
  Save the file.
 4. Create a new file: ``sudo nano paperless-consumer.conf``
 5. In the newly-created file enter::
    start on (local-filesystems and net-device-up IFACE=eth0)
    stop on shutdown
    respawn
    respawn limit 10 5
    script
      exec /srv/paperless/src/manage.py document_consumer
    end script
  Replace ``/srv/paperless/src/manage.py`` with the same values as in step 3
  above and replace ``eth0`` with the appropriate value, if necessary. Save the
  file.
 These two configuration files together will start both the Paperless webserver
 and document consumer processes when the file system and network interface
 specified is available after boot. Furthermore, if either process ever exits
 unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
 second period.
 .. _Upstart: http://upstart.ubuntu.com/
 .. _setup-permanent-vagrant:
 Vagrant
 .......
 You're currently on your own, but the Ubuntu explanation above may be enough.
 .. _setup-permanent-docker:
 Docker
 ......
 If you're using Docker, you can set a restart-policy_ in the
 ``docker-compose.yml`` to have the containers automatically start with the
 Docker daemon.
 .. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
--- a/docs/troubleshooting.rst
+++ b/docs/troubleshooting.rst
@@ -0,0 +1,19 @@
 .. _troubleshooting:
 Troubleshooting
 ===============
 .. _troubleshooting_ocr_language_files_missing:
 Consumer warns ``OCR for XX failed``
 ------------------------------------
 If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
 XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
 might need to install the `Tesseract language files
 <http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
 As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian
 box), and your documents are written in Spanish you may need to run::
    apt-get install -y tesseract-ocr-spa
--- a/paperless.conf.example
+++ b/paperless.conf.example
@@ -20,7 +20,7 @@ PAPERLESS_CONSUME_MAIL_PASS=""
 #
 # The passphrase you use here will be used when storing your documents in
 # Paperless, but you can always export them in an unencrypted format by using
-# document exporter.  See the documentaiton for more information.
+# document exporter.  See the documentation for more information.
 #
 # One final note about the passphrase.  Once you've consumed a document with
 # one passphrase, DON'T CHANGE IT.  Paperless assumes this to be a constant and
@@ -31,3 +31,8 @@ PAPERLESS_PASSPHRASE="secret"
 # If you intend to consume documents either via HTTP POST or by email, you must
 # have a shared secret here.
 PAPERLESS_SHARED_SECRET=""
 # By default, Paperless will attempt to use all available CPU cores to process
 # a document, but if you would like to limit that, you can set this value to
 # an integer:
 #PAPERLESS_OCR_THREADS=1
--- a/presentation/img/kitten.jpg
+++ b/presentation/img/kitten.jpg
--- a/presentation/index.html
+++ b/presentation/index.html
@@ -148,12 +148,12 @@
        <section data-background="img/pony.png">
          <h2>Demo!</h2>
-          <p>(Time to sacrifice a kitten)</p>
+          <img src="img/kitten.jpg" style="width: 50%;" />
        </section>
        <section>
          <h2>TODO</h2>
-          <p>It works, but it could use polish</p>
+          <p>It works, but it needs polish</p>
          <ul>
            <li>The UI is the Django admin</li>
            <li>Mail consumption is really raw</li>
@@ -163,11 +163,11 @@
          <aside class="notes">
            <ul>
              <li>
-                <strong>Plugin architecture</strong>: there've been requests for
+                <strong>Plugin architecture</strong>: there've been requests
-                some overly custom stuff to happen before and after consumption,
+                for some overly custom stuff to happen before and after
-                but in the UNIX spirit of "do one job well", I think this sort
+                consumption, but in the UNIX spirit of "do one job well", I
-                of thing is better written as a plugin -- which means I need to
+                think this sort of thing is better written as a plugin -- which
-                figure out a best practise for that.
+                means I need to figure out a best practise for that.
              </li>
            </ul>
          </aside>
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +1,4 @@
-Django==1.9.2
+Django==1.9.4
 Pillow==3.1.1
 django-crispy-forms==1.6.0
 django-extensions==1.6.1
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@@ -19,12 +19,11 @@ from PIL import Image
 from django.conf import settings
 from django.utils import timezone
 from django.template.defaultfilters import slugify
 from pyocr.tesseract import TesseractError
 from paperless.db import GnuPG
-from .models import Correspondent, Tag, Document, Log
+from .models import Tag, Document, Log, FileInfo
 from .languages import ISO639
 from .signals import (
    document_consumption_started, document_consumption_finished)
@@ -56,19 +55,6 @@ class Consumer(object):
    DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
    REGEX_TITLE = re.compile(
        r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$",
        flags=re.IGNORECASE
    )
    REGEX_CORRESPONDENT_TITLE = re.compile(
        r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$",
        flags=re.IGNORECASE
    )
    REGEX_CORRESPONDENT_TITLE_TAGS = re.compile(
        r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$",
        flags=re.IGNORECASE
    )
    def __init__(self):
        self.logger = logging.getLogger(__name__)
@@ -107,7 +93,7 @@ class Consumer(object):
            if not os.path.isfile(doc):
                continue
-            if not re.match(self.REGEX_TITLE, doc):
+            if not re.match(FileInfo.REGEXES["title"], doc):
                continue
            if doc in self._ignore:
@@ -282,72 +268,20 @@ class Consumer(object):
        # Strip out excess white space to allow matching to go smoother
        return re.sub(r"\s+", " ", r)
    def _guess_attributes_from_name(self, parseable):
        """
        We use a crude naming convention to make handling the correspondent,
        title, and tags easier:
          "<correspondent> - <title> - <tags>.<suffix>"
          "<correspondent> - <title>.<suffix>"
          "<title>.<suffix>"
        """
        def get_correspondent(correspondent_name):
            return Correspondent.objects.get_or_create(
                name=correspondent_name,
                defaults={"slug": slugify(correspondent_name)}
            )[0]
        def get_tags(tags):
            r = []
            for t in tags.split(","):
                r.append(
                    Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
            return tuple(r)
        def get_suffix(suffix):
            suffix = suffix.lower()
            if suffix == "jpeg":
                return "jpg"
            return suffix
        # First attempt: "<correspondent> - <title> - <tags>.<suffix>"
        m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable)
        if m:
            return (
                get_correspondent(m.group(1)),
                m.group(2),
                get_tags(m.group(3)),
                get_suffix(m.group(4))
            )
        # Second attempt: "<correspondent> - <title>.<suffix>"
        m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable)
        if m:
            return (
                get_correspondent(m.group(1)),
                m.group(2),
                (),
                get_suffix(m.group(3))
            )
        # That didn't work, so we assume correspondent and tags are None
        m = re.match(self.REGEX_TITLE, parseable)
        return None, m.group(1), (), get_suffix(m.group(2))
    def _store(self, text, doc, thumbnail):
-        sender, title, tags, file_type = self._guess_attributes_from_name(doc)
+        file_info = FileInfo.from_path(doc)
-        relevant_tags = set(list(Tag.match_all(text)) + list(tags))
+        relevant_tags = set(list(Tag.match_all(text)) + list(file_info.tags))
        stats = os.stat(doc)
        self.log("debug", "Saving record to database")
        document = Document.objects.create(
-            correspondent=sender,
+            correspondent=file_info.correspondent,
-            title=title,
+            title=file_info.title,
            content=text,
-            file_type=file_type,
+            file_type=file_info.extension,
            created=timezone.make_aware(
                datetime.datetime.fromtimestamp(stats.st_mtime)),
            modified=timezone.make_aware(
--- a/src/documents/management/commands/document_exporter.py
+++ b/src/documents/management/commands/document_exporter.py
@@ -96,11 +96,16 @@ class Command(Renderable, BaseCommand):
    @staticmethod
    def _get_legacy_file_name(doc):
-        if doc.correspondent and doc.title:
+
-            tags = ",".join([t.slug for t in doc.tags.all()])
+        if not doc.correspondent and not doc.title:
-            if tags:
+            return os.path.basename(doc.source_path)
-                return "{} - {} - {}.{}".format(
+
-                    doc.correspondent, doc.title, tags, doc.file_type)
+        created = doc.created.strftime("%Y%m%d%H%M%SZ")
-            return "{} - {}.{}".format(
+        tags = ",".join([t.slug for t in doc.tags.all()])
-                doc.correspondent, doc.title, doc.file_type)
+
-        return os.path.basename(doc.source_path)
+        if tags:
            return "{} - {} - {} - {}.{}".format(
                created, doc.correspondent, doc.title, tags, doc.file_type)
        return "{} - {} - {}.{}".format(
            created, doc.correspondent, doc.title, doc.file_type)
--- a/src/documents/models.py
+++ b/src/documents/models.py
@@ -1,8 +1,11 @@
 import dateutil.parser
 import logging
 import os
 import re
 import uuid
 from collections import OrderedDict
 from django.conf import settings
 from django.core.urlresolvers import reverse
 from django.db import models
@@ -152,7 +155,7 @@ class Document(models.Model):
    )
    tags = models.ManyToManyField(
        Tag, related_name="documents", blank=True)
-    created = models.DateTimeField(default=timezone.now, editable=False)
+    created = models.DateTimeField(default=timezone.now)
    modified = models.DateTimeField(auto_now=True, editable=False)
    class Meta(object):
@@ -250,3 +253,136 @@ class Log(models.Model):
            self.group = uuid.uuid4()
        models.Model.save(self, *args, **kwargs)
 class FileInfo(object):
    # This epic regex *almost* worked for our needs, so I'm keeping it here for
    # posterity, in the hopes that we might find a way to make it work one day.
    ALMOST_REGEX = re.compile(
        r"^((?P<date>\d\d\d\d\d\d\d\d\d\d\d\d\d\dZ){separator})?"
        r"((?P<correspondent>{non_separated_word}+){separator})??"
        r"(?P<title>{non_separated_word}+)"
        r"({separator}(?P<tags>[a-z,0-9-]+))?"
        r"\.(?P<extension>[a-zA-Z.-]+)$".format(
            separator=r"\s+-\s+",
            non_separated_word=r"([\w,. ]|([^\s]-))"
        )
    )
    REGEXES = OrderedDict([
        ("created-correspondent-title-tags", re.compile(
            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
            r"(?P<correspondent>.*) - "
            r"(?P<title>.*) - "
            r"(?P<tags>[a-z0-9\-,]*)"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        )),
        ("created-title-tags", re.compile(
            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
            r"(?P<title>.*) - "
            r"(?P<tags>[a-z0-9\-,]*)"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        )),
        ("created-correspondent-title", re.compile(
            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
            r"(?P<correspondent>.*) - "
            r"(?P<title>.*)"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        )),
        ("created-title", re.compile(
            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
            r"(?P<title>.*)"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        )),
        ("correspondent-title-tags", re.compile(
            r"(?P<correspondent>.*) - "
            r"(?P<title>.*) - "
            r"(?P<tags>[a-z0-9\-,]*)"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        )),
        ("correspondent-title", re.compile(
            r"(?P<correspondent>.*) - "
            r"(?P<title>.*)?"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        )),
        ("title", re.compile(
            r"(?P<title>.*)"
            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
            flags=re.IGNORECASE
        ))
    ])
    def __init__(self, created=None, correspondent=None, title=None, tags=(),
                 extension=None):
        self.created = created
        self.title = title
        self.extension = extension
        self.correspondent = correspondent
        self.tags = tags
    @classmethod
    def _get_created(cls, created):
        return dateutil.parser.parse("{:0<14}Z".format(created[:-1]))
    @classmethod
    def _get_correspondent(cls, name):
        if not name:
            return None
        return Correspondent.objects.get_or_create(name=name, defaults={
            "slug": slugify(name)
        })[0]
    @classmethod
    def _get_title(cls, title):
        return title
    @classmethod
    def _get_tags(cls, tags):
        r = []
        for t in tags.split(","):
            r.append(
                Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
        return tuple(r)
    @classmethod
    def _get_extension(cls, extension):
        r = extension.lower()
        if r == "jpeg":
            return "jpg"
        return r
    @classmethod
    def _mangle_property(cls, properties, name):
        if name in properties:
            properties[name] = getattr(cls, "_get_{}".format(name))(
                properties[name]
            )
    @classmethod
    def from_path(cls, path):
        """
        We use a crude naming convention to make handling the correspondent,
        title, and tags easier:
          "<correspondent> - <title> - <tags>.<suffix>"
          "<correspondent> - <title>.<suffix>"
          "<title>.<suffix>"
        """
        for regex in cls.REGEXES.values():
            m = regex.match(os.path.basename(path))
            if m:
                properties = m.groupdict()
                cls._mangle_property(properties, "created")
                cls._mangle_property(properties, "correspondent")
                cls._mangle_property(properties, "title")
                cls._mangle_property(properties, "tags")
                cls._mangle_property(properties, "extension")
                return cls(**properties)
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@@ -1,29 +1,36 @@
 from django.test import TestCase
-from ..consumer import Consumer
+from ..models import Document, FileInfo
 class TestAttachment(TestCase):
    TAGS = ("tag1", "tag2", "tag3")
-    CONSUMER = Consumer()
+    EXTENSIONS = (
    SUFFIXES = (
        "pdf", "png", "jpg", "jpeg", "gif",
        "PDF", "PNG", "JPG", "JPEG", "GIF",
        "PdF", "PnG", "JpG", "JPeG", "GiF",
    )
    def _test_guess_attributes_from_name(self, path, sender, title, tags):
-        for suffix in self.SUFFIXES:
+
-            f = path.format(suffix)
+        for extension in self.EXTENSIONS:
-            results = self.CONSUMER._guess_attributes_from_name(f)
+
-            self.assertEqual(results[0].name, sender, f)
+            f = path.format(extension)
-            self.assertEqual(results[1], title, f)
+            file_info = FileInfo.from_path(f)
-            self.assertEqual(tuple([t.slug for t in results[2]]), tags, f)
+
-            if suffix.lower() == "jpeg":
+            if sender:
-                self.assertEqual(results[3], "jpg", f)
+                self.assertEqual(file_info.correspondent.name, sender, f)
            else:
-                self.assertEqual(results[3], suffix.lower(), f)
+                self.assertIsNone(file_info.correspondent, f)
            self.assertEqual(file_info.title, title, f)
            self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f)
            if extension.lower() == "jpeg":
                self.assertEqual(file_info.extension, "jpg", f)
            else:
                self.assertEqual(file_info.extension, extension.lower(), f)
    def test_guess_attributes_from_name0(self):
        self._test_guess_attributes_from_name(
@@ -92,3 +99,206 @@ class TestAttachment(TestCase):
            "Τιτλε",
            self.TAGS
        )
    def test_guess_attributes_from_name_when_correspondent_empty(self):
        self._test_guess_attributes_from_name(
            '/path/to/ - weird empty correspondent but should not break.{}',
            None,
            'weird empty correspondent but should not break',
            ()
        )
    def test_guess_attributes_from_name_when_title_starts_with_dash(self):
        self._test_guess_attributes_from_name(
            '/path/to/- weird but should not break.{}',
            None,
            '- weird but should not break',
            ()
        )
    def test_guess_attributes_from_name_when_title_ends_with_dash(self):
        self._test_guess_attributes_from_name(
            '/path/to/weird but should not break -.{}',
            None,
            'weird but should not break -',
            ()
        )
    def test_guess_attributes_from_name_when_title_is_empty(self):
        self._test_guess_attributes_from_name(
            '/path/to/weird correspondent but should not break - .{}',
            'weird correspondent but should not break',
            '',
            ()
        )
 class Permutations(TestCase):
    valid_dates = (
        "20150102030405Z",
        "20150102Z",
    )
    valid_correspondents = [
        "timmy",
        "Dr. McWheelie",
        "Dash Gor-don",
        "ο Θερμαστής",
        ""
    ]
    valid_titles = ["title", "Title w Spaces", "Title a-dash", "Τίτλος", ""]
    valid_tags = ["tag", "tig,tag", "tag1,tag2,tag-3"]
    valid_extensions = ["pdf", "png", "jpg", "jpeg", "gif"]
    def _test_guessed_attributes(self, filename, created=None,
                                 correspondent=None, title=None,
                                 extension=None, tags=None):
        # print(filename)
        info = FileInfo.from_path(filename)
        # Created
        if created is None:
            self.assertIsNone(info.created, filename)
        else:
            self.assertEqual(info.created.year, int(created[:4]), filename)
            self.assertEqual(info.created.month, int(created[4:6]), filename)
            self.assertEqual(info.created.day, int(created[6:8]), filename)
        # Correspondent
        if correspondent:
            self.assertEqual(info.correspondent.name, correspondent, filename)
        else:
            self.assertEqual(info.correspondent, None, filename)
        # Title
        self.assertEqual(info.title, title, filename)
        # Tags
        if tags is None:
            self.assertEqual(info.tags, (), filename)
        else:
            self.assertEqual(
                [t.slug for t in info.tags], tags.split(','),
                filename
            )
        # Extension
        if extension == 'jpeg':
            extension = 'jpg'
        self.assertEqual(info.extension, extension, filename)
    def test_just_title(self):
        template = '/path/to/{title}.{extension}'
        for title in self.valid_titles:
            for extension in self.valid_extensions:
                spec = dict(title=title, extension=extension)
                filename = template.format(**spec)
                self._test_guessed_attributes(filename, **spec)
    def test_title_and_correspondent(self):
        template = '/path/to/{correspondent} - {title}.{extension}'
        for correspondent in self.valid_correspondents:
            for title in self.valid_titles:
                for extension in self.valid_extensions:
                    spec = dict(correspondent=correspondent, title=title,
                                extension=extension)
                    filename = template.format(**spec)
                    self._test_guessed_attributes(filename, **spec)
    def test_title_and_correspondent_and_tags(self):
        template = '/path/to/{correspondent} - {title} - {tags}.{extension}'
        for correspondent in self.valid_correspondents:
            for title in self.valid_titles:
                for tags in self.valid_tags:
                    for extension in self.valid_extensions:
                        spec = dict(correspondent=correspondent, title=title,
                                    tags=tags, extension=extension)
                        filename = template.format(**spec)
                        self._test_guessed_attributes(filename, **spec)
    def test_created_and_correspondent_and_title_and_tags(self):
        template = ("/path/to/{created} - "
                    "{correspondent} - "
                    "{title} - "
                    "{tags}"
                    ".{extension}")
        for created in self.valid_dates:
            for correspondent in self.valid_correspondents:
                for title in self.valid_titles:
                    for tags in self.valid_tags:
                        for extension in self.valid_extensions:
                            spec = {
                                "created": created,
                                "correspondent": correspondent,
                                "title": title,
                                "tags": tags,
                                "extension": extension
                            }
                            self._test_guessed_attributes(
                                template.format(**spec), **spec)
    def test_created_and_correspondent_and_title(self):
        template = ("/path/to/{created} - "
                    "{correspondent} - "
                    "{title}"
                    ".{extension}")
        for created in self.valid_dates:
            for correspondent in self.valid_correspondents:
                for title in self.valid_titles:
                    # Skip cases where title looks like a tag as we can't
                    # accommodate such cases.
                    if title.lower() == title:
                        continue
                    for extension in self.valid_extensions:
                        spec = {
                            "created": created,
                            "correspondent": correspondent,
                            "title": title,
                            "extension": extension
                        }
                        self._test_guessed_attributes(
                            template.format(**spec), **spec)
    def test_created_and_title(self):
        template = ("/path/to/{created} - "
                    "{title}"
                    ".{extension}")
        for created in self.valid_dates:
            for title in self.valid_titles:
                for extension in self.valid_extensions:
                    spec = {
                        "created": created,
                        "title": title,
                        "extension": extension
                    }
                    self._test_guessed_attributes(
                        template.format(**spec), **spec)
    def test_created_and_title_and_tags(self):
        template = ("/path/to/{created} - "
                    "{title} - "
                    "{tags}"
                    ".{extension}")
        for created in self.valid_dates:
            for title in self.valid_titles:
                for tags in self.valid_tags:
                    for extension in self.valid_extensions:
                        spec = {
                            "created": created,
                            "title": title,
                            "tags": tags,
                            "extension": extension
                        }
                        self._test_guessed_attributes(
                            template.format(**spec), **spec)