Merge branch 'master' into issue/81

2025-07-22 17:54:40 -05:00 · 2016-03-25 20:56:30 +00:00 · 2016-03-25 20:56:30 +00:00 · 49b56425e8
commit 49b56425e8
parent 1170139127 396ff98b41
16 changed files with 598 additions and 167 deletions
--- a/README.rst
+++ b/README.rst
@ -24,8 +24,11 @@ How it Works

 1. Buy a document scanner like `this one`_.
 2. Set it up to "scan to FTP" or something similar. It should be able to push
-   scanned images to a server without you having to do anything.
-3. Have the target server run the *Paperless* consumption script to OCR the PDF
+   scanned images to a server without you having to do anything.  If your
+   scanner doesn't know how to automatically upload the file somewhere, you can
+   always do that manually.  Paperless doesn't care how the documents get into
+   its local consumption directory.
+3. Have the target server run the Paperless consumption script to OCR the PDF
   and index it into a local database.
 4. Use the web frontend to sift through the database and find what you want.
 5. Download the PDF you need/want via the web interface and do whatever you
@ -56,7 +59,7 @@ powerful tools.

 * `ImageMagick`_ converts the images between colour and greyscale.
 * `Tesseract`_ does the character recognition.
-* `Unpaper`_ despeckles and and deskews the scanned image.
+* `Unpaper`_ despeckles and deskews the scanned image.
 * `GNU Privacy Guard`_ is used as the encryption backend.
 * `Python 3`_ is the language of the project.

--- a/docker-compose.yml.example
+++ b/docker-compose.yml.example
@ -11,6 +11,10 @@ services:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
        env_file: docker-compose.env
+        # The reason the line is here is so that the webserver that doesn't do
+        # any text recognition and doesn't have to install unnecessary
+        # languages the user might have set in the env-file by overwriting the
+        # value with nothing.
        environment:
            - PAPERLESS_OCR_LANGUAGES=
        command: ["runserver", "0.0.0.0:8000"]
--- a/docs/changelog.rst
+++ b/docs/changelog.rst
@ -1,6 +1,15 @@
 Changelog
 #########

+* 0.2.0
+
+  * Added support for guessing the date from the file name along with the
+    correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull
+    request that I took forever to merge and to `Pit`_ for his efforts on the
+    regex front.
+  * `#94`_: Restored support for changing the created date in the UI.  Thanks
+    to `Martin Honermeyer`_ and `Tim White`_ for working with me on this.
+
 * 0.1.1

  * Potentially **Breaking Change**: All references to "sender" in the code
@ -86,6 +95,8 @@ Changelog
 .. _Wayne Werner: https://github.com/waynew
 .. _darkmatter: https://github.com/darkmatter
 .. _zedster: https://github.com/zedster
+.. _Martin Honermeyer: https://github.com/djmaze
+.. _Tim White: https://github.com/timwhite

 .. _#20: https://github.com/danielquinn/paperless/issues/20
 .. _#44: https://github.com/danielquinn/paperless/issues/44
@ -99,3 +110,4 @@ Changelog
 .. _#67: https://github.com/danielquinn/paperless/issues/67
 .. _#68: https://github.com/danielquinn/paperless/issues/68
 .. _#71: https://github.com/danielquinn/paperless/issues/71
+.. _#94: https://github.com/danielquinn/paperless/issues/71
--- a/docs/consumption.rst
+++ b/docs/consumption.rst
@ -45,19 +45,27 @@ you name the file right, it'll automatically set some values in the database
 for you.  This is is the logic the consumer follows:

 1. Try to find the correspondent, title, and tags in the file name following
+   the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that
+   the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
+   ``YYYYMMDDZ``.  The ``Z`` is for "Zulu time" AKA "UTC".
+2. If that doesn't work, we skip the date and try this pattern:
   the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
-2. If that doesn't work, try to find the correspondent and title in the file
+3. If that doesn't work, we try to find the correspondent and title in the file
   name following the pattern:  ``Correspondent - Title.pdf``.
-3. If that doesn't work, just assume that the name of the file is the title.
+4. If that doesn't work, just assume that the name of the file is the title.

 So given the above, the following examples would work as you'd expect:

+* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
+* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
 * ``Another Company - Letter of Reference.jpg``
 * ``Dad's Recipe for Pancakes.png``

 These however wouldn't work:

+* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
+* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
 * ``Another Company- Letter of Reference.jpg``

@ -128,7 +136,7 @@ following name/value pairs:
  don't start uploading stuff to your server.  The means of generating this
  signature is defined below.

-Specify ``enctype="multipart/form-data"``, and then POST your file with:::
+Specify ``enctype="multipart/form-data"``, and then POST your file with::

    Content-Disposition: form-data; name="document"; filename="whatever.pdf"

--- a/docs/index.rst
+++ b/docs/index.rst
@ -33,4 +33,5 @@ Contents
   api
   utilities
   migrating
+   troubleshooting 
   changelog
--- a/docs/requirements.rst
+++ b/docs/requirements.rst
@ -8,7 +8,7 @@ should work) that has the following software installed on it:

 * `Python3`_ (with development libraries, pip and virtualenv)
 * `GNU Privacy Guard`_
-* `Tesseract`_
+* `Tesseract`_, plus its language files matching your document base.
 * `Imagemagick`_
 * `unpaper`_

@ -52,6 +52,7 @@ well as ImageMagick:

    $ brew install ghostscript
    $ brew install imagemagick
+    $ brew install libmagic


 .. _requirements-baremetal:
--- a/docs/setup.rst
+++ b/docs/setup.rst
@ -5,7 +5,8 @@ Setup

 Paperless isn't a very complicated app, but there are a few components, so some
 basic documentation is in order.  If you go follow along in this document and
-still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
+still have trouble, please open an `issue on GitHub`_ so I can fill in the
+gaps.

 .. _issue on GitHub: https://github.com/danielquinn/paperless/issues

@ -15,8 +16,8 @@ still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
 Download
 --------

-The source is currently only available via GitHub, so grab it from there, either
-by using ``git``:
+The source is currently only available via GitHub, so grab it from there,
+either by using ``git``:

 .. code:: bash

@ -42,15 +43,16 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
 consumption etc. We also `support Docker`_, which you can use natively under
 Linux and in a VM with `Docker Machine`_ (this guide was written for native
 Docker usage under Linux, you might have to adapt it for Docker Machine.)
-Alternatively the standard, `bare metal`_ approach is a little more complicated,
-but worth it because it makes it easier to should you want to contribute some
-code back.
+Alternatively the standard, `bare metal`_ approach is a little more
+complicated, but worth it because it makes it easier to should you want to
+contribute some code back.

 .. _Vagrant route: setup-installation-vagrant_
 .. _support Docker: setup-installation-docker_
 .. _bare metal: setup-installation-standard_
 .. _Docker Machine: https://docs.docker.com/machine/

+
 .. _setup-installation-standard:

 Standard (Bare Metal)
@ -58,19 +60,16 @@ Standard (Bare Metal)

 1. Install the requirements as per the :ref:`requirements <requirements>` page.
 2. Change to the ``src`` directory in this repo.
-3. Edit ``paperless/settings.py`` and be sure to set the values for:
-    * ``CONSUMPTION_DIR``: this is where your documents will be dumped to be
-      consumed by Paperless.
-    * ``PASSPHRASE``: this is the passphrase Paperless uses to encrypt/decrypt
-      the original document.  The default value attempts to source the
-      passphrase from the environment, so if you don't set it to a static value
-      here, you must set ``PAPERLESS_PASSPHRASE=some-secret-string`` on the
-      command line whenever invoking the consumer or webserver.
-    * ``OCR_THREADS``: this is the number of threads the OCR process will spawn
-      to process document pages in parallel. The default value gets sourced from
-      the environment-variable ``PAPERLESS_OCR_THREADS`` and expects it to be an
-      integer. If the variable is not set, Python determines the core-count of
-      your CPU and uses that value.
+3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
+   your favourite editor.  Set the values for:
+
+    * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
+      dumped to be consumed by Paperless.
+    * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
+      encrypt/decrypt the original document.
+    * ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
+      will spawn to process document pages in parallel.
+
 4. Initialise the database with ``./manage.py migrate``.
 5. Create a user for your Paperless instance with
   ``./manage.py createsuperuser``. Follow the prompts to create your user.
@ -79,8 +78,8 @@ Standard (Bare Metal)
   You should now be able to visit your (empty) `Paperless webserver`_ at
   ``127.0.0.1:8000`` (or whatever you chose).  You can login with the
   user/pass you created in #5.
-7. In a separate window, change to the ``src`` directory in this repo again, but
-   this time, you should start the consumer script with
+7. In a separate window, change to the ``src`` directory in this repo again,
+   but this time, you should start the consumer script with
   ``./manage.py document_consumer``.
 8. Scan something.  Put it in the ``CONSUMPTION_DIR``.
 9. Wait a few minutes
@ -100,6 +99,7 @@ Vagrant Method
   provisioned...
 3. Run ``vagrant ssh`` and once inside your new vagrant box, edit
   ``/etc/paperless.conf`` and set the values for:
+
    * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
      dumped to be consumed by Paperless.
    * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
@ -107,6 +107,7 @@ Vagrant Method
    * ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming
      documents from mail or via the API.  If you don't use either, leaving it
      blank is just fine.
+
 4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again.  This
   updates the environment to make use of the changes you made to the config
   file.
@ -140,9 +141,9 @@ Docker Method
   .. caution::

      As mentioned earlier, this guide assumes that you use Docker natively
-      under Linux. If you are using `Docker Machine`_ under Mac OS X or Windows,
-      you will have to adapt IP addresses, volume-mounting, command execution
-      and maybe more.
+      under Linux. If you are using `Docker Machine`_ under Mac OS X or
+      Windows, you will have to adapt IP addresses, volume-mounting, command
+      execution and maybe more.

 2. Install `docker-compose`_. [#compose]_

@ -161,14 +162,14 @@ Docker Method
       .. _Docker installation guide: https://docs.docker.com/engine/installation/
       .. _docker-compose installation guide: https://docs.docker.com/compose/install/

-3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` and
-   a copy of ``docker-compose.env.example`` as ``docker-compose.env``. You'll be
-   editing both these files: taking a copy ensures that you can ``git pull`` to
-   receive updates without risking merge conflicts with your modified versions
-   of the configuration files.
-4. Modify ``docker-compose.yml`` to your preferences, following the instructions
-   in comments in the file. The only change that is a hard requirement is to
-   specify where the consumption directory should mount.
+3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
+   and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
+   You'll be editing both these files: taking a copy ensures that you can
+   ``git pull`` to receive updates without risking merge conflicts with your
+   modified versions of the configuration files.
+4. Modify ``docker-compose.yml`` to your preferences, following the
+   instructions in comments in the file. The only change that is a hard
+   requirement is to specify where the consumption directory should mount.
 5. Modify ``docker-compose.env`` and adapt the following environment variables:

   ``PAPERLESS_PASSPHRASE``
@ -181,10 +182,11 @@ Docker Method
     the core-count of your CPU and uses that value.

   ``PAPERLESS_OCR_LANGUAGES``
-     If you want the OCR to recognize other languages in addition to the default
-     English, set this parameter to a space separated list of three-letter
-     language-codes after `ISO 639-2/T`_. For a list of available languages --
-     including their three letter codes -- see the `Debian packagelist`_.
+     If you want the OCR to recognize other languages in addition to the
+     default English, set this parameter to a space separated list of
+     three-letter language-codes after `ISO 639-2/T`_. For a list of available
+     languages -- including their three letter codes -- see the
+     `Debian packagelist`_.

   ``USERMAP_UID`` and ``USERMAP_GID``
     If you want to mount the consumption volume (directory ``/consume`` within
@ -192,11 +194,11 @@ Docker Method
     access rights might be an issue. The default user and group ``paperless``
     in the containers have an id of 1000. The containers will enforce that the
     owning group of the consumption directory will be ``paperless`` to be able
-     to delete consumed documents. If your host-system has a group with an id of
-     1000 and you don't want this group to have access rights to the consumption
-     directory, you can use ``USERMAP_GID`` to change the id in the container
-     and thus the one of the consumption directory. Furthermore, you can change
-     the id of the default user as well using ``USERMAP_UID``.
+     to delete consumed documents. If your host-system has a group with an ID
+     of 1000 and you don't want this group to have access rights to the
+     consumption directory, you can use ``USERMAP_GID`` to change the id in the
+     container and thus the one of the consumption directory. Furthermore, you
+     can change the id of the default user as well using ``USERMAP_UID``.

 6. Run ``docker-compose up -d``. This will create and start the necessary
   containers.
@ -234,14 +236,14 @@ Docker Method
      .. danger::

          While the consumption container will ensure at startup that it can
-          **delete** a consumed file from a host-mounted directory, it might not
-          be able to **read** the document in the first place if the access
+          **delete** a consumed file from a host-mounted directory, it might
+          not be able to **read** the document in the first place if the access
          rights to the file are incorrect.

          Make sure that the documents you put into the consumption directory
          will either be readable by everyone (``chmod o+r file.pdf``) or
-          readable by the default user or group id 1000 (or the one you have set
-          with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
+          readable by the default user or group id 1000 (or the one you have
+          set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).

   2. Use ``docker cp`` to copy your files directly into the container:

@ -258,8 +260,8 @@ Docker Method

      ``docker cp`` is a one-shot-command, just like ``cp``. This means that
      every time you want to consume a new document, you will have to execute
-      ``docker cp`` again. You can of course automate this process, but option 1
-      is generally the preferred one.
+      ``docker cp`` again. You can of course automate this process, but option
+      1 is generally the preferred one.

      .. danger::

@ -267,8 +269,8 @@ Docker Method
          to the acting user at the destination, which will be ``root``.

          You therefore need to ensure that the documents you want to copy into
-          the container are readable by everyone (``chmod o+r file.pdf``) before
-          copying them.
+          the container are readable by everyone (``chmod o+r file.pdf``)
+          before copying them.


 .. _Docker: https://www.docker.com/
@ -281,17 +283,108 @@ Docker Method
   free to tinker around without using compose!


-.. _making-things-a-little-more-permanent:
+.. _setup-permanent:

 Making Things a Little more Permanent
 -------------------------------------

-Once you've tested things and are happy with the work flow, you can automate the
-process of starting the webserver and consumer automatically.  If you're running
-on a bare metal system that's using Systemd, you can use the service unit files
-in the ``scripts`` directory to set this up.  If you're on another startup
-system or are using a Vagrant box, then you're currently on your own. If you are
-using Docker, you can set a restart-policy_ in the ``docker-compose.yml`` to
-have the containers automatically start with the Docker daemon.
+Once you've tested things and are happy with the work flow, you can automate
+the process of starting the webserver and consumer automatically.
+
+
+.. _setup-permanent-standard-systemd:
+
+Standard (Bare Metal, Systemd)
+..............................
+
+If you're running on a bare metal system that's using Systemd, you can use the
+service unit files in the ``scripts`` directory to set this up.  You'll need to
+create a user called ``paperless`` and setup Paperless to be in a place that
+this new user can read and write to.  Then, you can just tell Systemd to enable
+the two ``.service`` files::
+
+    # systemctl enable /path/to/paperless/scripts/paperless-consumer.service
+    # systemctl enable /path/to/paperless/scripts/paperless-webserver.service
+    # systemctl start /path/to/paperless/scripts/paperless-consumer.service
+    # systemctl start /path/to/paperless/scripts/paperless-webserver.service
+
+
+.. _setup-permanent-standard-ubuntu14:
+
+Ubuntu 14.04 (Bare Metal, Upstart)
+..................................
+
+Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
+during the boot process. To configure Upstart to run Paperless automatically
+after restarting your system:
+
+1. Change to the directory where Upstart's configuration files are kept:
+   ``cd /etc/init``
+2. Create a new file: ``sudo nano paperless-server.conf``
+3. In the newly-created file enter::
+
+    start on (local-filesystems and net-device-up IFACE=eth0)
+    stop on shutdown
+
+    respawn
+    respawn limit 10 5
+
+    script
+      exec /srv/paperless/src/manage.py runserver 0.0.0.0:80
+    end script
+
+   Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
+   path to the ``manage.py`` script in your installation directory.
+
+  If you are using a network interface other than ``eth0``, you will have to
+  change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
+  likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
+  run ``ifconfig``.
+
+  Save the file.
+
+4. Create a new file: ``sudo nano paperless-consumer.conf``
+
+5. In the newly-created file enter::
+
+    start on (local-filesystems and net-device-up IFACE=eth0)
+    stop on shutdown
+
+    respawn
+    respawn limit 10 5
+
+    script
+      exec /srv/paperless/src/manage.py document_consumer
+    end script
+
+  Replace ``/srv/paperless/src/manage.py`` with the same values as in step 3
+  above and replace ``eth0`` with the appropriate value, if necessary. Save the
+  file.
+
+These two configuration files together will start both the Paperless webserver
+and document consumer processes when the file system and network interface
+specified is available after boot. Furthermore, if either process ever exits
+unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
+second period.
+
+.. _Upstart: http://upstart.ubuntu.com/
+
+
+.. _setup-permanent-vagrant:
+
+Vagrant
+.......
+
+You're currently on your own, but the Ubuntu explanation above may be enough.
+
+
+.. _setup-permanent-docker:
+
+Docker
+......
+
+If you're using Docker, you can set a restart-policy_ in the
+``docker-compose.yml`` to have the containers automatically start with the
+Docker daemon.

 .. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
--- a/docs/troubleshooting.rst
+++ b/docs/troubleshooting.rst
@ -0,0 +1,19 @@
+.. _troubleshooting:
+
+Troubleshooting
+===============
+
+.. _troubleshooting_ocr_language_files_missing:
+
+Consumer warns ``OCR for XX failed``
+------------------------------------
+
+If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
+XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
+might need to install the `Tesseract language files
+<http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
+
+As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian
+box), and your documents are written in Spanish you may need to run::
+
+    apt-get install -y tesseract-ocr-spa
--- a/paperless.conf.example
+++ b/paperless.conf.example
@ -20,7 +20,7 @@ PAPERLESS_CONSUME_MAIL_PASS=""
 #
 # The passphrase you use here will be used when storing your documents in
 # Paperless, but you can always export them in an unencrypted format by using
-# document exporter.  See the documentaiton for more information.
+# document exporter.  See the documentation for more information.
 #
 # One final note about the passphrase.  Once you've consumed a document with
 # one passphrase, DON'T CHANGE IT.  Paperless assumes this to be a constant and
@ -31,3 +31,8 @@ PAPERLESS_PASSPHRASE="secret"
 # If you intend to consume documents either via HTTP POST or by email, you must
 # have a shared secret here.
 PAPERLESS_SHARED_SECRET=""
+
+# By default, Paperless will attempt to use all available CPU cores to process
+# a document, but if you would like to limit that, you can set this value to
+# an integer:
+#PAPERLESS_OCR_THREADS=1
--- a/presentation/img/kitten.jpg
+++ b/presentation/img/kitten.jpg
--- a/presentation/index.html
+++ b/presentation/index.html
@ -148,12 +148,12 @@

        <section data-background="img/pony.png">
          <h2>Demo!</h2>
-          <p>(Time to sacrifice a kitten)</p>
+          <img src="img/kitten.jpg" style="width: 50%;" />
        </section>

        <section>
          <h2>TODO</h2>
-          <p>It works, but it could use polish</p>
+          <p>It works, but it needs polish</p>
          <ul>
            <li>The UI is the Django admin</li>
            <li>Mail consumption is really raw</li>
@ -163,11 +163,11 @@
          <aside class="notes">
            <ul>
              <li>
-                <strong>Plugin architecture</strong>: there've been requests for
-                some overly custom stuff to happen before and after consumption,
-                but in the UNIX spirit of "do one job well", I think this sort
-                of thing is better written as a plugin -- which means I need to
-                figure out a best practise for that.
+                <strong>Plugin architecture</strong>: there've been requests
+                for some overly custom stuff to happen before and after
+                consumption, but in the UNIX spirit of "do one job well", I
+                think this sort of thing is better written as a plugin -- which
+                means I need to figure out a best practise for that.
              </li>
            </ul>
          </aside>
--- a/requirements.txt
+++ b/requirements.txt
@ -1,4 +1,4 @@
-Django==1.9.2
+Django==1.9.4
 Pillow==3.1.1
 django-crispy-forms==1.6.0
 django-extensions==1.6.1
--- a/src/documents/consumer.py
+++ b/src/documents/consumer.py
@ -19,12 +19,11 @@ from PIL import Image

 from django.conf import settings
 from django.utils import timezone
-from django.template.defaultfilters import slugify
 from pyocr.tesseract import TesseractError

 from paperless.db import GnuPG

-from .models import Correspondent, Tag, Document, Log
+from .models import Tag, Document, Log, FileInfo
 from .languages import ISO639
 from .signals import (
    document_consumption_started, document_consumption_finished)
@ -56,19 +55,6 @@ class Consumer(object):

    DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE

-    REGEX_TITLE = re.compile(
-        r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$",
-        flags=re.IGNORECASE
-    )
-    REGEX_CORRESPONDENT_TITLE = re.compile(
-        r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$",
-        flags=re.IGNORECASE
-    )
-    REGEX_CORRESPONDENT_TITLE_TAGS = re.compile(
-        r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$",
-        flags=re.IGNORECASE
-    )
-
    def __init__(self):

        self.logger = logging.getLogger(__name__)
@ -107,7 +93,7 @@ class Consumer(object):
            if not os.path.isfile(doc):
                continue

-            if not re.match(self.REGEX_TITLE, doc):
+            if not re.match(FileInfo.REGEXES["title"], doc):
                continue

            if doc in self._ignore:
@ -282,72 +268,20 @@ class Consumer(object):
        # Strip out excess white space to allow matching to go smoother
        return re.sub(r"\s+", " ", r)

-    def _guess_attributes_from_name(self, parseable):
-        """
-        We use a crude naming convention to make handling the correspondent,
-        title, and tags easier:
-          "<correspondent> - <title> - <tags>.<suffix>"
-          "<correspondent> - <title>.<suffix>"
-          "<title>.<suffix>"
-        """
-
-        def get_correspondent(correspondent_name):
-            return Correspondent.objects.get_or_create(
-                name=correspondent_name,
-                defaults={"slug": slugify(correspondent_name)}
-            )[0]
-
-        def get_tags(tags):
-            r = []
-            for t in tags.split(","):
-                r.append(
-                    Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
-            return tuple(r)
-
-        def get_suffix(suffix):
-            suffix = suffix.lower()
-            if suffix == "jpeg":
-                return "jpg"
-            return suffix
-
-        # First attempt: "<correspondent> - <title> - <tags>.<suffix>"
-        m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable)
-        if m:
-            return (
-                get_correspondent(m.group(1)),
-                m.group(2),
-                get_tags(m.group(3)),
-                get_suffix(m.group(4))
-            )
-
-        # Second attempt: "<correspondent> - <title>.<suffix>"
-        m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable)
-        if m:
-            return (
-                get_correspondent(m.group(1)),
-                m.group(2),
-                (),
-                get_suffix(m.group(3))
-            )
-
-        # That didn't work, so we assume correspondent and tags are None
-        m = re.match(self.REGEX_TITLE, parseable)
-        return None, m.group(1), (), get_suffix(m.group(2))
-
    def _store(self, text, doc, thumbnail):

-        sender, title, tags, file_type = self._guess_attributes_from_name(doc)
-        relevant_tags = set(list(Tag.match_all(text)) + list(tags))
+        file_info = FileInfo.from_path(doc)
+        relevant_tags = set(list(Tag.match_all(text)) + list(file_info.tags))

        stats = os.stat(doc)

        self.log("debug", "Saving record to database")

        document = Document.objects.create(
-            correspondent=sender,
-            title=title,
+            correspondent=file_info.correspondent,
+            title=file_info.title,
            content=text,
-            file_type=file_type,
+            file_type=file_info.extension,
            created=timezone.make_aware(
                datetime.datetime.fromtimestamp(stats.st_mtime)),
            modified=timezone.make_aware(
--- a/src/documents/management/commands/document_exporter.py
+++ b/src/documents/management/commands/document_exporter.py
@ -96,11 +96,16 @@ class Command(Renderable, BaseCommand):

    @staticmethod
    def _get_legacy_file_name(doc):
-        if doc.correspondent and doc.title:
-            tags = ",".join([t.slug for t in doc.tags.all()])
-            if tags:
-                return "{} - {} - {}.{}".format(
-                    doc.correspondent, doc.title, tags, doc.file_type)
-            return "{} - {}.{}".format(
-                doc.correspondent, doc.title, doc.file_type)
-        return os.path.basename(doc.source_path)
+
+        if not doc.correspondent and not doc.title:
+            return os.path.basename(doc.source_path)
+
+        created = doc.created.strftime("%Y%m%d%H%M%SZ")
+        tags = ",".join([t.slug for t in doc.tags.all()])
+
+        if tags:
+            return "{} - {} - {} - {}.{}".format(
+                created, doc.correspondent, doc.title, tags, doc.file_type)
+
+        return "{} - {} - {}.{}".format(
+            created, doc.correspondent, doc.title, doc.file_type)
--- a/src/documents/models.py
+++ b/src/documents/models.py
@ -1,8 +1,11 @@
+import dateutil.parser
 import logging
 import os
 import re
 import uuid

+from collections import OrderedDict
+
 from django.conf import settings
 from django.core.urlresolvers import reverse
 from django.db import models
@ -152,7 +155,7 @@ class Document(models.Model):
    )
    tags = models.ManyToManyField(
        Tag, related_name="documents", blank=True)
-    created = models.DateTimeField(default=timezone.now, editable=False)
+    created = models.DateTimeField(default=timezone.now)
    modified = models.DateTimeField(auto_now=True, editable=False)

    class Meta(object):
@ -250,3 +253,136 @@ class Log(models.Model):
            self.group = uuid.uuid4()

        models.Model.save(self, *args, **kwargs)
+
+
+class FileInfo(object):
+
+    # This epic regex *almost* worked for our needs, so I'm keeping it here for
+    # posterity, in the hopes that we might find a way to make it work one day.
+    ALMOST_REGEX = re.compile(
+        r"^((?P<date>\d\d\d\d\d\d\d\d\d\d\d\d\d\dZ){separator})?"
+        r"((?P<correspondent>{non_separated_word}+){separator})??"
+        r"(?P<title>{non_separated_word}+)"
+        r"({separator}(?P<tags>[a-z,0-9-]+))?"
+        r"\.(?P<extension>[a-zA-Z.-]+)$".format(
+            separator=r"\s+-\s+",
+            non_separated_word=r"([\w,. ]|([^\s]-))"
+        )
+    )
+
+    REGEXES = OrderedDict([
+        ("created-correspondent-title-tags", re.compile(
+            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
+            r"(?P<correspondent>.*) - "
+            r"(?P<title>.*) - "
+            r"(?P<tags>[a-z0-9\-,]*)"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        )),
+        ("created-title-tags", re.compile(
+            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
+            r"(?P<title>.*) - "
+            r"(?P<tags>[a-z0-9\-,]*)"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        )),
+        ("created-correspondent-title", re.compile(
+            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
+            r"(?P<correspondent>.*) - "
+            r"(?P<title>.*)"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        )),
+        ("created-title", re.compile(
+            r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
+            r"(?P<title>.*)"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        )),
+        ("correspondent-title-tags", re.compile(
+            r"(?P<correspondent>.*) - "
+            r"(?P<title>.*) - "
+            r"(?P<tags>[a-z0-9\-,]*)"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        )),
+        ("correspondent-title", re.compile(
+            r"(?P<correspondent>.*) - "
+            r"(?P<title>.*)?"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        )),
+        ("title", re.compile(
+            r"(?P<title>.*)"
+            r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
+            flags=re.IGNORECASE
+        ))
+    ])
+
+    def __init__(self, created=None, correspondent=None, title=None, tags=(),
+                 extension=None):
+
+        self.created = created
+        self.title = title
+        self.extension = extension
+        self.correspondent = correspondent
+        self.tags = tags
+
+    @classmethod
+    def _get_created(cls, created):
+        return dateutil.parser.parse("{:0<14}Z".format(created[:-1]))
+
+    @classmethod
+    def _get_correspondent(cls, name):
+        if not name:
+            return None
+        return Correspondent.objects.get_or_create(name=name, defaults={
+            "slug": slugify(name)
+        })[0]
+
+    @classmethod
+    def _get_title(cls, title):
+        return title
+
+    @classmethod
+    def _get_tags(cls, tags):
+        r = []
+        for t in tags.split(","):
+            r.append(
+                Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
+        return tuple(r)
+
+    @classmethod
+    def _get_extension(cls, extension):
+        r = extension.lower()
+        if r == "jpeg":
+            return "jpg"
+        return r
+
+    @classmethod
+    def _mangle_property(cls, properties, name):
+        if name in properties:
+            properties[name] = getattr(cls, "_get_{}".format(name))(
+                properties[name]
+            )
+
+    @classmethod
+    def from_path(cls, path):
+        """
+        We use a crude naming convention to make handling the correspondent,
+        title, and tags easier:
+          "<correspondent> - <title> - <tags>.<suffix>"
+          "<correspondent> - <title>.<suffix>"
+          "<title>.<suffix>"
+        """
+
+        for regex in cls.REGEXES.values():
+            m = regex.match(os.path.basename(path))
+            if m:
+                properties = m.groupdict()
+                cls._mangle_property(properties, "created")
+                cls._mangle_property(properties, "correspondent")
+                cls._mangle_property(properties, "title")
+                cls._mangle_property(properties, "tags")
+                cls._mangle_property(properties, "extension")
+                return cls(**properties)
--- a/src/documents/tests/test_consumer.py
+++ b/src/documents/tests/test_consumer.py
@ -1,29 +1,36 @@
 from django.test import TestCase

-from ..consumer import Consumer
+from ..models import Document, FileInfo


 class TestAttachment(TestCase):

    TAGS = ("tag1", "tag2", "tag3")
-    CONSUMER = Consumer()
-    SUFFIXES = (
+    EXTENSIONS = (
        "pdf", "png", "jpg", "jpeg", "gif",
        "PDF", "PNG", "JPG", "JPEG", "GIF",
        "PdF", "PnG", "JpG", "JPeG", "GiF",
    )

    def _test_guess_attributes_from_name(self, path, sender, title, tags):
-        for suffix in self.SUFFIXES:
-            f = path.format(suffix)
-            results = self.CONSUMER._guess_attributes_from_name(f)
-            self.assertEqual(results[0].name, sender, f)
-            self.assertEqual(results[1], title, f)
-            self.assertEqual(tuple([t.slug for t in results[2]]), tags, f)
-            if suffix.lower() == "jpeg":
-                self.assertEqual(results[3], "jpg", f)
+
+        for extension in self.EXTENSIONS:
+
+            f = path.format(extension)
+            file_info = FileInfo.from_path(f)
+
+            if sender:
+                self.assertEqual(file_info.correspondent.name, sender, f)
            else:
-                self.assertEqual(results[3], suffix.lower(), f)
+                self.assertIsNone(file_info.correspondent, f)
+
+            self.assertEqual(file_info.title, title, f)
+
+            self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f)
+            if extension.lower() == "jpeg":
+                self.assertEqual(file_info.extension, "jpg", f)
+            else:
+                self.assertEqual(file_info.extension, extension.lower(), f)

    def test_guess_attributes_from_name0(self):
        self._test_guess_attributes_from_name(
@ -92,3 +99,206 @@ class TestAttachment(TestCase):
            "Τιτλε",
            self.TAGS
        )
+
+    def test_guess_attributes_from_name_when_correspondent_empty(self):
+        self._test_guess_attributes_from_name(
+            '/path/to/ - weird empty correspondent but should not break.{}',
+            None,
+            'weird empty correspondent but should not break',
+            ()
+        )
+
+    def test_guess_attributes_from_name_when_title_starts_with_dash(self):
+        self._test_guess_attributes_from_name(
+            '/path/to/- weird but should not break.{}',
+            None,
+            '- weird but should not break',
+            ()
+        )
+
+    def test_guess_attributes_from_name_when_title_ends_with_dash(self):
+        self._test_guess_attributes_from_name(
+            '/path/to/weird but should not break -.{}',
+            None,
+            'weird but should not break -',
+            ()
+        )
+
+    def test_guess_attributes_from_name_when_title_is_empty(self):
+        self._test_guess_attributes_from_name(
+            '/path/to/weird correspondent but should not break - .{}',
+            'weird correspondent but should not break',
+            '',
+            ()
+        )
+
+
+class Permutations(TestCase):
+
+    valid_dates = (
+        "20150102030405Z",
+        "20150102Z",
+    )
+    valid_correspondents = [
+        "timmy",
+        "Dr. McWheelie",
+        "Dash Gor-don",
+        "ο Θερμαστής",
+        ""
+    ]
+    valid_titles = ["title", "Title w Spaces", "Title a-dash", "Τίτλος", ""]
+    valid_tags = ["tag", "tig,tag", "tag1,tag2,tag-3"]
+    valid_extensions = ["pdf", "png", "jpg", "jpeg", "gif"]
+
+    def _test_guessed_attributes(self, filename, created=None,
+                                 correspondent=None, title=None,
+                                 extension=None, tags=None):
+
+        # print(filename)
+        info = FileInfo.from_path(filename)
+
+        # Created
+        if created is None:
+            self.assertIsNone(info.created, filename)
+        else:
+            self.assertEqual(info.created.year, int(created[:4]), filename)
+            self.assertEqual(info.created.month, int(created[4:6]), filename)
+            self.assertEqual(info.created.day, int(created[6:8]), filename)
+
+        # Correspondent
+        if correspondent:
+            self.assertEqual(info.correspondent.name, correspondent, filename)
+        else:
+            self.assertEqual(info.correspondent, None, filename)
+
+        # Title
+        self.assertEqual(info.title, title, filename)
+
+        # Tags
+        if tags is None:
+            self.assertEqual(info.tags, (), filename)
+        else:
+            self.assertEqual(
+                [t.slug for t in info.tags], tags.split(','),
+                filename
+            )
+
+        # Extension
+        if extension == 'jpeg':
+            extension = 'jpg'
+        self.assertEqual(info.extension, extension, filename)
+
+    def test_just_title(self):
+        template = '/path/to/{title}.{extension}'
+        for title in self.valid_titles:
+            for extension in self.valid_extensions:
+                spec = dict(title=title, extension=extension)
+                filename = template.format(**spec)
+                self._test_guessed_attributes(filename, **spec)
+
+    def test_title_and_correspondent(self):
+        template = '/path/to/{correspondent} - {title}.{extension}'
+        for correspondent in self.valid_correspondents:
+            for title in self.valid_titles:
+                for extension in self.valid_extensions:
+                    spec = dict(correspondent=correspondent, title=title,
+                                extension=extension)
+                    filename = template.format(**spec)
+                    self._test_guessed_attributes(filename, **spec)
+
+    def test_title_and_correspondent_and_tags(self):
+        template = '/path/to/{correspondent} - {title} - {tags}.{extension}'
+        for correspondent in self.valid_correspondents:
+            for title in self.valid_titles:
+                for tags in self.valid_tags:
+                    for extension in self.valid_extensions:
+                        spec = dict(correspondent=correspondent, title=title,
+                                    tags=tags, extension=extension)
+                        filename = template.format(**spec)
+                        self._test_guessed_attributes(filename, **spec)
+
+    def test_created_and_correspondent_and_title_and_tags(self):
+
+        template = ("/path/to/{created} - "
+                    "{correspondent} - "
+                    "{title} - "
+                    "{tags}"
+                    ".{extension}")
+
+        for created in self.valid_dates:
+            for correspondent in self.valid_correspondents:
+                for title in self.valid_titles:
+                    for tags in self.valid_tags:
+                        for extension in self.valid_extensions:
+                            spec = {
+                                "created": created,
+                                "correspondent": correspondent,
+                                "title": title,
+                                "tags": tags,
+                                "extension": extension
+                            }
+                            self._test_guessed_attributes(
+                                template.format(**spec), **spec)
+
+    def test_created_and_correspondent_and_title(self):
+
+        template = ("/path/to/{created} - "
+                    "{correspondent} - "
+                    "{title}"
+                    ".{extension}")
+
+        for created in self.valid_dates:
+            for correspondent in self.valid_correspondents:
+                for title in self.valid_titles:
+
+                    # Skip cases where title looks like a tag as we can't
+                    # accommodate such cases.
+                    if title.lower() == title:
+                        continue
+
+                    for extension in self.valid_extensions:
+                        spec = {
+                            "created": created,
+                            "correspondent": correspondent,
+                            "title": title,
+                            "extension": extension
+                        }
+                        self._test_guessed_attributes(
+                            template.format(**spec), **spec)
+
+    def test_created_and_title(self):
+
+        template = ("/path/to/{created} - "
+                    "{title}"
+                    ".{extension}")
+
+        for created in self.valid_dates:
+            for title in self.valid_titles:
+                for extension in self.valid_extensions:
+                    spec = {
+                        "created": created,
+                        "title": title,
+                        "extension": extension
+                    }
+                    self._test_guessed_attributes(
+                        template.format(**spec), **spec)
+
+    def test_created_and_title_and_tags(self):
+
+        template = ("/path/to/{created} - "
+                    "{title} - "
+                    "{tags}"
+                    ".{extension}")
+
+        for created in self.valid_dates:
+            for title in self.valid_titles:
+                for tags in self.valid_tags:
+                    for extension in self.valid_extensions:
+                        spec = {
+                            "created": created,
+                            "title": title,
+                            "tags": tags,
+                            "extension": extension
+                        }
+                        self._test_guessed_attributes(
+                            template.format(**spec), **spec)