diff --git a/README.rst b/README.rst index 8bdf66319..6eec07d72 100644 --- a/README.rst +++ b/README.rst @@ -24,8 +24,11 @@ How it Works 1. Buy a document scanner like `this one`_. 2. Set it up to "scan to FTP" or something similar. It should be able to push - scanned images to a server without you having to do anything. -3. Have the target server run the *Paperless* consumption script to OCR the PDF + scanned images to a server without you having to do anything. If your + scanner doesn't know how to automatically upload the file somewhere, you can + always do that manually. Paperless doesn't care how the documents get into + its local consumption directory. +3. Have the target server run the Paperless consumption script to OCR the PDF and index it into a local database. 4. Use the web frontend to sift through the database and find what you want. 5. Download the PDF you need/want via the web interface and do whatever you @@ -56,7 +59,7 @@ powerful tools. * `ImageMagick`_ converts the images between colour and greyscale. * `Tesseract`_ does the character recognition. -* `Unpaper`_ despeckles and and deskews the scanned image. +* `Unpaper`_ despeckles and deskews the scanned image. * `GNU Privacy Guard`_ is used as the encryption backend. * `Python 3`_ is the language of the project. diff --git a/docker-compose.yml.example b/docker-compose.yml.example index 2a70c9ff7..fddda8198 100644 --- a/docker-compose.yml.example +++ b/docker-compose.yml.example @@ -11,6 +11,10 @@ services: - data:/usr/src/paperless/data - media:/usr/src/paperless/media env_file: docker-compose.env + # The reason the line is here is so that the webserver that doesn't do + # any text recognition and doesn't have to install unnecessary + # languages the user might have set in the env-file by overwriting the + # value with nothing. environment: - PAPERLESS_OCR_LANGUAGES= command: ["runserver", "0.0.0.0:8000"] diff --git a/docs/changelog.rst b/docs/changelog.rst index f2ab6cabc..c1397bb6c 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -1,6 +1,15 @@ Changelog ######### +* 0.2.0 + + * Added support for guessing the date from the file name along with the + correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull + request that I took forever to merge and to `Pit`_ for his efforts on the + regex front. + * `#94`_: Restored support for changing the created date in the UI. Thanks + to `Martin Honermeyer`_ and `Tim White`_ for working with me on this. + * 0.1.1 * Potentially **Breaking Change**: All references to "sender" in the code @@ -86,6 +95,8 @@ Changelog .. _Wayne Werner: https://github.com/waynew .. _darkmatter: https://github.com/darkmatter .. _zedster: https://github.com/zedster +.. _Martin Honermeyer: https://github.com/djmaze +.. _Tim White: https://github.com/timwhite .. _#20: https://github.com/danielquinn/paperless/issues/20 .. _#44: https://github.com/danielquinn/paperless/issues/44 @@ -99,3 +110,4 @@ Changelog .. _#67: https://github.com/danielquinn/paperless/issues/67 .. _#68: https://github.com/danielquinn/paperless/issues/68 .. _#71: https://github.com/danielquinn/paperless/issues/71 +.. _#94: https://github.com/danielquinn/paperless/issues/71 diff --git a/docs/consumption.rst b/docs/consumption.rst index eadf12823..2e404fddd 100644 --- a/docs/consumption.rst +++ b/docs/consumption.rst @@ -45,19 +45,27 @@ you name the file right, it'll automatically set some values in the database for you. This is is the logic the consumer follows: 1. Try to find the correspondent, title, and tags in the file name following + the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that + the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or + ``YYYYMMDDZ``. The ``Z`` is for "Zulu time" AKA "UTC". +2. If that doesn't work, we skip the date and try this pattern: the pattern: ``Correspondent - Title - tag,tag,tag.pdf``. -2. If that doesn't work, try to find the correspondent and title in the file +3. If that doesn't work, we try to find the correspondent and title in the file name following the pattern: ``Correspondent - Title.pdf``. -3. If that doesn't work, just assume that the name of the file is the title. +4. If that doesn't work, just assume that the name of the file is the title. So given the above, the following examples would work as you'd expect: +* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` * ``Another Company - Letter of Reference.jpg`` * ``Dad's Recipe for Pancakes.png`` These however wouldn't work: +* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` * ``Another Company- Letter of Reference.jpg`` @@ -128,7 +136,7 @@ following name/value pairs: don't start uploading stuff to your server. The means of generating this signature is defined below. -Specify ``enctype="multipart/form-data"``, and then POST your file with::: +Specify ``enctype="multipart/form-data"``, and then POST your file with:: Content-Disposition: form-data; name="document"; filename="whatever.pdf" diff --git a/docs/index.rst b/docs/index.rst index 47710d376..43f77b15a 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -33,4 +33,5 @@ Contents api utilities migrating + troubleshooting changelog diff --git a/docs/requirements.rst b/docs/requirements.rst index 36bc234c0..a1567361a 100644 --- a/docs/requirements.rst +++ b/docs/requirements.rst @@ -8,7 +8,7 @@ should work) that has the following software installed on it: * `Python3`_ (with development libraries, pip and virtualenv) * `GNU Privacy Guard`_ -* `Tesseract`_ +* `Tesseract`_, plus its language files matching your document base. * `Imagemagick`_ * `unpaper`_ @@ -52,6 +52,7 @@ well as ImageMagick: $ brew install ghostscript $ brew install imagemagick + $ brew install libmagic .. _requirements-baremetal: diff --git a/docs/setup.rst b/docs/setup.rst index 9992418c1..e18d11dda 100644 --- a/docs/setup.rst +++ b/docs/setup.rst @@ -5,7 +5,8 @@ Setup Paperless isn't a very complicated app, but there are a few components, so some basic documentation is in order. If you go follow along in this document and -still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps. +still have trouble, please open an `issue on GitHub`_ so I can fill in the +gaps. .. _issue on GitHub: https://github.com/danielquinn/paperless/issues @@ -15,8 +16,8 @@ still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps. Download -------- -The source is currently only available via GitHub, so grab it from there, either -by using ``git``: +The source is currently only available via GitHub, so grab it from there, +either by using ``git``: .. code:: bash @@ -42,15 +43,16 @@ route`_ is quick & easy, but means you're running a VM which comes with memory consumption etc. We also `support Docker`_, which you can use natively under Linux and in a VM with `Docker Machine`_ (this guide was written for native Docker usage under Linux, you might have to adapt it for Docker Machine.) -Alternatively the standard, `bare metal`_ approach is a little more complicated, -but worth it because it makes it easier to should you want to contribute some -code back. +Alternatively the standard, `bare metal`_ approach is a little more +complicated, but worth it because it makes it easier to should you want to +contribute some code back. .. _Vagrant route: setup-installation-vagrant_ .. _support Docker: setup-installation-docker_ .. _bare metal: setup-installation-standard_ .. _Docker Machine: https://docs.docker.com/machine/ + .. _setup-installation-standard: Standard (Bare Metal) @@ -58,19 +60,16 @@ Standard (Bare Metal) 1. Install the requirements as per the :ref:`requirements ` page. 2. Change to the ``src`` directory in this repo. -3. Edit ``paperless/settings.py`` and be sure to set the values for: - * ``CONSUMPTION_DIR``: this is where your documents will be dumped to be - consumed by Paperless. - * ``PASSPHRASE``: this is the passphrase Paperless uses to encrypt/decrypt - the original document. The default value attempts to source the - passphrase from the environment, so if you don't set it to a static value - here, you must set ``PAPERLESS_PASSPHRASE=some-secret-string`` on the - command line whenever invoking the consumer or webserver. - * ``OCR_THREADS``: this is the number of threads the OCR process will spawn - to process document pages in parallel. The default value gets sourced from - the environment-variable ``PAPERLESS_OCR_THREADS`` and expects it to be an - integer. If the variable is not set, Python determines the core-count of - your CPU and uses that value. +3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in + your favourite editor. Set the values for: + + * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be + dumped to be consumed by Paperless. + * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to + encrypt/decrypt the original document. + * ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process + will spawn to process document pages in parallel. + 4. Initialise the database with ``./manage.py migrate``. 5. Create a user for your Paperless instance with ``./manage.py createsuperuser``. Follow the prompts to create your user. @@ -79,8 +78,8 @@ Standard (Bare Metal) You should now be able to visit your (empty) `Paperless webserver`_ at ``127.0.0.1:8000`` (or whatever you chose). You can login with the user/pass you created in #5. -7. In a separate window, change to the ``src`` directory in this repo again, but - this time, you should start the consumer script with +7. In a separate window, change to the ``src`` directory in this repo again, + but this time, you should start the consumer script with ``./manage.py document_consumer``. 8. Scan something. Put it in the ``CONSUMPTION_DIR``. 9. Wait a few minutes @@ -100,6 +99,7 @@ Vagrant Method provisioned... 3. Run ``vagrant ssh`` and once inside your new vagrant box, edit ``/etc/paperless.conf`` and set the values for: + * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be dumped to be consumed by Paperless. * ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to @@ -107,6 +107,7 @@ Vagrant Method * ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming documents from mail or via the API. If you don't use either, leaving it blank is just fine. + 4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again. This updates the environment to make use of the changes you made to the config file. @@ -140,9 +141,9 @@ Docker Method .. caution:: As mentioned earlier, this guide assumes that you use Docker natively - under Linux. If you are using `Docker Machine`_ under Mac OS X or Windows, - you will have to adapt IP addresses, volume-mounting, command execution - and maybe more. + under Linux. If you are using `Docker Machine`_ under Mac OS X or + Windows, you will have to adapt IP addresses, volume-mounting, command + execution and maybe more. 2. Install `docker-compose`_. [#compose]_ @@ -161,14 +162,14 @@ Docker Method .. _Docker installation guide: https://docs.docker.com/engine/installation/ .. _docker-compose installation guide: https://docs.docker.com/compose/install/ -3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` and - a copy of ``docker-compose.env.example`` as ``docker-compose.env``. You'll be - editing both these files: taking a copy ensures that you can ``git pull`` to - receive updates without risking merge conflicts with your modified versions - of the configuration files. -4. Modify ``docker-compose.yml`` to your preferences, following the instructions - in comments in the file. The only change that is a hard requirement is to - specify where the consumption directory should mount. +3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` + and a copy of ``docker-compose.env.example`` as ``docker-compose.env``. + You'll be editing both these files: taking a copy ensures that you can + ``git pull`` to receive updates without risking merge conflicts with your + modified versions of the configuration files. +4. Modify ``docker-compose.yml`` to your preferences, following the + instructions in comments in the file. The only change that is a hard + requirement is to specify where the consumption directory should mount. 5. Modify ``docker-compose.env`` and adapt the following environment variables: ``PAPERLESS_PASSPHRASE`` @@ -181,10 +182,11 @@ Docker Method the core-count of your CPU and uses that value. ``PAPERLESS_OCR_LANGUAGES`` - If you want the OCR to recognize other languages in addition to the default - English, set this parameter to a space separated list of three-letter - language-codes after `ISO 639-2/T`_. For a list of available languages -- - including their three letter codes -- see the `Debian packagelist`_. + If you want the OCR to recognize other languages in addition to the + default English, set this parameter to a space separated list of + three-letter language-codes after `ISO 639-2/T`_. For a list of available + languages -- including their three letter codes -- see the + `Debian packagelist`_. ``USERMAP_UID`` and ``USERMAP_GID`` If you want to mount the consumption volume (directory ``/consume`` within @@ -192,11 +194,11 @@ Docker Method access rights might be an issue. The default user and group ``paperless`` in the containers have an id of 1000. The containers will enforce that the owning group of the consumption directory will be ``paperless`` to be able - to delete consumed documents. If your host-system has a group with an id of - 1000 and you don't want this group to have access rights to the consumption - directory, you can use ``USERMAP_GID`` to change the id in the container - and thus the one of the consumption directory. Furthermore, you can change - the id of the default user as well using ``USERMAP_UID``. + to delete consumed documents. If your host-system has a group with an ID + of 1000 and you don't want this group to have access rights to the + consumption directory, you can use ``USERMAP_GID`` to change the id in the + container and thus the one of the consumption directory. Furthermore, you + can change the id of the default user as well using ``USERMAP_UID``. 6. Run ``docker-compose up -d``. This will create and start the necessary containers. @@ -234,14 +236,14 @@ Docker Method .. danger:: While the consumption container will ensure at startup that it can - **delete** a consumed file from a host-mounted directory, it might not - be able to **read** the document in the first place if the access + **delete** a consumed file from a host-mounted directory, it might + not be able to **read** the document in the first place if the access rights to the file are incorrect. Make sure that the documents you put into the consumption directory will either be readable by everyone (``chmod o+r file.pdf``) or - readable by the default user or group id 1000 (or the one you have set - with ``USERMAP_UID`` or ``USERMAP_GID`` respectively). + readable by the default user or group id 1000 (or the one you have + set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively). 2. Use ``docker cp`` to copy your files directly into the container: @@ -258,8 +260,8 @@ Docker Method ``docker cp`` is a one-shot-command, just like ``cp``. This means that every time you want to consume a new document, you will have to execute - ``docker cp`` again. You can of course automate this process, but option 1 - is generally the preferred one. + ``docker cp`` again. You can of course automate this process, but option + 1 is generally the preferred one. .. danger:: @@ -267,8 +269,8 @@ Docker Method to the acting user at the destination, which will be ``root``. You therefore need to ensure that the documents you want to copy into - the container are readable by everyone (``chmod o+r file.pdf``) before - copying them. + the container are readable by everyone (``chmod o+r file.pdf``) + before copying them. .. _Docker: https://www.docker.com/ @@ -281,17 +283,108 @@ Docker Method free to tinker around without using compose! -.. _making-things-a-little-more-permanent: +.. _setup-permanent: Making Things a Little more Permanent ------------------------------------- -Once you've tested things and are happy with the work flow, you can automate the -process of starting the webserver and consumer automatically. If you're running -on a bare metal system that's using Systemd, you can use the service unit files -in the ``scripts`` directory to set this up. If you're on another startup -system or are using a Vagrant box, then you're currently on your own. If you are -using Docker, you can set a restart-policy_ in the ``docker-compose.yml`` to -have the containers automatically start with the Docker daemon. +Once you've tested things and are happy with the work flow, you can automate +the process of starting the webserver and consumer automatically. + + +.. _setup-permanent-standard-systemd: + +Standard (Bare Metal, Systemd) +.............................. + +If you're running on a bare metal system that's using Systemd, you can use the +service unit files in the ``scripts`` directory to set this up. You'll need to +create a user called ``paperless`` and setup Paperless to be in a place that +this new user can read and write to. Then, you can just tell Systemd to enable +the two ``.service`` files:: + + # systemctl enable /path/to/paperless/scripts/paperless-consumer.service + # systemctl enable /path/to/paperless/scripts/paperless-webserver.service + # systemctl start /path/to/paperless/scripts/paperless-consumer.service + # systemctl start /path/to/paperless/scripts/paperless-webserver.service + + +.. _setup-permanent-standard-ubuntu14: + +Ubuntu 14.04 (Bare Metal, Upstart) +.................................. + +Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services +during the boot process. To configure Upstart to run Paperless automatically +after restarting your system: + +1. Change to the directory where Upstart's configuration files are kept: + ``cd /etc/init`` +2. Create a new file: ``sudo nano paperless-server.conf`` +3. In the newly-created file enter:: + + start on (local-filesystems and net-device-up IFACE=eth0) + stop on shutdown + + respawn + respawn limit 10 5 + + script + exec /srv/paperless/src/manage.py runserver 0.0.0.0:80 + end script + + Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the + path to the ``manage.py`` script in your installation directory. + + If you are using a network interface other than ``eth0``, you will have to + change ``IFACE=eth0``. For example, if you are connected via WiFi, you will + likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces, + run ``ifconfig``. + + Save the file. + +4. Create a new file: ``sudo nano paperless-consumer.conf`` + +5. In the newly-created file enter:: + + start on (local-filesystems and net-device-up IFACE=eth0) + stop on shutdown + + respawn + respawn limit 10 5 + + script + exec /srv/paperless/src/manage.py document_consumer + end script + + Replace ``/srv/paperless/src/manage.py`` with the same values as in step 3 + above and replace ``eth0`` with the appropriate value, if necessary. Save the + file. + +These two configuration files together will start both the Paperless webserver +and document consumer processes when the file system and network interface +specified is available after boot. Furthermore, if either process ever exits +unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5 +second period. + +.. _Upstart: http://upstart.ubuntu.com/ + + +.. _setup-permanent-vagrant: + +Vagrant +....... + +You're currently on your own, but the Ubuntu explanation above may be enough. + + +.. _setup-permanent-docker: + +Docker +...... + +If you're using Docker, you can set a restart-policy_ in the +``docker-compose.yml`` to have the containers automatically start with the +Docker daemon. .. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart diff --git a/docs/troubleshooting.rst b/docs/troubleshooting.rst new file mode 100644 index 000000000..0fa7c1a29 --- /dev/null +++ b/docs/troubleshooting.rst @@ -0,0 +1,19 @@ +.. _troubleshooting: + +Troubleshooting +=============== + +.. _troubleshooting_ocr_language_files_missing: + +Consumer warns ``OCR for XX failed`` +------------------------------------ + +If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for +XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you +might need to install the `Tesseract language files +`_ marching your documents languages. + +As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian +box), and your documents are written in Spanish you may need to run:: + + apt-get install -y tesseract-ocr-spa diff --git a/paperless.conf.example b/paperless.conf.example index 3ee429ea8..d254b7320 100644 --- a/paperless.conf.example +++ b/paperless.conf.example @@ -20,7 +20,7 @@ PAPERLESS_CONSUME_MAIL_PASS="" # # The passphrase you use here will be used when storing your documents in # Paperless, but you can always export them in an unencrypted format by using -# document exporter. See the documentaiton for more information. +# document exporter. See the documentation for more information. # # One final note about the passphrase. Once you've consumed a document with # one passphrase, DON'T CHANGE IT. Paperless assumes this to be a constant and @@ -31,3 +31,8 @@ PAPERLESS_PASSPHRASE="secret" # If you intend to consume documents either via HTTP POST or by email, you must # have a shared secret here. PAPERLESS_SHARED_SECRET="" + +# By default, Paperless will attempt to use all available CPU cores to process +# a document, but if you would like to limit that, you can set this value to +# an integer: +#PAPERLESS_OCR_THREADS=1 diff --git a/presentation/img/kitten.jpg b/presentation/img/kitten.jpg new file mode 100644 index 000000000..cb90ef944 Binary files /dev/null and b/presentation/img/kitten.jpg differ diff --git a/presentation/index.html b/presentation/index.html index 0b6921f9f..25ce83ad9 100644 --- a/presentation/index.html +++ b/presentation/index.html @@ -148,12 +148,12 @@

Demo!

-

(Time to sacrifice a kitten)

+

TODO

-

It works, but it could use polish

+

It works, but it needs polish

  • The UI is the Django admin
  • Mail consumption is really raw
  • @@ -163,11 +163,11 @@ diff --git a/requirements.txt b/requirements.txt index 527ca4142..04ec38065 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -Django==1.9.2 +Django==1.9.4 Pillow==3.1.1 django-crispy-forms==1.6.0 django-extensions==1.6.1 diff --git a/src/documents/consumer.py b/src/documents/consumer.py index 244383211..08ed98fd0 100644 --- a/src/documents/consumer.py +++ b/src/documents/consumer.py @@ -19,12 +19,11 @@ from PIL import Image from django.conf import settings from django.utils import timezone -from django.template.defaultfilters import slugify from pyocr.tesseract import TesseractError from paperless.db import GnuPG -from .models import Correspondent, Tag, Document, Log +from .models import Tag, Document, Log, FileInfo from .languages import ISO639 from .signals import ( document_consumption_started, document_consumption_finished) @@ -56,19 +55,6 @@ class Consumer(object): DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE - REGEX_TITLE = re.compile( - r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$", - flags=re.IGNORECASE - ) - REGEX_CORRESPONDENT_TITLE = re.compile( - r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$", - flags=re.IGNORECASE - ) - REGEX_CORRESPONDENT_TITLE_TAGS = re.compile( - r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$", - flags=re.IGNORECASE - ) - def __init__(self): self.logger = logging.getLogger(__name__) @@ -107,7 +93,7 @@ class Consumer(object): if not os.path.isfile(doc): continue - if not re.match(self.REGEX_TITLE, doc): + if not re.match(FileInfo.REGEXES["title"], doc): continue if doc in self._ignore: @@ -282,72 +268,20 @@ class Consumer(object): # Strip out excess white space to allow matching to go smoother return re.sub(r"\s+", " ", r) - def _guess_attributes_from_name(self, parseable): - """ - We use a crude naming convention to make handling the correspondent, - title, and tags easier: - " - - <tags>.<suffix>" - "<correspondent> - <title>.<suffix>" - "<title>.<suffix>" - """ - - def get_correspondent(correspondent_name): - return Correspondent.objects.get_or_create( - name=correspondent_name, - defaults={"slug": slugify(correspondent_name)} - )[0] - - def get_tags(tags): - r = [] - for t in tags.split(","): - r.append( - Tag.objects.get_or_create(slug=t, defaults={"name": t})[0]) - return tuple(r) - - def get_suffix(suffix): - suffix = suffix.lower() - if suffix == "jpeg": - return "jpg" - return suffix - - # First attempt: "<correspondent> - <title> - <tags>.<suffix>" - m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable) - if m: - return ( - get_correspondent(m.group(1)), - m.group(2), - get_tags(m.group(3)), - get_suffix(m.group(4)) - ) - - # Second attempt: "<correspondent> - <title>.<suffix>" - m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable) - if m: - return ( - get_correspondent(m.group(1)), - m.group(2), - (), - get_suffix(m.group(3)) - ) - - # That didn't work, so we assume correspondent and tags are None - m = re.match(self.REGEX_TITLE, parseable) - return None, m.group(1), (), get_suffix(m.group(2)) - def _store(self, text, doc, thumbnail): - sender, title, tags, file_type = self._guess_attributes_from_name(doc) - relevant_tags = set(list(Tag.match_all(text)) + list(tags)) + file_info = FileInfo.from_path(doc) + relevant_tags = set(list(Tag.match_all(text)) + list(file_info.tags)) stats = os.stat(doc) self.log("debug", "Saving record to database") document = Document.objects.create( - correspondent=sender, - title=title, + correspondent=file_info.correspondent, + title=file_info.title, content=text, - file_type=file_type, + file_type=file_info.extension, created=timezone.make_aware( datetime.datetime.fromtimestamp(stats.st_mtime)), modified=timezone.make_aware( diff --git a/src/documents/management/commands/document_exporter.py b/src/documents/management/commands/document_exporter.py index 913f7ae79..1c6ac6e44 100644 --- a/src/documents/management/commands/document_exporter.py +++ b/src/documents/management/commands/document_exporter.py @@ -96,11 +96,16 @@ class Command(Renderable, BaseCommand): @staticmethod def _get_legacy_file_name(doc): - if doc.correspondent and doc.title: - tags = ",".join([t.slug for t in doc.tags.all()]) - if tags: - return "{} - {} - {}.{}".format( - doc.correspondent, doc.title, tags, doc.file_type) - return "{} - {}.{}".format( - doc.correspondent, doc.title, doc.file_type) - return os.path.basename(doc.source_path) + + if not doc.correspondent and not doc.title: + return os.path.basename(doc.source_path) + + created = doc.created.strftime("%Y%m%d%H%M%SZ") + tags = ",".join([t.slug for t in doc.tags.all()]) + + if tags: + return "{} - {} - {} - {}.{}".format( + created, doc.correspondent, doc.title, tags, doc.file_type) + + return "{} - {} - {}.{}".format( + created, doc.correspondent, doc.title, doc.file_type) diff --git a/src/documents/models.py b/src/documents/models.py index 0d79dba0a..cf32fabe3 100644 --- a/src/documents/models.py +++ b/src/documents/models.py @@ -1,8 +1,11 @@ +import dateutil.parser import logging import os import re import uuid +from collections import OrderedDict + from django.conf import settings from django.core.urlresolvers import reverse from django.db import models @@ -152,7 +155,7 @@ class Document(models.Model): ) tags = models.ManyToManyField( Tag, related_name="documents", blank=True) - created = models.DateTimeField(default=timezone.now, editable=False) + created = models.DateTimeField(default=timezone.now) modified = models.DateTimeField(auto_now=True, editable=False) class Meta(object): @@ -250,3 +253,136 @@ class Log(models.Model): self.group = uuid.uuid4() models.Model.save(self, *args, **kwargs) + + +class FileInfo(object): + + # This epic regex *almost* worked for our needs, so I'm keeping it here for + # posterity, in the hopes that we might find a way to make it work one day. + ALMOST_REGEX = re.compile( + r"^((?P<date>\d\d\d\d\d\d\d\d\d\d\d\d\d\dZ){separator})?" + r"((?P<correspondent>{non_separated_word}+){separator})??" + r"(?P<title>{non_separated_word}+)" + r"({separator}(?P<tags>[a-z,0-9-]+))?" + r"\.(?P<extension>[a-zA-Z.-]+)$".format( + separator=r"\s+-\s+", + non_separated_word=r"([\w,. ]|([^\s]-))" + ) + ) + + REGEXES = OrderedDict([ + ("created-correspondent-title-tags", re.compile( + r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - " + r"(?P<correspondent>.*) - " + r"(?P<title>.*) - " + r"(?P<tags>[a-z0-9\-,]*)" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )), + ("created-title-tags", re.compile( + r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - " + r"(?P<title>.*) - " + r"(?P<tags>[a-z0-9\-,]*)" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )), + ("created-correspondent-title", re.compile( + r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - " + r"(?P<correspondent>.*) - " + r"(?P<title>.*)" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )), + ("created-title", re.compile( + r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - " + r"(?P<title>.*)" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )), + ("correspondent-title-tags", re.compile( + r"(?P<correspondent>.*) - " + r"(?P<title>.*) - " + r"(?P<tags>[a-z0-9\-,]*)" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )), + ("correspondent-title", re.compile( + r"(?P<correspondent>.*) - " + r"(?P<title>.*)?" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )), + ("title", re.compile( + r"(?P<title>.*)" + r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$", + flags=re.IGNORECASE + )) + ]) + + def __init__(self, created=None, correspondent=None, title=None, tags=(), + extension=None): + + self.created = created + self.title = title + self.extension = extension + self.correspondent = correspondent + self.tags = tags + + @classmethod + def _get_created(cls, created): + return dateutil.parser.parse("{:0<14}Z".format(created[:-1])) + + @classmethod + def _get_correspondent(cls, name): + if not name: + return None + return Correspondent.objects.get_or_create(name=name, defaults={ + "slug": slugify(name) + })[0] + + @classmethod + def _get_title(cls, title): + return title + + @classmethod + def _get_tags(cls, tags): + r = [] + for t in tags.split(","): + r.append( + Tag.objects.get_or_create(slug=t, defaults={"name": t})[0]) + return tuple(r) + + @classmethod + def _get_extension(cls, extension): + r = extension.lower() + if r == "jpeg": + return "jpg" + return r + + @classmethod + def _mangle_property(cls, properties, name): + if name in properties: + properties[name] = getattr(cls, "_get_{}".format(name))( + properties[name] + ) + + @classmethod + def from_path(cls, path): + """ + We use a crude naming convention to make handling the correspondent, + title, and tags easier: + "<correspondent> - <title> - <tags>.<suffix>" + "<correspondent> - <title>.<suffix>" + "<title>.<suffix>" + """ + + for regex in cls.REGEXES.values(): + m = regex.match(os.path.basename(path)) + if m: + properties = m.groupdict() + cls._mangle_property(properties, "created") + cls._mangle_property(properties, "correspondent") + cls._mangle_property(properties, "title") + cls._mangle_property(properties, "tags") + cls._mangle_property(properties, "extension") + return cls(**properties) diff --git a/src/documents/tests/test_consumer.py b/src/documents/tests/test_consumer.py index 04f92f98c..48407044d 100644 --- a/src/documents/tests/test_consumer.py +++ b/src/documents/tests/test_consumer.py @@ -1,29 +1,36 @@ from django.test import TestCase -from ..consumer import Consumer +from ..models import Document, FileInfo class TestAttachment(TestCase): TAGS = ("tag1", "tag2", "tag3") - CONSUMER = Consumer() - SUFFIXES = ( + EXTENSIONS = ( "pdf", "png", "jpg", "jpeg", "gif", "PDF", "PNG", "JPG", "JPEG", "GIF", "PdF", "PnG", "JpG", "JPeG", "GiF", ) def _test_guess_attributes_from_name(self, path, sender, title, tags): - for suffix in self.SUFFIXES: - f = path.format(suffix) - results = self.CONSUMER._guess_attributes_from_name(f) - self.assertEqual(results[0].name, sender, f) - self.assertEqual(results[1], title, f) - self.assertEqual(tuple([t.slug for t in results[2]]), tags, f) - if suffix.lower() == "jpeg": - self.assertEqual(results[3], "jpg", f) + + for extension in self.EXTENSIONS: + + f = path.format(extension) + file_info = FileInfo.from_path(f) + + if sender: + self.assertEqual(file_info.correspondent.name, sender, f) else: - self.assertEqual(results[3], suffix.lower(), f) + self.assertIsNone(file_info.correspondent, f) + + self.assertEqual(file_info.title, title, f) + + self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f) + if extension.lower() == "jpeg": + self.assertEqual(file_info.extension, "jpg", f) + else: + self.assertEqual(file_info.extension, extension.lower(), f) def test_guess_attributes_from_name0(self): self._test_guess_attributes_from_name( @@ -92,3 +99,206 @@ class TestAttachment(TestCase): "Τιτλε", self.TAGS ) + + def test_guess_attributes_from_name_when_correspondent_empty(self): + self._test_guess_attributes_from_name( + '/path/to/ - weird empty correspondent but should not break.{}', + None, + 'weird empty correspondent but should not break', + () + ) + + def test_guess_attributes_from_name_when_title_starts_with_dash(self): + self._test_guess_attributes_from_name( + '/path/to/- weird but should not break.{}', + None, + '- weird but should not break', + () + ) + + def test_guess_attributes_from_name_when_title_ends_with_dash(self): + self._test_guess_attributes_from_name( + '/path/to/weird but should not break -.{}', + None, + 'weird but should not break -', + () + ) + + def test_guess_attributes_from_name_when_title_is_empty(self): + self._test_guess_attributes_from_name( + '/path/to/weird correspondent but should not break - .{}', + 'weird correspondent but should not break', + '', + () + ) + + +class Permutations(TestCase): + + valid_dates = ( + "20150102030405Z", + "20150102Z", + ) + valid_correspondents = [ + "timmy", + "Dr. McWheelie", + "Dash Gor-don", + "ο Θερμαστής", + "" + ] + valid_titles = ["title", "Title w Spaces", "Title a-dash", "Τίτλος", ""] + valid_tags = ["tag", "tig,tag", "tag1,tag2,tag-3"] + valid_extensions = ["pdf", "png", "jpg", "jpeg", "gif"] + + def _test_guessed_attributes(self, filename, created=None, + correspondent=None, title=None, + extension=None, tags=None): + + # print(filename) + info = FileInfo.from_path(filename) + + # Created + if created is None: + self.assertIsNone(info.created, filename) + else: + self.assertEqual(info.created.year, int(created[:4]), filename) + self.assertEqual(info.created.month, int(created[4:6]), filename) + self.assertEqual(info.created.day, int(created[6:8]), filename) + + # Correspondent + if correspondent: + self.assertEqual(info.correspondent.name, correspondent, filename) + else: + self.assertEqual(info.correspondent, None, filename) + + # Title + self.assertEqual(info.title, title, filename) + + # Tags + if tags is None: + self.assertEqual(info.tags, (), filename) + else: + self.assertEqual( + [t.slug for t in info.tags], tags.split(','), + filename + ) + + # Extension + if extension == 'jpeg': + extension = 'jpg' + self.assertEqual(info.extension, extension, filename) + + def test_just_title(self): + template = '/path/to/{title}.{extension}' + for title in self.valid_titles: + for extension in self.valid_extensions: + spec = dict(title=title, extension=extension) + filename = template.format(**spec) + self._test_guessed_attributes(filename, **spec) + + def test_title_and_correspondent(self): + template = '/path/to/{correspondent} - {title}.{extension}' + for correspondent in self.valid_correspondents: + for title in self.valid_titles: + for extension in self.valid_extensions: + spec = dict(correspondent=correspondent, title=title, + extension=extension) + filename = template.format(**spec) + self._test_guessed_attributes(filename, **spec) + + def test_title_and_correspondent_and_tags(self): + template = '/path/to/{correspondent} - {title} - {tags}.{extension}' + for correspondent in self.valid_correspondents: + for title in self.valid_titles: + for tags in self.valid_tags: + for extension in self.valid_extensions: + spec = dict(correspondent=correspondent, title=title, + tags=tags, extension=extension) + filename = template.format(**spec) + self._test_guessed_attributes(filename, **spec) + + def test_created_and_correspondent_and_title_and_tags(self): + + template = ("/path/to/{created} - " + "{correspondent} - " + "{title} - " + "{tags}" + ".{extension}") + + for created in self.valid_dates: + for correspondent in self.valid_correspondents: + for title in self.valid_titles: + for tags in self.valid_tags: + for extension in self.valid_extensions: + spec = { + "created": created, + "correspondent": correspondent, + "title": title, + "tags": tags, + "extension": extension + } + self._test_guessed_attributes( + template.format(**spec), **spec) + + def test_created_and_correspondent_and_title(self): + + template = ("/path/to/{created} - " + "{correspondent} - " + "{title}" + ".{extension}") + + for created in self.valid_dates: + for correspondent in self.valid_correspondents: + for title in self.valid_titles: + + # Skip cases where title looks like a tag as we can't + # accommodate such cases. + if title.lower() == title: + continue + + for extension in self.valid_extensions: + spec = { + "created": created, + "correspondent": correspondent, + "title": title, + "extension": extension + } + self._test_guessed_attributes( + template.format(**spec), **spec) + + def test_created_and_title(self): + + template = ("/path/to/{created} - " + "{title}" + ".{extension}") + + for created in self.valid_dates: + for title in self.valid_titles: + for extension in self.valid_extensions: + spec = { + "created": created, + "title": title, + "extension": extension + } + self._test_guessed_attributes( + template.format(**spec), **spec) + + def test_created_and_title_and_tags(self): + + template = ("/path/to/{created} - " + "{title} - " + "{tags}" + ".{extension}") + + for created in self.valid_dates: + for title in self.valid_titles: + for tags in self.valid_tags: + for extension in self.valid_extensions: + spec = { + "created": created, + "title": title, + "tags": tags, + "extension": extension + } + self._test_guessed_attributes( + template.format(**spec), **spec)