From 070463b85a396c4895e6473e63af70af6406b539 Mon Sep 17 00:00:00 2001 From: Daniel Quinn Date: Thu, 3 Mar 2016 20:52:42 +0000 Subject: [PATCH 1/6] s/Sender/Correspondent & reworked the (im|ex)porter --- docs/consumption.rst | 38 ++-- docs/migrating.rst | 177 +++++++----------- docs/utilities.rst | 90 ++++++++- src/documents/admin.py | 4 +- src/documents/consumer.py | 4 +- src/documents/forms.py | 10 +- src/documents/mail.py | 4 +- .../management/commands/document_exporter.py | 29 ++- .../management/commands/document_importer.py | 110 +++++++++++ .../migrations/0011_auto_20160303_1929.py | 19 ++ src/documents/models.py | 16 +- src/documents/serialisers.py | 6 +- src/documents/views.py | 15 +- src/paperless/urls.py | 4 +- 14 files changed, 342 insertions(+), 184 deletions(-) create mode 100644 src/documents/management/commands/document_importer.py create mode 100644 src/documents/migrations/0011_auto_20160303_1929.py diff --git a/docs/consumption.rst b/docs/consumption.rst index 8b9b35433..0f8ff7ca5 100644 --- a/docs/consumption.rst +++ b/docs/consumption.rst @@ -44,10 +44,10 @@ Any document you put into the consumption directory will be consumed, but if you name the file right, it'll automatically set some values in the database for you. This is is the logic the consumer follows: -1. Try to find the sender, title, and tags in the file name following the - pattern: ``Sender - Title - tag,tag,tag.pdf``. -2. If that doesn't work, try to find the sender and title in the file name - following the pattern: ``Sender - Title.pdf``. +1. Try to find the correspondent, title, and tags in the file name following + the pattern: ``Correspondent - Title - tag,tag,tag.pdf``. +2. If that doesn't work, try to find the correspondent and title in the file + name following the pattern: ``Correspondent - Title.pdf``. 3. If that doesn't work, just assume that the name of the file is the title. So given the above, the following examples would work as you'd expect: @@ -97,9 +97,9 @@ So, with all that in mind, here's what you do to get it running: the configured email account every 10 minutes for something new and pull down whatever it finds. 4. Send yourself an email! Note that the subject is treated as the file name, - so if you set the subject to ``Sender - Title - tag,tag,tag``, you'll get - what you expect. Also, you must include the aforementioned secret string in - every email so the fetcher knows that it's safe to import. + so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll + get what you expect. Also, you must include the aforementioned secret + string in every email so the fetcher knows that it's safe to import. 5. After a few minutes, the consumer will poll your mailbox, pull down the message, and place the attachment in the consumption directory with the appropriate name. A few minutes later, the consumer will import it like any @@ -118,16 +118,16 @@ a real API, it's just a URL that accepts an HTTP POST. To push your document to *Paperless*, send an HTTP POST to the server with the following name/value pairs: -* ``sender``: The name of the document's sender. Note that there are - restrictions on what characters you can use here. Specifically, alphanumeric - characters, `-`, `,`, `.`, and `'` are ok, everything else it out. You also - can't use the sequence ` - ` (space, dash, space). +* ``correspondent``: The name of the document's correspondent. Note that there + are restrictions on what characters you can use here. Specifically, + alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else it + out. You also can't use the sequence ` - ` (space, dash, space). * ``title``: The title of the document. The rules for characters is the same - here as the sender. -* ``signature``: For security reasons, we have the sender send a signature using - a "shared secret" method to make sure that random strangers don't start - uploading stuff to your server. The means of generating this signature is - defined below. + here as the correspondent. +* ``signature``: For security reasons, we have the correspondent send a + signature using a "shared secret" method to make sure that random strangers + don't start uploading stuff to your server. The means of generating this + signature is defined below. Specify ``enctype="multipart/form-data"``, and then POST your file with::: @@ -146,12 +146,12 @@ verification. In the case of *Paperless*, you configure the server with the secret by setting ``UPLOAD_SHARED_SECRET``. Then on your client, you generate your signature by -concatenating the sender, title, and the secret, and then using sha256 to -generate a hexdigest. +concatenating the correspondent, title, and the secret, and then using sha256 +to generate a hexdigest. If you're using Python, this is what that looks like: .. code:: python from hashlib import sha256 - signature = sha256(sender + title + secret).hexdigest() + signature = sha256(correspondent + title + secret).hexdigest() diff --git a/docs/migrating.rst b/docs/migrating.rst index 491eeace4..d659620ac 100644 --- a/docs/migrating.rst +++ b/docs/migrating.rst @@ -4,10 +4,68 @@ Migrating, Updates, and Backups =============================== As *Paperless* is still under active development, there's a lot that can change -as software updates roll out. The thing you just need to remember for all of -this is that for the most part, **the database is expendable** so long as you -have your files. This is because the file name of the exported files includes -the name of the sender, the title, and the tags (if any) on each file. +as software updates roll out. You should backup often, so if anything goes +wrong during an update, you at least have a means of restoring to something +usable. Thankfully, there are automated ways of backing up, restoring, and +updating the software. + + +.. _migrating-backup: + +Backing Up +---------- + +So you're bored of this whole project, or you want to make a remote backup of +the unencrypted files for whatever reason. This is easy to do, simply use the +:ref:`exporter ` to dump your documents and database out +into an arbitrary directory. + + +.. _migrating-restoring: + +Restoring +--------- + +Restoring your data is just as easy, since nearly all of your data exists either +in the file names, or in the contents of the files themselves. You just need to +create an empty database (just follow the +:ref:`installation instructions ` again) and then import the +``tags.json`` file you created as part of your backup. Lastly, copy your +exported documents into the consumption directory and start up the consumer. + +.. code-block:: shell-session + + $ cd /path/to/project + $ rm data/db.sqlite3 # Delete the database + $ cd src + $ ./manage.py migrate # Create the database + $ ./manage.py createsuperuser + $ ./manage.py loaddata /path/to/arbitrary/place/tags.json + $ cp /path/to/exported/docs/* /path/to/consumption/dir/ + $ ./manage.py document_consumer + +Importing your data if you are :ref:`using Docker ` +is almost as simple: + +.. code-block:: shell-session + + # Stop and remove your current containers + $ docker-compose stop + $ docker-compose rm -f + + # Recreate them, add the superuser + $ docker-compose up -d + $ docker-compose run --rm webserver createsuperuser + + # Load the tags + $ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin - + + # Load your exported documents into the consumption directory + # (How you do this highly depends on how you have set this up) + $ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/ + +After loading the documents into the consumption directory the consumer will +immediately start consuming the documents. .. _migrating-updates: @@ -20,7 +78,7 @@ on the directory containing the project files, and then use Django's ``migrate`` command to execute any database schema updates that might have been rolled in as part of the update: -.. code:: bash +.. code-block:: shell-session $ cd /path/to/project $ git pull @@ -43,112 +101,3 @@ requires only one additional step: If ``git pull`` doesn't report any changes, there is no need to continue with the remaining steps. - - -.. _migrating-backup: - -Backing Up ----------- - -So you're bored of this whole project, or you want to make a remote backup of -the unencrypted files for whatever reason. This is easy to do, simply use the -:ref:`exporter ` to dump your documents out into an -arbitrary directory. - -Additionally however, you'll need to back up the tags themselves. The file -names contain the tag names, but you still need to define the tags and their -matching algorithms in the database for things to work properly. We do this -with Django's ``dumpdata`` command, which produces JSON output. - -.. code:: bash - - $ cd /path/to/project - $ cd src - $ ./manage.py document_export /path/to/arbitrary/place/ - $ ./manage.py dumpdata documents.Tag > /path/to/arbitrary/place/tags.json - -If you are :ref:`using Docker `, exporting your tags -as JSON is almost as easy: - -.. code-block:: shell-session - - $ docker-compose run --rm webserver dumpdata documents.Tag > /path/to/arbitrary/place/tags.json - -To export the documents you can either use ``docker run`` directly, specifying all -the commandline options by hand, or (more simply) mount a second volume for export. - -To mount a volume for exports, follow the instructions in the -``docker-compose.yml.example`` file for the ``/export`` volume (making the changes -in your own ``docker-compose.yml`` file, of course). Once you have the -volume mounted, the command to run an export is: - -.. code-block:: console - - $ docker-compose run --rm consumer document_exporter /export - -If you prefer to use ``docker run`` directly, supplying the necessary commandline -options: - -.. code-block:: shell-session - - $ # Identify your containers - $ docker-compose ps - Name Command State Ports - ------------------------------------------------------------------------- - paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0 - paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0 - - $ # Make sure to replace your passphrase and remove or adapt the id mapping - $ docker run --rm \ - --volumes-from paperless_data_1 \ - --volume /path/to/arbitrary/place:/export \ - -e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \ - -e USERMAP_UID=1000 -e USERMAP_GID=1000 \ - paperless document_exporter /export - - -.. _migrating-restoring: - -Restoring ---------- - -Restoring your data is just as easy, since nearly all of your data exists either -in the file names, or in the contents of the files themselves. You just need to -create an empty database (just follow the -:ref:`installation instructions ` again) and then import the -``tags.json`` file you created as part of your backup. Lastly, copy your -exported documents into the consumption directory and start up the consumer. - -.. code:: bash - - $ cd /path/to/project - $ rm data/db.sqlite3 # Delete the database - $ cd src - $ ./manage.py migrate # Create the database - $ ./manage.py createsuperuser - $ ./manage.py loaddata /path/to/arbitrary/place/tags.json - $ cp /path/to/exported/docs/* /path/to/consumption/dir/ - $ ./manage.py document_consumer - -Importing your data if you are :ref:`using Docker ` -is almost as simple: - -.. code-block:: shell-session - - $ # Stop and remove your current containers - $ docker-compose stop - $ docker-compose rm -f - - $ # Recreate them, add the superuser - $ docker-compose up -d - $ docker-compose run --rm webserver createsuperuser - - $ # Load the tags - $ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin - - - $ # Load your exported documents into the consumption directory - $ # (How you do this highly depends on how you have set this up) - $ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/ - -After loading the documents into the consumption directory the consumer will -immediately start consuming the documents. diff --git a/docs/utilities.rst b/docs/utilities.rst index f5b452a6f..ce3555b73 100644 --- a/docs/utilities.rst +++ b/docs/utilities.rst @@ -26,7 +26,7 @@ How to Use It The webserver is started via the ``manage.py`` script: -.. code:: bash +.. code-block:: shell-session $ /path/to/paperless/src/manage.py runserver @@ -64,7 +64,7 @@ How to Use It The consumer is started via the ``manage.py`` script: -.. code:: bash +.. code-block:: shell-session $ /path/to/paperless/src/manage.py document_consumer @@ -95,16 +95,86 @@ How to Use It This too is done via the ``manage.py`` script: -.. code:: bash +.. code-block:: shell-session - $ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere + $ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/ -This will dump all of your PDFs into ``/path/to/somewhere`` for you to do with -as you please. The naming scheme on export is identical to that used for -import, so should you can now safely delete the entire project directly, -database, encrypted PDFs and all, and later create it all again simply by -running the consumer again and dumping all of these files into -``CONSUMPTION_DIR``. +This will dump all of your unencrypted PDFs into ``/path/to/somewhere`` for you +to do with as you please. The files are accompanied with a special file, +``manifest.json`` which can be used to +:ref:`import the files ` at a later date if you wish. + + +.. _utilities-exporter-howto-docker: + +Docker +______ + +If you are :ref:`using Docker `, running the +expoorter is almost as easy. To mount a volume for exports, follow the +instructions in the ``docker-compose.yml.example`` file for the ``/export`` +volume (making the changes in your own ``docker-compose.yml`` file, of course). +Once you have the volume mounted, the command to run an export is: + +.. code-block:: shell-session + + $ docker-compose run --rm consumer document_exporter /export + +If you prefer to use ``docker run`` directly, supplying the necessary commandline +options: + +.. code-block:: shell-session + + $ # Identify your containers + $ docker-compose ps + Name Command State Ports + ------------------------------------------------------------------------- + paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0 + paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0 + + $ # Make sure to replace your passphrase and remove or adapt the id mapping + $ docker run --rm \ + --volumes-from paperless_data_1 \ + --volume /path/to/arbitrary/place:/export \ + -e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \ + -e USERMAP_UID=1000 -e USERMAP_GID=1000 \ + paperless document_exporter /export + + +.. _utilities-importer: + +The Importer +------------ + +Looking to transfer Paperless data from one instance to another, or just want +to restore from a backup? This is your go-to toy. + + +.. _utilities-importer-howto: + +How to Use It +............. + +The importer works just like the exporter. You point it at a directory, and +the script does the rest of the work: + +.. code-block:: shell-session + + $ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/ + +Docker +______ + +Assuming that you've already gone through the steps above in the +:ref:`export ` section, then the easiest thing +to do is just re-use the ``/export`` path you already setup: + +.. code-block:: shell-session + + $ docker-compose run --rm consumer document_importer /export + +Similarly, if you're not using docker-compose, you can adjust the export +instructions above to do the import. .. _utilities-retagger: diff --git a/src/documents/admin.py b/src/documents/admin.py index 118a295eb..3baad817b 100644 --- a/src/documents/admin.py +++ b/src/documents/admin.py @@ -3,7 +3,7 @@ from django.contrib.auth.models import User, Group from django.core.urlresolvers import reverse from django.templatetags.static import static -from .models import Sender, Tag, Document, Log +from .models import Correspondent, Tag, Document, Log class MonthListFilter(admin.SimpleListFilter): @@ -107,7 +107,7 @@ class LogAdmin(admin.ModelAdmin): list_filter = ("level", "component",) -admin.site.register(Sender) +admin.site.register(Correspondent) admin.site.register(Tag, TagAdmin) admin.site.register(Document, DocumentAdmin) admin.site.register(Log, LogAdmin) diff --git a/src/documents/consumer.py b/src/documents/consumer.py index 5617ed550..4233cded8 100644 --- a/src/documents/consumer.py +++ b/src/documents/consumer.py @@ -24,7 +24,7 @@ from pyocr.tesseract import TesseractError from paperless.db import GnuPG -from .models import Sender, Tag, Document, Log +from .models import Correspondent, Tag, Document, Log from .languages import ISO639 @@ -246,7 +246,7 @@ class Consumer(object): """ def get_sender(sender_name): - return Sender.objects.get_or_create( + return Correspondent.objects.get_or_create( name=sender_name, defaults={"slug": slugify(sender_name)})[0] def get_tags(tags): diff --git a/src/documents/forms.py b/src/documents/forms.py index 8eb7b8381..d8960f88b 100644 --- a/src/documents/forms.py +++ b/src/documents/forms.py @@ -8,7 +8,7 @@ from time import mktime from django import forms from django.conf import settings -from .models import Document, Sender +from .models import Document, Correspondent from .consumer import Consumer @@ -24,7 +24,9 @@ class UploadForm(forms.Form): } sender = forms.CharField( - max_length=Sender._meta.get_field("name").max_length, required=False) + max_length=Correspondent._meta.get_field("name").max_length, + required=False + ) title = forms.CharField( max_length=Document._meta.get_field("title").max_length, required=False @@ -41,7 +43,7 @@ class UploadForm(forms.Form): sender = self.cleaned_data.get("sender") if not sender: return None - if not Sender.SAFE_REGEX.match(sender) or " - " in sender: + if not Correspondent.SAFE_REGEX.match(sender) or " - " in sender: raise forms.ValidationError("That sender name is suspicious.") return sender @@ -49,7 +51,7 @@ class UploadForm(forms.Form): title = self.cleaned_data.get("title") if not title: return None - if not Sender.SAFE_REGEX.match(title) or " - " in title: + if not Correspondent.SAFE_REGEX.match(title) or " - " in title: raise forms.ValidationError("That title is suspicious.") def clean_document(self): diff --git a/src/documents/mail.py b/src/documents/mail.py index 0bc3ce94f..5bacb5b5f 100644 --- a/src/documents/mail.py +++ b/src/documents/mail.py @@ -14,7 +14,7 @@ from dateutil import parser from django.conf import settings from .consumer import Consumer -from .models import Sender, Log +from .models import Correspondent, Log class MailFetcherError(Exception): @@ -103,7 +103,7 @@ class Message(Loggable): def check_subject(self): if self.subject is None: raise InvalidMessageError("Message does not have a subject") - if not Sender.SAFE_REGEX.match(self.subject): + if not Correspondent.SAFE_REGEX.match(self.subject): raise InvalidMessageError("Message subject is unsafe: {}".format( self.subject)) diff --git a/src/documents/management/commands/document_exporter.py b/src/documents/management/commands/document_exporter.py index ac448d8e8..87ed804a2 100644 --- a/src/documents/management/commands/document_exporter.py +++ b/src/documents/management/commands/document_exporter.py @@ -1,10 +1,12 @@ +import json import os import time from django.conf import settings from django.core.management.base import BaseCommand, CommandError +from django.core import serializers -from documents.models import Document +from documents.models import Document, Correspondent, Tag from paperless.db import GnuPG from ...mixins import Renderable @@ -14,21 +16,19 @@ class Command(Renderable, BaseCommand): help = """ Decrypt and rename all files in our collection into a given target - directory. Note that we don't export any of the parsed data since - that can always be re-collected via the consumer. + directory. And include a manifest file containing document data for + easy import. """.replace(" ", "") def add_arguments(self, parser): parser.add_argument("target") def __init__(self, *args, **kwargs): - self.verbosity = 0 - self.target = None BaseCommand.__init__(self, *args, **kwargs) + self.target = None def handle(self, *args, **options): - self.verbosity = options["verbosity"] self.target = options["target"] if not os.path.exists(self.target): @@ -40,9 +40,15 @@ class Command(Renderable, BaseCommand): if not settings.PASSPHRASE: settings.PASSPHRASE = input("Please enter the passphrase: ") - for document in Document.objects.all(): + documents = Document.objects.all() + document_map = {d.pk: d for d in documents} + manifest = json.loads(serializers.serialize("json", documents)) + for document_dict in manifest: + + document = document_map[document_dict["pk"]] target = os.path.join(self.target, document.file_name) + document_dict["__exported_file_name__"] = target print("Exporting: {}".format(target)) @@ -50,3 +56,12 @@ class Command(Renderable, BaseCommand): f.write(GnuPG.decrypted(document.source_file)) t = int(time.mktime(document.created.timetuple())) os.utime(target, times=(t, t)) + + manifest += json.loads( + serializers.serialize("json", Correspondent.objects.all())) + + manifest += json.loads(serializers.serialize( + "json", Tag.objects.all())) + + with open(os.path.join(self.target, "manifest.json"), "w") as f: + json.dump(manifest, f, indent=2) diff --git a/src/documents/management/commands/document_importer.py b/src/documents/management/commands/document_importer.py new file mode 100644 index 000000000..213c049e4 --- /dev/null +++ b/src/documents/management/commands/document_importer.py @@ -0,0 +1,110 @@ +import json +import os + +from django.conf import settings +from django.core.management.base import BaseCommand, CommandError +from django.core.management import call_command + +from documents.models import Document +from paperless.db import GnuPG + +from ...mixins import Renderable + + +class Command(Renderable, BaseCommand): + + help = """ + Using a manifest.json file, load the data from there, and import the + documents it refers to. + """.replace(" ", "") + + def add_arguments(self, parser): + parser.add_argument("source") + parser.add_argument( + '--ignore-absent', + action='store_true', + default=False, + help="If the manifest refers to a document that doesn't exist, " + "ignore it and attempt to import what it can" + ) + + def __init__(self, *args, **kwargs): + BaseCommand.__init__(self, *args, **kwargs) + self.source = None + self.manifest = None + + def handle(self, *args, **options): + + self.source = options["source"] + + if not os.path.exists(self.source): + raise CommandError("That path doesn't exist") + + if not os.access(self.source, os.R_OK): + raise CommandError("That path doesn't appear to be readable") + + manifest_path = os.path.join(self.source, "manifest.json") + self._check_manifest_exists(manifest_path) + + with open(manifest_path) as f: + self.manifest = json.load(f) + + self._check_manifest() + + if not settings.PASSPHRASE: + raise CommandError( + "You need to define a passphrase before continuing. Please " + "consult the documentation for setting up Paperless." + ) + + # Fill up the database with whatever is in the manifest + call_command("loaddata", manifest_path) + + self._import_files_from_manifest() + + @staticmethod + def _check_manifest_exists(path): + if not os.path.exists(path): + raise CommandError( + "That directory doesn't appear to contain a manifest.json " + "file." + ) + + def _check_manifest(self): + + for record in self.manifest: + + if not record["model"] == "documents.document": + continue + + if "__exported_file_name__" not in record: + raise CommandError( + 'The manifest file contains a record which does not ' + 'refer to an actual document file. If you want to import ' + 'the rest anyway (skipping such references) call the ' + 'importer with --ignore-absent' + ) + + doc_file = record["__exported_file_name__"] + if not os.path.exists(os.path.join(self.source, doc_file)): + raise CommandError( + 'The manifest file refers to "{}" which does not ' + 'appear to be in the source directory. If you want to ' + 'import the rest anyway (skipping such references) call ' + 'the importer with --ignore-absent'.format(doc_file) + ) + + def _import_files_from_manifest(self): + + for record in self.manifest: + + if not record["model"] == "documents.document": + continue + + doc_file = record["__exported_file_name__"] + document = Document.objects.get(pk=record["pk"]) + with open(doc_file, "rb") as unencrypted: + with open(document.source_path, "wb") as encrypted: + print("Encrypting {} and saving it to {}".format( + doc_file, document.source_path)) + encrypted.write(GnuPG.encrypted(unencrypted)) diff --git a/src/documents/migrations/0011_auto_20160303_1929.py b/src/documents/migrations/0011_auto_20160303_1929.py new file mode 100644 index 000000000..a9aefddaf --- /dev/null +++ b/src/documents/migrations/0011_auto_20160303_1929.py @@ -0,0 +1,19 @@ +# -*- coding: utf-8 -*- +# Generated by Django 1.9.2 on 2016-03-03 19:29 +from __future__ import unicode_literals + +from django.db import migrations + + +class Migration(migrations.Migration): + + dependencies = [ + ('documents', '0010_log'), + ] + + operations = [ + migrations.RenameModel( + old_name='Sender', + new_name='Correspondent', + ), + ] diff --git a/src/documents/models.py b/src/documents/models.py index e5556534a..0fb6489c4 100644 --- a/src/documents/models.py +++ b/src/documents/models.py @@ -28,7 +28,7 @@ class SluggedModel(models.Model): return self.name -class Sender(SluggedModel): +class Correspondent(SluggedModel): # This regex is probably more restrictive than it needs to be, but it's # better safe than sorry. @@ -141,7 +141,7 @@ class Document(models.Model): TYPES = (TYPE_PDF, TYPE_PNG, TYPE_JPG, TYPE_GIF, TYPE_TIF,) sender = models.ForeignKey( - Sender, blank=True, null=True, related_name="documents") + Correspondent, blank=True, null=True, related_name="documents") title = models.CharField(max_length=128, blank=True, db_index=True) content = models.TextField(db_index=True) file_type = models.CharField( @@ -158,9 +158,9 @@ class Document(models.Model): ordering = ("sender", "title") def __str__(self): - created = self.created.strftime("%Y-%m-%d") + created = self.created.strftime("%Y%m%d%H%M%S") if self.sender and self.title: - return "{}: {}, {}".format(created, self.sender, self.title) + return "{}: {} - {}".format(created, self.sender, self.title) if self.sender or self.title: return "{}: {}".format(created, self.sender or self.title) return str(created) @@ -179,13 +179,7 @@ class Document(models.Model): @property def file_name(self): - if self.sender and self.title: - tags = ",".join([t.slug for t in self.tags.all()]) - if tags: - return "{} - {} - {}.{}".format( - self.sender, self.title, tags, self.file_type) - return "{} - {}.{}".format(self.sender, self.title, self.file_type) - return os.path.basename(self.source_path) + return slugify(str(self)) + "." + self.file_type @property def download_url(self): diff --git a/src/documents/serialisers.py b/src/documents/serialisers.py index f9b29f790..340fdaa25 100644 --- a/src/documents/serialisers.py +++ b/src/documents/serialisers.py @@ -1,12 +1,12 @@ from rest_framework import serializers -from .models import Sender, Tag, Document, Log +from .models import Correspondent, Tag, Document, Log -class SenderSerializer(serializers.HyperlinkedModelSerializer): +class CorrespondentSerializer(serializers.HyperlinkedModelSerializer): class Meta(object): - model = Sender + model = Correspondent fields = ("id", "slug", "name") diff --git a/src/documents/views.py b/src/documents/views.py index 0b2b50926..ff7c4ce05 100644 --- a/src/documents/views.py +++ b/src/documents/views.py @@ -1,6 +1,5 @@ from django.contrib.auth.mixins import LoginRequiredMixin from django.http import HttpResponse -from django.template.defaultfilters import slugify from django.views.decorators.csrf import csrf_exempt from django.views.generic import FormView, DetailView, TemplateView @@ -14,9 +13,9 @@ from rest_framework.viewsets import ( from paperless.db import GnuPG from .forms import UploadForm -from .models import Sender, Tag, Document, Log +from .models import Correspondent, Tag, Document, Log from .serialisers import ( - SenderSerializer, TagSerializer, DocumentSerializer, LogSerializer) + CorrespondentSerializer, TagSerializer, DocumentSerializer, LogSerializer) class IndexView(TemplateView): @@ -52,7 +51,7 @@ class FetchView(LoginRequiredMixin, DetailView): content_type=content_types[self.object.file_type] ) response["Content-Disposition"] = 'attachment; filename="{}"'.format( - slugify(str(self.object)) + "." + self.object.file_type) + self.object.file_name) return response @@ -81,10 +80,10 @@ class StandardPagination(PageNumberPagination): max_page_size = 100000 -class SenderViewSet(ModelViewSet): - model = Sender - queryset = Sender.objects.all() - serializer_class = SenderSerializer +class CorrespondentViewSet(ModelViewSet): + model = Correspondent + queryset = Correspondent.objects.all() + serializer_class = CorrespondentSerializer pagination_class = StandardPagination permission_classes = (IsAuthenticated,) diff --git a/src/paperless/urls.py b/src/paperless/urls.py index 24a495810..e81d4dcf9 100644 --- a/src/paperless/urls.py +++ b/src/paperless/urls.py @@ -22,11 +22,11 @@ from rest_framework.routers import DefaultRouter from documents.views import ( IndexView, FetchView, PushView, - SenderViewSet, TagViewSet, DocumentViewSet, LogViewSet + CorrespondentViewSet, TagViewSet, DocumentViewSet, LogViewSet ) router = DefaultRouter() -router.register(r'senders', SenderViewSet) +router.register(r'senders', CorrespondentViewSet) router.register(r'tags', TagViewSet) router.register(r'documents', DocumentViewSet) router.register(r'logs', LogViewSet) From ba7878b9aa5b115ad91daddf387433a3948c7619 Mon Sep 17 00:00:00 2001 From: Daniel Quinn Date: Thu, 3 Mar 2016 21:25:08 +0000 Subject: [PATCH 2/6] Added some tests for the importer --- .../management/commands/document_importer.py | 15 ++------ src/documents/tests/test_importer.py | 36 +++++++++++++++++++ 2 files changed, 38 insertions(+), 13 deletions(-) create mode 100644 src/documents/tests/test_importer.py diff --git a/src/documents/management/commands/document_importer.py b/src/documents/management/commands/document_importer.py index 213c049e4..63c961815 100644 --- a/src/documents/management/commands/document_importer.py +++ b/src/documents/management/commands/document_importer.py @@ -20,13 +20,6 @@ class Command(Renderable, BaseCommand): def add_arguments(self, parser): parser.add_argument("source") - parser.add_argument( - '--ignore-absent', - action='store_true', - default=False, - help="If the manifest refers to a document that doesn't exist, " - "ignore it and attempt to import what it can" - ) def __init__(self, *args, **kwargs): BaseCommand.__init__(self, *args, **kwargs) @@ -80,18 +73,14 @@ class Command(Renderable, BaseCommand): if "__exported_file_name__" not in record: raise CommandError( 'The manifest file contains a record which does not ' - 'refer to an actual document file. If you want to import ' - 'the rest anyway (skipping such references) call the ' - 'importer with --ignore-absent' + 'refer to an actual document file.' ) doc_file = record["__exported_file_name__"] if not os.path.exists(os.path.join(self.source, doc_file)): raise CommandError( 'The manifest file refers to "{}" which does not ' - 'appear to be in the source directory. If you want to ' - 'import the rest anyway (skipping such references) call ' - 'the importer with --ignore-absent'.format(doc_file) + 'appear to be in the source directory.'.format(doc_file) ) def _import_files_from_manifest(self): diff --git a/src/documents/tests/test_importer.py b/src/documents/tests/test_importer.py new file mode 100644 index 000000000..8880aba66 --- /dev/null +++ b/src/documents/tests/test_importer.py @@ -0,0 +1,36 @@ +from django.core.management.base import CommandError +from django.test import TestCase + +from ..management.commands.document_importer import Command + + +class TestImporter(TestCase): + + def __init__(self, *args, **kwargs): + TestCase.__init__(self, *args, **kwargs) + + def test_check_manifest_exists(self): + cmd = Command() + self.assertRaises( + CommandError, cmd._check_manifest_exists, "/tmp/manifest.json") + + def test_check_manifest(self): + + cmd = Command() + cmd.source = "/tmp" + + cmd.manifest = [{"model": "documents.document"}] + with self.assertRaises(CommandError) as cm: + cmd._check_manifest() + self.assertTrue( + 'The manifest file contains a record' in str(cm.exception)) + + cmd.manifest = [{ + "model": "documents.document", + "__exported_file_name__": "noexist.pdf" + }] + # self.assertRaises(CommandError, cmd._check_manifest) + with self.assertRaises(CommandError) as cm: + cmd._check_manifest() + self.assertTrue( + 'The manifest file refers to "noexist.pdf"' in str(cm.exception)) From 5d4587ef8b599fbe91c74740ded81e35d1b711f8 Mon Sep 17 00:00:00 2001 From: Daniel Quinn Date: Fri, 4 Mar 2016 09:14:50 +0000 Subject: [PATCH 3/6] Accounted for .sender in a few places --- src/documents/admin.py | 6 +-- src/documents/consumer.py | 34 ++++++++-------- src/documents/forms.py | 29 +++++++------- .../management/commands/document_exporter.py | 39 +++++++++++++++++++ .../migrations/0011_auto_20160303_1929.py | 9 +++++ src/documents/models.py | 13 ++++--- src/documents/serialisers.py | 6 +-- 7 files changed, 95 insertions(+), 41 deletions(-) diff --git a/src/documents/admin.py b/src/documents/admin.py index 3baad817b..a5b523492 100644 --- a/src/documents/admin.py +++ b/src/documents/admin.py @@ -45,9 +45,9 @@ class DocumentAdmin(admin.ModelAdmin): "all": ("paperless.css",) } - search_fields = ("sender__name", "title", "content") - list_display = ("created_", "sender", "title", "tags_", "document") - list_filter = ("tags", "sender", MonthListFilter) + search_fields = ("correspondent__name", "title", "content") + list_display = ("created_", "correspondent", "title", "tags_", "document") + list_filter = ("tags", "correspondent", MonthListFilter) list_per_page = 25 def created_(self, obj): diff --git a/src/documents/consumer.py b/src/documents/consumer.py index 4233cded8..eeb42cdf1 100644 --- a/src/documents/consumer.py +++ b/src/documents/consumer.py @@ -57,11 +57,11 @@ class Consumer(object): r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$", flags=re.IGNORECASE ) - REGEX_SENDER_TITLE = re.compile( + REGEX_CORRESPONDENT_TITLE = re.compile( r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$", flags=re.IGNORECASE ) - REGEX_SENDER_TITLE_TAGS = re.compile( + REGEX_CORRESPONDENT_TITLE_TAGS = re.compile( r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$", flags=re.IGNORECASE ) @@ -238,16 +238,18 @@ class Consumer(object): def _guess_attributes_from_name(self, parseable): """ - We use a crude naming convention to make handling the sender, title, - and tags easier: - " - - <tags>.<suffix>" - "<sender> - <title>.<suffix>" + We use a crude naming convention to make handling the correspondent, + title, and tags easier: + "<correspondent> - <title> - <tags>.<suffix>" + "<correspondent> - <title>.<suffix>" "<title>.<suffix>" """ - def get_sender(sender_name): + def get_correspondent(correspondent_name): return Correspondent.objects.get_or_create( - name=sender_name, defaults={"slug": slugify(sender_name)})[0] + name=correspondent_name, + defaults={"slug": slugify(correspondent_name)} + )[0] def get_tags(tags): r = [] @@ -262,27 +264,27 @@ class Consumer(object): return "jpg" return suffix - # First attempt: "<sender> - <title> - <tags>.<suffix>" - m = re.match(self.REGEX_SENDER_TITLE_TAGS, parseable) + # First attempt: "<correspondent> - <title> - <tags>.<suffix>" + m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable) if m: return ( - get_sender(m.group(1)), + get_correspondent(m.group(1)), m.group(2), get_tags(m.group(3)), get_suffix(m.group(4)) ) - # Second attempt: "<sender> - <title>.<suffix>" - m = re.match(self.REGEX_SENDER_TITLE, parseable) + # Second attempt: "<correspondent> - <title>.<suffix>" + m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable) if m: return ( - get_sender(m.group(1)), + get_correspondent(m.group(1)), m.group(2), (), get_suffix(m.group(3)) ) - # That didn't work, so we assume sender and tags are None + # That didn't work, so we assume correspondent and tags are None m = re.match(self.REGEX_TITLE, parseable) return None, m.group(1), (), get_suffix(m.group(2)) @@ -296,7 +298,7 @@ class Consumer(object): self.log("debug", "Saving record to database") document = Document.objects.create( - sender=sender, + correspondent=sender, title=title, content=text, file_type=file_type, diff --git a/src/documents/forms.py b/src/documents/forms.py index d8960f88b..d4c01745a 100644 --- a/src/documents/forms.py +++ b/src/documents/forms.py @@ -23,7 +23,7 @@ class UploadForm(forms.Form): "image/tiff": Document.TYPE_TIF, } - sender = forms.CharField( + correspondent = forms.CharField( max_length=Correspondent._meta.get_field("name").max_length, required=False ) @@ -34,18 +34,19 @@ class UploadForm(forms.Form): document = forms.FileField() signature = forms.CharField(max_length=256) - def clean_sender(self): + def clean_correspondent(self): """ I suppose it might look cleaner to use .get_or_create() here, but that - would also allow someone to fill up the db with bogus senders before - all validation was met. + would also allow someone to fill up the db with bogus correspondents + before all validation was met. """ - sender = self.cleaned_data.get("sender") - if not sender: + corresp = self.cleaned_data.get("correspondent") + if not corresp: return None - if not Correspondent.SAFE_REGEX.match(sender) or " - " in sender: - raise forms.ValidationError("That sender name is suspicious.") - return sender + if not Correspondent.SAFE_REGEX.match(corresp) or " - " in corresp: + raise forms.ValidationError( + "That correspondent name is suspicious.") + return corresp def clean_title(self): title = self.cleaned_data.get("title") @@ -63,10 +64,10 @@ class UploadForm(forms.Form): return document, self.TYPE_LOOKUP[file_type] def clean(self): - sender = self.clened_data("sender") + corresp = self.clened_data("correspondent") title = self.cleaned_data("title") signature = self.cleaned_data("signature") - if sha256(sender + title + self.SECRET).hexdigest() == signature: + if sha256(corresp + title + self.SECRET).hexdigest() == signature: return True return False @@ -77,13 +78,15 @@ class UploadForm(forms.Form): form do that as well. Think of it as a poor-man's queue server. """ - sender = self.clened_data("sender") + correspondent = self.clened_data("correspondent") title = self.cleaned_data("title") document, file_type = self.cleaned_data.get("document") t = int(mktime(datetime.now())) file_name = os.path.join( - Consumer.CONSUME, "{} - {}.{}".format(sender, title, file_type)) + Consumer.CONSUME, + "{} - {}.{}".format(correspondent, title, file_type) + ) with open(file_name, "wb") as f: f.write(document) diff --git a/src/documents/management/commands/document_exporter.py b/src/documents/management/commands/document_exporter.py index 87ed804a2..913f7ae79 100644 --- a/src/documents/management/commands/document_exporter.py +++ b/src/documents/management/commands/document_exporter.py @@ -22,6 +22,13 @@ class Command(Renderable, BaseCommand): def add_arguments(self, parser): parser.add_argument("target") + parser.add_argument( + "--legacy", + action="store_true", + help="Don't try to export all of the document data, just dump the " + "original document files out in a format that makes " + "re-consuming them easy." + ) def __init__(self, *args, **kwargs): BaseCommand.__init__(self, *args, **kwargs) @@ -40,6 +47,13 @@ class Command(Renderable, BaseCommand): if not settings.PASSPHRASE: settings.PASSPHRASE = input("Please enter the passphrase: ") + if options["legacy"]: + self.dump_legacy() + else: + self.dump() + + def dump(self): + documents = Document.objects.all() document_map = {d.pk: d for d in documents} manifest = json.loads(serializers.serialize("json", documents)) @@ -65,3 +79,28 @@ class Command(Renderable, BaseCommand): with open(os.path.join(self.target, "manifest.json"), "w") as f: json.dump(manifest, f, indent=2) + + def dump_legacy(self): + + for document in Document.objects.all(): + + target = os.path.join( + self.target, self._get_legacy_file_name(document)) + + print("Exporting: {}".format(target)) + + with open(target, "wb") as f: + f.write(GnuPG.decrypted(document.source_file)) + t = int(time.mktime(document.created.timetuple())) + os.utime(target, times=(t, t)) + + @staticmethod + def _get_legacy_file_name(doc): + if doc.correspondent and doc.title: + tags = ",".join([t.slug for t in doc.tags.all()]) + if tags: + return "{} - {} - {}.{}".format( + doc.correspondent, doc.title, tags, doc.file_type) + return "{} - {}.{}".format( + doc.correspondent, doc.title, doc.file_type) + return os.path.basename(doc.source_path) diff --git a/src/documents/migrations/0011_auto_20160303_1929.py b/src/documents/migrations/0011_auto_20160303_1929.py index a9aefddaf..af4ee4c66 100644 --- a/src/documents/migrations/0011_auto_20160303_1929.py +++ b/src/documents/migrations/0011_auto_20160303_1929.py @@ -16,4 +16,13 @@ class Migration(migrations.Migration): old_name='Sender', new_name='Correspondent', ), + migrations.AlterModelOptions( + name='document', + options={'ordering': ('correspondent', 'title')}, + ), + migrations.RenameField( + model_name='document', + old_name='sender', + new_name='correspondent', + ), ] diff --git a/src/documents/models.py b/src/documents/models.py index 0fb6489c4..a82f7643f 100644 --- a/src/documents/models.py +++ b/src/documents/models.py @@ -140,7 +140,7 @@ class Document(models.Model): TYPE_TIF = "tiff" TYPES = (TYPE_PDF, TYPE_PNG, TYPE_JPG, TYPE_GIF, TYPE_TIF,) - sender = models.ForeignKey( + correspondent = models.ForeignKey( Correspondent, blank=True, null=True, related_name="documents") title = models.CharField(max_length=128, blank=True, db_index=True) content = models.TextField(db_index=True) @@ -155,14 +155,15 @@ class Document(models.Model): modified = models.DateTimeField(auto_now=True, editable=False) class Meta(object): - ordering = ("sender", "title") + ordering = ("correspondent", "title") def __str__(self): created = self.created.strftime("%Y%m%d%H%M%S") - if self.sender and self.title: - return "{}: {} - {}".format(created, self.sender, self.title) - if self.sender or self.title: - return "{}: {}".format(created, self.sender or self.title) + if self.correspondent and self.title: + return "{}: {} - {}".format( + created, self.correspondent, self.title) + if self.correspondent or self.title: + return "{}: {}".format(created, self.correspondent or self.title) return str(created) @property diff --git a/src/documents/serialisers.py b/src/documents/serialisers.py index 340fdaa25..c2b2ae7fd 100644 --- a/src/documents/serialisers.py +++ b/src/documents/serialisers.py @@ -20,8 +20,8 @@ class TagSerializer(serializers.HyperlinkedModelSerializer): class DocumentSerializer(serializers.ModelSerializer): - sender = serializers.HyperlinkedRelatedField( - read_only=True, view_name="drf:sender-detail", allow_null=True) + correspondent = serializers.HyperlinkedRelatedField( + read_only=True, view_name="drf:correspondent-detail", allow_null=True) tags = serializers.HyperlinkedRelatedField( read_only=True, view_name="drf:tag-detail", many=True) @@ -29,7 +29,7 @@ class DocumentSerializer(serializers.ModelSerializer): model = Document fields = ( "id", - "sender", + "correspondent", "title", "content", "file_type", From 13c2ed66e13c493c25ca460f29f43aa1f0f5815d Mon Sep 17 00:00:00 2001 From: Daniel Quinn <code@danielquinn.org> Date: Fri, 4 Mar 2016 17:53:54 +0000 Subject: [PATCH 4/6] Better bare metal explanation --- docs/setup.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/setup.rst b/docs/setup.rst index 077ce135c..9992418c1 100644 --- a/docs/setup.rst +++ b/docs/setup.rst @@ -42,12 +42,13 @@ route`_ is quick & easy, but means you're running a VM which comes with memory consumption etc. We also `support Docker`_, which you can use natively under Linux and in a VM with `Docker Machine`_ (this guide was written for native Docker usage under Linux, you might have to adapt it for Docker Machine.) -Alternatively the standard, `bare metal`_ approach is a little more complicated. +Alternatively the standard, `bare metal`_ approach is a little more complicated, +but worth it because it makes it easier to should you want to contribute some +code back. .. _Vagrant route: setup-installation-vagrant_ .. _support Docker: setup-installation-docker_ .. _bare metal: setup-installation-standard_ - .. _Docker Machine: https://docs.docker.com/machine/ .. _setup-installation-standard: From 94a7914073f1ba449f3c23b314be87e7418e90d4 Mon Sep 17 00:00:00 2001 From: Daniel Quinn <code@danielquinn.org> Date: Fri, 4 Mar 2016 23:20:22 +0000 Subject: [PATCH 5/6] More descriptive --- docs/changelog.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/changelog.rst b/docs/changelog.rst index 772e30dc0..d135d3564 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -4,7 +4,7 @@ Changelog * 0.1.1 (master) * `#68`_: Added support for using a proper config file at - ``/etc/paperless.conf``. + ``/etc/paperless.conf`` and modified the systemd unit files to use it. * Refactored the Vagrant installation process to use environment variables rather than asking the user to modify ``settings.py``. * `#44`_: Harmonise environment variable names with constant names. From d24cfbb24652972b6c72f70a3eca4b78f22817f7 Mon Sep 17 00:00:00 2001 From: Daniel Quinn <code@danielquinn.org> Date: Fri, 4 Mar 2016 23:22:57 +0000 Subject: [PATCH 6/6] Added the bit about s/sender/correspondent/g --- docs/changelog.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/changelog.rst b/docs/changelog.rst index d135d3564..2228c9be1 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -3,6 +3,10 @@ Changelog * 0.1.1 (master) + * Potentially **Breaking Change**: All references to "sender" in the code + have been renamed to "correspondent" to better reflect the nature of the + property (one could quite reasonably scan a document before sending it to + someone.) * `#68`_: Added support for using a proper config file at ``/etc/paperless.conf`` and modified the systemd unit files to use it. * Refactored the Vagrant installation process to use environment variables