Merge pull request #70 from danielquinn/feature/exporter2

New exporter + s/sender/correspondent/g
This commit is contained in:
Daniel Quinn 2016-03-04 23:32:48 +00:00
commit 66df220bea
17 changed files with 468 additions and 226 deletions

View File

@ -3,8 +3,12 @@ Changelog
* 0.1.1 (master) * 0.1.1 (master)
* Potentially **Breaking Change**: All references to "sender" in the code
have been renamed to "correspondent" to better reflect the nature of the
property (one could quite reasonably scan a document before sending it to
someone.)
* `#68`_: Added support for using a proper config file at * `#68`_: Added support for using a proper config file at
``/etc/paperless.conf``. ``/etc/paperless.conf`` and modified the systemd unit files to use it.
* Refactored the Vagrant installation process to use environment variables * Refactored the Vagrant installation process to use environment variables
rather than asking the user to modify ``settings.py``. rather than asking the user to modify ``settings.py``.
* `#44`_: Harmonise environment variable names with constant names. * `#44`_: Harmonise environment variable names with constant names.

View File

@ -44,10 +44,10 @@ Any document you put into the consumption directory will be consumed, but if you
name the file right, it'll automatically set some values in the database for name the file right, it'll automatically set some values in the database for
you. This is is the logic the consumer follows: you. This is is the logic the consumer follows:
1. Try to find the sender, title, and tags in the file name following the 1. Try to find the correspondent, title, and tags in the file name following
pattern: ``Sender - Title - tag,tag,tag.pdf``. the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
2. If that doesn't work, try to find the sender and title in the file name 2. If that doesn't work, try to find the correspondent and title in the file
following the pattern: ``Sender - Title.pdf``. name following the pattern: ``Correspondent - Title.pdf``.
3. If that doesn't work, just assume that the name of the file is the title. 3. If that doesn't work, just assume that the name of the file is the title.
So given the above, the following examples would work as you'd expect: So given the above, the following examples would work as you'd expect:
@ -97,9 +97,9 @@ So, with all that in mind, here's what you do to get it running:
the configured email account every 10 minutes for something new and pull down the configured email account every 10 minutes for something new and pull down
whatever it finds. whatever it finds.
4. Send yourself an email! Note that the subject is treated as the file name, 4. Send yourself an email! Note that the subject is treated as the file name,
so if you set the subject to ``Sender - Title - tag,tag,tag``, you'll get so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
what you expect. Also, you must include the aforementioned secret string in get what you expect. Also, you must include the aforementioned secret
every email so the fetcher knows that it's safe to import. string in every email so the fetcher knows that it's safe to import.
5. After a few minutes, the consumer will poll your mailbox, pull down the 5. After a few minutes, the consumer will poll your mailbox, pull down the
message, and place the attachment in the consumption directory with the message, and place the attachment in the consumption directory with the
appropriate name. A few minutes later, the consumer will import it like any appropriate name. A few minutes later, the consumer will import it like any
@ -118,16 +118,16 @@ a real API, it's just a URL that accepts an HTTP POST.
To push your document to *Paperless*, send an HTTP POST to the server with the To push your document to *Paperless*, send an HTTP POST to the server with the
following name/value pairs: following name/value pairs:
* ``sender``: The name of the document's sender. Note that there are * ``correspondent``: The name of the document's correspondent. Note that there
restrictions on what characters you can use here. Specifically, alphanumeric are restrictions on what characters you can use here. Specifically,
characters, `-`, `,`, `.`, and `'` are ok, everything else it out. You also alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else it
can't use the sequence ` - ` (space, dash, space). out. You also can't use the sequence ` - ` (space, dash, space).
* ``title``: The title of the document. The rules for characters is the same * ``title``: The title of the document. The rules for characters is the same
here as the sender. here as the correspondent.
* ``signature``: For security reasons, we have the sender send a signature using * ``signature``: For security reasons, we have the correspondent send a
a "shared secret" method to make sure that random strangers don't start signature using a "shared secret" method to make sure that random strangers
uploading stuff to your server. The means of generating this signature is don't start uploading stuff to your server. The means of generating this
defined below. signature is defined below.
Specify ``enctype="multipart/form-data"``, and then POST your file with::: Specify ``enctype="multipart/form-data"``, and then POST your file with:::
@ -146,12 +146,12 @@ verification.
In the case of *Paperless*, you configure the server with the secret by setting In the case of *Paperless*, you configure the server with the secret by setting
``UPLOAD_SHARED_SECRET``. Then on your client, you generate your signature by ``UPLOAD_SHARED_SECRET``. Then on your client, you generate your signature by
concatenating the sender, title, and the secret, and then using sha256 to concatenating the correspondent, title, and the secret, and then using sha256
generate a hexdigest. to generate a hexdigest.
If you're using Python, this is what that looks like: If you're using Python, this is what that looks like:
.. code:: python .. code:: python
from hashlib import sha256 from hashlib import sha256
signature = sha256(sender + title + secret).hexdigest() signature = sha256(correspondent + title + secret).hexdigest()

View File

@ -4,10 +4,68 @@ Migrating, Updates, and Backups
=============================== ===============================
As *Paperless* is still under active development, there's a lot that can change As *Paperless* is still under active development, there's a lot that can change
as software updates roll out. The thing you just need to remember for all of as software updates roll out. You should backup often, so if anything goes
this is that for the most part, **the database is expendable** so long as you wrong during an update, you at least have a means of restoring to something
have your files. This is because the file name of the exported files includes usable. Thankfully, there are automated ways of backing up, restoring, and
the name of the sender, the title, and the tags (if any) on each file. updating the software.
.. _migrating-backup:
Backing Up
----------
So you're bored of this whole project, or you want to make a remote backup of
the unencrypted files for whatever reason. This is easy to do, simply use the
:ref:`exporter <utilities-exporter>` to dump your documents and database out
into an arbitrary directory.
.. _migrating-restoring:
Restoring
---------
Restoring your data is just as easy, since nearly all of your data exists either
in the file names, or in the contents of the files themselves. You just need to
create an empty database (just follow the
:ref:`installation instructions <setup-installation>` again) and then import the
``tags.json`` file you created as part of your backup. Lastly, copy your
exported documents into the consumption directory and start up the consumer.
.. code-block:: shell-session
$ cd /path/to/project
$ rm data/db.sqlite3 # Delete the database
$ cd src
$ ./manage.py migrate # Create the database
$ ./manage.py createsuperuser
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
$ ./manage.py document_consumer
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
is almost as simple:
.. code-block:: shell-session
# Stop and remove your current containers
$ docker-compose stop
$ docker-compose rm -f
# Recreate them, add the superuser
$ docker-compose up -d
$ docker-compose run --rm webserver createsuperuser
# Load the tags
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
# Load your exported documents into the consumption directory
# (How you do this highly depends on how you have set this up)
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
After loading the documents into the consumption directory the consumer will
immediately start consuming the documents.
.. _migrating-updates: .. _migrating-updates:
@ -20,7 +78,7 @@ on the directory containing the project files, and then use Django's ``migrate``
command to execute any database schema updates that might have been rolled in command to execute any database schema updates that might have been rolled in
as part of the update: as part of the update:
.. code:: bash .. code-block:: shell-session
$ cd /path/to/project $ cd /path/to/project
$ git pull $ git pull
@ -43,112 +101,3 @@ requires only one additional step:
If ``git pull`` doesn't report any changes, there is no need to continue with If ``git pull`` doesn't report any changes, there is no need to continue with
the remaining steps. the remaining steps.
.. _migrating-backup:
Backing Up
----------
So you're bored of this whole project, or you want to make a remote backup of
the unencrypted files for whatever reason. This is easy to do, simply use the
:ref:`exporter <utilities-exporter>` to dump your documents out into an
arbitrary directory.
Additionally however, you'll need to back up the tags themselves. The file
names contain the tag names, but you still need to define the tags and their
matching algorithms in the database for things to work properly. We do this
with Django's ``dumpdata`` command, which produces JSON output.
.. code:: bash
$ cd /path/to/project
$ cd src
$ ./manage.py document_export /path/to/arbitrary/place/
$ ./manage.py dumpdata documents.Tag > /path/to/arbitrary/place/tags.json
If you are :ref:`using Docker <setup-installation-docker>`, exporting your tags
as JSON is almost as easy:
.. code-block:: shell-session
$ docker-compose run --rm webserver dumpdata documents.Tag > /path/to/arbitrary/place/tags.json
To export the documents you can either use ``docker run`` directly, specifying all
the commandline options by hand, or (more simply) mount a second volume for export.
To mount a volume for exports, follow the instructions in the
``docker-compose.yml.example`` file for the ``/export`` volume (making the changes
in your own ``docker-compose.yml`` file, of course). Once you have the
volume mounted, the command to run an export is:
.. code-block:: console
$ docker-compose run --rm consumer document_exporter /export
If you prefer to use ``docker run`` directly, supplying the necessary commandline
options:
.. code-block:: shell-session
$ # Identify your containers
$ docker-compose ps
Name Command State Ports
-------------------------------------------------------------------------
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
$ # Make sure to replace your passphrase and remove or adapt the id mapping
$ docker run --rm \
--volumes-from paperless_data_1 \
--volume /path/to/arbitrary/place:/export \
-e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \
-e USERMAP_UID=1000 -e USERMAP_GID=1000 \
paperless document_exporter /export
.. _migrating-restoring:
Restoring
---------
Restoring your data is just as easy, since nearly all of your data exists either
in the file names, or in the contents of the files themselves. You just need to
create an empty database (just follow the
:ref:`installation instructions <setup-installation>` again) and then import the
``tags.json`` file you created as part of your backup. Lastly, copy your
exported documents into the consumption directory and start up the consumer.
.. code:: bash
$ cd /path/to/project
$ rm data/db.sqlite3 # Delete the database
$ cd src
$ ./manage.py migrate # Create the database
$ ./manage.py createsuperuser
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
$ ./manage.py document_consumer
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
is almost as simple:
.. code-block:: shell-session
$ # Stop and remove your current containers
$ docker-compose stop
$ docker-compose rm -f
$ # Recreate them, add the superuser
$ docker-compose up -d
$ docker-compose run --rm webserver createsuperuser
$ # Load the tags
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
$ # Load your exported documents into the consumption directory
$ # (How you do this highly depends on how you have set this up)
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
After loading the documents into the consumption directory the consumer will
immediately start consuming the documents.

View File

@ -42,12 +42,13 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
consumption etc. We also `support Docker`_, which you can use natively under consumption etc. We also `support Docker`_, which you can use natively under
Linux and in a VM with `Docker Machine`_ (this guide was written for native Linux and in a VM with `Docker Machine`_ (this guide was written for native
Docker usage under Linux, you might have to adapt it for Docker Machine.) Docker usage under Linux, you might have to adapt it for Docker Machine.)
Alternatively the standard, `bare metal`_ approach is a little more complicated. Alternatively the standard, `bare metal`_ approach is a little more complicated,
but worth it because it makes it easier to should you want to contribute some
code back.
.. _Vagrant route: setup-installation-vagrant_ .. _Vagrant route: setup-installation-vagrant_
.. _support Docker: setup-installation-docker_ .. _support Docker: setup-installation-docker_
.. _bare metal: setup-installation-standard_ .. _bare metal: setup-installation-standard_
.. _Docker Machine: https://docs.docker.com/machine/ .. _Docker Machine: https://docs.docker.com/machine/
.. _setup-installation-standard: .. _setup-installation-standard:

View File

@ -26,7 +26,7 @@ How to Use It
The webserver is started via the ``manage.py`` script: The webserver is started via the ``manage.py`` script:
.. code:: bash .. code-block:: shell-session
$ /path/to/paperless/src/manage.py runserver $ /path/to/paperless/src/manage.py runserver
@ -64,7 +64,7 @@ How to Use It
The consumer is started via the ``manage.py`` script: The consumer is started via the ``manage.py`` script:
.. code:: bash .. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_consumer $ /path/to/paperless/src/manage.py document_consumer
@ -95,16 +95,86 @@ How to Use It
This too is done via the ``manage.py`` script: This too is done via the ``manage.py`` script:
.. code:: bash .. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere $ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/
This will dump all of your PDFs into ``/path/to/somewhere`` for you to do with This will dump all of your unencrypted PDFs into ``/path/to/somewhere`` for you
as you please. The naming scheme on export is identical to that used for to do with as you please. The files are accompanied with a special file,
import, so should you can now safely delete the entire project directly, ``manifest.json`` which can be used to
database, encrypted PDFs and all, and later create it all again simply by :ref:`import the files <utilities-importer>` at a later date if you wish.
running the consumer again and dumping all of these files into
``CONSUMPTION_DIR``.
.. _utilities-exporter-howto-docker:
Docker
______
If you are :ref:`using Docker <setup-installation-docker>`, running the
expoorter is almost as easy. To mount a volume for exports, follow the
instructions in the ``docker-compose.yml.example`` file for the ``/export``
volume (making the changes in your own ``docker-compose.yml`` file, of course).
Once you have the volume mounted, the command to run an export is:
.. code-block:: shell-session
$ docker-compose run --rm consumer document_exporter /export
If you prefer to use ``docker run`` directly, supplying the necessary commandline
options:
.. code-block:: shell-session
$ # Identify your containers
$ docker-compose ps
Name Command State Ports
-------------------------------------------------------------------------
paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0
paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0
$ # Make sure to replace your passphrase and remove or adapt the id mapping
$ docker run --rm \
--volumes-from paperless_data_1 \
--volume /path/to/arbitrary/place:/export \
-e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \
-e USERMAP_UID=1000 -e USERMAP_GID=1000 \
paperless document_exporter /export
.. _utilities-importer:
The Importer
------------
Looking to transfer Paperless data from one instance to another, or just want
to restore from a backup? This is your go-to toy.
.. _utilities-importer-howto:
How to Use It
.............
The importer works just like the exporter. You point it at a directory, and
the script does the rest of the work:
.. code-block:: shell-session
$ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/
Docker
______
Assuming that you've already gone through the steps above in the
:ref:`export <utilities-exporter-howto-docker>` section, then the easiest thing
to do is just re-use the ``/export`` path you already setup:
.. code-block:: shell-session
$ docker-compose run --rm consumer document_importer /export
Similarly, if you're not using docker-compose, you can adjust the export
instructions above to do the import.
.. _utilities-retagger: .. _utilities-retagger:

View File

@ -3,7 +3,7 @@ from django.contrib.auth.models import User, Group
from django.core.urlresolvers import reverse from django.core.urlresolvers import reverse
from django.templatetags.static import static from django.templatetags.static import static
from .models import Sender, Tag, Document, Log from .models import Correspondent, Tag, Document, Log
class MonthListFilter(admin.SimpleListFilter): class MonthListFilter(admin.SimpleListFilter):
@ -45,9 +45,9 @@ class DocumentAdmin(admin.ModelAdmin):
"all": ("paperless.css",) "all": ("paperless.css",)
} }
search_fields = ("sender__name", "title", "content") search_fields = ("correspondent__name", "title", "content")
list_display = ("created_", "sender", "title", "tags_", "document") list_display = ("created_", "correspondent", "title", "tags_", "document")
list_filter = ("tags", "sender", MonthListFilter) list_filter = ("tags", "correspondent", MonthListFilter)
list_per_page = 25 list_per_page = 25
def created_(self, obj): def created_(self, obj):
@ -107,7 +107,7 @@ class LogAdmin(admin.ModelAdmin):
list_filter = ("level", "component",) list_filter = ("level", "component",)
admin.site.register(Sender) admin.site.register(Correspondent)
admin.site.register(Tag, TagAdmin) admin.site.register(Tag, TagAdmin)
admin.site.register(Document, DocumentAdmin) admin.site.register(Document, DocumentAdmin)
admin.site.register(Log, LogAdmin) admin.site.register(Log, LogAdmin)

View File

@ -24,7 +24,7 @@ from pyocr.tesseract import TesseractError
from paperless.db import GnuPG from paperless.db import GnuPG
from .models import Sender, Tag, Document, Log from .models import Correspondent, Tag, Document, Log
from .languages import ISO639 from .languages import ISO639
@ -57,11 +57,11 @@ class Consumer(object):
r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$", r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE flags=re.IGNORECASE
) )
REGEX_SENDER_TITLE = re.compile( REGEX_CORRESPONDENT_TITLE = re.compile(
r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$", r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE flags=re.IGNORECASE
) )
REGEX_SENDER_TITLE_TAGS = re.compile( REGEX_CORRESPONDENT_TITLE_TAGS = re.compile(
r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$", r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$",
flags=re.IGNORECASE flags=re.IGNORECASE
) )
@ -238,16 +238,18 @@ class Consumer(object):
def _guess_attributes_from_name(self, parseable): def _guess_attributes_from_name(self, parseable):
""" """
We use a crude naming convention to make handling the sender, title, We use a crude naming convention to make handling the correspondent,
and tags easier: title, and tags easier:
"<sender> - <title> - <tags>.<suffix>" "<correspondent> - <title> - <tags>.<suffix>"
"<sender> - <title>.<suffix>" "<correspondent> - <title>.<suffix>"
"<title>.<suffix>" "<title>.<suffix>"
""" """
def get_sender(sender_name): def get_correspondent(correspondent_name):
return Sender.objects.get_or_create( return Correspondent.objects.get_or_create(
name=sender_name, defaults={"slug": slugify(sender_name)})[0] name=correspondent_name,
defaults={"slug": slugify(correspondent_name)}
)[0]
def get_tags(tags): def get_tags(tags):
r = [] r = []
@ -262,27 +264,27 @@ class Consumer(object):
return "jpg" return "jpg"
return suffix return suffix
# First attempt: "<sender> - <title> - <tags>.<suffix>" # First attempt: "<correspondent> - <title> - <tags>.<suffix>"
m = re.match(self.REGEX_SENDER_TITLE_TAGS, parseable) m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable)
if m: if m:
return ( return (
get_sender(m.group(1)), get_correspondent(m.group(1)),
m.group(2), m.group(2),
get_tags(m.group(3)), get_tags(m.group(3)),
get_suffix(m.group(4)) get_suffix(m.group(4))
) )
# Second attempt: "<sender> - <title>.<suffix>" # Second attempt: "<correspondent> - <title>.<suffix>"
m = re.match(self.REGEX_SENDER_TITLE, parseable) m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable)
if m: if m:
return ( return (
get_sender(m.group(1)), get_correspondent(m.group(1)),
m.group(2), m.group(2),
(), (),
get_suffix(m.group(3)) get_suffix(m.group(3))
) )
# That didn't work, so we assume sender and tags are None # That didn't work, so we assume correspondent and tags are None
m = re.match(self.REGEX_TITLE, parseable) m = re.match(self.REGEX_TITLE, parseable)
return None, m.group(1), (), get_suffix(m.group(2)) return None, m.group(1), (), get_suffix(m.group(2))
@ -296,7 +298,7 @@ class Consumer(object):
self.log("debug", "Saving record to database") self.log("debug", "Saving record to database")
document = Document.objects.create( document = Document.objects.create(
sender=sender, correspondent=sender,
title=title, title=title,
content=text, content=text,
file_type=file_type, file_type=file_type,

View File

@ -8,7 +8,7 @@ from time import mktime
from django import forms from django import forms
from django.conf import settings from django.conf import settings
from .models import Document, Sender from .models import Document, Correspondent
from .consumer import Consumer from .consumer import Consumer
@ -23,8 +23,10 @@ class UploadForm(forms.Form):
"image/tiff": Document.TYPE_TIF, "image/tiff": Document.TYPE_TIF,
} }
sender = forms.CharField( correspondent = forms.CharField(
max_length=Sender._meta.get_field("name").max_length, required=False) max_length=Correspondent._meta.get_field("name").max_length,
required=False
)
title = forms.CharField( title = forms.CharField(
max_length=Document._meta.get_field("title").max_length, max_length=Document._meta.get_field("title").max_length,
required=False required=False
@ -32,24 +34,25 @@ class UploadForm(forms.Form):
document = forms.FileField() document = forms.FileField()
signature = forms.CharField(max_length=256) signature = forms.CharField(max_length=256)
def clean_sender(self): def clean_correspondent(self):
""" """
I suppose it might look cleaner to use .get_or_create() here, but that I suppose it might look cleaner to use .get_or_create() here, but that
would also allow someone to fill up the db with bogus senders before would also allow someone to fill up the db with bogus correspondents
all validation was met. before all validation was met.
""" """
sender = self.cleaned_data.get("sender") corresp = self.cleaned_data.get("correspondent")
if not sender: if not corresp:
return None return None
if not Sender.SAFE_REGEX.match(sender) or " - " in sender: if not Correspondent.SAFE_REGEX.match(corresp) or " - " in corresp:
raise forms.ValidationError("That sender name is suspicious.") raise forms.ValidationError(
return sender "That correspondent name is suspicious.")
return corresp
def clean_title(self): def clean_title(self):
title = self.cleaned_data.get("title") title = self.cleaned_data.get("title")
if not title: if not title:
return None return None
if not Sender.SAFE_REGEX.match(title) or " - " in title: if not Correspondent.SAFE_REGEX.match(title) or " - " in title:
raise forms.ValidationError("That title is suspicious.") raise forms.ValidationError("That title is suspicious.")
def clean_document(self): def clean_document(self):
@ -61,10 +64,10 @@ class UploadForm(forms.Form):
return document, self.TYPE_LOOKUP[file_type] return document, self.TYPE_LOOKUP[file_type]
def clean(self): def clean(self):
sender = self.clened_data("sender") corresp = self.clened_data("correspondent")
title = self.cleaned_data("title") title = self.cleaned_data("title")
signature = self.cleaned_data("signature") signature = self.cleaned_data("signature")
if sha256(sender + title + self.SECRET).hexdigest() == signature: if sha256(corresp + title + self.SECRET).hexdigest() == signature:
return True return True
return False return False
@ -75,13 +78,15 @@ class UploadForm(forms.Form):
form do that as well. Think of it as a poor-man's queue server. form do that as well. Think of it as a poor-man's queue server.
""" """
sender = self.clened_data("sender") correspondent = self.clened_data("correspondent")
title = self.cleaned_data("title") title = self.cleaned_data("title")
document, file_type = self.cleaned_data.get("document") document, file_type = self.cleaned_data.get("document")
t = int(mktime(datetime.now())) t = int(mktime(datetime.now()))
file_name = os.path.join( file_name = os.path.join(
Consumer.CONSUME, "{} - {}.{}".format(sender, title, file_type)) Consumer.CONSUME,
"{} - {}.{}".format(correspondent, title, file_type)
)
with open(file_name, "wb") as f: with open(file_name, "wb") as f:
f.write(document) f.write(document)

View File

@ -14,7 +14,7 @@ from dateutil import parser
from django.conf import settings from django.conf import settings
from .consumer import Consumer from .consumer import Consumer
from .models import Sender, Log from .models import Correspondent, Log
class MailFetcherError(Exception): class MailFetcherError(Exception):
@ -103,7 +103,7 @@ class Message(Loggable):
def check_subject(self): def check_subject(self):
if self.subject is None: if self.subject is None:
raise InvalidMessageError("Message does not have a subject") raise InvalidMessageError("Message does not have a subject")
if not Sender.SAFE_REGEX.match(self.subject): if not Correspondent.SAFE_REGEX.match(self.subject):
raise InvalidMessageError("Message subject is unsafe: {}".format( raise InvalidMessageError("Message subject is unsafe: {}".format(
self.subject)) self.subject))

View File

@ -1,10 +1,12 @@
import json
import os import os
import time import time
from django.conf import settings from django.conf import settings
from django.core.management.base import BaseCommand, CommandError from django.core.management.base import BaseCommand, CommandError
from django.core import serializers
from documents.models import Document from documents.models import Document, Correspondent, Tag
from paperless.db import GnuPG from paperless.db import GnuPG
from ...mixins import Renderable from ...mixins import Renderable
@ -14,21 +16,26 @@ class Command(Renderable, BaseCommand):
help = """ help = """
Decrypt and rename all files in our collection into a given target Decrypt and rename all files in our collection into a given target
directory. Note that we don't export any of the parsed data since directory. And include a manifest file containing document data for
that can always be re-collected via the consumer. easy import.
""".replace(" ", "") """.replace(" ", "")
def add_arguments(self, parser): def add_arguments(self, parser):
parser.add_argument("target") parser.add_argument("target")
parser.add_argument(
"--legacy",
action="store_true",
help="Don't try to export all of the document data, just dump the "
"original document files out in a format that makes "
"re-consuming them easy."
)
def __init__(self, *args, **kwargs): def __init__(self, *args, **kwargs):
self.verbosity = 0
self.target = None
BaseCommand.__init__(self, *args, **kwargs) BaseCommand.__init__(self, *args, **kwargs)
self.target = None
def handle(self, *args, **options): def handle(self, *args, **options):
self.verbosity = options["verbosity"]
self.target = options["target"] self.target = options["target"]
if not os.path.exists(self.target): if not os.path.exists(self.target):
@ -40,9 +47,22 @@ class Command(Renderable, BaseCommand):
if not settings.PASSPHRASE: if not settings.PASSPHRASE:
settings.PASSPHRASE = input("Please enter the passphrase: ") settings.PASSPHRASE = input("Please enter the passphrase: ")
for document in Document.objects.all(): if options["legacy"]:
self.dump_legacy()
else:
self.dump()
def dump(self):
documents = Document.objects.all()
document_map = {d.pk: d for d in documents}
manifest = json.loads(serializers.serialize("json", documents))
for document_dict in manifest:
document = document_map[document_dict["pk"]]
target = os.path.join(self.target, document.file_name) target = os.path.join(self.target, document.file_name)
document_dict["__exported_file_name__"] = target
print("Exporting: {}".format(target)) print("Exporting: {}".format(target))
@ -50,3 +70,37 @@ class Command(Renderable, BaseCommand):
f.write(GnuPG.decrypted(document.source_file)) f.write(GnuPG.decrypted(document.source_file))
t = int(time.mktime(document.created.timetuple())) t = int(time.mktime(document.created.timetuple()))
os.utime(target, times=(t, t)) os.utime(target, times=(t, t))
manifest += json.loads(
serializers.serialize("json", Correspondent.objects.all()))
manifest += json.loads(serializers.serialize(
"json", Tag.objects.all()))
with open(os.path.join(self.target, "manifest.json"), "w") as f:
json.dump(manifest, f, indent=2)
def dump_legacy(self):
for document in Document.objects.all():
target = os.path.join(
self.target, self._get_legacy_file_name(document))
print("Exporting: {}".format(target))
with open(target, "wb") as f:
f.write(GnuPG.decrypted(document.source_file))
t = int(time.mktime(document.created.timetuple()))
os.utime(target, times=(t, t))
@staticmethod
def _get_legacy_file_name(doc):
if doc.correspondent and doc.title:
tags = ",".join([t.slug for t in doc.tags.all()])
if tags:
return "{} - {} - {}.{}".format(
doc.correspondent, doc.title, tags, doc.file_type)
return "{} - {}.{}".format(
doc.correspondent, doc.title, doc.file_type)
return os.path.basename(doc.source_path)

View File

@ -0,0 +1,99 @@
import json
import os
from django.conf import settings
from django.core.management.base import BaseCommand, CommandError
from django.core.management import call_command
from documents.models import Document
from paperless.db import GnuPG
from ...mixins import Renderable
class Command(Renderable, BaseCommand):
help = """
Using a manifest.json file, load the data from there, and import the
documents it refers to.
""".replace(" ", "")
def add_arguments(self, parser):
parser.add_argument("source")
def __init__(self, *args, **kwargs):
BaseCommand.__init__(self, *args, **kwargs)
self.source = None
self.manifest = None
def handle(self, *args, **options):
self.source = options["source"]
if not os.path.exists(self.source):
raise CommandError("That path doesn't exist")
if not os.access(self.source, os.R_OK):
raise CommandError("That path doesn't appear to be readable")
manifest_path = os.path.join(self.source, "manifest.json")
self._check_manifest_exists(manifest_path)
with open(manifest_path) as f:
self.manifest = json.load(f)
self._check_manifest()
if not settings.PASSPHRASE:
raise CommandError(
"You need to define a passphrase before continuing. Please "
"consult the documentation for setting up Paperless."
)
# Fill up the database with whatever is in the manifest
call_command("loaddata", manifest_path)
self._import_files_from_manifest()
@staticmethod
def _check_manifest_exists(path):
if not os.path.exists(path):
raise CommandError(
"That directory doesn't appear to contain a manifest.json "
"file."
)
def _check_manifest(self):
for record in self.manifest:
if not record["model"] == "documents.document":
continue
if "__exported_file_name__" not in record:
raise CommandError(
'The manifest file contains a record which does not '
'refer to an actual document file.'
)
doc_file = record["__exported_file_name__"]
if not os.path.exists(os.path.join(self.source, doc_file)):
raise CommandError(
'The manifest file refers to "{}" which does not '
'appear to be in the source directory.'.format(doc_file)
)
def _import_files_from_manifest(self):
for record in self.manifest:
if not record["model"] == "documents.document":
continue
doc_file = record["__exported_file_name__"]
document = Document.objects.get(pk=record["pk"])
with open(doc_file, "rb") as unencrypted:
with open(document.source_path, "wb") as encrypted:
print("Encrypting {} and saving it to {}".format(
doc_file, document.source_path))
encrypted.write(GnuPG.encrypted(unencrypted))

View File

@ -0,0 +1,28 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.9.2 on 2016-03-03 19:29
from __future__ import unicode_literals
from django.db import migrations
class Migration(migrations.Migration):
dependencies = [
('documents', '0010_log'),
]
operations = [
migrations.RenameModel(
old_name='Sender',
new_name='Correspondent',
),
migrations.AlterModelOptions(
name='document',
options={'ordering': ('correspondent', 'title')},
),
migrations.RenameField(
model_name='document',
old_name='sender',
new_name='correspondent',
),
]

View File

@ -28,7 +28,7 @@ class SluggedModel(models.Model):
return self.name return self.name
class Sender(SluggedModel): class Correspondent(SluggedModel):
# This regex is probably more restrictive than it needs to be, but it's # This regex is probably more restrictive than it needs to be, but it's
# better safe than sorry. # better safe than sorry.
@ -140,8 +140,8 @@ class Document(models.Model):
TYPE_TIF = "tiff" TYPE_TIF = "tiff"
TYPES = (TYPE_PDF, TYPE_PNG, TYPE_JPG, TYPE_GIF, TYPE_TIF,) TYPES = (TYPE_PDF, TYPE_PNG, TYPE_JPG, TYPE_GIF, TYPE_TIF,)
sender = models.ForeignKey( correspondent = models.ForeignKey(
Sender, blank=True, null=True, related_name="documents") Correspondent, blank=True, null=True, related_name="documents")
title = models.CharField(max_length=128, blank=True, db_index=True) title = models.CharField(max_length=128, blank=True, db_index=True)
content = models.TextField(db_index=True) content = models.TextField(db_index=True)
file_type = models.CharField( file_type = models.CharField(
@ -155,14 +155,15 @@ class Document(models.Model):
modified = models.DateTimeField(auto_now=True, editable=False) modified = models.DateTimeField(auto_now=True, editable=False)
class Meta(object): class Meta(object):
ordering = ("sender", "title") ordering = ("correspondent", "title")
def __str__(self): def __str__(self):
created = self.created.strftime("%Y-%m-%d") created = self.created.strftime("%Y%m%d%H%M%S")
if self.sender and self.title: if self.correspondent and self.title:
return "{}: {}, {}".format(created, self.sender, self.title) return "{}: {} - {}".format(
if self.sender or self.title: created, self.correspondent, self.title)
return "{}: {}".format(created, self.sender or self.title) if self.correspondent or self.title:
return "{}: {}".format(created, self.correspondent or self.title)
return str(created) return str(created)
@property @property
@ -179,13 +180,7 @@ class Document(models.Model):
@property @property
def file_name(self): def file_name(self):
if self.sender and self.title: return slugify(str(self)) + "." + self.file_type
tags = ",".join([t.slug for t in self.tags.all()])
if tags:
return "{} - {} - {}.{}".format(
self.sender, self.title, tags, self.file_type)
return "{} - {}.{}".format(self.sender, self.title, self.file_type)
return os.path.basename(self.source_path)
@property @property
def download_url(self): def download_url(self):

View File

@ -1,12 +1,12 @@
from rest_framework import serializers from rest_framework import serializers
from .models import Sender, Tag, Document, Log from .models import Correspondent, Tag, Document, Log
class SenderSerializer(serializers.HyperlinkedModelSerializer): class CorrespondentSerializer(serializers.HyperlinkedModelSerializer):
class Meta(object): class Meta(object):
model = Sender model = Correspondent
fields = ("id", "slug", "name") fields = ("id", "slug", "name")
@ -20,8 +20,8 @@ class TagSerializer(serializers.HyperlinkedModelSerializer):
class DocumentSerializer(serializers.ModelSerializer): class DocumentSerializer(serializers.ModelSerializer):
sender = serializers.HyperlinkedRelatedField( correspondent = serializers.HyperlinkedRelatedField(
read_only=True, view_name="drf:sender-detail", allow_null=True) read_only=True, view_name="drf:correspondent-detail", allow_null=True)
tags = serializers.HyperlinkedRelatedField( tags = serializers.HyperlinkedRelatedField(
read_only=True, view_name="drf:tag-detail", many=True) read_only=True, view_name="drf:tag-detail", many=True)
@ -29,7 +29,7 @@ class DocumentSerializer(serializers.ModelSerializer):
model = Document model = Document
fields = ( fields = (
"id", "id",
"sender", "correspondent",
"title", "title",
"content", "content",
"file_type", "file_type",

View File

@ -0,0 +1,36 @@
from django.core.management.base import CommandError
from django.test import TestCase
from ..management.commands.document_importer import Command
class TestImporter(TestCase):
def __init__(self, *args, **kwargs):
TestCase.__init__(self, *args, **kwargs)
def test_check_manifest_exists(self):
cmd = Command()
self.assertRaises(
CommandError, cmd._check_manifest_exists, "/tmp/manifest.json")
def test_check_manifest(self):
cmd = Command()
cmd.source = "/tmp"
cmd.manifest = [{"model": "documents.document"}]
with self.assertRaises(CommandError) as cm:
cmd._check_manifest()
self.assertTrue(
'The manifest file contains a record' in str(cm.exception))
cmd.manifest = [{
"model": "documents.document",
"__exported_file_name__": "noexist.pdf"
}]
# self.assertRaises(CommandError, cmd._check_manifest)
with self.assertRaises(CommandError) as cm:
cmd._check_manifest()
self.assertTrue(
'The manifest file refers to "noexist.pdf"' in str(cm.exception))

View File

@ -1,6 +1,5 @@
from django.contrib.auth.mixins import LoginRequiredMixin from django.contrib.auth.mixins import LoginRequiredMixin
from django.http import HttpResponse from django.http import HttpResponse
from django.template.defaultfilters import slugify
from django.views.decorators.csrf import csrf_exempt from django.views.decorators.csrf import csrf_exempt
from django.views.generic import FormView, DetailView, TemplateView from django.views.generic import FormView, DetailView, TemplateView
@ -14,9 +13,9 @@ from rest_framework.viewsets import (
from paperless.db import GnuPG from paperless.db import GnuPG
from .forms import UploadForm from .forms import UploadForm
from .models import Sender, Tag, Document, Log from .models import Correspondent, Tag, Document, Log
from .serialisers import ( from .serialisers import (
SenderSerializer, TagSerializer, DocumentSerializer, LogSerializer) CorrespondentSerializer, TagSerializer, DocumentSerializer, LogSerializer)
class IndexView(TemplateView): class IndexView(TemplateView):
@ -52,7 +51,7 @@ class FetchView(LoginRequiredMixin, DetailView):
content_type=content_types[self.object.file_type] content_type=content_types[self.object.file_type]
) )
response["Content-Disposition"] = 'attachment; filename="{}"'.format( response["Content-Disposition"] = 'attachment; filename="{}"'.format(
slugify(str(self.object)) + "." + self.object.file_type) self.object.file_name)
return response return response
@ -81,10 +80,10 @@ class StandardPagination(PageNumberPagination):
max_page_size = 100000 max_page_size = 100000
class SenderViewSet(ModelViewSet): class CorrespondentViewSet(ModelViewSet):
model = Sender model = Correspondent
queryset = Sender.objects.all() queryset = Correspondent.objects.all()
serializer_class = SenderSerializer serializer_class = CorrespondentSerializer
pagination_class = StandardPagination pagination_class = StandardPagination
permission_classes = (IsAuthenticated,) permission_classes = (IsAuthenticated,)

View File

@ -22,11 +22,11 @@ from rest_framework.routers import DefaultRouter
from documents.views import ( from documents.views import (
IndexView, FetchView, PushView, IndexView, FetchView, PushView,
SenderViewSet, TagViewSet, DocumentViewSet, LogViewSet CorrespondentViewSet, TagViewSet, DocumentViewSet, LogViewSet
) )
router = DefaultRouter() router = DefaultRouter()
router.register(r'senders', SenderViewSet) router.register(r'senders', CorrespondentViewSet)
router.register(r'tags', TagViewSet) router.register(r'tags', TagViewSet)
router.register(r'documents', DocumentViewSet) router.register(r'documents', DocumentViewSet)
router.register(r'logs', LogViewSet) router.register(r'logs', LogViewSet)