mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-17 10:13:56 -05:00
Merge branch 'master' into issue/81
This commit is contained in:
commit
49b56425e8
@ -24,8 +24,11 @@ How it Works
|
|||||||
|
|
||||||
1. Buy a document scanner like `this one`_.
|
1. Buy a document scanner like `this one`_.
|
||||||
2. Set it up to "scan to FTP" or something similar. It should be able to push
|
2. Set it up to "scan to FTP" or something similar. It should be able to push
|
||||||
scanned images to a server without you having to do anything.
|
scanned images to a server without you having to do anything. If your
|
||||||
3. Have the target server run the *Paperless* consumption script to OCR the PDF
|
scanner doesn't know how to automatically upload the file somewhere, you can
|
||||||
|
always do that manually. Paperless doesn't care how the documents get into
|
||||||
|
its local consumption directory.
|
||||||
|
3. Have the target server run the Paperless consumption script to OCR the PDF
|
||||||
and index it into a local database.
|
and index it into a local database.
|
||||||
4. Use the web frontend to sift through the database and find what you want.
|
4. Use the web frontend to sift through the database and find what you want.
|
||||||
5. Download the PDF you need/want via the web interface and do whatever you
|
5. Download the PDF you need/want via the web interface and do whatever you
|
||||||
@ -56,7 +59,7 @@ powerful tools.
|
|||||||
|
|
||||||
* `ImageMagick`_ converts the images between colour and greyscale.
|
* `ImageMagick`_ converts the images between colour and greyscale.
|
||||||
* `Tesseract`_ does the character recognition.
|
* `Tesseract`_ does the character recognition.
|
||||||
* `Unpaper`_ despeckles and and deskews the scanned image.
|
* `Unpaper`_ despeckles and deskews the scanned image.
|
||||||
* `GNU Privacy Guard`_ is used as the encryption backend.
|
* `GNU Privacy Guard`_ is used as the encryption backend.
|
||||||
* `Python 3`_ is the language of the project.
|
* `Python 3`_ is the language of the project.
|
||||||
|
|
||||||
|
@ -11,6 +11,10 @@ services:
|
|||||||
- data:/usr/src/paperless/data
|
- data:/usr/src/paperless/data
|
||||||
- media:/usr/src/paperless/media
|
- media:/usr/src/paperless/media
|
||||||
env_file: docker-compose.env
|
env_file: docker-compose.env
|
||||||
|
# The reason the line is here is so that the webserver that doesn't do
|
||||||
|
# any text recognition and doesn't have to install unnecessary
|
||||||
|
# languages the user might have set in the env-file by overwriting the
|
||||||
|
# value with nothing.
|
||||||
environment:
|
environment:
|
||||||
- PAPERLESS_OCR_LANGUAGES=
|
- PAPERLESS_OCR_LANGUAGES=
|
||||||
command: ["runserver", "0.0.0.0:8000"]
|
command: ["runserver", "0.0.0.0:8000"]
|
||||||
|
@ -1,6 +1,15 @@
|
|||||||
Changelog
|
Changelog
|
||||||
#########
|
#########
|
||||||
|
|
||||||
|
* 0.2.0
|
||||||
|
|
||||||
|
* Added support for guessing the date from the file name along with the
|
||||||
|
correspondent, title, and tags. Thanks to `Tikitu de Jager`_ for his pull
|
||||||
|
request that I took forever to merge and to `Pit`_ for his efforts on the
|
||||||
|
regex front.
|
||||||
|
* `#94`_: Restored support for changing the created date in the UI. Thanks
|
||||||
|
to `Martin Honermeyer`_ and `Tim White`_ for working with me on this.
|
||||||
|
|
||||||
* 0.1.1
|
* 0.1.1
|
||||||
|
|
||||||
* Potentially **Breaking Change**: All references to "sender" in the code
|
* Potentially **Breaking Change**: All references to "sender" in the code
|
||||||
@ -86,6 +95,8 @@ Changelog
|
|||||||
.. _Wayne Werner: https://github.com/waynew
|
.. _Wayne Werner: https://github.com/waynew
|
||||||
.. _darkmatter: https://github.com/darkmatter
|
.. _darkmatter: https://github.com/darkmatter
|
||||||
.. _zedster: https://github.com/zedster
|
.. _zedster: https://github.com/zedster
|
||||||
|
.. _Martin Honermeyer: https://github.com/djmaze
|
||||||
|
.. _Tim White: https://github.com/timwhite
|
||||||
|
|
||||||
.. _#20: https://github.com/danielquinn/paperless/issues/20
|
.. _#20: https://github.com/danielquinn/paperless/issues/20
|
||||||
.. _#44: https://github.com/danielquinn/paperless/issues/44
|
.. _#44: https://github.com/danielquinn/paperless/issues/44
|
||||||
@ -99,3 +110,4 @@ Changelog
|
|||||||
.. _#67: https://github.com/danielquinn/paperless/issues/67
|
.. _#67: https://github.com/danielquinn/paperless/issues/67
|
||||||
.. _#68: https://github.com/danielquinn/paperless/issues/68
|
.. _#68: https://github.com/danielquinn/paperless/issues/68
|
||||||
.. _#71: https://github.com/danielquinn/paperless/issues/71
|
.. _#71: https://github.com/danielquinn/paperless/issues/71
|
||||||
|
.. _#94: https://github.com/danielquinn/paperless/issues/71
|
||||||
|
@ -45,19 +45,27 @@ you name the file right, it'll automatically set some values in the database
|
|||||||
for you. This is is the logic the consumer follows:
|
for you. This is is the logic the consumer follows:
|
||||||
|
|
||||||
1. Try to find the correspondent, title, and tags in the file name following
|
1. Try to find the correspondent, title, and tags in the file name following
|
||||||
|
the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that
|
||||||
|
the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or
|
||||||
|
``YYYYMMDDZ``. The ``Z`` is for "Zulu time" AKA "UTC".
|
||||||
|
2. If that doesn't work, we skip the date and try this pattern:
|
||||||
the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
|
the pattern: ``Correspondent - Title - tag,tag,tag.pdf``.
|
||||||
2. If that doesn't work, try to find the correspondent and title in the file
|
3. If that doesn't work, we try to find the correspondent and title in the file
|
||||||
name following the pattern: ``Correspondent - Title.pdf``.
|
name following the pattern: ``Correspondent - Title.pdf``.
|
||||||
3. If that doesn't work, just assume that the name of the file is the title.
|
4. If that doesn't work, just assume that the name of the file is the title.
|
||||||
|
|
||||||
So given the above, the following examples would work as you'd expect:
|
So given the above, the following examples would work as you'd expect:
|
||||||
|
|
||||||
|
* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||||
|
* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||||
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf``
|
||||||
* ``Another Company - Letter of Reference.jpg``
|
* ``Another Company - Letter of Reference.jpg``
|
||||||
* ``Dad's Recipe for Pancakes.png``
|
* ``Dad's Recipe for Pancakes.png``
|
||||||
|
|
||||||
These however wouldn't work:
|
These however wouldn't work:
|
||||||
|
|
||||||
|
* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||||
|
* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||||
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf``
|
||||||
* ``Another Company- Letter of Reference.jpg``
|
* ``Another Company- Letter of Reference.jpg``
|
||||||
|
|
||||||
@ -128,7 +136,7 @@ following name/value pairs:
|
|||||||
don't start uploading stuff to your server. The means of generating this
|
don't start uploading stuff to your server. The means of generating this
|
||||||
signature is defined below.
|
signature is defined below.
|
||||||
|
|
||||||
Specify ``enctype="multipart/form-data"``, and then POST your file with:::
|
Specify ``enctype="multipart/form-data"``, and then POST your file with::
|
||||||
|
|
||||||
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
|
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
|
||||||
|
|
||||||
|
@ -33,4 +33,5 @@ Contents
|
|||||||
api
|
api
|
||||||
utilities
|
utilities
|
||||||
migrating
|
migrating
|
||||||
|
troubleshooting
|
||||||
changelog
|
changelog
|
||||||
|
@ -8,7 +8,7 @@ should work) that has the following software installed on it:
|
|||||||
|
|
||||||
* `Python3`_ (with development libraries, pip and virtualenv)
|
* `Python3`_ (with development libraries, pip and virtualenv)
|
||||||
* `GNU Privacy Guard`_
|
* `GNU Privacy Guard`_
|
||||||
* `Tesseract`_
|
* `Tesseract`_, plus its language files matching your document base.
|
||||||
* `Imagemagick`_
|
* `Imagemagick`_
|
||||||
* `unpaper`_
|
* `unpaper`_
|
||||||
|
|
||||||
@ -52,6 +52,7 @@ well as ImageMagick:
|
|||||||
|
|
||||||
$ brew install ghostscript
|
$ brew install ghostscript
|
||||||
$ brew install imagemagick
|
$ brew install imagemagick
|
||||||
|
$ brew install libmagic
|
||||||
|
|
||||||
|
|
||||||
.. _requirements-baremetal:
|
.. _requirements-baremetal:
|
||||||
|
207
docs/setup.rst
207
docs/setup.rst
@ -5,7 +5,8 @@ Setup
|
|||||||
|
|
||||||
Paperless isn't a very complicated app, but there are a few components, so some
|
Paperless isn't a very complicated app, but there are a few components, so some
|
||||||
basic documentation is in order. If you go follow along in this document and
|
basic documentation is in order. If you go follow along in this document and
|
||||||
still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
|
still have trouble, please open an `issue on GitHub`_ so I can fill in the
|
||||||
|
gaps.
|
||||||
|
|
||||||
.. _issue on GitHub: https://github.com/danielquinn/paperless/issues
|
.. _issue on GitHub: https://github.com/danielquinn/paperless/issues
|
||||||
|
|
||||||
@ -15,8 +16,8 @@ still have trouble, please open an `issue on GitHub`_ so I can fill in the gaps.
|
|||||||
Download
|
Download
|
||||||
--------
|
--------
|
||||||
|
|
||||||
The source is currently only available via GitHub, so grab it from there, either
|
The source is currently only available via GitHub, so grab it from there,
|
||||||
by using ``git``:
|
either by using ``git``:
|
||||||
|
|
||||||
.. code:: bash
|
.. code:: bash
|
||||||
|
|
||||||
@ -42,15 +43,16 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
|
|||||||
consumption etc. We also `support Docker`_, which you can use natively under
|
consumption etc. We also `support Docker`_, which you can use natively under
|
||||||
Linux and in a VM with `Docker Machine`_ (this guide was written for native
|
Linux and in a VM with `Docker Machine`_ (this guide was written for native
|
||||||
Docker usage under Linux, you might have to adapt it for Docker Machine.)
|
Docker usage under Linux, you might have to adapt it for Docker Machine.)
|
||||||
Alternatively the standard, `bare metal`_ approach is a little more complicated,
|
Alternatively the standard, `bare metal`_ approach is a little more
|
||||||
but worth it because it makes it easier to should you want to contribute some
|
complicated, but worth it because it makes it easier to should you want to
|
||||||
code back.
|
contribute some code back.
|
||||||
|
|
||||||
.. _Vagrant route: setup-installation-vagrant_
|
.. _Vagrant route: setup-installation-vagrant_
|
||||||
.. _support Docker: setup-installation-docker_
|
.. _support Docker: setup-installation-docker_
|
||||||
.. _bare metal: setup-installation-standard_
|
.. _bare metal: setup-installation-standard_
|
||||||
.. _Docker Machine: https://docs.docker.com/machine/
|
.. _Docker Machine: https://docs.docker.com/machine/
|
||||||
|
|
||||||
|
|
||||||
.. _setup-installation-standard:
|
.. _setup-installation-standard:
|
||||||
|
|
||||||
Standard (Bare Metal)
|
Standard (Bare Metal)
|
||||||
@ -58,19 +60,16 @@ Standard (Bare Metal)
|
|||||||
|
|
||||||
1. Install the requirements as per the :ref:`requirements <requirements>` page.
|
1. Install the requirements as per the :ref:`requirements <requirements>` page.
|
||||||
2. Change to the ``src`` directory in this repo.
|
2. Change to the ``src`` directory in this repo.
|
||||||
3. Edit ``paperless/settings.py`` and be sure to set the values for:
|
3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
|
||||||
* ``CONSUMPTION_DIR``: this is where your documents will be dumped to be
|
your favourite editor. Set the values for:
|
||||||
consumed by Paperless.
|
|
||||||
* ``PASSPHRASE``: this is the passphrase Paperless uses to encrypt/decrypt
|
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
|
||||||
the original document. The default value attempts to source the
|
dumped to be consumed by Paperless.
|
||||||
passphrase from the environment, so if you don't set it to a static value
|
* ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
|
||||||
here, you must set ``PAPERLESS_PASSPHRASE=some-secret-string`` on the
|
encrypt/decrypt the original document.
|
||||||
command line whenever invoking the consumer or webserver.
|
* ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
|
||||||
* ``OCR_THREADS``: this is the number of threads the OCR process will spawn
|
will spawn to process document pages in parallel.
|
||||||
to process document pages in parallel. The default value gets sourced from
|
|
||||||
the environment-variable ``PAPERLESS_OCR_THREADS`` and expects it to be an
|
|
||||||
integer. If the variable is not set, Python determines the core-count of
|
|
||||||
your CPU and uses that value.
|
|
||||||
4. Initialise the database with ``./manage.py migrate``.
|
4. Initialise the database with ``./manage.py migrate``.
|
||||||
5. Create a user for your Paperless instance with
|
5. Create a user for your Paperless instance with
|
||||||
``./manage.py createsuperuser``. Follow the prompts to create your user.
|
``./manage.py createsuperuser``. Follow the prompts to create your user.
|
||||||
@ -79,8 +78,8 @@ Standard (Bare Metal)
|
|||||||
You should now be able to visit your (empty) `Paperless webserver`_ at
|
You should now be able to visit your (empty) `Paperless webserver`_ at
|
||||||
``127.0.0.1:8000`` (or whatever you chose). You can login with the
|
``127.0.0.1:8000`` (or whatever you chose). You can login with the
|
||||||
user/pass you created in #5.
|
user/pass you created in #5.
|
||||||
7. In a separate window, change to the ``src`` directory in this repo again, but
|
7. In a separate window, change to the ``src`` directory in this repo again,
|
||||||
this time, you should start the consumer script with
|
but this time, you should start the consumer script with
|
||||||
``./manage.py document_consumer``.
|
``./manage.py document_consumer``.
|
||||||
8. Scan something. Put it in the ``CONSUMPTION_DIR``.
|
8. Scan something. Put it in the ``CONSUMPTION_DIR``.
|
||||||
9. Wait a few minutes
|
9. Wait a few minutes
|
||||||
@ -100,6 +99,7 @@ Vagrant Method
|
|||||||
provisioned...
|
provisioned...
|
||||||
3. Run ``vagrant ssh`` and once inside your new vagrant box, edit
|
3. Run ``vagrant ssh`` and once inside your new vagrant box, edit
|
||||||
``/etc/paperless.conf`` and set the values for:
|
``/etc/paperless.conf`` and set the values for:
|
||||||
|
|
||||||
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
|
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
|
||||||
dumped to be consumed by Paperless.
|
dumped to be consumed by Paperless.
|
||||||
* ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
|
* ``PAPERLESS_PASSPHRASE``: this is the passphrase Paperless uses to
|
||||||
@ -107,6 +107,7 @@ Vagrant Method
|
|||||||
* ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming
|
* ``PAPERLESS_SHARED_SECRET``: this is the "magic word" used when consuming
|
||||||
documents from mail or via the API. If you don't use either, leaving it
|
documents from mail or via the API. If you don't use either, leaving it
|
||||||
blank is just fine.
|
blank is just fine.
|
||||||
|
|
||||||
4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again. This
|
4. Exit the vagrant box and re-enter it with ``vagrant ssh`` again. This
|
||||||
updates the environment to make use of the changes you made to the config
|
updates the environment to make use of the changes you made to the config
|
||||||
file.
|
file.
|
||||||
@ -140,9 +141,9 @@ Docker Method
|
|||||||
.. caution::
|
.. caution::
|
||||||
|
|
||||||
As mentioned earlier, this guide assumes that you use Docker natively
|
As mentioned earlier, this guide assumes that you use Docker natively
|
||||||
under Linux. If you are using `Docker Machine`_ under Mac OS X or Windows,
|
under Linux. If you are using `Docker Machine`_ under Mac OS X or
|
||||||
you will have to adapt IP addresses, volume-mounting, command execution
|
Windows, you will have to adapt IP addresses, volume-mounting, command
|
||||||
and maybe more.
|
execution and maybe more.
|
||||||
|
|
||||||
2. Install `docker-compose`_. [#compose]_
|
2. Install `docker-compose`_. [#compose]_
|
||||||
|
|
||||||
@ -161,14 +162,14 @@ Docker Method
|
|||||||
.. _Docker installation guide: https://docs.docker.com/engine/installation/
|
.. _Docker installation guide: https://docs.docker.com/engine/installation/
|
||||||
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
|
.. _docker-compose installation guide: https://docs.docker.com/compose/install/
|
||||||
|
|
||||||
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` and
|
3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml``
|
||||||
a copy of ``docker-compose.env.example`` as ``docker-compose.env``. You'll be
|
and a copy of ``docker-compose.env.example`` as ``docker-compose.env``.
|
||||||
editing both these files: taking a copy ensures that you can ``git pull`` to
|
You'll be editing both these files: taking a copy ensures that you can
|
||||||
receive updates without risking merge conflicts with your modified versions
|
``git pull`` to receive updates without risking merge conflicts with your
|
||||||
of the configuration files.
|
modified versions of the configuration files.
|
||||||
4. Modify ``docker-compose.yml`` to your preferences, following the instructions
|
4. Modify ``docker-compose.yml`` to your preferences, following the
|
||||||
in comments in the file. The only change that is a hard requirement is to
|
instructions in comments in the file. The only change that is a hard
|
||||||
specify where the consumption directory should mount.
|
requirement is to specify where the consumption directory should mount.
|
||||||
5. Modify ``docker-compose.env`` and adapt the following environment variables:
|
5. Modify ``docker-compose.env`` and adapt the following environment variables:
|
||||||
|
|
||||||
``PAPERLESS_PASSPHRASE``
|
``PAPERLESS_PASSPHRASE``
|
||||||
@ -181,10 +182,11 @@ Docker Method
|
|||||||
the core-count of your CPU and uses that value.
|
the core-count of your CPU and uses that value.
|
||||||
|
|
||||||
``PAPERLESS_OCR_LANGUAGES``
|
``PAPERLESS_OCR_LANGUAGES``
|
||||||
If you want the OCR to recognize other languages in addition to the default
|
If you want the OCR to recognize other languages in addition to the
|
||||||
English, set this parameter to a space separated list of three-letter
|
default English, set this parameter to a space separated list of
|
||||||
language-codes after `ISO 639-2/T`_. For a list of available languages --
|
three-letter language-codes after `ISO 639-2/T`_. For a list of available
|
||||||
including their three letter codes -- see the `Debian packagelist`_.
|
languages -- including their three letter codes -- see the
|
||||||
|
`Debian packagelist`_.
|
||||||
|
|
||||||
``USERMAP_UID`` and ``USERMAP_GID``
|
``USERMAP_UID`` and ``USERMAP_GID``
|
||||||
If you want to mount the consumption volume (directory ``/consume`` within
|
If you want to mount the consumption volume (directory ``/consume`` within
|
||||||
@ -192,11 +194,11 @@ Docker Method
|
|||||||
access rights might be an issue. The default user and group ``paperless``
|
access rights might be an issue. The default user and group ``paperless``
|
||||||
in the containers have an id of 1000. The containers will enforce that the
|
in the containers have an id of 1000. The containers will enforce that the
|
||||||
owning group of the consumption directory will be ``paperless`` to be able
|
owning group of the consumption directory will be ``paperless`` to be able
|
||||||
to delete consumed documents. If your host-system has a group with an id of
|
to delete consumed documents. If your host-system has a group with an ID
|
||||||
1000 and you don't want this group to have access rights to the consumption
|
of 1000 and you don't want this group to have access rights to the
|
||||||
directory, you can use ``USERMAP_GID`` to change the id in the container
|
consumption directory, you can use ``USERMAP_GID`` to change the id in the
|
||||||
and thus the one of the consumption directory. Furthermore, you can change
|
container and thus the one of the consumption directory. Furthermore, you
|
||||||
the id of the default user as well using ``USERMAP_UID``.
|
can change the id of the default user as well using ``USERMAP_UID``.
|
||||||
|
|
||||||
6. Run ``docker-compose up -d``. This will create and start the necessary
|
6. Run ``docker-compose up -d``. This will create and start the necessary
|
||||||
containers.
|
containers.
|
||||||
@ -234,14 +236,14 @@ Docker Method
|
|||||||
.. danger::
|
.. danger::
|
||||||
|
|
||||||
While the consumption container will ensure at startup that it can
|
While the consumption container will ensure at startup that it can
|
||||||
**delete** a consumed file from a host-mounted directory, it might not
|
**delete** a consumed file from a host-mounted directory, it might
|
||||||
be able to **read** the document in the first place if the access
|
not be able to **read** the document in the first place if the access
|
||||||
rights to the file are incorrect.
|
rights to the file are incorrect.
|
||||||
|
|
||||||
Make sure that the documents you put into the consumption directory
|
Make sure that the documents you put into the consumption directory
|
||||||
will either be readable by everyone (``chmod o+r file.pdf``) or
|
will either be readable by everyone (``chmod o+r file.pdf``) or
|
||||||
readable by the default user or group id 1000 (or the one you have set
|
readable by the default user or group id 1000 (or the one you have
|
||||||
with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
|
set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively).
|
||||||
|
|
||||||
2. Use ``docker cp`` to copy your files directly into the container:
|
2. Use ``docker cp`` to copy your files directly into the container:
|
||||||
|
|
||||||
@ -258,8 +260,8 @@ Docker Method
|
|||||||
|
|
||||||
``docker cp`` is a one-shot-command, just like ``cp``. This means that
|
``docker cp`` is a one-shot-command, just like ``cp``. This means that
|
||||||
every time you want to consume a new document, you will have to execute
|
every time you want to consume a new document, you will have to execute
|
||||||
``docker cp`` again. You can of course automate this process, but option 1
|
``docker cp`` again. You can of course automate this process, but option
|
||||||
is generally the preferred one.
|
1 is generally the preferred one.
|
||||||
|
|
||||||
.. danger::
|
.. danger::
|
||||||
|
|
||||||
@ -267,8 +269,8 @@ Docker Method
|
|||||||
to the acting user at the destination, which will be ``root``.
|
to the acting user at the destination, which will be ``root``.
|
||||||
|
|
||||||
You therefore need to ensure that the documents you want to copy into
|
You therefore need to ensure that the documents you want to copy into
|
||||||
the container are readable by everyone (``chmod o+r file.pdf``) before
|
the container are readable by everyone (``chmod o+r file.pdf``)
|
||||||
copying them.
|
before copying them.
|
||||||
|
|
||||||
|
|
||||||
.. _Docker: https://www.docker.com/
|
.. _Docker: https://www.docker.com/
|
||||||
@ -281,17 +283,108 @@ Docker Method
|
|||||||
free to tinker around without using compose!
|
free to tinker around without using compose!
|
||||||
|
|
||||||
|
|
||||||
.. _making-things-a-little-more-permanent:
|
.. _setup-permanent:
|
||||||
|
|
||||||
Making Things a Little more Permanent
|
Making Things a Little more Permanent
|
||||||
-------------------------------------
|
-------------------------------------
|
||||||
|
|
||||||
Once you've tested things and are happy with the work flow, you can automate the
|
Once you've tested things and are happy with the work flow, you can automate
|
||||||
process of starting the webserver and consumer automatically. If you're running
|
the process of starting the webserver and consumer automatically.
|
||||||
on a bare metal system that's using Systemd, you can use the service unit files
|
|
||||||
in the ``scripts`` directory to set this up. If you're on another startup
|
|
||||||
system or are using a Vagrant box, then you're currently on your own. If you are
|
.. _setup-permanent-standard-systemd:
|
||||||
using Docker, you can set a restart-policy_ in the ``docker-compose.yml`` to
|
|
||||||
have the containers automatically start with the Docker daemon.
|
Standard (Bare Metal, Systemd)
|
||||||
|
..............................
|
||||||
|
|
||||||
|
If you're running on a bare metal system that's using Systemd, you can use the
|
||||||
|
service unit files in the ``scripts`` directory to set this up. You'll need to
|
||||||
|
create a user called ``paperless`` and setup Paperless to be in a place that
|
||||||
|
this new user can read and write to. Then, you can just tell Systemd to enable
|
||||||
|
the two ``.service`` files::
|
||||||
|
|
||||||
|
# systemctl enable /path/to/paperless/scripts/paperless-consumer.service
|
||||||
|
# systemctl enable /path/to/paperless/scripts/paperless-webserver.service
|
||||||
|
# systemctl start /path/to/paperless/scripts/paperless-consumer.service
|
||||||
|
# systemctl start /path/to/paperless/scripts/paperless-webserver.service
|
||||||
|
|
||||||
|
|
||||||
|
.. _setup-permanent-standard-ubuntu14:
|
||||||
|
|
||||||
|
Ubuntu 14.04 (Bare Metal, Upstart)
|
||||||
|
..................................
|
||||||
|
|
||||||
|
Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services
|
||||||
|
during the boot process. To configure Upstart to run Paperless automatically
|
||||||
|
after restarting your system:
|
||||||
|
|
||||||
|
1. Change to the directory where Upstart's configuration files are kept:
|
||||||
|
``cd /etc/init``
|
||||||
|
2. Create a new file: ``sudo nano paperless-server.conf``
|
||||||
|
3. In the newly-created file enter::
|
||||||
|
|
||||||
|
start on (local-filesystems and net-device-up IFACE=eth0)
|
||||||
|
stop on shutdown
|
||||||
|
|
||||||
|
respawn
|
||||||
|
respawn limit 10 5
|
||||||
|
|
||||||
|
script
|
||||||
|
exec /srv/paperless/src/manage.py runserver 0.0.0.0:80
|
||||||
|
end script
|
||||||
|
|
||||||
|
Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the
|
||||||
|
path to the ``manage.py`` script in your installation directory.
|
||||||
|
|
||||||
|
If you are using a network interface other than ``eth0``, you will have to
|
||||||
|
change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
|
||||||
|
likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
|
||||||
|
run ``ifconfig``.
|
||||||
|
|
||||||
|
Save the file.
|
||||||
|
|
||||||
|
4. Create a new file: ``sudo nano paperless-consumer.conf``
|
||||||
|
|
||||||
|
5. In the newly-created file enter::
|
||||||
|
|
||||||
|
start on (local-filesystems and net-device-up IFACE=eth0)
|
||||||
|
stop on shutdown
|
||||||
|
|
||||||
|
respawn
|
||||||
|
respawn limit 10 5
|
||||||
|
|
||||||
|
script
|
||||||
|
exec /srv/paperless/src/manage.py document_consumer
|
||||||
|
end script
|
||||||
|
|
||||||
|
Replace ``/srv/paperless/src/manage.py`` with the same values as in step 3
|
||||||
|
above and replace ``eth0`` with the appropriate value, if necessary. Save the
|
||||||
|
file.
|
||||||
|
|
||||||
|
These two configuration files together will start both the Paperless webserver
|
||||||
|
and document consumer processes when the file system and network interface
|
||||||
|
specified is available after boot. Furthermore, if either process ever exits
|
||||||
|
unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5
|
||||||
|
second period.
|
||||||
|
|
||||||
|
.. _Upstart: http://upstart.ubuntu.com/
|
||||||
|
|
||||||
|
|
||||||
|
.. _setup-permanent-vagrant:
|
||||||
|
|
||||||
|
Vagrant
|
||||||
|
.......
|
||||||
|
|
||||||
|
You're currently on your own, but the Ubuntu explanation above may be enough.
|
||||||
|
|
||||||
|
|
||||||
|
.. _setup-permanent-docker:
|
||||||
|
|
||||||
|
Docker
|
||||||
|
......
|
||||||
|
|
||||||
|
If you're using Docker, you can set a restart-policy_ in the
|
||||||
|
``docker-compose.yml`` to have the containers automatically start with the
|
||||||
|
Docker daemon.
|
||||||
|
|
||||||
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
|
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
|
||||||
|
19
docs/troubleshooting.rst
Normal file
19
docs/troubleshooting.rst
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
.. _troubleshooting:
|
||||||
|
|
||||||
|
Troubleshooting
|
||||||
|
===============
|
||||||
|
|
||||||
|
.. _troubleshooting_ocr_language_files_missing:
|
||||||
|
|
||||||
|
Consumer warns ``OCR for XX failed``
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
If you find the OCR accuracy to be too low, and/or the document consumer warns that ``OCR for
|
||||||
|
XX failed, but we're going to stick with what we've got since FORGIVING_OCR is enabled``, then you
|
||||||
|
might need to install the `Tesseract language files
|
||||||
|
<http://packages.ubuntu.com/search?keywords=tesseract-ocr>`_ marching your documents languages.
|
||||||
|
|
||||||
|
As an example, if you are running Paperless from the Vagrant setup provided (or from any Ubuntu or Debian
|
||||||
|
box), and your documents are written in Spanish you may need to run::
|
||||||
|
|
||||||
|
apt-get install -y tesseract-ocr-spa
|
@ -20,7 +20,7 @@ PAPERLESS_CONSUME_MAIL_PASS=""
|
|||||||
#
|
#
|
||||||
# The passphrase you use here will be used when storing your documents in
|
# The passphrase you use here will be used when storing your documents in
|
||||||
# Paperless, but you can always export them in an unencrypted format by using
|
# Paperless, but you can always export them in an unencrypted format by using
|
||||||
# document exporter. See the documentaiton for more information.
|
# document exporter. See the documentation for more information.
|
||||||
#
|
#
|
||||||
# One final note about the passphrase. Once you've consumed a document with
|
# One final note about the passphrase. Once you've consumed a document with
|
||||||
# one passphrase, DON'T CHANGE IT. Paperless assumes this to be a constant and
|
# one passphrase, DON'T CHANGE IT. Paperless assumes this to be a constant and
|
||||||
@ -31,3 +31,8 @@ PAPERLESS_PASSPHRASE="secret"
|
|||||||
# If you intend to consume documents either via HTTP POST or by email, you must
|
# If you intend to consume documents either via HTTP POST or by email, you must
|
||||||
# have a shared secret here.
|
# have a shared secret here.
|
||||||
PAPERLESS_SHARED_SECRET=""
|
PAPERLESS_SHARED_SECRET=""
|
||||||
|
|
||||||
|
# By default, Paperless will attempt to use all available CPU cores to process
|
||||||
|
# a document, but if you would like to limit that, you can set this value to
|
||||||
|
# an integer:
|
||||||
|
#PAPERLESS_OCR_THREADS=1
|
||||||
|
BIN
presentation/img/kitten.jpg
Normal file
BIN
presentation/img/kitten.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 92 KiB |
@ -148,12 +148,12 @@
|
|||||||
|
|
||||||
<section data-background="img/pony.png">
|
<section data-background="img/pony.png">
|
||||||
<h2>Demo!</h2>
|
<h2>Demo!</h2>
|
||||||
<p>(Time to sacrifice a kitten)</p>
|
<img src="img/kitten.jpg" style="width: 50%;" />
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section>
|
<section>
|
||||||
<h2>TODO</h2>
|
<h2>TODO</h2>
|
||||||
<p>It works, but it could use polish</p>
|
<p>It works, but it needs polish</p>
|
||||||
<ul>
|
<ul>
|
||||||
<li>The UI is the Django admin</li>
|
<li>The UI is the Django admin</li>
|
||||||
<li>Mail consumption is really raw</li>
|
<li>Mail consumption is really raw</li>
|
||||||
@ -163,11 +163,11 @@
|
|||||||
<aside class="notes">
|
<aside class="notes">
|
||||||
<ul>
|
<ul>
|
||||||
<li>
|
<li>
|
||||||
<strong>Plugin architecture</strong>: there've been requests for
|
<strong>Plugin architecture</strong>: there've been requests
|
||||||
some overly custom stuff to happen before and after consumption,
|
for some overly custom stuff to happen before and after
|
||||||
but in the UNIX spirit of "do one job well", I think this sort
|
consumption, but in the UNIX spirit of "do one job well", I
|
||||||
of thing is better written as a plugin -- which means I need to
|
think this sort of thing is better written as a plugin -- which
|
||||||
figure out a best practise for that.
|
means I need to figure out a best practise for that.
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</aside>
|
</aside>
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
Django==1.9.2
|
Django==1.9.4
|
||||||
Pillow==3.1.1
|
Pillow==3.1.1
|
||||||
django-crispy-forms==1.6.0
|
django-crispy-forms==1.6.0
|
||||||
django-extensions==1.6.1
|
django-extensions==1.6.1
|
||||||
|
@ -19,12 +19,11 @@ from PIL import Image
|
|||||||
|
|
||||||
from django.conf import settings
|
from django.conf import settings
|
||||||
from django.utils import timezone
|
from django.utils import timezone
|
||||||
from django.template.defaultfilters import slugify
|
|
||||||
from pyocr.tesseract import TesseractError
|
from pyocr.tesseract import TesseractError
|
||||||
|
|
||||||
from paperless.db import GnuPG
|
from paperless.db import GnuPG
|
||||||
|
|
||||||
from .models import Correspondent, Tag, Document, Log
|
from .models import Tag, Document, Log, FileInfo
|
||||||
from .languages import ISO639
|
from .languages import ISO639
|
||||||
from .signals import (
|
from .signals import (
|
||||||
document_consumption_started, document_consumption_finished)
|
document_consumption_started, document_consumption_finished)
|
||||||
@ -56,19 +55,6 @@ class Consumer(object):
|
|||||||
|
|
||||||
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
|
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
|
||||||
|
|
||||||
REGEX_TITLE = re.compile(
|
|
||||||
r"^.*/(.*)\.(pdf|jpe?g|png|gif|tiff)$",
|
|
||||||
flags=re.IGNORECASE
|
|
||||||
)
|
|
||||||
REGEX_CORRESPONDENT_TITLE = re.compile(
|
|
||||||
r"^.*/(.+) - (.*)\.(pdf|jpe?g|png|gif|tiff)$",
|
|
||||||
flags=re.IGNORECASE
|
|
||||||
)
|
|
||||||
REGEX_CORRESPONDENT_TITLE_TAGS = re.compile(
|
|
||||||
r"^.*/(.*) - (.*) - ([a-z0-9\-,]*)\.(pdf|jpe?g|png|gif|tiff)$",
|
|
||||||
flags=re.IGNORECASE
|
|
||||||
)
|
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
|
|
||||||
self.logger = logging.getLogger(__name__)
|
self.logger = logging.getLogger(__name__)
|
||||||
@ -107,7 +93,7 @@ class Consumer(object):
|
|||||||
if not os.path.isfile(doc):
|
if not os.path.isfile(doc):
|
||||||
continue
|
continue
|
||||||
|
|
||||||
if not re.match(self.REGEX_TITLE, doc):
|
if not re.match(FileInfo.REGEXES["title"], doc):
|
||||||
continue
|
continue
|
||||||
|
|
||||||
if doc in self._ignore:
|
if doc in self._ignore:
|
||||||
@ -282,72 +268,20 @@ class Consumer(object):
|
|||||||
# Strip out excess white space to allow matching to go smoother
|
# Strip out excess white space to allow matching to go smoother
|
||||||
return re.sub(r"\s+", " ", r)
|
return re.sub(r"\s+", " ", r)
|
||||||
|
|
||||||
def _guess_attributes_from_name(self, parseable):
|
|
||||||
"""
|
|
||||||
We use a crude naming convention to make handling the correspondent,
|
|
||||||
title, and tags easier:
|
|
||||||
"<correspondent> - <title> - <tags>.<suffix>"
|
|
||||||
"<correspondent> - <title>.<suffix>"
|
|
||||||
"<title>.<suffix>"
|
|
||||||
"""
|
|
||||||
|
|
||||||
def get_correspondent(correspondent_name):
|
|
||||||
return Correspondent.objects.get_or_create(
|
|
||||||
name=correspondent_name,
|
|
||||||
defaults={"slug": slugify(correspondent_name)}
|
|
||||||
)[0]
|
|
||||||
|
|
||||||
def get_tags(tags):
|
|
||||||
r = []
|
|
||||||
for t in tags.split(","):
|
|
||||||
r.append(
|
|
||||||
Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
|
|
||||||
return tuple(r)
|
|
||||||
|
|
||||||
def get_suffix(suffix):
|
|
||||||
suffix = suffix.lower()
|
|
||||||
if suffix == "jpeg":
|
|
||||||
return "jpg"
|
|
||||||
return suffix
|
|
||||||
|
|
||||||
# First attempt: "<correspondent> - <title> - <tags>.<suffix>"
|
|
||||||
m = re.match(self.REGEX_CORRESPONDENT_TITLE_TAGS, parseable)
|
|
||||||
if m:
|
|
||||||
return (
|
|
||||||
get_correspondent(m.group(1)),
|
|
||||||
m.group(2),
|
|
||||||
get_tags(m.group(3)),
|
|
||||||
get_suffix(m.group(4))
|
|
||||||
)
|
|
||||||
|
|
||||||
# Second attempt: "<correspondent> - <title>.<suffix>"
|
|
||||||
m = re.match(self.REGEX_CORRESPONDENT_TITLE, parseable)
|
|
||||||
if m:
|
|
||||||
return (
|
|
||||||
get_correspondent(m.group(1)),
|
|
||||||
m.group(2),
|
|
||||||
(),
|
|
||||||
get_suffix(m.group(3))
|
|
||||||
)
|
|
||||||
|
|
||||||
# That didn't work, so we assume correspondent and tags are None
|
|
||||||
m = re.match(self.REGEX_TITLE, parseable)
|
|
||||||
return None, m.group(1), (), get_suffix(m.group(2))
|
|
||||||
|
|
||||||
def _store(self, text, doc, thumbnail):
|
def _store(self, text, doc, thumbnail):
|
||||||
|
|
||||||
sender, title, tags, file_type = self._guess_attributes_from_name(doc)
|
file_info = FileInfo.from_path(doc)
|
||||||
relevant_tags = set(list(Tag.match_all(text)) + list(tags))
|
relevant_tags = set(list(Tag.match_all(text)) + list(file_info.tags))
|
||||||
|
|
||||||
stats = os.stat(doc)
|
stats = os.stat(doc)
|
||||||
|
|
||||||
self.log("debug", "Saving record to database")
|
self.log("debug", "Saving record to database")
|
||||||
|
|
||||||
document = Document.objects.create(
|
document = Document.objects.create(
|
||||||
correspondent=sender,
|
correspondent=file_info.correspondent,
|
||||||
title=title,
|
title=file_info.title,
|
||||||
content=text,
|
content=text,
|
||||||
file_type=file_type,
|
file_type=file_info.extension,
|
||||||
created=timezone.make_aware(
|
created=timezone.make_aware(
|
||||||
datetime.datetime.fromtimestamp(stats.st_mtime)),
|
datetime.datetime.fromtimestamp(stats.st_mtime)),
|
||||||
modified=timezone.make_aware(
|
modified=timezone.make_aware(
|
||||||
|
@ -96,11 +96,16 @@ class Command(Renderable, BaseCommand):
|
|||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def _get_legacy_file_name(doc):
|
def _get_legacy_file_name(doc):
|
||||||
if doc.correspondent and doc.title:
|
|
||||||
tags = ",".join([t.slug for t in doc.tags.all()])
|
if not doc.correspondent and not doc.title:
|
||||||
if tags:
|
return os.path.basename(doc.source_path)
|
||||||
return "{} - {} - {}.{}".format(
|
|
||||||
doc.correspondent, doc.title, tags, doc.file_type)
|
created = doc.created.strftime("%Y%m%d%H%M%SZ")
|
||||||
return "{} - {}.{}".format(
|
tags = ",".join([t.slug for t in doc.tags.all()])
|
||||||
doc.correspondent, doc.title, doc.file_type)
|
|
||||||
return os.path.basename(doc.source_path)
|
if tags:
|
||||||
|
return "{} - {} - {} - {}.{}".format(
|
||||||
|
created, doc.correspondent, doc.title, tags, doc.file_type)
|
||||||
|
|
||||||
|
return "{} - {} - {}.{}".format(
|
||||||
|
created, doc.correspondent, doc.title, doc.file_type)
|
||||||
|
@ -1,8 +1,11 @@
|
|||||||
|
import dateutil.parser
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import re
|
import re
|
||||||
import uuid
|
import uuid
|
||||||
|
|
||||||
|
from collections import OrderedDict
|
||||||
|
|
||||||
from django.conf import settings
|
from django.conf import settings
|
||||||
from django.core.urlresolvers import reverse
|
from django.core.urlresolvers import reverse
|
||||||
from django.db import models
|
from django.db import models
|
||||||
@ -152,7 +155,7 @@ class Document(models.Model):
|
|||||||
)
|
)
|
||||||
tags = models.ManyToManyField(
|
tags = models.ManyToManyField(
|
||||||
Tag, related_name="documents", blank=True)
|
Tag, related_name="documents", blank=True)
|
||||||
created = models.DateTimeField(default=timezone.now, editable=False)
|
created = models.DateTimeField(default=timezone.now)
|
||||||
modified = models.DateTimeField(auto_now=True, editable=False)
|
modified = models.DateTimeField(auto_now=True, editable=False)
|
||||||
|
|
||||||
class Meta(object):
|
class Meta(object):
|
||||||
@ -250,3 +253,136 @@ class Log(models.Model):
|
|||||||
self.group = uuid.uuid4()
|
self.group = uuid.uuid4()
|
||||||
|
|
||||||
models.Model.save(self, *args, **kwargs)
|
models.Model.save(self, *args, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class FileInfo(object):
|
||||||
|
|
||||||
|
# This epic regex *almost* worked for our needs, so I'm keeping it here for
|
||||||
|
# posterity, in the hopes that we might find a way to make it work one day.
|
||||||
|
ALMOST_REGEX = re.compile(
|
||||||
|
r"^((?P<date>\d\d\d\d\d\d\d\d\d\d\d\d\d\dZ){separator})?"
|
||||||
|
r"((?P<correspondent>{non_separated_word}+){separator})??"
|
||||||
|
r"(?P<title>{non_separated_word}+)"
|
||||||
|
r"({separator}(?P<tags>[a-z,0-9-]+))?"
|
||||||
|
r"\.(?P<extension>[a-zA-Z.-]+)$".format(
|
||||||
|
separator=r"\s+-\s+",
|
||||||
|
non_separated_word=r"([\w,. ]|([^\s]-))"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
REGEXES = OrderedDict([
|
||||||
|
("created-correspondent-title-tags", re.compile(
|
||||||
|
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
|
||||||
|
r"(?P<correspondent>.*) - "
|
||||||
|
r"(?P<title>.*) - "
|
||||||
|
r"(?P<tags>[a-z0-9\-,]*)"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
)),
|
||||||
|
("created-title-tags", re.compile(
|
||||||
|
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
|
||||||
|
r"(?P<title>.*) - "
|
||||||
|
r"(?P<tags>[a-z0-9\-,]*)"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
)),
|
||||||
|
("created-correspondent-title", re.compile(
|
||||||
|
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
|
||||||
|
r"(?P<correspondent>.*) - "
|
||||||
|
r"(?P<title>.*)"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
)),
|
||||||
|
("created-title", re.compile(
|
||||||
|
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
|
||||||
|
r"(?P<title>.*)"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
)),
|
||||||
|
("correspondent-title-tags", re.compile(
|
||||||
|
r"(?P<correspondent>.*) - "
|
||||||
|
r"(?P<title>.*) - "
|
||||||
|
r"(?P<tags>[a-z0-9\-,]*)"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
)),
|
||||||
|
("correspondent-title", re.compile(
|
||||||
|
r"(?P<correspondent>.*) - "
|
||||||
|
r"(?P<title>.*)?"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
)),
|
||||||
|
("title", re.compile(
|
||||||
|
r"(?P<title>.*)"
|
||||||
|
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
|
||||||
|
flags=re.IGNORECASE
|
||||||
|
))
|
||||||
|
])
|
||||||
|
|
||||||
|
def __init__(self, created=None, correspondent=None, title=None, tags=(),
|
||||||
|
extension=None):
|
||||||
|
|
||||||
|
self.created = created
|
||||||
|
self.title = title
|
||||||
|
self.extension = extension
|
||||||
|
self.correspondent = correspondent
|
||||||
|
self.tags = tags
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _get_created(cls, created):
|
||||||
|
return dateutil.parser.parse("{:0<14}Z".format(created[:-1]))
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _get_correspondent(cls, name):
|
||||||
|
if not name:
|
||||||
|
return None
|
||||||
|
return Correspondent.objects.get_or_create(name=name, defaults={
|
||||||
|
"slug": slugify(name)
|
||||||
|
})[0]
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _get_title(cls, title):
|
||||||
|
return title
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _get_tags(cls, tags):
|
||||||
|
r = []
|
||||||
|
for t in tags.split(","):
|
||||||
|
r.append(
|
||||||
|
Tag.objects.get_or_create(slug=t, defaults={"name": t})[0])
|
||||||
|
return tuple(r)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _get_extension(cls, extension):
|
||||||
|
r = extension.lower()
|
||||||
|
if r == "jpeg":
|
||||||
|
return "jpg"
|
||||||
|
return r
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _mangle_property(cls, properties, name):
|
||||||
|
if name in properties:
|
||||||
|
properties[name] = getattr(cls, "_get_{}".format(name))(
|
||||||
|
properties[name]
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_path(cls, path):
|
||||||
|
"""
|
||||||
|
We use a crude naming convention to make handling the correspondent,
|
||||||
|
title, and tags easier:
|
||||||
|
"<correspondent> - <title> - <tags>.<suffix>"
|
||||||
|
"<correspondent> - <title>.<suffix>"
|
||||||
|
"<title>.<suffix>"
|
||||||
|
"""
|
||||||
|
|
||||||
|
for regex in cls.REGEXES.values():
|
||||||
|
m = regex.match(os.path.basename(path))
|
||||||
|
if m:
|
||||||
|
properties = m.groupdict()
|
||||||
|
cls._mangle_property(properties, "created")
|
||||||
|
cls._mangle_property(properties, "correspondent")
|
||||||
|
cls._mangle_property(properties, "title")
|
||||||
|
cls._mangle_property(properties, "tags")
|
||||||
|
cls._mangle_property(properties, "extension")
|
||||||
|
return cls(**properties)
|
||||||
|
@ -1,29 +1,36 @@
|
|||||||
from django.test import TestCase
|
from django.test import TestCase
|
||||||
|
|
||||||
from ..consumer import Consumer
|
from ..models import Document, FileInfo
|
||||||
|
|
||||||
|
|
||||||
class TestAttachment(TestCase):
|
class TestAttachment(TestCase):
|
||||||
|
|
||||||
TAGS = ("tag1", "tag2", "tag3")
|
TAGS = ("tag1", "tag2", "tag3")
|
||||||
CONSUMER = Consumer()
|
EXTENSIONS = (
|
||||||
SUFFIXES = (
|
|
||||||
"pdf", "png", "jpg", "jpeg", "gif",
|
"pdf", "png", "jpg", "jpeg", "gif",
|
||||||
"PDF", "PNG", "JPG", "JPEG", "GIF",
|
"PDF", "PNG", "JPG", "JPEG", "GIF",
|
||||||
"PdF", "PnG", "JpG", "JPeG", "GiF",
|
"PdF", "PnG", "JpG", "JPeG", "GiF",
|
||||||
)
|
)
|
||||||
|
|
||||||
def _test_guess_attributes_from_name(self, path, sender, title, tags):
|
def _test_guess_attributes_from_name(self, path, sender, title, tags):
|
||||||
for suffix in self.SUFFIXES:
|
|
||||||
f = path.format(suffix)
|
for extension in self.EXTENSIONS:
|
||||||
results = self.CONSUMER._guess_attributes_from_name(f)
|
|
||||||
self.assertEqual(results[0].name, sender, f)
|
f = path.format(extension)
|
||||||
self.assertEqual(results[1], title, f)
|
file_info = FileInfo.from_path(f)
|
||||||
self.assertEqual(tuple([t.slug for t in results[2]]), tags, f)
|
|
||||||
if suffix.lower() == "jpeg":
|
if sender:
|
||||||
self.assertEqual(results[3], "jpg", f)
|
self.assertEqual(file_info.correspondent.name, sender, f)
|
||||||
else:
|
else:
|
||||||
self.assertEqual(results[3], suffix.lower(), f)
|
self.assertIsNone(file_info.correspondent, f)
|
||||||
|
|
||||||
|
self.assertEqual(file_info.title, title, f)
|
||||||
|
|
||||||
|
self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f)
|
||||||
|
if extension.lower() == "jpeg":
|
||||||
|
self.assertEqual(file_info.extension, "jpg", f)
|
||||||
|
else:
|
||||||
|
self.assertEqual(file_info.extension, extension.lower(), f)
|
||||||
|
|
||||||
def test_guess_attributes_from_name0(self):
|
def test_guess_attributes_from_name0(self):
|
||||||
self._test_guess_attributes_from_name(
|
self._test_guess_attributes_from_name(
|
||||||
@ -92,3 +99,206 @@ class TestAttachment(TestCase):
|
|||||||
"Τιτλε",
|
"Τιτλε",
|
||||||
self.TAGS
|
self.TAGS
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def test_guess_attributes_from_name_when_correspondent_empty(self):
|
||||||
|
self._test_guess_attributes_from_name(
|
||||||
|
'/path/to/ - weird empty correspondent but should not break.{}',
|
||||||
|
None,
|
||||||
|
'weird empty correspondent but should not break',
|
||||||
|
()
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_guess_attributes_from_name_when_title_starts_with_dash(self):
|
||||||
|
self._test_guess_attributes_from_name(
|
||||||
|
'/path/to/- weird but should not break.{}',
|
||||||
|
None,
|
||||||
|
'- weird but should not break',
|
||||||
|
()
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_guess_attributes_from_name_when_title_ends_with_dash(self):
|
||||||
|
self._test_guess_attributes_from_name(
|
||||||
|
'/path/to/weird but should not break -.{}',
|
||||||
|
None,
|
||||||
|
'weird but should not break -',
|
||||||
|
()
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_guess_attributes_from_name_when_title_is_empty(self):
|
||||||
|
self._test_guess_attributes_from_name(
|
||||||
|
'/path/to/weird correspondent but should not break - .{}',
|
||||||
|
'weird correspondent but should not break',
|
||||||
|
'',
|
||||||
|
()
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class Permutations(TestCase):
|
||||||
|
|
||||||
|
valid_dates = (
|
||||||
|
"20150102030405Z",
|
||||||
|
"20150102Z",
|
||||||
|
)
|
||||||
|
valid_correspondents = [
|
||||||
|
"timmy",
|
||||||
|
"Dr. McWheelie",
|
||||||
|
"Dash Gor-don",
|
||||||
|
"ο Θερμαστής",
|
||||||
|
""
|
||||||
|
]
|
||||||
|
valid_titles = ["title", "Title w Spaces", "Title a-dash", "Τίτλος", ""]
|
||||||
|
valid_tags = ["tag", "tig,tag", "tag1,tag2,tag-3"]
|
||||||
|
valid_extensions = ["pdf", "png", "jpg", "jpeg", "gif"]
|
||||||
|
|
||||||
|
def _test_guessed_attributes(self, filename, created=None,
|
||||||
|
correspondent=None, title=None,
|
||||||
|
extension=None, tags=None):
|
||||||
|
|
||||||
|
# print(filename)
|
||||||
|
info = FileInfo.from_path(filename)
|
||||||
|
|
||||||
|
# Created
|
||||||
|
if created is None:
|
||||||
|
self.assertIsNone(info.created, filename)
|
||||||
|
else:
|
||||||
|
self.assertEqual(info.created.year, int(created[:4]), filename)
|
||||||
|
self.assertEqual(info.created.month, int(created[4:6]), filename)
|
||||||
|
self.assertEqual(info.created.day, int(created[6:8]), filename)
|
||||||
|
|
||||||
|
# Correspondent
|
||||||
|
if correspondent:
|
||||||
|
self.assertEqual(info.correspondent.name, correspondent, filename)
|
||||||
|
else:
|
||||||
|
self.assertEqual(info.correspondent, None, filename)
|
||||||
|
|
||||||
|
# Title
|
||||||
|
self.assertEqual(info.title, title, filename)
|
||||||
|
|
||||||
|
# Tags
|
||||||
|
if tags is None:
|
||||||
|
self.assertEqual(info.tags, (), filename)
|
||||||
|
else:
|
||||||
|
self.assertEqual(
|
||||||
|
[t.slug for t in info.tags], tags.split(','),
|
||||||
|
filename
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extension
|
||||||
|
if extension == 'jpeg':
|
||||||
|
extension = 'jpg'
|
||||||
|
self.assertEqual(info.extension, extension, filename)
|
||||||
|
|
||||||
|
def test_just_title(self):
|
||||||
|
template = '/path/to/{title}.{extension}'
|
||||||
|
for title in self.valid_titles:
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = dict(title=title, extension=extension)
|
||||||
|
filename = template.format(**spec)
|
||||||
|
self._test_guessed_attributes(filename, **spec)
|
||||||
|
|
||||||
|
def test_title_and_correspondent(self):
|
||||||
|
template = '/path/to/{correspondent} - {title}.{extension}'
|
||||||
|
for correspondent in self.valid_correspondents:
|
||||||
|
for title in self.valid_titles:
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = dict(correspondent=correspondent, title=title,
|
||||||
|
extension=extension)
|
||||||
|
filename = template.format(**spec)
|
||||||
|
self._test_guessed_attributes(filename, **spec)
|
||||||
|
|
||||||
|
def test_title_and_correspondent_and_tags(self):
|
||||||
|
template = '/path/to/{correspondent} - {title} - {tags}.{extension}'
|
||||||
|
for correspondent in self.valid_correspondents:
|
||||||
|
for title in self.valid_titles:
|
||||||
|
for tags in self.valid_tags:
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = dict(correspondent=correspondent, title=title,
|
||||||
|
tags=tags, extension=extension)
|
||||||
|
filename = template.format(**spec)
|
||||||
|
self._test_guessed_attributes(filename, **spec)
|
||||||
|
|
||||||
|
def test_created_and_correspondent_and_title_and_tags(self):
|
||||||
|
|
||||||
|
template = ("/path/to/{created} - "
|
||||||
|
"{correspondent} - "
|
||||||
|
"{title} - "
|
||||||
|
"{tags}"
|
||||||
|
".{extension}")
|
||||||
|
|
||||||
|
for created in self.valid_dates:
|
||||||
|
for correspondent in self.valid_correspondents:
|
||||||
|
for title in self.valid_titles:
|
||||||
|
for tags in self.valid_tags:
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = {
|
||||||
|
"created": created,
|
||||||
|
"correspondent": correspondent,
|
||||||
|
"title": title,
|
||||||
|
"tags": tags,
|
||||||
|
"extension": extension
|
||||||
|
}
|
||||||
|
self._test_guessed_attributes(
|
||||||
|
template.format(**spec), **spec)
|
||||||
|
|
||||||
|
def test_created_and_correspondent_and_title(self):
|
||||||
|
|
||||||
|
template = ("/path/to/{created} - "
|
||||||
|
"{correspondent} - "
|
||||||
|
"{title}"
|
||||||
|
".{extension}")
|
||||||
|
|
||||||
|
for created in self.valid_dates:
|
||||||
|
for correspondent in self.valid_correspondents:
|
||||||
|
for title in self.valid_titles:
|
||||||
|
|
||||||
|
# Skip cases where title looks like a tag as we can't
|
||||||
|
# accommodate such cases.
|
||||||
|
if title.lower() == title:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = {
|
||||||
|
"created": created,
|
||||||
|
"correspondent": correspondent,
|
||||||
|
"title": title,
|
||||||
|
"extension": extension
|
||||||
|
}
|
||||||
|
self._test_guessed_attributes(
|
||||||
|
template.format(**spec), **spec)
|
||||||
|
|
||||||
|
def test_created_and_title(self):
|
||||||
|
|
||||||
|
template = ("/path/to/{created} - "
|
||||||
|
"{title}"
|
||||||
|
".{extension}")
|
||||||
|
|
||||||
|
for created in self.valid_dates:
|
||||||
|
for title in self.valid_titles:
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = {
|
||||||
|
"created": created,
|
||||||
|
"title": title,
|
||||||
|
"extension": extension
|
||||||
|
}
|
||||||
|
self._test_guessed_attributes(
|
||||||
|
template.format(**spec), **spec)
|
||||||
|
|
||||||
|
def test_created_and_title_and_tags(self):
|
||||||
|
|
||||||
|
template = ("/path/to/{created} - "
|
||||||
|
"{title} - "
|
||||||
|
"{tags}"
|
||||||
|
".{extension}")
|
||||||
|
|
||||||
|
for created in self.valid_dates:
|
||||||
|
for title in self.valid_titles:
|
||||||
|
for tags in self.valid_tags:
|
||||||
|
for extension in self.valid_extensions:
|
||||||
|
spec = {
|
||||||
|
"created": created,
|
||||||
|
"title": title,
|
||||||
|
"tags": tags,
|
||||||
|
"extension": extension
|
||||||
|
}
|
||||||
|
self._test_guessed_attributes(
|
||||||
|
template.format(**spec), **spec)
|
||||||
|
Loading…
x
Reference in New Issue
Block a user