Compare commits

..

114 Commits
0.3.5 ... 0.8.0

Author SHA1 Message Date
Daniel Quinn
4d2b71454d Ignore .virtualenv 2017-09-09 12:22:03 +03:00
Daniel Quinn
5cbb33b02b Add documentation for the new FORCE_SCRIPT_NAME feature 2017-09-09 12:21:31 +03:00
Daniel Quinn
2c55aad6c0 Merge pull request #255 from maphy-psd/master
add FORCE_SCRIPT_NAME to host paperless on a subpath url
2017-09-06 15:56:44 +01:00
Daniel Quinn
1e039dcb32 Bump gunicorn 2017-08-30 00:44:13 +03:00
Daniel Quinn
6ca8da4858 Patch requirements to keep up with Django versions 2017-08-30 00:27:54 +03:00
maphy-psd
82f05e27c3 fix travis ci E510
E501 line too long (85 > 79 characters)
2017-08-20 16:18:39 +02:00
maphy-psd
7a627e4ad8 white spacing and remove var's prefix 2017-08-20 14:29:51 +02:00
maphy-psd
73af9552ec getenv has "None" as default
@MasterofJOKers in PR#255
2017-08-20 14:13:23 +02:00
maphy-psd
e4854f2144 def thumbnail uses FORCE_SCRIPT_NAME
with this edit the tumbnails are show up..
2017-08-19 18:37:17 +02:00
maphy-psd
6f5c1ac4e1 add FORCE_SCRIPT_NAME setting 2017-08-19 12:39:25 +02:00
maphy-psd
22acc51284 add PAPERLESS_FORCE_SCRIPT_NAME 2017-08-19 12:38:45 +02:00
Daniel Quinn
a05644fc31 Merge pull request #250 from brightdroid/master
create documents subfolder folder if they do not exist
2017-08-12 14:39:22 +01:00
Christoph Roeder
d1aa54caa9 create documents subfolder folder if they do not exist 2017-07-31 21:35:41 +02:00
Daniel Quinn
e293f70a91 Merge pull request #247 from danielquinn/issue/235
Allow correspondents to be deleted without deleting their documents
2017-07-15 19:41:33 +01:00
Daniel Quinn
347986a2b3 Allow correspondents to be deleted without deleting their documents
Fixes #235
2017-07-15 19:13:10 +01:00
Daniel Quinn
ede274386b Detect .tif files properly
Fixes #232
2017-07-15 19:02:11 +01:00
Daniel Quinn
3e083354cc Merge pull request #246 from kskyten/vb_memory
Add memory to the virtual machine
2017-07-10 15:02:45 +01:00
Kusti Skytén
b2b4f6516a Add memory to the virtual machine
Fixes #244
2017-07-10 16:55:51 +03:00
Daniel Quinn
2ae702c7bb Merge pull request #245 from tooomm/patch-1
README: unify badges (versioneye)
2017-07-09 18:25:28 +01:00
tooomm
b748420a94 unify badges (versioneye)
normal > flat style
2017-07-09 15:17:42 +02:00
Daniel Quinn
8a4546ce0d Merge pull request #242 from MasterofJOKers/setup_collectstatic
Mention "collectstatic" in the docs
2017-06-27 13:07:07 +01:00
MasterofJOKers
167412a003 Mention "collectstatic" in the docs
When using the built-in webserver in debug mode, the static files are
handled automatically. From the Django docs:

	During development, if you use django.contrib.staticfiles, this will
	be done automatically by runserver when DEBUG is set to True (see
	django.contrib.staticfiles.views.serve()).

	This method is grossly inefficient and probably insecure, so it is
	unsuitable for production.

This means, when using a real webserver, it also has to serve the static
files, i.e.  CSS and JavaScript. For that, one needs to run `./manage.py
collectstatic` first.
2017-06-26 17:08:37 +02:00
Daniel Quinn
e8d90b42a1 Merge pull request #240 from ddddavidmartin/timezone_documentation_clarification
Add link to Django documentation for time zone setting in example config.
2017-06-24 09:24:42 +01:00
David Martin
d8c7e9de5f Add link to documentation for time zone setting in example config.
It is not obvious which time zones the option in the config file
accepts. Having a link to the official django documentation makes it
clear.
2017-06-24 12:27:26 +10:00
Daniel Quinn
2ac1b78a2c Move testing ENV vars into pytest.ini 2017-06-19 10:57:30 +01:00
Daniel Quinn
e8e38befb7 Fix test for new email secret 2017-06-19 10:24:23 +01:00
Daniel Quinn
b30629dd60 Remove debugging info 2017-06-19 09:22:26 +01:00
Daniel Quinn
f66d7e1c2d Drop SHARED_SECRET in favour of EMAIL_SECRET
Originally we used SHARED secret both for email and for the API.  That
was a bad idea, and now that we're only using this value for one case,
I've renamed it to reflect its actual use.
2017-06-18 22:08:42 +01:00
Daniel Quinn
8417ac7eeb Merge pull request #237 from danielquinn/fix-http-post
Fix http post
2017-06-13 17:52:48 +01:00
Daniel Quinn
6342225b22 Merge pull request #238 from Strubbl/fix-shellcheck-issues
docker-entrypoint.sh: fix shellcheck issues
2017-06-13 17:51:49 +01:00
Sven Fischer
4460fb7004 docker-entrypoint.sh: fix shellcheck issues
issues found by shellcheck were:

```
$ shellcheck docker-entrypoint.sh

In docker-entrypoint.sh line 10:
    if [[ ${USERMAP_UID} != ${USERMAP_ORIG_UID} || ${USERMAP_GID} != ${USERMAP_ORIG_GID} ]]; then
                            ^-- SC2053: Quote the rhs of != in [[ ]] to prevent glob matching.
                                                                     ^-- SC2053: Quote the rhs of != in [[ ]] to prevent glob matching.

In docker-entrypoint.sh line 12:
        groupmod -g ${USERMAP_GID} paperless
                    ^-- SC2086: Double quote to prevent globbing and word splitting.

In docker-entrypoint.sh line 65:
        if dpkg -s "$pkg" 2>&1 > /dev/null; then
                          ^-- SC2069: The order of the 2>&1 and the redirect matters. The 2>&1 has to be last.

In docker-entrypoint.sh line 69:
        if ! apt-cache show "$pkg" 2>&1 > /dev/null; then
                                   ^-- SC2069: The order of the 2>&1 and the redirect matters. The 2>&1 has to be last.
```
2017-06-12 21:09:59 +02:00
Daniel Quinn
6f635c74fc Fix HTTP POST of documents
After tinkering with this for about 2 hours, I'm reasonably sure this
ever worked.  This feature was added by me in haste and poked by by the
occasional contributor, and it suffered from neglect.

* Removed the requirement for signature generation in favour of simply
  requiring BasicAuth or a valid session id.
* Fixed a number of bugs in the form itself that would have ensured that
  the form never accepted anything.
* Documented it all properly so now (hopefully) people will have less
  trouble figuring it out in the future.
2017-06-11 01:23:37 +01:00
Daniel Quinn
c82d45689c Remove unused imports & comments 2017-06-11 01:23:08 +01:00
Daniel Quinn
02e0543a02 Merge pull request #233 from lucaskolstad/django_filters_installed_app
Add django_filters to INSTALLED_APPS
2017-05-31 10:39:49 +01:00
Lucas Kolstad
fde0276d65 Add django_filters to INSTALLED_APPS 2017-05-30 15:05:34 -07:00
Daniel Quinn
3d6289e4e1 Preparing for 0.5.0
I hadn't realised that I hadn't released 0.5.0 yet, so I've amended the version numbers
2017-05-27 13:23:25 +01:00
Daniel Quinn
5e55b971a8 Update changelog for 0.5.1 2017-05-27 13:21:04 +01:00
Daniel Quinn
0a43b84a96 Merge pull request #228 from ddddavidmartin/extend_email_handling
Set email inbox in config file, fetch email at consumer startup and bring documentation up to date
2017-05-27 13:07:17 +01:00
Daniel Quinn
dc74cc2db5 Merge pull request #230 from ddddavidmartin/webserver_paperless_titles
Refer to Paperless in Django webserver titles and update Django documentation URLs
2017-05-27 13:00:46 +01:00
Daniel Quinn
fc00a09318 Merge pull request #229 from ddddavidmartin/clarify_systemd_instructions
Copy Paperless service files to systemd directory before enabling them.
2017-05-27 12:59:00 +01:00
Daniel Quinn
19cf9d0b9a Merge pull request #227 from ddddavidmartin/fix_forms_typos
Fix clened_data typos in forms.py.
2017-05-27 12:57:43 +01:00
Daniel Quinn
f81780cf88 Merge pull request #226 from ddddavidmartin/bump_pyocr_requirement_for_tesseract_4_support
Bump pyocr requirement to version 0.4.7 to support tesseract 4.0.0alpha.
2017-05-27 12:56:54 +01:00
David Martin
c3a55c91dc Update version of remaining weblinks to Django documentation.
We are using Django 1.10 as per requirements.txt and should refer to its
documentation as well.
2017-05-27 08:49:03 +10:00
David Martin
482f02fbaa Update link to Django documentation in urls.py.
As per requirements.txt we are using Django version 1.10. It makes sense
to link to the documentation for that version as well.
Also, the documentation for the previous version has a notice on the top
that informs about the version being unsafe which is a bit disconcerting
when seeing it.
2017-05-25 20:22:05 +10:00
David Martin
6bf7429ef6 Refer to Paperless instead of Django in webserver pages.
It looks better to have the page titles refer to Paperless rather than
Django. The same with the login. Setting it in urls.py is based on this
stackoverflow response [0]. The proper documentation for the admin page
is under [1].

[0] https://stackoverflow.com/a/24983231
[1] https://docs.djangoproject.com/en/1.10/ref/contrib/admin/#adminsite-attributes
2017-05-25 20:16:59 +10:00
David Martin
4198de604f Copy Paperless service files to systemd directory before enabling them.
The problem with the original instruction is that systemd creates a
symlink pointing to the service file in the paperless directory. A user
is unlikely to leave the changes in the service files committed
(especially not on a master branch checkout) and they are easily lost and
the services fail to start without obvious reason.

To avoid this we simply copy the service files to the systemd directory
directly and use the files in the repository only as an example.
2017-05-24 22:48:35 +10:00
David Martin
8c06dc2dd1 Mention safe characters for email titles in documentation.
This makes it clear that only a specific set of characters is allowed to
be used for email titles. It is worth mentioning this in the
documentation as it otherwise needs to be figured out from the Paperless
sources [0].

[0] SAFE_REGEX in src/documents/models.py
2017-05-23 11:16:38 +10:00
David Martin
13b4610c1d Clarify consumption documentation to match the current Paperless behaviour.
The configuration does not have to be hardcoded in settings.py anymore,
and instead happens in the config file. Also, we added that the emails
are checked at startup [0].

[0] see commit 3153bbd6a8
2017-05-23 11:15:33 +10:00
David Martin
0090128249 Fix clened_data typos in forms.py.
This is where linters shine. Either pylint or pyflake discovered these
typos and even suggested the correct name.
2017-05-21 17:05:49 +10:00
David Martin
3153bbd6a8 Fetch emails right at startup instead of waiting for 10 minutes.
Especially when first setting up the configuration for consuming
documents from emails it makes sense to quickly test the changes. Having
to wait for 10 minutes is not acceptable.

There are two ways around it that come to my mind: the simple approach
is to always fetch the emails when Paperless first starts. This way the
fetching of emails can be tested straight away.
The alternative would be to have a configuration option that allows to
set the interval in which emails are checked. The user could then reduce
it to test the setup and increase it again later on. This seems
needlessly complicated though, so fetching at startup it is.
2017-05-21 14:23:46 +10:00
David Martin
7b1812a9be Capitalise Paperless in example config.
This is in line with how it is spelled in the rest of the config file.
2017-05-21 08:44:41 +10:00
David Martin
c647daace2 Connect to configured inbox instead of hardcoded one.
Now the retrieving of emails from the inbox set in the config file works
as expected.
2017-05-21 08:34:49 +10:00
David Martin
70dceb3b37 Allow to configure the email inbox via config file.
Same as all the other parameters it makes sense to set it in the config
file as well.
2017-05-20 16:48:40 +10:00
David Martin
72b1ce5fe6 Bump pyocr requirement to version 0.4.7 to support tesseract 4.0.0alpha.
The latest pyocr version now allows running it with the latest tesseract
version. Hopefully this means better OCR results.

I am not sure about whether there are binary packages for the latest
tesseract. But on my setup it was simply a case of checking out the
master branch [0] and compiling + installing from there. It seems to work
fine with paperless as well.

[0] https://github.com/tesseract-ocr/tesseract
2017-05-14 12:59:32 +10:00
Daniel Quinn
731942d855 add: migration for fuzzy matching 2017-05-11 22:09:30 -07:00
Daniel Quinn
058dad7ba7 Merge branch 'master' of github.com:danielquinn/paperless 2017-05-10 16:14:14 -07:00
Daniel Quinn
fe43e5a717 add: credit for ckut's import/export changes 2017-05-10 16:14:05 -07:00
Daniel Quinn
34bab04310 fix: formatting cleanup 2017-05-10 17:38:00 -07:00
Daniel Quinn
18f7c4f31f Merge pull request #224 from CkuT/exporter_improvements
WIP : Exporter improvements
2017-05-10 16:09:11 -07:00
Daniel Quinn
3477b96d87 Merge pull request #222 from tido-/master
little changes to reflect as much as possible
2017-05-10 15:25:35 -07:00
Tido-
ac850b64aa minor changes on documentation files 2017-05-10 22:25:59 +02:00
CkuT
279e421ad5 PEP8 2017-05-08 15:48:37 +02:00
CkuT
22c8049bed Use relatives paths instead of absolutes paths for document export/import 2017-05-08 15:23:35 +02:00
CkuT
3f1392769d Refactor to get the document time once 2017-05-08 15:02:59 +02:00
CkuT
da71eab0ae Use constants for manifest 2017-05-08 14:54:48 +02:00
CkuT
2e0e6bb8d2 Add thumbnail export 2017-05-06 15:14:36 +02:00
CkuT
1f145c6cba Fix the source file checking 2017-05-06 15:04:47 +02:00
Tido-
c4d48181ee find the error in line break 03 2017-05-04 19:39:58 +02:00
Tido-
0c4ecad4a7 find the error in line break 02 2017-05-04 19:36:55 +02:00
Tido-
d25de5592a find the error in line break 01 2017-05-04 19:35:58 +02:00
Tido-
88fc35d8ea find the error in line break 2017-05-04 19:31:17 +02:00
Tido-
02730be871 found some additional bits to yours 2017-05-03 22:20:13 +02:00
Daniel Quinn
c7876dbbe8 add: credit for #212 2017-05-03 12:01:04 -07:00
Daniel Quinn
85fcb5fedf Merge pull request #212 from Strubbl/docker-prepare-export
Docker: prepare export directory
2017-05-03 09:55:43 -07:00
Tido-
58cbfeb72a little changes to reflect as much as possible 2017-05-02 22:48:37 +02:00
Sven Fischer
b2b6cbe9c8 Docker: review refacorting for export directory preparation 2017-05-02 19:52:36 +02:00
Sven Fischer
4c05a511c2 Docker: review fix: if end-user host-mounts the export directory 2017-05-02 19:06:01 +02:00
Sven Fischer
b5bef13b46 Docker: prepare export directory 2017-05-02 13:01:09 +02:00
Daniel Quinn
bb47dc5e06 fix: spacing and typos 2017-05-01 13:25:07 -07:00
Daniel Quinn
511f154e16 Merge pull request #221 from tido-/master
adding sections, grouped what belongs together
2017-05-01 13:10:23 -07:00
Tido-
10ae2207df adding sections, grouped what belongs together 2017-05-01 21:18:34 +02:00
Daniel Quinn
71df99ffb6 add: note for new fuzzy match support 2017-04-30 19:40:58 -07:00
Daniel Quinn
5eb26102d4 Merge pull request #220 from jgysland/add-fuzzy-matching
fuzzy matching
2017-04-30 19:37:03 -07:00
jgysland
a7fa82a83f KISS fuzzy match help text 2017-04-30 16:56:50 -04:00
jgysland
6ce27d225d add fuzzy matching + tests 2017-04-29 17:13:04 -04:00
Daniel Quinn
819a0e1f57 Merge pull request #219 from Strubbl/remove-duplicate-conf-option
paperless.conf.example: remove duplicate option
2017-04-28 17:38:55 -07:00
Sven Fischer
702a60b7e7 paperless.conf.example: remove duplicate option
This commit removes the duplicated option in this config.
Please see 057d5f149f/paperless.conf.example (L113) compared with 057d5f149f/paperless.conf.example (L122)
2017-04-24 23:43:54 +02:00
Daniel Quinn
057d5f149f Merge pull request #214 from philippeowagner/master
Fixes #213 (MySQL syntax error)
2017-04-19 10:42:50 +01:00
Philippe O. Wagner
d047dafd23 Fixes #213 (MySQL syntax error) 2017-04-19 11:30:12 +02:00
Daniel Quinn
b449a7f6e2 feat: add @eonist's recommendation
Fixes #211
2017-04-08 20:12:59 +01:00
Daniel Quinn
f302874ae8 Merge pull request #207 from danielquinn/fix/travis
fix: travis doesn't like my new tests
2017-03-28 22:26:35 +01:00
Daniel Quinn
6af58203dd fix: travis doesn't like my new tests 2017-03-28 21:23:42 +00:00
Daniel Quinn
fa4924d5ba fix: allow for caps in file name suffixes #206
@schinkelg ran aground of this one and I took the opportunity to add a
test to catch this sort of thing for next time.
2017-03-28 21:14:24 +00:00
Daniel Quinn
5b88ebf0e7 Merge pull request #203 from danielquinn/feature/reminders
Feature: Reminders
2017-03-25 16:27:28 +00:00
Daniel Quinn
a0edc7d54d chore: update the changelog for reminders 2017-03-25 16:22:04 +00:00
Daniel Quinn
b876a0d0df feat: add the new reminders app 2017-03-25 16:21:46 +00:00
Daniel Quinn
27db4f7e51 refactor: code cleanup
I hate single quotes.
2017-03-25 16:20:59 +00:00
Daniel Quinn
426919fa9f refactor: break document-only stuff into the paperless app
The `SessionOrBasicAuthMixin` and `StandardPagination` classes were
living in the documents app and I needed them in the new `reminders`
app, so this commit breaks them out of `documents` and puts them in the
central `paperless` app instead.
2017-03-25 16:18:34 +00:00
Daniel Quinn
e47c152b81 feat: migration for changes in 0.3.6 2017-03-25 16:01:59 +00:00
Daniel Quinn
b7cb708053 Merge pull request #197 from danielquinn/pluggable-consumers
Pluggable consumers
2017-03-25 15:20:48 +00:00
Daniel Quinn
7611c2b3d5 fix: pep8 + travis & tox env updates 2017-03-25 15:10:51 +00:00
Daniel Quinn
5f964830aa version bump 2017-03-25 15:10:51 +00:00
Daniel Quinn
7ec4f906af feat: make the content field optional 2017-03-25 15:10:25 +00:00
Daniel Quinn
b5f6c06b8b fix: a little cleanup 2017-03-25 15:10:25 +00:00
Daniel Quinn
55e81ca4bb feat: refactor for pluggable consumers
I've broken out the OCR-specific code from the consumers and dumped it
all into its own app, `paperless_tesseract`.  This new app should serve
as a sample of how to create one's own consumer for different file
types.

Documentation for how to do this isn't ready yet, but for the impatient:

* Create a new app
    * containing a `parsers.py` for your parser modelled after
      `paperless_tesseract.parsers.RasterisedDocumentParser`
    * containing a `signals.py` with a handler moddelled after
      `paperless_tesseract.signals.ConsumerDeclaration`
    * connect the signal handler to
      `documents.signals.document_consumer_declaration` in
      `your_app.apps`
* Install the app into Paperless by declaring
  `PAPERLESS_INSTALLED_APPS=your_app`.  Additional apps should be
  separated with commas.
* Restart the consumer
2017-03-25 15:10:25 +00:00
Daniel Quinn
0f7bfc547a Merge pull request #202 from danielquinn/fix/api-should-allow-writes
Fix/api should allow writes
2017-03-25 15:08:40 +00:00
Daniel Quinn
9525725c28 chore: update the changelog 2017-03-25 15:07:58 +00:00
Daniel Quinn
2a2196fa4d fix: #200 allow edits of correspondent & tags 2017-03-25 15:01:01 +00:00
Daniel Quinn
237efbcaa0 Merge branch 'master' of github.com:danielquinn/paperless 2017-03-05 12:15:22 +00:00
Daniel Quinn
351cd06ef7 Disable adding through the admin 2017-03-05 12:15:18 +00:00
Daniel Quinn
8b37160953 Merge pull request #194 from philippeowagner/master
Better alt-text for thumbnails.
2017-03-01 09:12:10 +00:00
Philippe O. Wagner
db64478d9f Better alt-text for thumbnails. 2017-03-01 00:50:53 +01:00
Daniel Quinn
8bc2dfe4c6 Django migrations doesn't account for PostgreSQL completely
This was a weird bug to run into.  Basically I changed a CharField into
a ForeignKey field and ran `makemigrations` to get the job done.
However, rather than doing a `RemoveField` and an `AddField`, migrations
created a single `AlterField` which worked just fine in SQLite, but blew
up in PostgreSQL with:

    psycopg2.ProgrammingError: operator class "varchar_pattern_ops" does
    not accept data type integer

The fix was to rewrite the single migration into the two separate steps.
2017-02-18 17:55:52 +00:00
Daniel Quinn
3a427c9130 Allow for MariaDB/MySQL
MariaDB/MySQL doesn't handle indexes on TextFields well and for some
reason, Django's migrations opts to blow up rather than handle this in a
more user-friendly way.  The fix here isn't ideal, but should be
sufficient should anyone try to use Paperless with MySQL.
2017-02-18 17:53:43 +00:00
71 changed files with 1460 additions and 663 deletions

1
.gitignore vendored
View File

@@ -68,6 +68,7 @@ db.sqlite3
.idea
# Other stuff that doesn't belong
.virtualenv
virtualenv
.vagrant
docker-compose.yml

View File

@@ -8,7 +8,9 @@ matrix:
env: TOXENV=py34
- python: 3.5
env: TOXENV=py35
- python: 3.5
- python: 3.6
env: TOXENV=py36
- python: 3.6
env: TOXENV=pep8
install:

View File

@@ -35,12 +35,16 @@ RUN groupadd -g 1000 paperless \
&& useradd -u 1000 -g 1000 -d /usr/src/paperless paperless \
&& chown -Rh paperless:paperless /usr/src/paperless
# Set export directory
ENV PAPERLESS_EXPORT_DIR /export
RUN mkdir -p $PAPERLESS_EXPORT_DIR
# Setup entrypoint
COPY scripts/docker-entrypoint.sh /sbin/docker-entrypoint.sh
RUN chmod 755 /sbin/docker-entrypoint.sh
# Mount volumes
VOLUME ["/usr/src/paperless/data", "/usr/src/paperless/media", "/consume"]
VOLUME ["/usr/src/paperless/data", "/usr/src/paperless/media", "/consume", "/export"]
ENTRYPOINT ["/sbin/docker-entrypoint.sh"]
CMD ["--help"]

View File

@@ -6,7 +6,7 @@ Paperless
|Travis|
|Dependencies|
Scan, index, and archive all of your paper documents
Index and archive all of your scanned paper documents
I hate paper. Environmental issues aside, it's a tech person's nightmare:
@@ -23,13 +23,17 @@ it... because paper. I wrote this to make my life easier.
How it Works
============
1. Buy a document scanner like `this one`_.
Paperless does not control your scanner, it only helps you deal with what your
scanner produces
1. Buy a document scanner like `this one`_ (used by me) or `this other one`_
recommended by another user.
2. Set it up to "scan to FTP" or something similar. It should be able to push
scanned images to a server without you having to do anything. If your
scanner doesn't know how to automatically upload the file somewhere, you can
always do that manually. Paperless doesn't care how the documents get into
its local consumption directory.
3. Have the target server run the Paperless consumption script to OCR the PDF
3. Have the target server run the Paperless consumption script to OCR the file
and index it into a local database.
4. Use the web frontend to sift through the database and find what you want.
5. Download the PDF you need/want via the web interface and do whatever you
@@ -47,16 +51,15 @@ Stability
=========
Paperless is still under active development (just look at the git commit
history) so don't expect it to be 100% stable. I'm using it for my own
documents, but I'm crazy like that. If you use this and it breaks something,
you get to keep all the shiny pieces.
history) so don't expect it to be 100% stable. You can backup the sqlite3
database, media directory and your configuration file to be on the safe side.
Requirements
============
This is all really a quite simple, shiny, user-friendly wrapper around some very
powerful tools.
This is all really a quite simple, shiny, user-friendly wrapper around some
very powerful tools.
* `ImageMagick`_ converts the images between colour and greyscale.
* `Tesseract`_ does the character recognition.
@@ -82,22 +85,22 @@ Similar Projects
There's another project out there called `Mayan EDMS`_ that has a surprising
amount of technical overlap with Paperless. Also based on Django and using
a consumer model with Tesseract and unpaper, Mayan EDMS is *much* more
featureful and comes with a slick UI as well. It may be that Paperless is
better suited for low-resource environments (like a Rasberry Pi), but to be
honest, this is just a guess as I haven't tested this myself. One thing's
for certain though, *Paperless* is a **much** better name.
a consumer model with Tesseract and Unpaper, Mayan EDMS is *much* more
featureful and comes with a slick UI as well, but still in Python 2. It may be
that Paperless consumes fewer resources, but to be honest, this is just a guess
as I haven't tested this myself. One thing's for certain though, *Paperless*
is a **much** better name.
Important Note
==============
Document scanners are typically used to scan sensitive documents. Things like
your social insurance number, tax records, invoices, etc. While paperless
encrypts the original PDFs via the consumption script, the OCR'd text is *not*
your social insurance number, tax records, invoices, etc. While Paperless
encrypts the original files via the consumption script, the OCR'd text is *not*
encrypted and is therefore stored in the clear (it needs to be searchable, so
if someone has ideas on how to do that on encrypted data, I'm all ears). This
means that paperless should never be run on an untrusted host. Instead, I
means that Paperless should never be run on an untrusted host. Instead, I
recommend that if you do want to use it, run it locally on a server in your own
home.
@@ -116,6 +119,7 @@ the `United Nations High Commissioner for Refugees`_. They're doing important
work and they need the money a lot more than I do.
.. _this one: http://www.brother.ca/en-CA/Scanners/11/ProductDetail/ADS1500W?ProductDetail=productdetail
.. _this other one: http://www.fujitsu.com/us/products/computing/peripheral/scanners/scansnap/ix500/
.. _ImageMagick: http://imagemagick.org/
.. _Tesseract: https://github.com/tesseract-ocr
.. _Unpaper: https://www.flameeyes.eu/projects/unpaper
@@ -136,5 +140,5 @@ work and they need the money a lot more than I do.
:target: https://gitter.im/danielquinn/paperless?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
.. |Travis| image:: https://travis-ci.org/danielquinn/paperless.svg?branch=master
:target: https://travis-ci.org/danielquinn/paperless
.. |Dependencies| image:: https://www.versioneye.com/user/projects/57b33b81d9f1b00016faa500/badge.svg?style=flat-square
.. |Dependencies| image:: https://www.versioneye.com/user/projects/57b33b81d9f1b00016faa500/badge.svg
:target: https://www.versioneye.com/user/projects/57b33b81d9f1b00016faa500

5
Vagrantfile vendored
View File

@@ -12,4 +12,9 @@ Vagrant.configure(VAGRANT_API_VERSION) do |config|
# Networking details
config.vm.network "private_network", ip: "172.28.128.4"
config.vm.provider "virtualbox" do |vb|
# Customize the amount of memory on the VM:
vb.memory = "1024"
end
end

View File

@@ -1,6 +1,67 @@
Changelog
#########
* 0.7.0
* **Potentially breaking change**: As per `#235`_, Paperless will no longer
automatically delete documents attached to correspondents when those
correspondents are themselves deleted. This was Django's default
behaviour, but didn't make much sense in Paperless' case. Thanks to
`Thomas Brueggemann`_ and `David Martin`_ for their input on this one.
* Fix for `#232`_ wherein Paperless wasn't recognising ``.tif`` files
properly. Thanks to `ayounggun`_ for reporting this one and to
`Kusti Skytén`_ for posting the correct solution in the Github issue.
* 0.6.0
* Abandon the shared-secret trick we were using for the POST API in favour
of BasicAuth or Django session.
* Fix the POST API so it actually works. `#236`_
* **Breaking change**: We've dropped the use of ``PAPERLESS_SHARED_SECRET``
as it was being used both for the API (now replaced with a normal auth)
and form email polling. Now that we're only using it for email, this
variable has been renamed to ``PAPERLESS_EMAIL_SECRET``. The old value
will still work for a while, but you should change your config if you've
been using the email polling feature. Thanks to `Joshua Gilman`_ for all
the help with this feature.
* 0.5.0
* Support for fuzzy matching in the auto-tagger & auto-correspondent systems
thanks to `Jake Gysland`_'s patch `#220`_.
* Modified the Dockerfile to prepare an export directory (`#212`_). Thanks
to combined efforts from `Pit`_ and `Strubbl`_ in working out the kinks on
this one.
* Updated the import/export scripts to include support for thumbnails. Big
thanks to `CkuT`_ for finding this shortcoming and doing the work to get
it fixed in `#224`_.
* All of the following changes are thanks to `David Martin`_:
* Bumped the dependency on pyocr to 0.4.7 so new users can make use of
Tesseract 4 if they so prefer (`#226`_).
* Fixed a number of issues with the automated mail handler (`#227`_, `#228`_)
* Amended the documentation for better handling of systemd service files (`#229`_)
* Amended the Django Admin configuration to have nice headers (`#230`_)
* 0.4.1
* Fix for `#206`_ wherein the pluggable parser didn't recognise files with
all-caps suffixes like ``.PDF``
* 0.4.0
* Introducing reminders. See `#199`_ for more information, but the short
explanation is that you can now attach simple notes & times to documents
which are made available via the API. Currently, the default API
(basically just the Django admin) doesn't really make use of this, but
`Thomas Brueggemann`_ over at `Paperless Desktop`_ has said that he would
like to make use of this feature in his project.
* 0.3.6
* Fix for `#200`_ (!!) where the API wasn't configured to allow updating the
correspondent or the tags for a document.
* The ``content`` field is now optional, to allow for the edge case of a
purely graphical document.
* You can no longer add documents via the admin. This never worked in the
first place, so all I've done here is remove the link to the broken form.
* The consumer code has been heavily refactored to support a pluggable
interface. Install a paperless consumer via pip and tell paperless about
it with an environment variable, and you're good to go. Proper
documentation is on its way.
* 0.3.5
* A serious facelift for the documents listing page wherein we drop the
tabular layout in favour of a tiled interface.
@@ -161,6 +222,15 @@ Changelog
.. _Tim White: https://github.com/timwhite
.. _Florian Harr: https://github.com/evils
.. _Justin Snyman: https://github.com/stringlytyped
.. _Thomas Brueggemann: https://github.com/thomasbrueggemann
.. _Jake Gysland: https://github.com/jgysland
.. _Strubbl: https://github.com/strubbl
.. _CkuT: https://github.com/CkuT
.. _David Martin: https://github.com/ddddavidmartin
.. _Paperless Desktop: https://github.com/thomasbrueggemann/paperless-desktop
.. _Joshua Gilman: https://github.com/jmgilman
.. _ayounggun: https://github.com/ayounggun
.. _Kusti Skytén: https://github.com/kskyten
.. _#20: https://github.com/danielquinn/paperless/issues/20
.. _#44: https://github.com/danielquinn/paperless/issues/44
@@ -187,3 +257,17 @@ Changelog
.. _#171: https://github.com/danielquinn/paperless/issues/171
.. _#172: https://github.com/danielquinn/paperless/issues/172
.. _#179: https://github.com/danielquinn/paperless/pull/179
.. _#199: https://github.com/danielquinn/paperless/issues/199
.. _#200: https://github.com/danielquinn/paperless/issues/200
.. _#206: https://github.com/danielquinn/paperless/issues/206
.. _#212: https://github.com/danielquinn/paperless/pull/212
.. _#220: https://github.com/danielquinn/paperless/pull/220
.. _#224: https://github.com/danielquinn/paperless/pull/224
.. _#226: https://github.com/danielquinn/paperless/pull/226
.. _#227: https://github.com/danielquinn/paperless/pull/227
.. _#228: https://github.com/danielquinn/paperless/pull/228
.. _#229: https://github.com/danielquinn/paperless/pull/229
.. _#230: https://github.com/danielquinn/paperless/pull/230
.. _#232: https://github.com/danielquinn/paperless/issues/232
.. _#235: https://github.com/danielquinn/paperless/issues/235
.. _#236: https://github.com/danielquinn/paperless/issues/236

View File

@@ -121,18 +121,21 @@ So, with all that in mind, here's what you do to get it running:
1. Setup a new email account somewhere, or if you're feeling daring, create a
folder in an existing email box and note the path to that folder.
2. In ``settings.py`` set all of the appropriate values in ``MAIL_CONSUMPTION``.
2. In ``/etc/paperless.conf`` set all of the appropriate values in
``PATHS AND FOLDERS`` and ``SECURITY``.
If you decided to use a subfolder of an existing account, then make sure you
set ``INBOX`` accordingly here. You also have to set the
``UPLOAD_SHARED_SECRET`` to something you can remember 'cause you'll have to
include that in every email you send.
set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set
the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll
have to include that in every email you send.
3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check
the configured email account every 10 minutes for something new and pull down
whatever it finds.
the configured email account at startup and from then on every 10 minutes
for something new and pulls down whatever it finds.
4. Send yourself an email! Note that the subject is treated as the file name,
so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll
get what you expect. Also, you must include the aforementioned secret
string in every email so the fetcher knows that it's safe to import.
Note that Paperless only allows the email title to consist of safe characters
to be imported. These consist of alpha-numeric characters and ``-_ ,.'``.
5. After a few minutes, the consumer will poll your mailbox, pull down the
message, and place the attachment in the consumption directory with the
appropriate name. A few minutes later, the consumer will import it like any
@@ -144,46 +147,83 @@ So, with all that in mind, here's what you do to get it running:
HTTP POST
=========
You can also submit a document via HTTP POST. It doesn't do tags yet, and the
URL schema isn't concrete, but it's a start.
To push your document to Paperless, send an HTTP POST to the server with the
following name/value pairs:
You can also submit a document via HTTP POST, so long as you do so after
authenticating. To push your document to Paperless, send an HTTP POST to the
server with the following name/value pairs:
* ``correspondent``: The name of the document's correspondent. Note that there
are restrictions on what characters you can use here. Specifically,
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else it
alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is
out. You also can't use the sequence ` - ` (space, dash, space).
* ``title``: The title of the document. The rules for characters is the same
here as the correspondent.
* ``signature``: For security reasons, we have the correspondent send a
signature using a "shared secret" method to make sure that random strangers
don't start uploading stuff to your server. The means of generating this
signature is defined below.
* ``document``: The file you're uploading
Specify ``enctype="multipart/form-data"``, and then POST your file with::
Content-Disposition: form-data; name="document"; filename="whatever.pdf"
An example of this in HTML is a typical form:
.. _consumption-http-signature:
.. code:: html
Generating the Signature
------------------------
<form method="post" enctype="multipart/form-data">
<input type="text" name="correspondent" value="My Correspondent" />
<input type="text" name="title" value="My Title" />
<input type="file" name="document" />
<input type="submit" name="go" value="Do the thing" />
</form>
Generating a signature based a shared secret is pretty simple: define a secret,
and store it on the server and the client. Then use that secret, along with
the text you want to verify to generate a string that you can use for
verification.
In the case of Paperless, you configure the server with the secret by setting
``UPLOAD_SHARED_SECRET``. Then on your client, you generate your signature by
concatenating the correspondent, title, and the secret, and then using sha256
to generate a hexdigest.
If you're using Python, this is what that looks like:
But a potentially more useful way to do this would be in Python. Here we use
the requests library to handle basic authentication and to send the POST data
to the URL.
.. code:: python
import os
from hashlib import sha256
signature = sha256(correspondent + title + secret).hexdigest()
import requests
from requests.auth import HTTPBasicAuth
# You authenticate via BasicAuth or with a session id.
# We use BasicAuth here
username = "my-username"
password = "my-super-secret-password"
# Where you have Paperless installed and listening
url = "http://localhost:8000/push"
# Document metadata
correspondent = "Test Correspondent"
title = "Test Title"
# The local file you want to push
path = "/path/to/some/directory/my-document.pdf"
with open(path, "rb") as f:
response = requests.post(
url=url,
data={"title": title, "correspondent": correspondent},
files={"document": (os.path.basename(path), f, "application/pdf")},
auth=HTTPBasicAuth(username, password),
allow_redirects=False
)
if response.status_code == 202:
# Everything worked out ok
print("Upload successful")
else:
# If you don't get a 202, it's probably because your credentials
# are wrong or something. This will give you a rough idea of what
# happened.
print("We got HTTP status code: {}".format(response.status_code))
for k, v in response.headers.items():
print("{}: {}".format(k, v))

View File

@@ -3,7 +3,11 @@
Paperless
=========
Scan, index, and archive all of your paper documents. Say goodbye to paper.
Paperless is a simple Django application running in two parts:
a :ref:`consumer <utilities-consumer>` (the thing that does the indexing) and
the :ref:`webserver <utilities-webserver>` (the part that lets you search & download
already-indexed documents). If you want to learn more about its functions keep on
reading after the installation section.
.. _index-why-this-exists:
@@ -15,10 +19,11 @@ Paper is a nightmare. Environmental issues aside, there's no excuse for it in
the 21st century. It takes up space, collects dust, doesn't support any form of
a search feature, indexing is tedious, it's heavy and prone to damage & loss.
I wrote this to make "going paperless" easier. I wanted to be able to feed
documents right from the post box into the scanner and then shred them so I
never have to worry about finding stuff again. Perhaps you might find it useful
too.
I wrote this to make "going paperless" easier. I do not have to worry about
finding stuff again. I feed documents right from the post box into the scanner and
then shred them. Perhaps you might find it useful too.
Contents

View File

@@ -4,7 +4,7 @@ Requirements
============
You need a Linux machine or Unix-like setup (theoretically an Apple machine
should work) that has the following software installed on it:
should work) that has the following software installed:
* `Python3`_ (with development libraries, pip and virtualenv)
* `GNU Privacy Guard`_
@@ -21,14 +21,14 @@ should work) that has the following software installed on it:
Notably, you should confirm how you access your Python3 installation. Many
Linux distributions will install Python3 in parallel to Python2, using the names
``python3`` and ``python`` respectively. The same goes for ``pip3`` and
``pip``. Using Python2 will likely break things, so make sure that you're using
the right version.
``pip``. Running Paperless with Python2 will likely break things, so make sure that
you're using the right version.
For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to
refer to their Python 3 versions.
refer to their Python3 versions.
In addition to the above, there are a number of Python requirements, all of
which are listed in a file called ``requirements.txt`` in the project root.
which are listed in a file called ``requirements.txt`` in the project root directory.
If you're not working on a virtual environment (like Vagrant or Docker), you
should probably be using a virtualenv, but that's your call. The reasons why
@@ -67,7 +67,7 @@ dependencies is easy:
$ pip install --user --requirement /path/to/paperless/requirements.txt
This should download and install all of the requirements into
This will download and install all of the requirements into
``${HOME}/.local``. Remember that your distribution may be using ``pip3`` as
mentioned above.
@@ -86,8 +86,8 @@ enter it, and install the requirements using the ``requirements.txt`` file:
$ . /path/to/arbitrary/directory/bin/activate
$ pip install --requirement /path/to/paperless/requirements.txt
Now you're ready to go. Just remember to enter your virtualenv whenever you
want to use Paperless.
Now you're ready to go. Just remember to enter (activate) your virtualenv
whenever you want to use Paperless.
.. _requirements-documentation:
@@ -95,7 +95,7 @@ want to use Paperless.
Documentation
-------------
As generation of the documentation is not required for use of Paperless,
As generation of the documentation is not required for the use of Paperless,
dependencies for this process are not included in ``requirements.txt``. If
you'd like to generate your own docs locally, you'll need to:

View File

@@ -4,7 +4,7 @@ Setup
=====
Paperless isn't a very complicated app, but there are a few components, so some
basic documentation is in order. If you go follow along in this document and
basic documentation is in order. If you follow along in this document and
still have trouble, please open an `issue on GitHub`_ so I can fill in the
gaps.
@@ -28,6 +28,7 @@ or just download the tarball and go that route:
.. code:: bash
$ cd to the directory where you want to run Paperless
$ wget https://github.com/danielquinn/paperless/archive/master.zip
$ unzip master.zip
$ cd paperless-master
@@ -43,7 +44,9 @@ route`_ is quick & easy, but means you're running a VM which comes with memory
consumption etc. We also `support Docker`_, which you can use natively under
Linux and in a VM with `Docker Machine`_ (this guide was written for native
Docker usage under Linux, you might have to adapt it for Docker Machine.)
Alternatively the standard, `bare metal`_ approach is a little more
Not to forget the virtualenv, this is similar to `bare metal`_ with the
exception that you have to activate the virtualenv first.
Last but not least, the standard `bare metal`_ approach is a little more
complicated, but worth it because it makes it easier should you want to
contribute some code back.
@@ -59,9 +62,11 @@ Standard (Bare Metal)
.....................
1. Install the requirements as per the :ref:`requirements <requirements>` page.
2. Change to the ``src`` directory in this repo.
3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` and open it in
your favourite editor. Set the values for:
2. Within the extract of master.zip go to the ``src`` directory.
3. Copy ``paperless.conf.example`` to ``/etc/paperless.conf`` also the virtual
envrionment look there for it and open it in your favourite editor.
Because this file contains passwords it should only be readable by user root
and paperless ! Set the values for:
* ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be
dumped to be consumed by Paperless.
@@ -70,18 +75,19 @@ Standard (Bare Metal)
* ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process
will spawn to process document pages in parallel.
4. Initialise the database with ``./manage.py migrate``.
4. Initialise the SQLite database with ``./manage.py migrate``.
5. Create a user for your Paperless instance with
``./manage.py createsuperuser``. Follow the prompts to create your user.
6. Start the webserver with ``./manage.py runserver <IP>:<PORT>``.
If no specifc IP or port are given, the default is ``127.0.0.1:8000``.
You should now be able to visit your (empty) `Paperless webserver`_ at
``127.0.0.1:8000`` (or whatever you chose). You can login with the
user/pass you created in #5.
If no specifc IP or port are given, the default is ``127.0.0.1:8000``
also known as http://localhost:8000/.
You should now be able to visit your (empty) at `Paperless webserver`_ or
whatever you chose before. You can login with the user/pass you created in
#5.
7. In a separate window, change to the ``src`` directory in this repo again,
but this time, you should start the consumer script with
``./manage.py document_consumer``.
8. Scan something. Put it in the ``CONSUMPTION_DIR``.
8. Scan something or put a file into the ``CONSUMPTION_DIR``.
9. Wait a few minutes
10. Visit the document list on your webserver, and it should be there, indexed
and downloadable.
@@ -299,17 +305,21 @@ Standard (Bare Metal, Systemd)
If you're running on a bare metal system that's using Systemd, you can use the
service unit files in the ``scripts`` directory to set this up. You'll need to
create a user called ``paperless`` and setup Paperless to be in a place that
this new user can read and write to. Be sure to edit the service scripts to point
to the proper location of your paperless install, referencing the appropriate Python
binary. For example: ``ExecStart=/path/to/python3 /path/to/paperless/src/manage.py document_consumer``.
If you don't want to make a new user, you can change the ``Group`` and ``User`` variables
accordingly.
create a user called ``paperless`` (without login (if not already done so #5))
and setup Paperless to be in a place that this new user can read and write to.
Be sure to edit the service scripts to point to the proper location of your
paperless install, referencing the appropriate Python binary. For example:
``ExecStart=/path/to/python3 /path/to/paperless/src/manage.py document_consumer``.
If you don't want to make a new user, you can change the ``Group`` and ``User``
variables accordingly.
Then, you can just tell Systemd as ``root`` (or using ``sudo``) to enable the two ``.service`` files::
Then, as ``root`` (or using ``sudo``) you can just copy the ``.service`` files
to the Systemd directory and tell it to enable the two services::
# systemctl enable /path/to/paperless/scripts/paperless-consumer.service
# systemctl enable /path/to/paperless/scripts/paperless-webserver.service
# cp /path/to/paperless/scripts/paperless-consumer.service /etc/systemd/system/
# cp /path/to/paperless/scripts/paperless-webserver.service /etc/systemd/system/
# systemctl enable paperless-consumer
# systemctl enable paperless-webserver
# systemctl start paperless-consumer
# systemctl start paperless-webserver
@@ -344,7 +354,7 @@ after restarting your system:
If you are using a network interface other than ``eth0``, you will have to
change ``IFACE=eth0``. For example, if you are connected via WiFi, you will
likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces,
run ``ifconfig``.
run ``ifconfig -a``.
Save the file.
@@ -384,7 +394,10 @@ Using a Real Webserver
The default is to use Django's development server, as that's easy and does the
job well enough on a home network. However, if you want to do things right,
it's probably a good idea to use a webserver capable of handling more than one
thread.
thread. You will also have to let the webserver serve the static files (CSS,
JavaScript) from the directory configured in ``PAPERLESS_STATICDIR``. For that,
you need to run ``./manage.py collectstatic`` in the ``src`` directory. The
default static files directory is ``../static``.
Apache
~~~~~~
@@ -562,3 +575,28 @@ If you're using Docker, you can set a restart-policy_ in the
Docker daemon.
.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart
.. _setup-subdirectory
Hosting Paperless in a Subdirectory
-----------------------------------
Paperless was designed to run off the root of the hosting domain,
(ie: ``https://example.com/``) but with a few changes, you can configure
it to run in a subdirectory on your server
(ie: ``https://example.com/paperless/``).
Thanks to the efforts of `maphy-psd`_ on `Github`_, running Paperless in a
subdirectory is now as easy as setting a config variable. Simply set
``PAPERLESS_FORCE_SCRIPT_NAME`` in your environment or
``/etc/paperless.conf`` to the path you want Paperless hosted at, configure
Nginx/Apache for your needs and you're done. So, if you want Paperless to live
at ``https://example.com/arbitrary/path/to/paperless`` then you just set
``PAPERLESS_FORCE_SCRIPT_NAME`` to ``/arbitrary/path/to/paperless``. Note the
leading ``/`` there.
As to how to configure Nginx or Apache for this, that's on you :-)
.. _maphy-psd: https://github.com/maphy-psd
.. _Github: https://github.com/danielquinn/paperless/pull/255

View File

@@ -1,11 +1,34 @@
# Sample paperless.conf
# Copy this file to /etc/paperless.conf and modify it to suit your needs.
# As this file contains passwords it should only be readable by the user
# running paperless.
###############################################################################
#### Paths & Folders ####
###############################################################################
# This where your documents should go to be consumed. Make sure that it exists
# and that the user running the paperless service can read/write its contents
# before you start Paperless.
PAPERLESS_CONSUMPTION_DIR=""
# You can specify where you want the SQLite database to be stored instead of
# the default location of /data/ within the install directory.
#PAPERLESS_DBDIR=/path/to/database/file
# Override the default MEDIA_ROOT here. This is where all files are stored.
# The default location is /media/documents/ within the install folder.
#PAPERLESS_MEDIADIR=/path/to/media
# Override the default STATIC_ROOT here. This is where all static files
# created using "collectstatic" manager command are stored.
#PAPERLESS_STATICDIR=""
# These values are required if you want paperless to check a particular email
# box every 10 minutes and attempt to consume documents from there. If you
# don't define a HOST, mail checking will just be disabled.
@@ -14,6 +37,19 @@ PAPERLESS_CONSUME_MAIL_PORT=""
PAPERLESS_CONSUME_MAIL_USER=""
PAPERLESS_CONSUME_MAIL_PASS=""
# Override the default IMAP inbox here. If not set Paperless defaults to
# "INBOX".
#PAPERLESS_CONSUME_MAIL_INBOX="INBOX"
# Any email sent to the target account that does not contain this text will be
# ignored.
PAPERLESS_EMAIL_SECRET=""
###############################################################################
#### Security ####
###############################################################################
# You must have a passphrase in order for Paperless to work at all. If you set
# this to "", GNUGPG will "encrypt" your PDF by writing it out as a zero-byte
# file.
@@ -28,75 +64,13 @@ PAPERLESS_CONSUME_MAIL_PASS=""
# you've since changed it to a new one.
PAPERLESS_PASSPHRASE="secret"
# If you intend to consume documents either via HTTP POST or by email, you must
# have a shared secret here.
PAPERLESS_SHARED_SECRET=""
# After a document is consumed, Paperless can trigger an arbitrary script if
# you like. This script will be passed a number of arguments for you to work
# with. The default is blank, which means nothing will be executed. For more
# information, take a look at the docs: http://paperless.readthedocs.org/en/latest/consumption.html#hooking-into-the-consumption-process
#PAPERLESS_POST_CONSUME_SCRIPT="/path/to/an/arbitrary/script.sh"
# The secret key has a default that should be fine so long as you're hosting
# Paperless on a closed network. However, if you're putting this anywhere
# public, you should change the key to something unique and verbose.
#PAPERLESS_SECRET_KEY="change-me"
#
# The following values use sensible defaults for modern systems, but if you're
# running Paperless on a low-resource machine (like a Raspberry Pi), modifying
# some of these values may be necessary.
#
# By default, Paperless will attempt to use all available CPU cores to process
# a document, but if you would like to limit that, you can set this value to
# an integer:
#PAPERLESS_OCR_THREADS=1
# On smaller systems, or even in the case of Very Large Documents, the consumer
# may explode, complaining about how it's "unable to extent pixel cache". In
# such cases, try setting this to a reasonably low value, like 32000000. The
# default is to use whatever is necessary to do everything without writing to
# disk, and units are in megabytes.
#
# For more information on how to use this value, you should probably search
# the web for "MAGICK_MEMORY_LIMIT".
#PAPERLESS_CONVERT_MEMORY_LIMIT=0
# By default the conversion density setting for documents is 300DPI, in some
# cases it has proven useful to configure a lesser value.
# This setting has a high impact on the physical size of tmp page files,
# the speed of document conversion, and can affect the accuracy of OCR
# results. Individual results can vary and this setting should be tested
# thoroughly against the documents you are importing to see if it has any
# impacts either negative or positive. Testing on limited document sets has
# shown a setting of 200 can cut the size of tmp files by 1/3, and speed up
# conversion by up to 4x with little impact to OCR accuracy.
#PAPERLESS_CONVERT_DENSITY=300
# Similar to the memory limit, if you've got a small system and your OS mounts
# /tmp as tmpfs, you should set this to a path that's on a physical disk, like
# /home/your_user/tmp or something. ImageMagick will use this as scratch space
# when crunching through very large documents.
#
# For more information on how to use this value, you should probably search
# the web for "MAGICK_TMPDIR".
#PAPERLESS_CONVERT_TMPDIR=/var/tmp/paperless
# You can specify where you want the SQLite database to be stored instead of
# the default location
#PAPERLESS_DBDIR=/path/to/database/file
# Override the default MEDIA_ROOT here. This is where all files are stored.
#PAPERLESS_MEDIADIR=/path/to/media
# Override the default STATIC_ROOT here. This is where all static files created
# using "collectstatic" manager command are stored.
#PAPERLESS_STATICDIR=""
# The number of seconds that Paperless will wait between checking
# PAPERLESS_CONSUMPTION_DIR. If you tend to write documents to this directory
# very slowly, you may want to use a higher value than the default (10).
# PAPERLESS_CONSUMER_LOOP_TIME=10
# If you're planning on putting Paperless on the open internet, then you
# really should set this value to the domain name you're using. Failing to do
# so leaves you open to HTTP host header attacks:
@@ -106,22 +80,94 @@ PAPERLESS_SHARED_SECRET=""
# as is "example.com,www.example.com", but NOT " example.com" or "example.com,"
#PAPERLESS_ALLOWED_HOSTS="example.com,www.example.com"
# Override the default UTC time zone here
# To host paperless under a subpath url like example.com/paperless you set
# this value to /paperless. No trailing slash!
#
# https://docs.djangoproject.com/en/1.11/ref/settings/#force-script-name
#PAPERLESS_FORCE_SCRIPT_NAME=""
###############################################################################
#### Software Tweaks ####
###############################################################################
# After a document is consumed, Paperless can trigger an arbitrary script if
# you like. This script will be passed a number of arguments for you to work
# with. The default is blank, which means nothing will be executed. For more
# information, take a look at the docs:
# http://paperless.readthedocs.org/en/latest/consumption.html#hooking-into-the-consumption-process
#PAPERLESS_POST_CONSUME_SCRIPT="/path/to/an/arbitrary/script.sh"
#
# The following values use sensible defaults for modern systems, but if you're
# running Paperless on a low-resource device (like a Raspberry Pi), modifying
# some of these values may be necessary.
#
# By default, Paperless will attempt to use all available CPU cores to process
# a document, but if you would like to limit that, you can set this value to
# an integer:
#PAPERLESS_OCR_THREADS=1
# Customize the default language that tesseract will attempt to use when
# parsing documents. It should be a 3-letter language code consistent with ISO
# 639: https://www.loc.gov/standards/iso639-2/php/code_list.php
#PAPERLESS_OCR_LANGUAGE=eng
# On smaller systems, or even in the case of Very Large Documents, the consumer
# may explode, complaining about how it's "unable to extend pixel cache". In
# such cases, try setting this to a reasonably low value, like 32000000. The
# default is to use whatever is necessary to do everything without writing to
# disk, and units are in megabytes.
#
# For more information on how to use this value, you should probably search
# the web for "MAGICK_MEMORY_LIMIT".
#PAPERLESS_CONVERT_MEMORY_LIMIT=0
# Similar to the memory limit, if you've got a small system and your OS mounts
# /tmp as tmpfs, you should set this to a path that's on a physical disk, like
# /home/your_user/tmp or something. ImageMagick will use this as scratch space
# when crunching through very large documents.
#
# For more information on how to use this value, you should probably search
# the web for "MAGICK_TMPDIR".
#PAPERLESS_CONVERT_TMPDIR=/var/tmp/paperless
# By default the conversion density setting for documents is 300DPI, in some
# cases it has proven useful to configure a lesser value.
# This setting has a high impact on the physical size of tmp page files,
# the speed of document conversion, and can affect the accuracy of OCR
# results. Individual results can vary and this setting should be tested
# thoroughly against the documents you are importing to see if it has any
# impacts either negative or positive.
# Testing on limited document sets has shown a setting of 200 can cut the
# size of tmp files by 1/3, and speed up conversion by up to 4x
# with little impact to OCR accuracy.
#PAPERLESS_CONVERT_DENSITY=300
# The number of seconds that Paperless will wait between checking
# PAPERLESS_CONSUMPTION_DIR. If you tend to write documents to this directory
# rarely, you may want to use a higher value than the default (10).
#PAPERLESS_CONSUMER_LOOP_TIME=10
###############################################################################
#### Interface ####
###############################################################################
# Override the default UTC time zone here.
# See https://docs.djangoproject.com/en/1.10/ref/settings/#std:setting-TIME_ZONE
# for details on how to set it.
#PAPERLESS_TIME_ZONE=UTC
# Customize number of list items to show per page
#PAPERLESS_LIST_PER_PAGE=50
# Customize the default language that tesseract will attempt to use when parsing
# documents. It should be a 3-letter language code consistent with ISO 639.
#PAPERLESS_OCR_LANGUAGE=eng
# The number of items on each page in the web UI. This value must be a
# positive integer, but if you don't define one in paperless.conf, a default of
# 100 will be used.
#PAPERLESS_LIST_PER_PAGE=100
# The secret key has a default that should be fine so long as you're hosting
# Paperless on a closed network. However, if you're putting this anywhere
# public, you should change the key to something unique and verbose.
#PAPERLESS_SECRET_KEY="change-me"

View File

@@ -1,4 +1,4 @@
Django==1.10.5
Django>=1.10,<1.11
Pillow>=3.1.1
django-crispy-forms>=1.6.1
django-extensions>=1.7.6
@@ -6,18 +6,21 @@ django-filter>=1.0
django-flat-responsive>=1.2.0
djangorestframework>=3.5.3
filemagic>=1.6
fuzzywuzzy[speedup]==0.15.0
langdetect>=1.0.7
pyocr>=0.4.6
pyocr>=0.4.7
python-dateutil>=2.6.0
python-dotenv>=0.6.2
python-gnupg>=0.3.9
pytz>=2016.10
gunicorn==19.6.0
gunicorn==19.7.1
# For the tests
factory-boy
pytest
pytest-django
pytest-sugar
pytest-env
pep8
flake8
tox

View File

@@ -7,34 +7,37 @@ map_uidgid() {
USERMAP_ORIG_UID=$(id -g paperless)
USERMAP_GID=${USERMAP_GID:-${USERMAP_UID:-$USERMAP_ORIG_GID}}
USERMAP_UID=${USERMAP_UID:-$USERMAP_ORIG_UID}
if [[ ${USERMAP_UID} != ${USERMAP_ORIG_UID} || ${USERMAP_GID} != ${USERMAP_ORIG_GID} ]]; then
if [[ ${USERMAP_UID} != "${USERMAP_ORIG_UID}" || ${USERMAP_GID} != "${USERMAP_ORIG_GID}" ]]; then
echo "Mapping UID and GID for paperless:paperless to $USERMAP_UID:$USERMAP_GID"
groupmod -g ${USERMAP_GID} paperless
groupmod -g "${USERMAP_GID}" paperless
sed -i -e "s|:${USERMAP_ORIG_UID}:${USERMAP_GID}:|:${USERMAP_UID}:${USERMAP_GID}:|" /etc/passwd
fi
}
set_permissions() {
# Set permissions for consumption directory
chgrp paperless "$PAPERLESS_CONSUMPTION_DIR" || {
echo "Changing group of consumption directory:"
echo " $PAPERLESS_CONSUMPTION_DIR"
echo "failed."
echo ""
echo "Either try to set it on your host-mounted directory"
echo "directly, or make sure that the directory has \`o+x\`"
echo "permissions and the files in it at least \`o+r\`."
} >&2
chmod g+x "$PAPERLESS_CONSUMPTION_DIR" || {
echo "Changing group permissions of consumption directory:"
echo " $PAPERLESS_CONSUMPTION_DIR"
echo "failed."
echo ""
echo "Either try to set it on your host-mounted directory"
echo "directly, or make sure that the directory has \`o+x\`"
echo "permissions and the files in it at least \`o+r\`."
} >&2
# Set permissions for consumption and export directory
for dir in PAPERLESS_CONSUMPTION_DIR PAPERLESS_EXPORT_DIR; do
# Extract the name of the current directory from $dir for the error message
cur_dir_name=$(echo "$dir" | awk -F'_' '{ print tolower($2); }')
chgrp paperless "${!dir}" || {
echo "Changing group of ${cur_dir_name} directory:"
echo " ${!dir}"
echo "failed."
echo ""
echo "Either try to set it on your host-mounted directory"
echo "directly, or make sure that the directory has \`o+x\`"
echo "permissions and the files in it at least \`o+r\`."
} >&2
chmod g+x "${!dir}" || {
echo "Changing group permissions of ${cur_dir_name} directory:"
echo " ${!dir}"
echo "failed."
echo ""
echo "Either try to set it on your host-mounted directory"
echo "directly, or make sure that the directory has \`o+x\`"
echo "permissions and the files in it at least \`o+r\`."
} >&2
done
# Set permissions for application directory
chown -Rh paperless:paperless /usr/src/paperless
}
@@ -59,11 +62,11 @@ install_languages() {
# Loop over languages to be installed
for lang in "${langs[@]}"; do
pkg="tesseract-ocr-$lang"
if dpkg -s "$pkg" 2>&1 > /dev/null; then
if dpkg -s "$pkg" > /dev/null 2>&1; then
continue
fi
if ! apt-cache show "$pkg" 2>&1 > /dev/null; then
if ! apt-cache show "$pkg" > /dev/null 2>&1; then
continue
fi

View File

@@ -62,15 +62,24 @@ class DocumentAdmin(CommonAdmin):
list_filter = ("tags", "correspondent", MonthListFilter)
ordering = ["-created", "correspondent"]
def has_add_permission(self, request):
return False
def created_(self, obj):
return obj.created.date().strftime("%Y-%m-%d")
created_.short_description = "Created"
def thumbnail(self, obj):
if settings.FORCE_SCRIPT_NAME:
src_link = "{}/fetch/thumb/{}".format(
settings.FORCE_SCRIPT_NAME, obj.id)
else:
src_link = "/fetch/thumb/{}".format(obj.id)
png_img = self._html_tag(
"img",
src="/fetch/thumb/{}".format(obj.id),
src=src_link,
width=180,
alt="thumbnail",
alt="Thumbnail of {}".format(obj.file_name),
title=obj.file_name
)
return self._html_tag("a", png_img, href=obj.download_url)

View File

@@ -1,35 +1,21 @@
import datetime
import hashlib
import logging
import os
import re
import uuid
import shutil
import hashlib
import logging
import datetime
import tempfile
import itertools
import subprocess
from multiprocessing.pool import Pool
import pyocr
import langdetect
from PIL import Image
from django.conf import settings
from django.utils import timezone
from paperless.db import GnuPG
from pyocr.tesseract import TesseractError
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from .models import Tag, Document, FileInfo
from .models import Document, FileInfo, Tag
from .parsers import ParseError
from .signals import (
document_consumption_started,
document_consumption_finished
document_consumer_declaration,
document_consumption_finished,
document_consumption_started
)
from .languages import ISO639
class OCRError(Exception):
pass
class ConsumerError(Exception):
@@ -47,13 +33,7 @@ class Consumer(object):
"""
SCRATCH = settings.SCRATCH_DIR
CONVERT = settings.CONVERT_BINARY
UNPAPER = settings.UNPAPER_BINARY
CONSUME = settings.CONSUMPTION_DIR
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
def __init__(self):
@@ -78,6 +58,16 @@ class Consumer(object):
raise ConsumerError(
"Consumption directory {} does not exist".format(self.CONSUME))
self.parsers = []
for response in document_consumer_declaration.send(self):
self.parsers.append(response[1])
if not self.parsers:
raise ConsumerError(
"No parsers could be found, not even the default. "
"This is a problem."
)
def log(self, level, message):
getattr(self.logger, level)(message, extra={
"group": self.logging_group
@@ -109,6 +99,13 @@ class Consumer(object):
self._ignore.append(doc)
continue
parser_class = self._get_parser_class(doc)
if not parser_class:
self.log(
"error", "No parsers could be found for {}".format(doc))
self._ignore.append(doc)
continue
self.logging_group = uuid.uuid4()
self.log("info", "Consuming {}".format(doc))
@@ -119,25 +116,26 @@ class Consumer(object):
logging_group=self.logging_group
)
tempdir = tempfile.mkdtemp(prefix="paperless", dir=self.SCRATCH)
imgs = self._get_greyscale(tempdir, doc)
thumbnail = self._get_thumbnail(tempdir, doc)
parsed_document = parser_class(doc)
thumbnail = parsed_document.get_thumbnail()
try:
document = self._store(self._get_ocr(imgs), doc, thumbnail)
except OCRError as e:
document = self._store(
parsed_document.get_text(),
doc,
thumbnail
)
except ParseError as e:
self._ignore.append(doc)
self.log("error", "OCR FAILURE for {}: {}".format(doc, e))
self._cleanup_tempdir(tempdir)
self.log("error", "PARSE FAILURE for {}: {}".format(doc, e))
parsed_document.cleanup()
continue
else:
self._cleanup_tempdir(tempdir)
parsed_document.cleanup()
self._cleanup_doc(doc)
self.log(
@@ -151,142 +149,30 @@ class Consumer(object):
logging_group=self.logging_group
)
def _get_greyscale(self, tempdir, doc):
def _get_parser_class(self, doc):
"""
Greyscale images are easier for Tesseract to OCR
Determine the appropriate parser class based on the file
"""
self.log("info", "Generating greyscale image from {}".format(doc))
options = []
for parser in self.parsers:
result = parser(doc)
if result:
options.append(result)
# Convert PDF to multiple PNMs
pnm = os.path.join(tempdir, "convert-%04d.pnm")
run_convert(
self.CONVERT,
"-density", str(self.DENSITY),
"-depth", "8",
"-type", "grayscale",
doc, pnm,
)
# Get a list of converted images
pnms = []
for f in os.listdir(tempdir):
if f.endswith(".pnm"):
pnms.append(os.path.join(tempdir, f))
# Run unpaper in parallel on converted images
with Pool(processes=self.THREADS) as pool:
pool.map(run_unpaper, itertools.product([self.UNPAPER], pnms))
# Return list of converted images, processed with unpaper
pnms = []
for f in os.listdir(tempdir):
if f.endswith(".unpaper.pnm"):
pnms.append(os.path.join(tempdir, f))
return sorted(filter(lambda __: os.path.isfile(__), pnms))
def _get_thumbnail(self, tempdir, doc):
"""
The thumbnail of a PDF is just a 500px wide image of the first page.
"""
self.log("info", "Generating the thumbnail")
run_convert(
self.CONVERT,
"-scale", "500x5000",
"-alpha", "remove",
doc, os.path.join(tempdir, "convert-%04d.png")
)
return os.path.join(tempdir, "convert-0000.png")
def _guess_language(self, text):
try:
guess = langdetect.detect(text)
self.log("debug", "Language detected: {}".format(guess))
return guess
except Exception as e:
self.log("warning", "Language detection error: {}".format(e))
def _get_ocr(self, imgs):
"""
Attempts to do the best job possible OCR'ing the document based on
simple language detection trial & error.
"""
if not imgs:
raise OCRError("No images found")
self.log("info", "OCRing the document")
# Since the division gets rounded down by int, this calculation works
# for every edge-case, i.e. 1
middle = int(len(imgs) / 2)
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
guessed_language = self._guess_language(raw_text)
if not guessed_language or guessed_language not in ISO639:
self.log("warning", "Language detection failed!")
if settings.FORGIVING_OCR:
self.log(
"warning",
"As FORGIVING_OCR is enabled, we're going to make the "
"best with what we have."
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError("Language detection failed")
if ISO639[guessed_language] == self.DEFAULT_OCR_LANGUAGE:
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
try:
return self._ocr(imgs, ISO639[guessed_language])
except pyocr.pyocr.tesseract.TesseractError:
if settings.FORGIVING_OCR:
self.log(
"warning",
"OCR for {} failed, but we're going to stick with what "
"we've got since FORGIVING_OCR is enabled.".format(
guessed_language
)
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError(
"The guessed language is not available in this instance of "
"Tesseract."
self.log(
"info",
"Parsers available: {}".format(
", ".join([str(o["parser"].__name__) for o in options])
)
)
def _assemble_ocr_sections(self, imgs, middle, text):
"""
Given a `middle` value and the text that middle page represents, we OCR
the remainder of the document and return the whole thing.
"""
text = self._ocr(imgs[:middle], self.DEFAULT_OCR_LANGUAGE) + text
text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
return text
if not options:
return None
def _ocr(self, imgs, lang):
"""
Performs a single OCR attempt.
"""
if not imgs:
return ""
self.log("info", "Parsing for {}".format(lang))
with Pool(processes=self.THREADS) as pool:
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
r = " ".join(r)
# Strip out excess white space to allow matching to go smoother
return strip_excess_whitespace(r)
# Return the parser with the highest weight.
return sorted(
options, key=lambda _: _["weight"], reverse=True)[0]["parser"]
def _store(self, text, doc, thumbnail):
@@ -332,10 +218,6 @@ class Consumer(object):
return document
def _cleanup_tempdir(self, d):
self.log("debug", "Deleting directory {}".format(d))
shutil.rmtree(d)
def _cleanup_doc(self, doc):
self.log("debug", "Deleting document {}".format(doc))
os.unlink(doc)
@@ -361,41 +243,3 @@ class Consumer(object):
with open(doc, "rb") as f:
checksum = hashlib.md5(f.read()).hexdigest()
return Document.objects.filter(checksum=checksum).exists()
def strip_excess_whitespace(text):
collapsed_spaces = re.sub(r"([^\S\r\n]+)", " ", text)
no_leading_whitespace = re.sub(
"([\n\r]+)([^\S\n\r]+)", '\\1', collapsed_spaces)
no_trailing_whitespace = re.sub("([^\S\n\r]+)$", '', no_leading_whitespace)
return no_trailing_whitespace
def image_to_string(args):
img, lang = args
ocr = pyocr.get_available_tools()[0]
with Image.open(os.path.join(Consumer.SCRATCH, img)) as f:
if ocr.can_detect_orientation():
try:
orientation = ocr.detect_orientation(f, lang=lang)
f = f.rotate(orientation["angle"], expand=1)
except (TesseractError, OtherTesseractError):
pass
return ocr.image_to_string(f, lang=lang)
def run_unpaper(args):
unpaper, pnm = args
subprocess.Popen(
(unpaper, pnm, pnm.replace(".pnm", ".unpaper.pnm"))).wait()
def run_convert(*args):
environment = os.environ.copy()
if settings.CONVERT_MEMORY_LIMIT:
environment["MAGICK_MEMORY_LIMIT"] = settings.CONVERT_MEMORY_LIMIT
if settings.CONVERT_TMPDIR:
environment["MAGICK_TMPDIR"] = settings.CONVERT_TMPDIR
subprocess.Popen(args, env=environment).wait()

View File

@@ -8,7 +8,7 @@ class CorrespondentFilterSet(FilterSet):
class Meta(object):
model = Correspondent
fields = {
'name': [
"name": [
"startswith", "endswith", "contains",
"istartswith", "iendswith", "icontains"
],
@@ -21,7 +21,7 @@ class TagFilterSet(FilterSet):
class Meta(object):
model = Tag
fields = {
'name': [
"name": [
"startswith", "endswith", "contains",
"istartswith", "iendswith", "icontains"
],

View File

@@ -2,7 +2,6 @@ import magic
import os
from datetime import datetime
from hashlib import sha256
from time import mktime
from django import forms
@@ -14,7 +13,6 @@ from .consumer import Consumer
class UploadForm(forms.Form):
SECRET = settings.SHARED_SECRET
TYPE_LOOKUP = {
"application/pdf": Document.TYPE_PDF,
"image/png": Document.TYPE_PNG,
@@ -32,10 +30,9 @@ class UploadForm(forms.Form):
required=False
)
document = forms.FileField()
signature = forms.CharField(max_length=256)
def __init__(self, *args, **kwargs):
forms.Form.__init__(*args, **kwargs)
forms.Form.__init__(self, *args, **kwargs)
self._file_type = None
def clean_correspondent(self):
@@ -82,17 +79,6 @@ class UploadForm(forms.Form):
return document
def clean(self):
corresp = self.clened_data.get("correspondent")
title = self.cleaned_data.get("title")
signature = self.cleaned_data.get("signature")
if sha256(corresp + title + self.SECRET).hexdigest() == signature:
return self.cleaned_data
raise forms.ValidationError("The signature provided did not validate")
def save(self):
"""
Since the consumer already does a lot of work, it's easier just to save
@@ -100,11 +86,11 @@ class UploadForm(forms.Form):
form do that as well. Think of it as a poor-man's queue server.
"""
correspondent = self.clened_data.get("correspondent")
correspondent = self.cleaned_data.get("correspondent")
title = self.cleaned_data.get("title")
document = self.cleaned_data.get("document")
t = int(mktime(datetime.now()))
t = int(mktime(datetime.now().timetuple()))
file_name = os.path.join(
Consumer.CONSUME,
"{} - {}.{}".format(correspondent, title, self._file_type)

View File

@@ -43,7 +43,10 @@ class Message(Loggable):
and n attachments, and that we don't care about the message body.
"""
SECRET = settings.SHARED_SECRET
SECRET = os.getenv(
"PAPERLESS_EMAIL_SECRET",
os.getenv("PAPERLESS_SHARED_SECRET") # TODO: Remove after 2017/09
)
def __init__(self, data, group=None):
"""
@@ -153,11 +156,11 @@ class MailFetcher(Loggable):
Loggable.__init__(self)
self._connection = None
self._host = settings.MAIL_CONSUMPTION["HOST"]
self._port = settings.MAIL_CONSUMPTION["PORT"]
self._username = settings.MAIL_CONSUMPTION["USERNAME"]
self._password = settings.MAIL_CONSUMPTION["PASSWORD"]
self._inbox = settings.MAIL_CONSUMPTION["INBOX"]
self._host = os.getenv("PAPERLESS_CONSUME_MAIL_HOST")
self._port = os.getenv("PAPERLESS_CONSUME_MAIL_PORT")
self._username = os.getenv("PAPERLESS_CONSUME_MAIL_USER")
self._password = os.getenv("PAPERLESS_CONSUME_MAIL_PASS")
self._inbox = os.getenv("PAPERLESS_CONSUME_MAIL_INBOX", "INBOX")
self._enabled = bool(self._host)
@@ -219,7 +222,7 @@ class MailFetcher(Loggable):
if not login[0] == "OK":
raise MailFetcherError("Can't log into mail: {}".format(login[1]))
inbox = self._connection.select("INBOX")
inbox = self._connection.select(self._inbox)
if not inbox[0] == "OK":
raise MailFetcherError("Can't find the inbox: {}".format(inbox[1]))

View File

@@ -28,6 +28,7 @@ class Command(BaseCommand):
self.file_consumer = None
self.mail_fetcher = None
self.first_iteration = True
BaseCommand.__init__(self, *args, **kwargs)
@@ -66,6 +67,9 @@ class Command(BaseCommand):
self.file_consumer.consume()
# Occasionally fetch mail and store it to be consumed on the next loop
# We fetch email when we first start up so that it is not necessary to
# wait for 10 minutes after making changes to the config file.
delta = self.mail_fetcher.last_checked + self.MAIL_DELTA
if delta < datetime.datetime.now():
if self.first_iteration or delta < datetime.datetime.now():
self.first_iteration = False
self.mail_fetcher.pull()

View File

@@ -10,6 +10,7 @@ from documents.models import Document, Correspondent, Tag
from paperless.db import GnuPG
from ...mixins import Renderable
from documents.settings import EXPORTER_FILE_NAME, EXPORTER_THUMBNAIL_NAME
class Command(Renderable, BaseCommand):
@@ -61,15 +62,24 @@ class Command(Renderable, BaseCommand):
document = document_map[document_dict["pk"]]
target = os.path.join(self.target, document.file_name)
document_dict["__exported_file_name__"] = target
file_target = os.path.join(self.target, document.file_name)
print("Exporting: {}".format(target))
thumbnail_name = document.file_name + "-tumbnail.png"
thumbnail_target = os.path.join(self.target, thumbnail_name)
with open(target, "wb") as f:
document_dict[EXPORTER_FILE_NAME] = document.file_name
document_dict[EXPORTER_THUMBNAIL_NAME] = thumbnail_name
print("Exporting: {}".format(file_target))
t = int(time.mktime(document.created.timetuple()))
with open(file_target, "wb") as f:
f.write(GnuPG.decrypted(document.source_file))
t = int(time.mktime(document.created.timetuple()))
os.utime(target, times=(t, t))
os.utime(file_target, times=(t, t))
with open(thumbnail_target, "wb") as f:
f.write(GnuPG.decrypted(document.thumbnail_file))
os.utime(thumbnail_target, times=(t, t))
manifest += json.loads(
serializers.serialize("json", Correspondent.objects.all()))

View File

@@ -10,6 +10,8 @@ from paperless.db import GnuPG
from ...mixins import Renderable
from documents.settings import EXPORTER_FILE_NAME, EXPORTER_THUMBNAIL_NAME
class Command(Renderable, BaseCommand):
@@ -70,13 +72,13 @@ class Command(Renderable, BaseCommand):
if not record["model"] == "documents.document":
continue
if "__exported_file_name__" not in record:
if EXPORTER_FILE_NAME not in record:
raise CommandError(
'The manifest file contains a record which does not '
'refer to an actual document file.'
)
doc_file = record["__exported_file_name__"]
doc_file = record[EXPORTER_FILE_NAME]
if not os.path.exists(os.path.join(self.source, doc_file)):
raise CommandError(
'The manifest file refers to "{}" which does not '
@@ -90,10 +92,21 @@ class Command(Renderable, BaseCommand):
if not record["model"] == "documents.document":
continue
doc_file = record["__exported_file_name__"]
doc_file = record[EXPORTER_FILE_NAME]
thumb_file = record[EXPORTER_THUMBNAIL_NAME]
document = Document.objects.get(pk=record["pk"])
with open(doc_file, "rb") as unencrypted:
document_path = os.path.join(self.source, doc_file)
thumbnail_path = os.path.join(self.source, thumb_file)
with open(document_path, "rb") as unencrypted:
with open(document.source_path, "wb") as encrypted:
print("Encrypting {} and saving it to {}".format(
doc_file, document.source_path))
encrypted.write(GnuPG.encrypted(unencrypted))
with open(thumbnail_path, "rb") as unencrypted:
with open(document.thumbnail_path, "wb") as encrypted:
print("Encrypting {} and saving it to {}".format(
thumb_file, document.thumbnail_path))
encrypted.write(GnuPG.encrypted(unencrypted))

View File

@@ -50,7 +50,7 @@ class GroupConcat(models.Aggregate):
def _get_template(self, separator):
if self.engine == self.ENGINE_MYSQL:
return "%(function)s(%(expressions)s, SEPARATOR '{}')".format(
return "%(function)s(%(expressions)s SEPARATOR '{}')".format(
separator)
return "%(function)s(%(expressions)s, '{}')".format(separator)

View File

@@ -3,6 +3,7 @@
from __future__ import unicode_literals
from django.db import migrations, models
from django.conf import settings
class Migration(migrations.Migration):
@@ -19,7 +20,7 @@ class Migration(migrations.Migration):
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('sender', models.CharField(blank=True, db_index=True, max_length=128)),
('title', models.CharField(blank=True, db_index=True, max_length=128)),
('content', models.TextField(db_index=True)),
('content', models.TextField(db_index=("mysql" not in settings.DATABASES["default"]["ENGINE"]))),
('created', models.DateTimeField(auto_now_add=True)),
('modified', models.DateTimeField(auto_now=True)),
],

View File

@@ -47,7 +47,11 @@ class Migration(migrations.Migration):
],
),
migrations.RunPython(move_sender_strings_to_sender_model),
migrations.AlterField(
migrations.RemoveField(
model_name='document',
name='sender',
),
migrations.AddField(
model_name='document',
name='sender',
field=models.ForeignKey(blank=True, on_delete=django.db.models.deletion.CASCADE, to='documents.Sender'),

View File

@@ -38,6 +38,9 @@ class GnuPG(object):
def move_documents_and_create_thumbnails(apps, schema_editor):
os.makedirs(os.path.join(settings.MEDIA_ROOT, "documents", "originals"), exist_ok=True)
os.makedirs(os.path.join(settings.MEDIA_ROOT, "documents", "thumbnails"), exist_ok=True)
documents = os.listdir(os.path.join(settings.MEDIA_ROOT, "documents"))
if set(documents) == {"originals", "thumbnails"}:

View File

@@ -0,0 +1,20 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.10.5 on 2017-03-25 15:58
from __future__ import unicode_literals
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('documents', '0015_add_insensitive_to_match'),
]
operations = [
migrations.AlterField(
model_name='document',
name='content',
field=models.TextField(blank=True, db_index=True, help_text='The raw, text-only data of the document. This field is primarily used for searching.'),
),
]

View File

@@ -0,0 +1,25 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.10.5 on 2017-05-12 05:07
from __future__ import unicode_literals
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('documents', '0016_auto_20170325_1558'),
]
operations = [
migrations.AlterField(
model_name='correspondent',
name='matching_algorithm',
field=models.PositiveIntegerField(choices=[(1, 'Any'), (2, 'All'), (3, 'Literal'), (4, 'Regular Expression'), (5, 'Fuzzy Match')], default=1, help_text='Which algorithm you want to use when matching text to the OCR\'d PDF. Here, "any" looks for any occurrence of any word provided in the PDF, while "all" requires that every word provided appear in the PDF, albeit not in the order provided. A "literal" match means that the text you enter must appear in the PDF exactly as you\'ve entered it, and "regular expression" uses a regex to match the PDF. (If you don\'t know what a regex is, you probably don\'t want this option.) Finally, a "fuzzy match" looks for words or phrases that are mostly—but not exactly—the same, which can be useful for matching against documents containg imperfections that foil accurate OCR.'),
),
migrations.AlterField(
model_name='tag',
name='matching_algorithm',
field=models.PositiveIntegerField(choices=[(1, 'Any'), (2, 'All'), (3, 'Literal'), (4, 'Regular Expression'), (5, 'Fuzzy Match')], default=1, help_text='Which algorithm you want to use when matching text to the OCR\'d PDF. Here, "any" looks for any occurrence of any word provided in the PDF, while "all" requires that every word provided appear in the PDF, albeit not in the order provided. A "literal" match means that the text you enter must appear in the PDF exactly as you\'ve entered it, and "regular expression" uses a regex to match the PDF. (If you don\'t know what a regex is, you probably don\'t want this option.) Finally, a "fuzzy match" looks for words or phrases that are mostly—but not exactly—the same, which can be useful for matching against documents containg imperfections that foil accurate OCR.'),
),
]

View File

@@ -0,0 +1,21 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.10.5 on 2017-07-15 17:12
from __future__ import unicode_literals
from django.db import migrations, models
import django.db.models.deletion
class Migration(migrations.Migration):
dependencies = [
('documents', '0017_auto_20170512_0507'),
]
operations = [
migrations.AlterField(
model_name='document',
name='correspondent',
field=models.ForeignKey(blank=True, null=True, on_delete=django.db.models.deletion.SET_NULL, related_name='documents', to='documents.Correspondent'),
),
]

View File

@@ -1,8 +1,3 @@
from django.contrib.auth.mixins import AccessMixin
from django.contrib.auth import authenticate, login
import base64
class Renderable(object):
"""
A handy mixin to make it easier/cleaner to print output based on a
@@ -12,46 +7,3 @@ class Renderable(object):
def _render(self, text, verbosity):
if self.verbosity >= verbosity:
print(text)
class SessionOrBasicAuthMixin(AccessMixin):
"""
Session or Basic Authentication mixin for Django.
It determines if the requester is already logged in or if they have
provided proper http-authorization and returning the view if all goes
well, otherwise responding with a 401.
Base for mixin found here: https://djangosnippets.org/snippets/3073/
"""
def dispatch(self, request, *args, **kwargs):
# check if user is authenticated via the session
if request.user.is_authenticated:
# Already logged in, just return the view.
return super(SessionOrBasicAuthMixin, self).dispatch(
request, *args, **kwargs
)
# apparently not authenticated via session, maybe via HTTP Basic?
if 'HTTP_AUTHORIZATION' in request.META:
auth = request.META['HTTP_AUTHORIZATION'].split()
if len(auth) == 2:
# NOTE: Support for only basic authentication
if auth[0].lower() == "basic":
authString = base64.b64decode(auth[1]).decode('utf-8')
uname, passwd = authString.split(':')
user = authenticate(username=uname, password=passwd)
if user is not None:
if user.is_active:
login(request, user)
request.user = user
return super(
SessionOrBasicAuthMixin, self
).dispatch(
request, *args, **kwargs
)
# nope, really not authenticated
return self.handle_no_permission()

View File

@@ -5,6 +5,7 @@ import re
import uuid
from collections import OrderedDict
from fuzzywuzzy import fuzz
from django.conf import settings
from django.core.urlresolvers import reverse
@@ -21,11 +22,13 @@ class MatchingModel(models.Model):
MATCH_ALL = 2
MATCH_LITERAL = 3
MATCH_REGEX = 4
MATCH_FUZZY = 5
MATCHING_ALGORITHMS = (
(MATCH_ANY, "Any"),
(MATCH_ALL, "All"),
(MATCH_LITERAL, "Literal"),
(MATCH_REGEX, "Regular Expression"),
(MATCH_FUZZY, "Fuzzy Match"),
)
name = models.CharField(max_length=128, unique=True)
@@ -42,8 +45,11 @@ class MatchingModel(models.Model):
"provided appear in the PDF, albeit not in the order provided. A "
"\"literal\" match means that the text you enter must appear in "
"the PDF exactly as you've entered it, and \"regular expression\" "
"uses a regex to match the PDF. If you don't know what a regex "
"is, you probably don't want this option."
"uses a regex to match the PDF. (If you don't know what a regex "
"is, you probably don't want this option.) Finally, a \"fuzzy "
"match\" looks for words or phrases that are mostly—but not "
"exactly—the same, which can be useful for matching against "
"documents containg imperfections that foil accurate OCR."
)
)
@@ -104,6 +110,15 @@ class MatchingModel(models.Model):
return bool(re.search(
re.compile(self.match, **search_kwargs), text))
if self.matching_algorithm == self.MATCH_FUZZY:
match = re.sub(r'[^\w\s]', '', self.match)
text = re.sub(r'[^\w\s]', '', text)
if self.is_insensitive:
match = match.lower()
text = text.lower()
return True if fuzz.partial_ratio(match, text) >= 90 else False
raise NotImplementedError("Unsupported matching algorithm")
def save(self, *args, **kwargs):
@@ -157,14 +172,28 @@ class Document(models.Model):
TYPES = (TYPE_PDF, TYPE_PNG, TYPE_JPG, TYPE_GIF, TYPE_TIF,)
correspondent = models.ForeignKey(
Correspondent, blank=True, null=True, related_name="documents")
Correspondent,
blank=True,
null=True,
related_name="documents",
on_delete=models.SET_NULL
)
title = models.CharField(max_length=128, blank=True, db_index=True)
content = models.TextField(db_index=True)
content = models.TextField(
db_index=True,
blank=True,
help_text="The raw, text-only data of the document. This field is "
"primarily used for searching."
)
file_type = models.CharField(
max_length=4,
editable=False,
choices=tuple([(t, t.upper()) for t in TYPES])
)
tags = models.ManyToManyField(
Tag, related_name="documents", blank=True)
@@ -292,45 +321,45 @@ class FileInfo(object):
r"(?P<correspondent>.*) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
)),
("created-title-tags", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
)),
("created-correspondent-title", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<correspondent>.*) - "
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
)),
("created-title", re.compile(
r"^(?P<created>\d\d\d\d\d\d\d\d(\d\d\d\d\d\d)?Z) - "
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
)),
("correspondent-title-tags", re.compile(
r"(?P<correspondent>.*) - "
r"(?P<title>.*) - "
r"(?P<tags>[a-z0-9\-,]*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
)),
("correspondent-title", re.compile(
r"(?P<correspondent>.*) - "
r"(?P<title>.*)?"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
)),
("title", re.compile(
r"(?P<title>.*)"
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff)$",
r"\.(?P<extension>pdf|jpe?g|png|gif|tiff?)$",
flags=re.IGNORECASE
))
])
@@ -373,6 +402,8 @@ class FileInfo(object):
r = extension.lower()
if r == "jpeg":
return "jpg"
if r == "tif":
return "tiff"
return r
@classmethod

45
src/documents/parsers.py Normal file
View File

@@ -0,0 +1,45 @@
import logging
import shutil
import tempfile
from django.conf import settings
class ParseError(Exception):
pass
class DocumentParser(object):
"""
Subclass this to make your own parser. Have a look at
`paperless_tesseract.parsers` for inspiration.
"""
SCRATCH = settings.SCRATCH_DIR
def __init__(self, path):
self.document_path = path
self.tempdir = tempfile.mkdtemp(prefix="paperless", dir=self.SCRATCH)
self.logger = logging.getLogger(__name__)
self.logging_group = None
def get_thumbnail(self):
"""
Returns the path to a file we can use as a thumbnail for this document.
"""
raise NotImplementedError()
def get_text(self):
"""
Returns the text from the document and only the text.
"""
raise NotImplementedError()
def log(self, level, message):
getattr(self.logger, level)(message, extra={
"group": self.logging_group
})
def cleanup(self):
self.log("debug", "Deleting directory {}".format(self.tempdir))
shutil.rmtree(self.tempdir)

View File

@@ -18,12 +18,21 @@ class TagSerializer(serializers.HyperlinkedModelSerializer):
"id", "slug", "name", "colour", "match", "matching_algorithm")
class CorrespondentField(serializers.HyperlinkedRelatedField):
def get_queryset(self):
return Correspondent.objects.all()
class TagsField(serializers.HyperlinkedRelatedField):
def get_queryset(self):
return Tag.objects.all()
class DocumentSerializer(serializers.ModelSerializer):
correspondent = serializers.HyperlinkedRelatedField(
read_only=True, view_name="drf:correspondent-detail", allow_null=True)
tags = serializers.HyperlinkedRelatedField(
read_only=True, view_name="drf:tag-detail", many=True)
correspondent = CorrespondentField(
view_name="drf:correspondent-detail", allow_null=True)
tags = TagsField(view_name="drf:tag-detail", many=True)
class Meta(object):
model = Document

View File

@@ -0,0 +1,4 @@
# Defines the names of file/thumbnail for the manifest
# for exporting/importing commands
EXPORTER_FILE_NAME = "__exported_file_name__"
EXPORTER_THUMBNAIL_NAME = "__exported_thumbnail_name__"

View File

@@ -2,3 +2,4 @@ from django.dispatch import Signal
document_consumption_started = Signal(providing_args=["filename"])
document_consumption_finished = Signal(providing_args=["document"])
document_consumer_declaration = Signal(providing_args=[])

View File

@@ -1,6 +1,5 @@
import logging
import os
from subprocess import Popen
from django.conf import settings

View File

@@ -10,3 +10,14 @@ td a.tag {
margin: 1px;
display: inline-block;
}
#result_list th.column-note {
text-align: right;
}
#result_list td.field-note {
text-align: right;
}
#result_list td textarea {
width: 90%;
height: 5em;
}

View File

@@ -158,7 +158,7 @@
<script>
// We nee to re-build the select-all functionality as the old logic pointed
// We need to re-build the select-all functionality as the old logic pointed
// to a table and we're using divs now.
django.jQuery("#action-toggle").on("change", function(){
django.jQuery(".grid .box .result .checkbox input")

View File

@@ -0,0 +1,17 @@
import factory
from ..models import Document, Correspondent
class CorrespondentFactory(factory.DjangoModelFactory):
class Meta:
model = Correspondent
name = factory.Faker("name")
class DocumentFactory(factory.DjangoModelFactory):
class Meta:
model = Document

View File

@@ -1,22 +1,66 @@
import os
from unittest import mock, skipIf
import pyocr
from django.test import TestCase
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from unittest import mock
from ..consumer import Consumer
from ..models import FileInfo
from ..consumer import image_to_string, strip_excess_whitespace
class TestConsumer(TestCase):
class DummyParser(object):
pass
def test__get_parser_class_1_parser(self):
self.assertEqual(
self._get_consumer()._get_parser_class("doc.pdf"),
self.DummyParser
)
@mock.patch("documents.consumer.Consumer.CONSUME")
@mock.patch("documents.consumer.os.makedirs")
@mock.patch("documents.consumer.os.path.exists", return_value=True)
@mock.patch("documents.consumer.document_consumer_declaration.send")
def test__get_parser_class_n_parsers(self, m, *args):
class DummyParser1(object):
pass
class DummyParser2(object):
pass
m.return_value = (
(None, lambda _: {"weight": 0, "parser": DummyParser1}),
(None, lambda _: {"weight": 1, "parser": DummyParser2}),
)
self.assertEqual(Consumer()._get_parser_class("doc.pdf"), DummyParser2)
@mock.patch("documents.consumer.Consumer.CONSUME")
@mock.patch("documents.consumer.os.makedirs")
@mock.patch("documents.consumer.os.path.exists", return_value=True)
@mock.patch("documents.consumer.document_consumer_declaration.send")
def test__get_parser_class_0_parsers(self, m, *args):
m.return_value = ((None, lambda _: None),)
self.assertIsNone(Consumer()._get_parser_class("doc.pdf"))
@mock.patch("documents.consumer.Consumer.CONSUME")
@mock.patch("documents.consumer.os.makedirs")
@mock.patch("documents.consumer.os.path.exists", return_value=True)
@mock.patch("documents.consumer.document_consumer_declaration.send")
def _get_consumer(self, m, *args):
m.return_value = (
(None, lambda _: {"weight": 0, "parser": self.DummyParser}),
)
return Consumer()
class TestAttributes(TestCase):
TAGS = ("tag1", "tag2", "tag3")
EXTENSIONS = (
"pdf", "png", "jpg", "jpeg", "gif",
"PDF", "PNG", "JPG", "JPEG", "GIF",
"PdF", "PnG", "JpG", "JPeG", "GiF",
"pdf", "png", "jpg", "jpeg", "gif", "tiff", "tif",
"PDF", "PNG", "JPG", "JPEG", "GIF", "TIFF", "TIF",
"PdF", "PnG", "JpG", "JPeG", "GiF", "TiFf", "TiF",
)
def _test_guess_attributes_from_name(self, path, sender, title, tags):
@@ -36,6 +80,8 @@ class TestAttributes(TestCase):
self.assertEqual(tuple([t.slug for t in file_info.tags]), tags, f)
if extension.lower() == "jpeg":
self.assertEqual(file_info.extension, "jpg", f)
elif extension.lower() == "tif":
self.assertEqual(file_info.extension, "tiff", f)
else:
self.assertEqual(file_info.extension, extension.lower(), f)
@@ -308,71 +354,3 @@ class TestFieldPermutations(TestCase):
}
self._test_guessed_attributes(
template.format(**spec), **spec)
class FakeTesseract(object):
@staticmethod
def can_detect_orientation():
return True
@staticmethod
def detect_orientation(file_handle, lang):
raise OtherTesseractError("arbitrary status", "message")
@staticmethod
def image_to_string(file_handle, lang):
return "This is test text"
class FakePyOcr(object):
@staticmethod
def get_available_tools():
return [FakeTesseract]
class TestOCR(TestCase):
text_cases = [
("simple string", "simple string"),
(
"simple newline\n testing string",
"simple newline\ntesting string"
),
(
"utf-8 строка с пробелами в конце ",
"utf-8 строка с пробелами в конце"
)
]
SAMPLE_FILES = os.path.join(os.path.dirname(__file__), "samples")
TESSERACT_INSTALLED = bool(pyocr.get_available_tools())
def test_strip_excess_whitespace(self):
for source, result in self.text_cases:
actual_result = strip_excess_whitespace(source)
self.assertEqual(
result,
actual_result,
"strip_exceess_whitespace({}) != '{}', but '{}'".format(
source,
result,
actual_result
)
)
@skipIf(not TESSERACT_INSTALLED, "Tesseract not installed. Skipping")
@mock.patch("documents.consumer.Consumer.SCRATCH", SAMPLE_FILES)
@mock.patch("documents.consumer.pyocr", FakePyOcr)
def test_image_to_string_with_text_free_page(self):
"""
This test is sort of silly, since it's really just reproducing an odd
exception thrown by pyocr when it encounters a page with no text.
Actually running this test against an installation of Tesseract results
in a segmentation fault rooted somewhere deep inside pyocr where I
don't care to dig. Regardless, if you run the consumer normally,
text-free pages are now handled correctly so long as we work around
this weird exception.
"""
image_to_string(["no-text.png", "en"])

View File

@@ -3,6 +3,8 @@ from django.test import TestCase
from ..management.commands.document_importer import Command
from documents.settings import EXPORTER_FILE_NAME
class TestImporter(TestCase):
@@ -27,7 +29,7 @@ class TestImporter(TestCase):
cmd.manifest = [{
"model": "documents.document",
"__exported_file_name__": "noexist.pdf"
EXPORTER_FILE_NAME: "noexist.pdf"
}]
# self.assertRaises(CommandError, cmd._check_manifest)
with self.assertRaises(CommandError) as cm:

View File

@@ -149,6 +149,22 @@ class TestMatching(TestCase):
)
)
def test_match_fuzzy(self):
self._test_matching(
"Springfield, Miss.",
"MATCH_FUZZY",
(
"1220 Main Street, Springf eld, Miss.",
"1220 Main Street, Spring field, Miss.",
"1220 Main Street, Springfeld, Miss.",
"1220 Main Street Springfield Miss",
),
(
"1220 Main Street, Springfield, Mich.",
)
)
class TestApplications(TestCase):
"""

View File

@@ -0,0 +1,31 @@
from django.test import TestCase
from ..models import Document, Correspondent
from .factories import DocumentFactory, CorrespondentFactory
class CorrespondentTestCase(TestCase):
def test___str__(self):
for s in ("test", "οχι", "test with fun_charÅc'\"terß"):
correspondent = CorrespondentFactory.create(name=s)
self.assertEqual(str(correspondent), s)
class DocumentTestCase(TestCase):
def test_correspondent_deletion_does_not_cascade(self):
self.assertEqual(Correspondent.objects.all().count(), 0)
correspondent = CorrespondentFactory.create()
self.assertEqual(Correspondent.objects.all().count(), 1)
self.assertEqual(Document.objects.all().count(), 0)
DocumentFactory.create(correspondent=correspondent)
self.assertEqual(Document.objects.all().count(), 1)
self.assertIsNotNone(Document.objects.all().first().correspondent)
correspondent.delete()
self.assertEqual(Correspondent.objects.all().count(), 0)
self.assertEqual(Document.objects.all().count(), 1)
self.assertIsNone(Document.objects.all().first().correspondent)

View File

@@ -1,16 +1,16 @@
from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse, HttpResponseBadRequest
from django.views.generic import DetailView, FormView, TemplateView
from django_filters.rest_framework import DjangoFilterBackend
from rest_framework.filters import SearchFilter, OrderingFilter
from paperless.db import GnuPG
from paperless.mixins import SessionOrBasicAuthMixin
from paperless.views import StandardPagination
from rest_framework.filters import OrderingFilter, SearchFilter
from rest_framework.mixins import (
DestroyModelMixin,
ListModelMixin,
RetrieveModelMixin,
UpdateModelMixin
)
from rest_framework.pagination import PageNumberPagination
from rest_framework.permissions import IsAuthenticated
from rest_framework.viewsets import (
GenericViewSet,
@@ -27,7 +27,6 @@ from .serialisers import (
LogSerializer,
TagSerializer
)
from .mixins import SessionOrBasicAuthMixin
class IndexView(TemplateView):
@@ -81,21 +80,12 @@ class PushView(SessionOrBasicAuthMixin, FormView):
form_class = UploadForm
@classmethod
def as_view(cls, **kwargs):
return csrf_exempt(FormView.as_view(**kwargs))
def form_valid(self, form):
return HttpResponse("1")
form.save()
return HttpResponse("1", status=202)
def form_invalid(self, form):
return HttpResponse("0")
class StandardPagination(PageNumberPagination):
page_size = 25
page_size_query_param = "page-size"
max_page_size = 100000
return HttpResponseBadRequest(str(form.errors))
class CorrespondentViewSet(ModelViewSet):

View File

@@ -84,3 +84,20 @@ def binaries_check(app_configs, **kwargs):
check_messages.append(Warning(error.format(binary), hint))
return check_messages
@register()
def config_check(app_configs, **kwargs):
warning = (
"It looks like you have PAPERLESS_SHARED_SECRET defined. Note that "
"in the \npast, this variable was used for both API authentication "
"and as the mail \nkeyword. As the API no no longer uses it, this "
"variable has been renamed to \nPAPERLESS_EMAIL_SECRET, so if you're "
"using the mail feature, you'd best update \nyour variable name.\n\n"
"The old variable will stop working in a few months."
)
if os.getenv("PAPERLESS_SHARED_SECRET"):
return [Warning(warning)]
return []

46
src/paperless/mixins.py Normal file
View File

@@ -0,0 +1,46 @@
from django.contrib.auth.mixins import AccessMixin
from django.contrib.auth import authenticate, login
import base64
class SessionOrBasicAuthMixin(AccessMixin):
"""
Session or Basic Authentication mixin for Django.
It determines if the requester is already logged in or if they have
provided proper http-authorization and returning the view if all goes
well, otherwise responding with a 401.
Base for mixin found here: https://djangosnippets.org/snippets/3073/
"""
def dispatch(self, request, *args, **kwargs):
# check if user is authenticated via the session
if request.user.is_authenticated:
# Already logged in, just return the view.
return super(SessionOrBasicAuthMixin, self).dispatch(
request, *args, **kwargs
)
# apparently not authenticated via session, maybe via HTTP Basic?
if 'HTTP_AUTHORIZATION' in request.META:
auth = request.META['HTTP_AUTHORIZATION'].split()
if len(auth) == 2:
# NOTE: Support for only basic authentication
if auth[0].lower() == "basic":
authString = base64.b64decode(auth[1]).decode('utf-8')
uname, passwd = authString.split(':')
user = authenticate(username=uname, password=passwd)
if user is not None:
if user.is_active:
login(request, user)
request.user = user
return super(
SessionOrBasicAuthMixin, self
).dispatch(
request, *args, **kwargs
)
# nope, really not authenticated
return self.handle_no_permission()

View File

@@ -4,10 +4,10 @@ Django settings for paperless project.
Generated by 'django-admin startproject' using Django 1.9.
For more information on this file, see
https://docs.djangoproject.com/en/1.9/topics/settings/
https://docs.djangoproject.com/en/1.10/topics/settings/
For the full list of settings and their values, see
https://docs.djangoproject.com/en/1.9/ref/settings/
https://docs.djangoproject.com/en/1.10/ref/settings/
"""
import os
@@ -25,7 +25,7 @@ BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/1.9/howto/deployment/checklist/
# See https://docs.djangoproject.com/en/1.10/howto/deployment/checklist/
# The secret key has a default that should be fine so long as you're hosting
# Paperless on a closed network. However, if you're putting this anywhere
@@ -47,7 +47,8 @@ _allowed_hosts = os.getenv("PAPERLESS_ALLOWED_HOSTS")
if _allowed_hosts:
ALLOWED_HOSTS = _allowed_hosts.split(",")
FORCE_SCRIPT_NAME = os.getenv("PAPERLESS_FORCE_SCRIPT_NAME")
# Application definition
INSTALLED_APPS = [
@@ -61,15 +62,21 @@ INSTALLED_APPS = [
"django_extensions",
"documents.apps.DocumentsConfig",
"reminders.apps.RemindersConfig",
"paperless_tesseract.apps.PaperlessTesseractConfig",
"flat_responsive",
"django.contrib.admin",
"rest_framework",
"crispy_forms",
"django_filters"
]
if os.getenv("PAPERLESS_INSTALLED_APPS"):
INSTALLED_APPS += os.getenv("PAPERLESS_INSTALLED_APPS").split(",")
MIDDLEWARE_CLASSES = [
'django.middleware.security.SecurityMiddleware',
'django.contrib.sessions.middleware.SessionMiddleware',
@@ -103,7 +110,7 @@ WSGI_APPLICATION = 'paperless.wsgi.application'
# Database
# https://docs.djangoproject.com/en/1.9/ref/settings/#databases
# https://docs.djangoproject.com/en/1.10/ref/settings/#databases
DATABASES = {
"default": {
@@ -128,7 +135,7 @@ if os.getenv("PAPERLESS_DBUSER") and os.getenv("PAPERLESS_DBPASS"):
# Password validation
# https://docs.djangoproject.com/en/1.9/ref/settings/#auth-password-validators
# https://docs.djangoproject.com/en/1.10/ref/settings/#auth-password-validators
AUTH_PASSWORD_VALIDATORS = [
{
@@ -147,7 +154,7 @@ AUTH_PASSWORD_VALIDATORS = [
# Internationalization
# https://docs.djangoproject.com/en/1.9/topics/i18n/
# https://docs.djangoproject.com/en/1.10/topics/i18n/
LANGUAGE_CODE = 'en-us'
@@ -161,7 +168,7 @@ USE_TZ = True
# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/1.9/howto/static-files/
# https://docs.djangoproject.com/en/1.10/howto/static-files/
STATIC_ROOT = os.getenv(
"PAPERLESS_STATICDIR", os.path.join(BASE_DIR, "..", "static"))
@@ -231,18 +238,6 @@ CONSUMPTION_DIR = os.getenv("PAPERLESS_CONSUMPTION_DIR")
# slowly, you may want to use a higher value than the default.
CONSUMER_LOOP_TIME = int(os.getenv("PAPERLESS_CONSUMER_LOOP_TIME", 10))
# If you want to use IMAP mail consumption, populate this with useful values.
# If you leave HOST set to None, we assume you're not going to use this
# feature.
MAIL_CONSUMPTION = {
"HOST": os.getenv("PAPERLESS_CONSUME_MAIL_HOST"),
"PORT": os.getenv("PAPERLESS_CONSUME_MAIL_PORT"),
"USERNAME": os.getenv("PAPERLESS_CONSUME_MAIL_USER"),
"PASSWORD": os.getenv("PAPERLESS_CONSUME_MAIL_PASS"),
"USE_SSL": os.getenv("PAPERLESS_CONSUME_MAIL_USE_SSL", "y").lower() == "y", # If True, use SSL/TLS to connect
"INBOX": "INBOX" # The name of the inbox on the server
}
# This is used to encrypt the original documents and decrypt them later when
# you want to download them. Set it and change the permissions on this file to
# 0600, or set it to `None` and you'll be prompted for the passphrase at
@@ -252,11 +247,6 @@ MAIL_CONSUMPTION = {
# files.
PASSPHRASE = os.getenv("PAPERLESS_PASSPHRASE")
# If you intend to use the "API" to push files into the consumer, you'll need
# to provide a shared secret here. Leaving this as the default will disable
# the API.
SHARED_SECRET = os.getenv("PAPERLESS_SHARED_SECRET", "")
# Trigger a script after every successful document consumption?
PRE_CONSUME_SCRIPT = os.getenv("PAPERLESS_PRE_CONSUME_SCRIPT")
POST_CONSUME_SCRIPT = os.getenv("PAPERLESS_POST_CONSUME_SCRIPT")

View File

@@ -1,35 +1,22 @@
"""paperless URL Configuration
The `urlpatterns` list routes URLs to views. For more information please see:
https://docs.djangoproject.com/en/1.9/topics/http/urls/
Examples:
Function views
1. Add an import: from my_app import views
2. Add a URL to urlpatterns: url(r'^$', views.home, name='home')
Class-based views
1. Add an import: from other_app.views import Home
2. Add a URL to urlpatterns: url(r'^$', Home.as_view(), name='home')
Including another URLconf
1. Add an import: from blog import urls as blog_urls
2. Import the include() function: from django.conf.urls import url, include
3. Add a URL to urlpatterns: url(r'^blog/', include(blog_urls))
"""
from django.conf import settings
from django.conf.urls import url, static, include
from django.contrib import admin
from django.views.decorators.csrf import csrf_exempt
from rest_framework.routers import DefaultRouter
from documents.views import (
IndexView, FetchView, PushView,
FetchView, PushView,
CorrespondentViewSet, TagViewSet, DocumentViewSet, LogViewSet
)
from reminders.views import ReminderViewSet
router = DefaultRouter()
router.register(r'correspondents', CorrespondentViewSet)
router.register(r'tags', TagViewSet)
router.register(r'documents', DocumentViewSet)
router.register(r'logs', LogViewSet)
router.register(r"correspondents", CorrespondentViewSet)
router.register(r"documents", DocumentViewSet)
router.register(r"logs", LogViewSet)
router.register(r"reminders", ReminderViewSet)
router.register(r"tags", TagViewSet)
urlpatterns = [
@@ -40,9 +27,6 @@ urlpatterns = [
),
url(r"^api/", include(router.urls, namespace="drf")),
# Normal pages (coming soon)
# url(r"^$", IndexView.as_view(), name="index"),
# File downloads
url(
r"^fetch/(?P<kind>doc|thumb)/(?P<pk>\d+)$",
@@ -50,11 +34,18 @@ urlpatterns = [
name="fetch"
),
# File uploads
url(r"^push$", csrf_exempt(PushView.as_view()), name="push"),
# The Django admin
url(r"admin/", admin.site.urls),
url(r"", admin.site.urls), # This is going away
] + static.static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)
if settings.SHARED_SECRET:
urlpatterns.insert(0, url(r"^push$", PushView.as_view(), name="push"))
# Text in each page's <h1> (and above login form).
admin.site.site_header = 'Paperless'
# Text at the end of each page's <title>.
admin.site.site_title = 'Paperless'
# Text at the top of the admin index page.
admin.site.index_title = 'Paperless administration'

View File

@@ -1 +1 @@
__version__ = (0, 3, 5)
__version__ = (0, 6, 1)

7
src/paperless/views.py Normal file
View File

@@ -0,0 +1,7 @@
from rest_framework.pagination import PageNumberPagination
class StandardPagination(PageNumberPagination):
page_size = 25
page_size_query_param = "page-size"
max_page_size = 100000

View File

@@ -4,7 +4,7 @@ WSGI config for paperless project.
It exposes the WSGI callable as a module-level variable named ``application``.
For more information on this file, see
https://docs.djangoproject.com/en/1.9/howto/deployment/wsgi/
https://docs.djangoproject.com/en/1.10/howto/deployment/wsgi/
"""
import os

View File

View File

@@ -0,0 +1,16 @@
from django.apps import AppConfig
class PaperlessTesseractConfig(AppConfig):
name = "paperless_tesseract"
def ready(self):
from documents.signals import document_consumer_declaration
from .signals import ConsumerDeclaration
document_consumer_declaration.connect(ConsumerDeclaration.handle)
AppConfig.ready(self)

View File

@@ -0,0 +1,214 @@
import itertools
import os
import re
import subprocess
from multiprocessing.pool import Pool
import langdetect
import pyocr
from django.conf import settings
from documents.parsers import DocumentParser, ParseError
from PIL import Image
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from pyocr.tesseract import TesseractError
from .languages import ISO639
class OCRError(Exception):
pass
class RasterisedDocumentParser(DocumentParser):
"""
This parser uses Tesseract to try and get some text out of a rasterised
image, whether it's a PDF, or other graphical format (JPEG, TIFF, etc.)
"""
CONVERT = settings.CONVERT_BINARY
DENSITY = settings.CONVERT_DENSITY if settings.CONVERT_DENSITY else 300
THREADS = int(settings.OCR_THREADS) if settings.OCR_THREADS else None
UNPAPER = settings.UNPAPER_BINARY
DEFAULT_OCR_LANGUAGE = settings.OCR_LANGUAGE
def get_thumbnail(self):
"""
The thumbnail of a PDF is just a 500px wide image of the first page.
"""
run_convert(
self.CONVERT,
"-scale", "500x5000",
"-alpha", "remove",
self.document_path, os.path.join(self.tempdir, "convert-%04d.png")
)
return os.path.join(self.tempdir, "convert-0000.png")
def get_text(self):
images = self._get_greyscale()
try:
return self._get_ocr(images)
except OCRError as e:
raise ParseError(e)
def _get_greyscale(self):
"""
Greyscale images are easier for Tesseract to OCR
"""
# Convert PDF to multiple PNMs
pnm = os.path.join(self.tempdir, "convert-%04d.pnm")
run_convert(
self.CONVERT,
"-density", str(self.DENSITY),
"-depth", "8",
"-type", "grayscale",
self.document_path, pnm,
)
# Get a list of converted images
pnms = []
for f in os.listdir(self.tempdir):
if f.endswith(".pnm"):
pnms.append(os.path.join(self.tempdir, f))
# Run unpaper in parallel on converted images
with Pool(processes=self.THREADS) as pool:
pool.map(run_unpaper, itertools.product([self.UNPAPER], pnms))
# Return list of converted images, processed with unpaper
pnms = []
for f in os.listdir(self.tempdir):
if f.endswith(".unpaper.pnm"):
pnms.append(os.path.join(self.tempdir, f))
return sorted(filter(lambda __: os.path.isfile(__), pnms))
def _guess_language(self, text):
try:
guess = langdetect.detect(text)
self.log("debug", "Language detected: {}".format(guess))
return guess
except Exception as e:
self.log("warning", "Language detection error: {}".format(e))
def _get_ocr(self, imgs):
"""
Attempts to do the best job possible OCR'ing the document based on
simple language detection trial & error.
"""
if not imgs:
raise OCRError("No images found")
self.log("info", "OCRing the document")
# Since the division gets rounded down by int, this calculation works
# for every edge-case, i.e. 1
middle = int(len(imgs) / 2)
raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
guessed_language = self._guess_language(raw_text)
if not guessed_language or guessed_language not in ISO639:
self.log("warning", "Language detection failed!")
if settings.FORGIVING_OCR:
self.log(
"warning",
"As FORGIVING_OCR is enabled, we're going to make the "
"best with what we have."
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError("Language detection failed")
if ISO639[guessed_language] == self.DEFAULT_OCR_LANGUAGE:
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
try:
return self._ocr(imgs, ISO639[guessed_language])
except pyocr.pyocr.tesseract.TesseractError:
if settings.FORGIVING_OCR:
self.log(
"warning",
"OCR for {} failed, but we're going to stick with what "
"we've got since FORGIVING_OCR is enabled.".format(
guessed_language
)
)
raw_text = self._assemble_ocr_sections(imgs, middle, raw_text)
return raw_text
raise OCRError(
"The guessed language is not available in this instance of "
"Tesseract."
)
def _ocr(self, imgs, lang):
"""
Performs a single OCR attempt.
"""
if not imgs:
return ""
self.log("info", "Parsing for {}".format(lang))
with Pool(processes=self.THREADS) as pool:
r = pool.map(image_to_string, itertools.product(imgs, [lang]))
r = " ".join(r)
# Strip out excess white space to allow matching to go smoother
return strip_excess_whitespace(r)
def _assemble_ocr_sections(self, imgs, middle, text):
"""
Given a `middle` value and the text that middle page represents, we OCR
the remainder of the document and return the whole thing.
"""
text = self._ocr(imgs[:middle], self.DEFAULT_OCR_LANGUAGE) + text
text += self._ocr(imgs[middle + 1:], self.DEFAULT_OCR_LANGUAGE)
return text
def run_convert(*args):
environment = os.environ.copy()
if settings.CONVERT_MEMORY_LIMIT:
environment["MAGICK_MEMORY_LIMIT"] = settings.CONVERT_MEMORY_LIMIT
if settings.CONVERT_TMPDIR:
environment["MAGICK_TMPDIR"] = settings.CONVERT_TMPDIR
subprocess.Popen(args, env=environment).wait()
def run_unpaper(args):
unpaper, pnm = args
subprocess.Popen(
(unpaper, pnm, pnm.replace(".pnm", ".unpaper.pnm"))).wait()
def strip_excess_whitespace(text):
collapsed_spaces = re.sub(r"([^\S\r\n]+)", " ", text)
no_leading_whitespace = re.sub(
"([\n\r]+)([^\S\n\r]+)", '\\1', collapsed_spaces)
no_trailing_whitespace = re.sub("([^\S\n\r]+)$", '', no_leading_whitespace)
return no_trailing_whitespace
def image_to_string(args):
img, lang = args
ocr = pyocr.get_available_tools()[0]
with Image.open(os.path.join(RasterisedDocumentParser.SCRATCH, img)) as f:
if ocr.can_detect_orientation():
try:
orientation = ocr.detect_orientation(f, lang=lang)
f = f.rotate(orientation["angle"], expand=1)
except (TesseractError, OtherTesseractError):
pass
return ocr.image_to_string(f, lang=lang)

View File

@@ -0,0 +1,23 @@
import re
from .parsers import RasterisedDocumentParser
class ConsumerDeclaration(object):
MATCHING_FILES = re.compile("^.*\.(pdf|jpg|gif|png|tiff?|pnm|bmp)$")
@classmethod
def handle(cls, sender, **kwargs):
return cls.test
@classmethod
def test(cls, doc):
if cls.MATCHING_FILES.match(doc.lower()):
return {
"parser": RasterisedDocumentParser,
"weight": 0
}
return None

View File

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 32 KiB

View File

@@ -0,0 +1,80 @@
import os
from unittest import mock, skipIf
import pyocr
from django.test import TestCase
from pyocr.libtesseract.tesseract_raw import \
TesseractError as OtherTesseractError
from ..parsers import image_to_string, strip_excess_whitespace
class FakeTesseract(object):
@staticmethod
def can_detect_orientation():
return True
@staticmethod
def detect_orientation(file_handle, lang):
raise OtherTesseractError("arbitrary status", "message")
@staticmethod
def image_to_string(file_handle, lang):
return "This is test text"
class FakePyOcr(object):
@staticmethod
def get_available_tools():
return [FakeTesseract]
class TestOCR(TestCase):
text_cases = [
("simple string", "simple string"),
(
"simple newline\n testing string",
"simple newline\ntesting string"
),
(
"utf-8 строка с пробелами в конце ",
"utf-8 строка с пробелами в конце"
)
]
SAMPLE_FILES = os.path.join(os.path.dirname(__file__), "samples")
TESSERACT_INSTALLED = bool(pyocr.get_available_tools())
def test_strip_excess_whitespace(self):
for source, result in self.text_cases:
actual_result = strip_excess_whitespace(source)
self.assertEqual(
result,
actual_result,
"strip_exceess_whitespace({}) != '{}', but '{}'".format(
source,
result,
actual_result
)
)
@skipIf(not TESSERACT_INSTALLED, "Tesseract not installed. Skipping")
@mock.patch(
"paperless_tesseract.parsers.RasterisedDocumentParser.SCRATCH",
SAMPLE_FILES
)
@mock.patch("paperless_tesseract.parsers.pyocr", FakePyOcr)
def test_image_to_string_with_text_free_page(self):
"""
This test is sort of silly, since it's really just reproducing an odd
exception thrown by pyocr when it encounters a page with no text.
Actually running this test against an installation of Tesseract results
in a segmentation fault rooted somewhere deep inside pyocr where I
don't care to dig. Regardless, if you run the consumer normally,
text-free pages are now handled correctly so long as we work around
this weird exception.
"""
image_to_string(["no-text.png", "en"])

View File

@@ -0,0 +1,36 @@
from django.test import TestCase
from ..signals import ConsumerDeclaration
class SignalsTestCase(TestCase):
def test_test_handles_various_file_names_true(self):
prefixes = (
"doc", "My Document", "Μυ Γρεεκ Δοψθμεντ", "Doc -with - tags",
"A document with a . in it", "Doc with -- in it"
)
suffixes = (
"pdf", "jpg", "gif", "png", "tiff", "tif", "pnm", "bmp",
"PDF", "JPG", "GIF", "PNG", "TIFF", "TIF", "PNM", "BMP",
"pDf", "jPg", "gIf", "pNg", "tIff", "tIf", "pNm", "bMp",
)
for prefix in prefixes:
for suffix in suffixes:
name = "{}.{}".format(prefix, suffix)
self.assertTrue(ConsumerDeclaration.test(name))
def test_test_handles_various_file_names_false(self):
prefixes = ("doc",)
suffixes = ("txt", "markdown", "",)
for prefix in prefixes:
for suffix in suffixes:
name = "{}.{}".format(prefix, suffix)
self.assertFalse(ConsumerDeclaration.test(name))
self.assertFalse(ConsumerDeclaration.test(""))
self.assertFalse(ConsumerDeclaration.test("doc"))

View File

@@ -1,3 +1,7 @@
[pytest]
DJANGO_SETTINGS_MODULE=paperless.settings
env =
PAPERLESS_CONSUME=/tmp
PAPERLESS_PASSPHRASE=THISISNOTASECRET
PAPERLESS_SECRET=paperless
PAPERLESS_EMAIL_SECRET=paperless

View File

20
src/reminders/admin.py Normal file
View File

@@ -0,0 +1,20 @@
from django.conf import settings
from django.contrib import admin
from .models import Reminder
class ReminderAdmin(admin.ModelAdmin):
class Media:
css = {
"all": ("paperless.css",)
}
list_per_page = settings.PAPERLESS_LIST_PER_PAGE
list_display = ("date", "document", "note")
list_filter = ("date",)
list_editable = ("note",)
admin.site.register(Reminder, ReminderAdmin)

5
src/reminders/apps.py Normal file
View File

@@ -0,0 +1,5 @@
from django.apps import AppConfig
class RemindersConfig(AppConfig):
name = "reminders"

14
src/reminders/filters.py Normal file
View File

@@ -0,0 +1,14 @@
from django_filters.rest_framework import CharFilter, FilterSet
from .models import Reminder
class ReminderFilterSet(FilterSet):
class Meta(object):
model = Reminder
fields = {
"document": ["exact"],
"date": ["gt", "lt", "gte", "lte", "exact"],
"note": ["istartswith", "iendswith", "icontains"]
}

View File

@@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
# Generated by Django 1.10.5 on 2017-03-25 15:58
from __future__ import unicode_literals
from django.db import migrations, models
import django.db.models.deletion
class Migration(migrations.Migration):
initial = True
dependencies = [
('documents', '0016_auto_20170325_1558'),
]
operations = [
migrations.CreateModel(
name='Reminder',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('date', models.DateTimeField()),
('note', models.TextField(blank=True)),
('document', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='documents.Document')),
],
),
]

View File

8
src/reminders/models.py Normal file
View File

@@ -0,0 +1,8 @@
from django.db import models
class Reminder(models.Model):
document = models.ForeignKey("documents.Document")
date = models.DateTimeField()
note = models.TextField(blank=True)

View File

@@ -0,0 +1,14 @@
from documents.models import Document
from rest_framework import serializers
from .models import Reminder
class ReminderSerializer(serializers.HyperlinkedModelSerializer):
document = serializers.HyperlinkedRelatedField(
view_name="drf:document-detail", queryset=Document.objects)
class Meta(object):
model = Reminder
fields = ("id", "document", "date", "note")

3
src/reminders/tests.py Normal file
View File

@@ -0,0 +1,3 @@
from django.test import TestCase
# Create your tests here.

22
src/reminders/views.py Normal file
View File

@@ -0,0 +1,22 @@
from django_filters.rest_framework import DjangoFilterBackend
from rest_framework.filters import OrderingFilter
from rest_framework.permissions import IsAuthenticated
from rest_framework.viewsets import (
ModelViewSet,
)
from .filters import ReminderFilterSet
from .models import Reminder
from .serialisers import ReminderSerializer
from paperless.views import StandardPagination
class ReminderViewSet(ModelViewSet):
model = Reminder
queryset = Reminder.objects
serializer_class = ReminderSerializer
pagination_class = StandardPagination
permission_classes = (IsAuthenticated,)
filter_backends = (DjangoFilterBackend, OrderingFilter)
filter_class = ReminderFilterSet
ordering_fields = ("date", "document")

View File

@@ -5,15 +5,11 @@
[tox]
skipsdist = True
envlist = py34, py35, pep8
envlist = py34, py35, py36, pep8
[testenv]
commands = {envpython} manage.py test
commands = pytest
deps = -r{toxinidir}/../requirements.txt
setenv =
PAPERLESS_CONSUME=/tmp
PAPERLESS_PASSPHRASE=THISISNOTASECRET
PAPERLESS_SECRET=paperless
[testenv:pep8]
commands=pep8