166 Commits

Author SHA1 Message Date
Daniel Quinn
5f0962bc3e Travis integration: take 6 2016-02-21 01:58:09 +00:00
Daniel Quinn
300dc97e83 Travis integration: take 5 2016-02-21 01:53:10 +00:00
Daniel Quinn
e0b2d27e01 Travis integration: take 4 2016-02-21 01:50:04 +00:00
Daniel Quinn
6f7169d2d6 Travis integration: take 3 2016-02-21 01:46:49 +00:00
Daniel Quinn
55a7dc2444 pep8 2016-02-21 01:43:48 +00:00
Daniel Quinn
c7787bc076 Let's see if I can get Travis CI working on the first try 2016-02-21 01:37:57 +00:00
Daniel Quinn
0d46643026 Version bump 2016-02-21 01:24:30 +00:00
Daniel Quinn
17d3a44952 A crude API is in place 2016-02-21 00:55:38 +00:00
Daniel Quinn
809fb8fa1f Moved the default GNUPG home to /tmp for tox-friendliness 2016-02-21 00:29:59 +00:00
Daniel Quinn
440614eddc Got tox working 2016-02-21 00:29:21 +00:00
Daniel Quinn
422ae9303a pep8 2016-02-21 00:14:50 +00:00
Daniel Quinn
a5124cade6 Merge branch 'master' into feature/api 2016-02-20 22:55:42 +00:00
Daniel Quinn
224f4acdc3 Merge branch 'master' of github.com:danielquinn/paperless 2016-02-20 22:50:58 +00:00
Daniel Quinn
51b19f4c19 Issue #57 2016-02-20 22:30:01 +00:00
Daniel Quinn
5c6aa201be Merge pull request #50 from tikitu/docker-tweaks
Some small tweaks to the Docker setup and documentation
2016-02-20 00:06:12 +00:00
Tikitu de Jager
438b161a25 Move docker-compose.env to docker-compose.env.example & adjust docs
This file, like `docker-compose.yml`, should be edited by the user. To
avoid merge conflicts when pulling updates, the edited version should
not be committed to the repository.
2016-02-19 22:51:49 +02:00
Tikitu de Jager
147f8f72a2 Simplify instructions for exporting with docker
The export workflow reusing the `/consume` volume is complex and error-
prone, and not at all necessary if the `docker-compose.yml` file has a
volume for `/export` from the beginning.
2016-02-19 22:27:48 +02:00
Daniel Quinn
3a8755e4c8 Document the retagger
Fixes #54
2016-02-19 17:26:40 +00:00
Daniel Quinn
d9602312b1 Merge pull request #52 from pitkley/fix/detect-orientation-errors
Ignore error if orientation detection fails
2016-02-19 09:13:14 +00:00
Pit Kleyersburg
c45f951ca0 Ignore error if orientation detection fails
Fixes an additional issue that came up in #48.
2016-02-19 09:52:32 +01:00
Daniel Quinn
ec88ea73f6 #48: make the tag matching smarter 2016-02-19 00:45:02 +00:00
Daniel Quinn
99be40a433 Merge pull request #39 from pitkley/feature/dockerfile
Add Dockerfile for application and documentation
2016-02-18 22:01:54 +00:00
Pit Kleyersburg
724afa59c7 Add Dockerfile for application and documentation
This commit adds a `Dockerfile` to the root of the project, accompanied
by a `docker-compose.yml.example` for simplified deployment. The
`Dockerfile` is agnostic to whether it will be the webserver, the
consumer, or if it is run for a one-off command (i.e. creation of a
superuser, migration of the database, document export, ...).

The containers entrypoint is the `scripts/docker-entrypoint.sh` script.
This script verifies that the required permissions are set, remaps the
default users and/or groups id if required and installs additional
languages if the user wishes to.

After initialization, it analyzes the command the user supplied:

  - If the command starts with a slash, it is expected that the user
    wants to execute a binary file and the command will be executed
    without further intervention. (Using `exec` to effectively replace
    the started shell-script and not have any reaping-issues.)

  - If the command does not start with a slash, the command will be
    passed directly to the `manage.py` script without further
    modification. (Again using `exec`.)

The default command is set to `--help`.

If the user wants to execute a command that is not meant for `manage.py`
but doesn't start with a slash, the Docker `--entrypoint` parameter can
be used to circumvent the mechanics of `docker-entrypoint.sh`.

Further information can be found in `docs/setup.rst` and in
`docs/migrating.rst`.

For additional convenience, a `Dockerfile` has been added to the `docs/`
directory which allows for easy building and serving of the
documentation. This is documented in `docs/requirements.rst`.
2016-02-18 22:58:32 +01:00
Daniel Quinn
57bcb883bf Merge pull request #49 from pitkley/feature/detect-orientation
Detect image orientation if the OCR supports it. Fixes #47
2016-02-18 11:36:08 +00:00
Pit Kleyersburg
c34d57a872 Detect image orientation if the OCR supports it
Fixes issue #47.
2016-02-18 09:37:13 +01:00
Daniel Quinn
1e7ece81ee Fixes #45 2016-02-17 23:07:54 +00:00
Daniel Quinn
eb01bcf98b The Log class needed a __str__() method 2016-02-17 23:06:35 +00:00
Daniel Quinn
1c45ca10d4 Patched sorting 2016-02-17 00:11:57 +00:00
Daniel Quinn
550184cbae Patched sorting 2016-02-17 00:11:46 +00:00
Daniel Quinn
52f242574f Merge branch 'pitkley-fix/secure-temporary-files' 2016-02-17 00:10:54 +00:00
Daniel Quinn
6f95b05287 Support appropriate sorting for long documents 2016-02-17 00:10:05 +00:00
Pit Kleyersburg
46f8f492f5 Safely and non-randomly create scratch directory
Creating the scratch-files in `_get_grayscale` using a random integer is
for one inherently unsafe and can cause a collision. On the other hand,
it should be unnecessary given that the files will be cleaned up after
the OCR run.

Since we don't know if OCR runs might be parallel in the future, this
commit implements thread-safe and deterministic directory-creation.

Additionally it fixes the call to `_cleanup` by `consume`. In the
current implementation `_cleanup` will not be called if the last
consumed document failed with an `OCRError`, this commit fixes this.
2016-02-16 12:15:57 +01:00
Daniel Quinn
cebc44f2c9 API is halfway there 2016-02-16 09:28:34 +00:00
Daniel Quinn
bbe7a02b4d Added a screenshot and cleaned things up a bit. 2016-02-16 09:22:51 +00:00
Daniel Quinn
5de4951a46 Added a screenshot, now I have to figure out how to put it in the readme. 2016-02-16 09:08:35 +00:00
Daniel Quinn
8a5d4b1cc8 Merge branch 'master' of github.com:danielquinn/paperless 2016-02-15 22:38:25 +00:00
Daniel Quinn
2f0da8ab25 Added download_url to the Document model 2016-02-15 22:38:18 +00:00
Daniel Quinn
a256d5ee2f Merge pull request #37 from jat255/DOCFIX_documentation_badge
Make docs badge in readme redirect to documentation, not image
2016-02-15 16:59:30 +00:00
Joshua Taillon
d2757707b3 Make docs badge in readme redirect to documentation, not image 2016-02-15 11:58:07 -05:00
Daniel Quinn
9a437dc9f6 Merge pull request #35 from pitkley/fix/matching-logic
Fix matching if user supplied an empty value
2016-02-14 19:21:50 +00:00
Pit Kleyersburg
7b227ffa2f Fix matching if user supplied an empty value 2016-02-14 19:47:05 +01:00
Daniel Quinn
aea4af5d3b Version bump and feature update 2016-02-14 17:18:28 +00:00
Daniel Quinn
a0f4f6c5f2 Fixed merge conflict and did some pep8 2016-02-14 17:13:48 +00:00
Daniel Quinn
4689e2b975 Merge pull request #32 from pitkley/feature/single-page-langdetect
Detect language only on first page of PDF
2016-02-14 16:56:30 +00:00
Pit Kleyersburg
aeab9a0e81 Detect language only on one page of PDF
To detect the language currently the entire document gets processed. If
a different language has been detected than the default one, the entire
document will be processed again for the new language.

This PR analyzes the middle page for its language and either processes
the remaining pages with the default language if it didn't differ, or
processes all pages for the new guessed language.

The amount of processed pages comes down from the worst case `2n` to
worst case `n+1`.
2016-02-14 17:55:13 +01:00
Daniel Quinn
7843ea5037 Added and implemented a rudimentary logger 2016-02-14 16:09:52 +00:00
Daniel Quinn
9162e41507 Merge pull request #33 from pitkley/fix/parallelism
Ensure `OCR_THREADS` is integer, add documentation
2016-02-14 15:40:20 +00:00
Pit Kleyersburg
20b2408dbb Ensure OCR_THREADS is integer, add documentation 2016-02-14 16:37:38 +01:00
Daniel Quinn
88acf50fe0 Merge pull request #31 from pitkley/feature/paralellism
This is great.  It seriously sped up the OCR time.
2016-02-14 15:29:05 +00:00
Pit Kleyersburg
f5beda9c56 Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using
tesseract. Since the processing of a single page is independent from every
other page, one can make use of multi-core machines.

This PR introduces a multiprocessing pool to process multiple pages
simultaneously. The amount of threads to use can be specified in the
environment variable `PAPERLESS_OCR_THREADS`. This will default to the
number of cores/hyperthreads Python detects for your system.
2016-02-14 15:57:42 +01:00