9872 Commits

Author SHA1 Message Date
Pit Kleyersburg
aeab9a0e81 Detect language only on one page of PDF
To detect the language currently the entire document gets processed. If
a different language has been detected than the default one, the entire
document will be processed again for the new language.

This PR analyzes the middle page for its language and either processes
the remaining pages with the default language if it didn't differ, or
processes all pages for the new guessed language.

The amount of processed pages comes down from the worst case `2n` to
worst case `n+1`.
2016-02-14 17:55:13 +01:00
Daniel Quinn
7843ea5037 Added and implemented a rudimentary logger 2016-02-14 16:09:52 +00:00
Daniel Quinn
9162e41507 Merge pull request #33 from pitkley/fix/parallelism
Ensure `OCR_THREADS` is integer, add documentation
2016-02-14 15:40:20 +00:00
Pit Kleyersburg
20b2408dbb Ensure OCR_THREADS is integer, add documentation 2016-02-14 16:37:38 +01:00
Daniel Quinn
88acf50fe0 Merge pull request #31 from pitkley/feature/paralellism
This is great.  It seriously sped up the OCR time.
2016-02-14 15:29:05 +00:00
Pit Kleyersburg
f5beda9c56 Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using
tesseract. Since the processing of a single page is independent from every
other page, one can make use of multi-core machines.

This PR introduces a multiprocessing pool to process multiple pages
simultaneously. The amount of threads to use can be specified in the
environment variable `PAPERLESS_OCR_THREADS`. This will default to the
number of cores/hyperthreads Python detects for your system.
2016-02-14 15:57:42 +01:00
Daniel Quinn
6b0a537bff Added support for a shared secret in email 2016-02-14 03:01:24 +00:00
Daniel Quinn
3b5d4cdd39 Added some error handling 2016-02-14 01:32:25 +00:00
Daniel Quinn
fc5d89c6fc Added a default algorithm 2016-02-14 01:30:41 +00:00
Daniel Quinn
d9b7851de9 Added a default algorithm 2016-02-14 01:30:18 +00:00
Daniel Quinn
cec9968cdb Documented consumption 2016-02-14 00:10:49 +00:00
Daniel Quinn
330dfa544b Fixed a typo in the description. There's no need for a new migration here. 2016-02-14 00:10:37 +00:00
Daniel Quinn
294f104474 Merge branch 'master' into feature/images-as-docs 2016-02-13 01:01:10 +00:00
Daniel Quinn
68fa7d68fa Merge branch 'master' of github.com:danielquinn/paperless 2016-02-13 00:59:36 +00:00
Daniel Quinn
2ed2d641b5 Added a note about the plight of Apple users. 2016-02-13 00:59:19 +00:00
Daniel Quinn
a846b3f7b8 Adding some more debugging 2016-02-13 00:57:05 +00:00
Daniel Quinn
b7859a0ff3 Merge pull request #26 from wttw/master
Document cloning from public URL rather than ssh
2016-02-12 20:30:07 +00:00
Steve Atkins
a4903049a3 Document cloning from public URL rather than ssh 2016-02-12 11:36:07 -08:00
Daniel Quinn
9ed8a2b2d7 Merge branch 'master' into feature/images-as-docs 2016-02-12 09:03:46 +00:00
Daniel Quinn
1d4b87ee46 Update for #22 2016-02-12 08:54:04 +00:00
Daniel Quinn
840472071c Added the required verbosity reference 2016-02-12 08:27:28 +00:00
Daniel Quinn
2421f559be Simpler regex 2016-02-12 08:27:09 +00:00
Daniel Quinn
a022fcb8f1 Fixed the auto-naming regexes 2016-02-11 22:05:55 +00:00
Daniel Quinn
7aadab23cc Added the Renderable mixin because DRY 2016-02-11 22:05:38 +00:00
Daniel Quinn
ef1639208c Tests for the consumer 2016-02-11 12:25:23 +00:00
Daniel Quinn
cef4abc01d version bump 2016-02-11 12:25:12 +00:00
Daniel Quinn
78ee138ad7 Added migration and changelog updates 2016-02-11 12:25:00 +00:00
Daniel Quinn
c423a13f85 Added a simple re-tagger 2016-02-11 12:24:18 +00:00
Daniel Quinn
39134b517e Cleaned up file_name() 2016-02-10 23:53:48 +00:00
Daniel Quinn
a892abc701 Added dateutil 2016-02-10 23:50:58 +00:00
Daniel Quinn
4a078dcfbc Merge branch 'master' into feature/images-as-docs 2016-02-09 17:20:45 +00:00
Daniel Quinn
642b2f7ee3 Merge pull request #18 from mrwacky42/master
Add other prerequisites for Vagrant
2016-02-09 09:41:53 +00:00
Sharif Nassar
6115b2f03d Add other prerequisites
Vagrant setup didn't work for me unless I manually installed tesseract and ImageMagick.
2016-02-09 01:07:48 -08:00
Daniel Quinn
0eaed36420 The 'API' is written but untested 2016-02-08 23:46:16 +00:00
Daniel Quinn
212752f46e Fixt the tags to be optional 2016-02-08 17:28:59 +00:00
Daniel Quinn
0c729e5675 Changed the name, forgot to change the check.
Closes #17
2016-02-08 11:14:57 +00:00
Daniel Quinn
e5e4ee0350 Added file magic 2016-02-08 11:12:14 +00:00
Daniel Quinn
c4311af263 Cleaned up the tests 2016-02-06 17:41:11 +00:00
Daniel Quinn
febb45af81 Prettied up the interface a little 2016-02-06 17:27:17 +00:00
Daniel Quinn
ce69e37256 Linked tag labels 2016-02-06 17:14:44 +00:00
Daniel Quinn
48761911b3 Image imports and consumption by mail work 2016-02-06 17:05:36 +00:00
Daniel Quinn
71075a691a The mailconsumer isn't a consumer at all. Best fixt that 2016-02-05 20:15:08 +00:00
Daniel Quinn
d8ad6b589b Added pytest and broke up the consumer into file and mail 2016-02-05 00:23:36 +00:00
Daniel Quinn
3bc89d23c8 Sorting the filters 2016-02-03 17:20:12 +00:00
Daniel Quinn
a70b40f618 Broke the consumer script into separate files and started on a mail consumer 2016-01-30 01:18:52 +00:00
Daniel Quinn
84d5f8cc5d Merge branch 'master' into feature/images-as-docs 2016-01-29 23:41:13 +00:00
Daniel Quinn
cf4c437eca Be a little more verbose about the passphrase 2016-01-29 23:40:57 +00:00
Daniel Quinn
8701007a7a Merge pull request #15 from jat255/DOC_setup_enh
Clarify how to start server on a different port/ip
2016-01-29 23:32:15 +00:00
Daniel Quinn
889fd93c5e Merge pull request #14 from gitter-badger/gitter-badge
Add a Gitter chat badge to README.rst
2016-01-29 23:31:25 +00:00
The Gitter Badger
77a2a5bb8e Add Gitter badge 2016-01-29 23:27:37 +00:00