Add support for using pre-existing text from PDFs

This commit is contained in:
Daniel Quinn
2018-01-30 20:13:35 +00:00
parent 7ad7323cc7
commit 269c32ce6a
7 changed files with 60 additions and 13 deletions

View File

@@ -3,7 +3,15 @@ Changelog
* 1.2.0
* New Docker image, now based on Alpine, thanks to the efforts of `addadi`_
and `Pit`_.
and `Pit`_.
* `BastianPoe`_ has added the long-awaited feature to automatically skip the
OCR step when the PDF already contains text. This can be overridden by
setting ``PAPERLESS_OCR_ALWAYS=YES`` either in your ``paperless.conf`` or
in the environment. Note that this also means that Paperless now requires
``libpoppler-cpp-dev`` to be installed. **You'll need to run
``pip install -r requirements.txt`` after the usual ``git pull`` to
properly update**.
* 1.1.0
* Fix for `#283`_, a redirect bug which broke interactions with
paperless-desktop. Thanks to `chris-aeviator`_ for reporting it.
@@ -272,6 +280,7 @@ Changelog
.. _chris-aeviator: https://github.com/chris-aeviator
.. _Dan Panzarella: https://github.com/pzl
.. _addadi: https://github.com/addadi
.. _BastianPoe: https://github.com/BastianPoe
.. _#20: https://github.com/danielquinn/paperless/issues/20
.. _#44: https://github.com/danielquinn/paperless/issues/44

View File

@@ -11,24 +11,27 @@ should work) that has the following software installed:
* `Tesseract`_, plus its language files matching your document base.
* `Imagemagick`_ version 6.7.5 or higher
* `unpaper`_
* `libpoppler-cpp-dev`_ PDF rendering library
.. _Python3: https://python.org/
.. _GNU Privacy Guard: https://gnupg.org
.. _Tesseract: https://github.com/tesseract-ocr
.. _Imagemagick: http://imagemagick.org/
.. _unpaper: https://www.flameeyes.eu/projects/unpaper
.. _libpoppler-cpp-dev: https://poppler.freedesktop.org/
Notably, you should confirm how you access your Python3 installation. Many
Linux distributions will install Python3 in parallel to Python2, using the names
``python3`` and ``python`` respectively. The same goes for ``pip3`` and
``pip``. Running Paperless with Python2 will likely break things, so make sure that
you're using the right version.
Linux distributions will install Python3 in parallel to Python2, using the
names ``python3`` and ``python`` respectively. The same goes for ``pip3`` and
``pip``. Running Paperless with Python2 will likely break things, so make sure
that you're using the right version.
For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to
refer to their Python3 versions.
In addition to the above, there are a number of Python requirements, all of
which are listed in a file called ``requirements.txt`` in the project root directory.
which are listed in a file called ``requirements.txt`` in the project root
directory.
If you're not working on a virtual environment (like Vagrant or Docker), you
should probably be using a virtualenv, but that's your call. The reasons why
@@ -39,12 +42,13 @@ probably figure that out before continuing.
.. _requirements-apple:
Apple-tastic Complications
--------------------------
Problems with Imagemagick & PDFs
--------------------------------
Some users have `run into problems`_ with installing ImageMagick on Apple
systems using HomeBrew. The solution appears to be to install ghostscript as
well as ImageMagick:
Some users have `run into problems`_ with getting ImageMagick to do its thing
with PDFs. Often this is the case with Apple systems using HomeBrew, but other
Linuxes have been a problem as well. The solution appears to be to install
ghostscript as well as ImageMagick:
.. _run into problems: https://github.com/danielquinn/paperless/issues/25