Merge branch 'master' into dev

2025-11-05 03:26:11 -06:00 · 2020-12-08 16:46:13 +01:00
parent 3596c58ea6 fdf3c04014
commit 6a965a8467
4 changed files with 68 additions and 99 deletions
--- a/README.md
+++ b/README.md
@@ -38,6 +38,7 @@ Here's what you get:
 	* When adding documents from mails, paperless can move these mails to a new folder, mark them as read, flag them or delete them.
 * Machine learning powered document matching.
 	* Paperless learns from your documents and will be able to automatically assign tags, correspondents and types to documents once you've stored a few documents in paperless.
+* We have a mobile app that offers a 'Share with paperless' option over at https://github.com/qcasey/paperless_share. You can use that in combination with any of the mobile scanning apps out there. It's still a little rough around the edges, but it works!
 * A task processor that processes documents in parallel and also tells you when something goes wrong. On modern multi core systems, consumption is blazing fast.
 * Code cleanup in many, MANY areas. Some of the code from OG paperless was just overly complicated.
 * More tests, more stability.
@@ -50,7 +51,6 @@ For a complete list of changes from paperless, check out the [changelog](https:/

 - Make the front end nice (except mobile).
 - Test coverage at 90%.
- Store archived documents with an embedded OCR text layer, while keeping originals available. Making good progress in the `feature-ocrmypdf` branch.
 - Fix whatever bugs I and you find.

 ## Roadmap for versions beyond 1.0
--- a/docs/administration.rst
+++ b/docs/administration.rst
@@ -119,8 +119,11 @@ Updating paperless without docker

 After grabbing the new release and unpacking the contents, do the following:

-1.  Update python requirements. Paperless uses
-    `Pipenv`_ for managing dependencies:
+1.  Update dependencies. New paperless version may require additional
+    dependencies. The dependencies required are listed in the section about 
+    :ref:`bare metal installations <setup-bare_metal>`.
+
+2.  Update python requirements. If you use Pipenv, this is done with the following steps.

    .. code:: shell-session

@@ -132,14 +135,14 @@ After grabbing the new release and unpacking the contents, do the following:
    This creates a new virtual environment (or uses your existing environment)
    and installs all dependencies into it.

-2.  Collect static files.
+3.  Collect static files.

    .. code:: shell-session

        $ cd src
        $ pipenv run python3 manage.py collectstatic --clear
    
-3.  Migrate the database.
+4.  Migrate the database.

    .. code:: shell-session

@@ -153,14 +156,14 @@ Management utilities
 Paperless comes with some management commands that perform various maintenance
 tasks on your paperless instance. You can invoke these commands either by

-.. code:: bash
+.. code:: shell-session

    $ cd /path/to/paperless
    $ docker-compose run --rm webserver <command> <arguments>

 or

-.. code:: bash
+.. code:: shell-session

    $ cd /path/to/paperless/src
    $ pipenv run python manage.py <command> <arguments>
@@ -366,7 +369,7 @@ is specified, the archiver will only process that document.
 .. note::

    Some documents will cause errors and cannot be converted into PDF/A documents,
-    such as encrypted PDF documents. The archiver will skip over these Documents
+    such as encrypted PDF documents. The archiver will skip over these documents
    each time it sees them.

 .. _utilities-encyption:
--- a/docs/extending.rst
+++ b/docs/extending.rst
@@ -118,114 +118,80 @@ This will test and assemble everything and also build and tag a docker image.
 Extending Paperless
 ===================

-.. warning::
+Paperless does not have any fancy plugin systems and will probably never have. However,
+some parts of the application have been designed to allow easy integration of additional
+features without any modification to the base code.

-    This section is not updated to paperless-ng yet.
+Making custom parsers
+---------------------

-For the most part, Paperless is monolithic, so extending it is often best
-managed by way of modifying the code directly and issuing a pull request on
-`GitHub`_.  However, over time the project has been evolving to be a little
-more "pluggable" so that users can write their own stuff that talks to it.
+Paperless uses parsers to add documents to paperless. A parser is responsible for:

-.. _GitHub: https://github.com/the-paperless-project/paperless
+*   Retrieve the content from the original
+*   Create a thumbnail
+*   Optional: Retrieve a created date from the original
+*   Optional: Create an archived document from the original

+Custom parsers can be added to paperless to support more file types. In order to do that,
+you need to write the parser itself and announce its existence to paperless.

-.. _extending-parsers:
-
-Parsers
-------
-
-You can leverage Paperless' consumption model to have it consume files *other*
-than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``.  To do so,
-you simply follow Django's convention of creating a new app, with a few key
-requirements.
-
-
-.. _extending-parsers-parserspy:
-
-parsers.py
-..........
-
-In this file, you create a class that extends
-``documents.parsers.DocumentParser`` and go about implementing the three
-required methods:
-
-* ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for
-  this document.
-* ``get_text()``: Returns the text from the document and only the text.
-* ``get_date()``: If possible, this returns the date of the document, otherwise
-  it should return ``None``.
-
-
-.. _extending-parsers-signalspy:
-
-signals.py
-..........
-
-At consumption time, Paperless emits a ``document_consumer_declaration``
-signal which your module has to react to in order to let the consumer know
-whether or not it's capable of handling a particular file.  Think of it like
-this:
-
-1. Consumer finds a file in the consumption directory.
-2. It asks all the available parsers: *"Hey, can you handle this file?"*
-3. Each parser responds with either ``None`` meaning they can't handle the
-   file, or a dictionary in the following format:
+The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the
+methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to
+``get_date`` if you don't want to rely on paperless' default date guessing mechanisms.

 .. code:: python

-    {
-        "parser": <the class name>,
-        "weight": <an integer>
-    }
+    class MyCustomParser(DocumentParser):

-The consumer compares the ``weight`` values from all respondents and uses the
-class with the highest value to consume the document.  The default parser,
-``RasterisedDocumentParser`` has a weight of ``0``.
+        def parse(self, document_path, mime_type):
+            # This method does not return anything. Rather, you should assign
+            # whatever you got from the document to the following fields:

+            # The content of the document.
+            self.text = "content"
+            
+            # Optional: path to a PDF document that you created from the original.
+            self.archive_path = os.path.join(self.tempdir, "archived.pdf")

-.. _extending-parsers-appspy:
+            # Optional: "created" date of the document.
+            self.date = get_created_from_metadata(document_path)

-apps.py
-.......
+        def get_thumbnail(self, document_path, mime_type):
+            # This should return the path to a thumbnail you created for this
+            # document.
+            return os.path.join(self.tempdir, "thumb.png")

-This is a standard Django file, but you'll need to add some code to it to
-connect your parser to the ``document_consumer_declaration`` signal.
+If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``.

+The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty
+and removed after consumption finished. You can use that directory to store any
+intermediate files and also use it to store the thumbnail / archived document.

-.. _extending-parsers-finally:
-
-Finally
-.......
-
-The last step is to update ``settings.py`` to include your new module.
-Eventually, this will be dynamic, but at the moment, you have to edit the
-``INSTALLED_APPS`` section manually.  Simply add the path to your AppConfig to
-the list like this:
+After that, you need to announce your parser to paperless. You need to connect a
+handler to the ``document_consumer_declaration`` signal. Have a look in the file
+``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method
+that returns information about your parser:

 .. code:: python

-    INSTALLED_APPS = [
-        ...
-        "my_module.apps.MyModuleConfig",
-        ...
-    ]
+    def myparser_consumer_declaration(sender, **kwargs):
+        return {
+            "parser": MyCustomParser,
+            "weight": 0,
+            "mime_types": {
+                "application/pdf": ".pdf",
+                "image/jpeg": ".jpg",
+            }
+        }

-Order doesn't matter, but generally it's a good idea to place your module lower
-in the list so that you don't end up accidentally overriding project defaults
-somewhere.
+*   ``parser`` is a reference to a class that extends ``DocumentParser``.

+*   ``weight`` is used whenever two or more parsers are able to parse a file: The parser with
+    the higher weight wins. This can be used to override the parsers provided by
+    paperless.

-.. _extending-parsers-example:
-
-An Example
-..........
-
-The core Paperless functionality is based on this design, so if you want to see
-what a parser module should look like, have a look at `parsers.py`_,
-`signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module.
-
-.. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py
-.. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py
-.. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py
-.. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/
+*   ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value
+    is the default file extension that paperless should use when storing files and serving them for
+    download. We could guess that from the file extensions, but some mime types have many extensions
+    associated with them and the python methods responsible for guessing the extension do not always
+    return the same value.
--- a/docs/faq.rst
+++ b/docs/faq.rst
@@ -73,7 +73,7 @@ in your browser and paperless has to do much less work to serve the data.

 **Q:** *How do I install paperless-ng on Raspberry Pi?*

-**A:** There is not docker image for ARM available. If you know how to build
+**A:** There is no docker image for ARM available. If you know how to build
 that automatically, I'm all ears. For now, you have to grab the latest release
 archive from the project page and build the image yourself. The release comes
 with the front end already compiled, so you don't have to do this on the Pi.