documentation.

2026-02-07 23:42:46 -06:00 · 2020-12-06 14:41:14 +01:00
parent 65816a434c
commit 278f6da16a
3 changed files with 67 additions and 98 deletions
--- a/docs/administration.rst
+++ b/docs/administration.rst
@@ -119,8 +119,11 @@ Updating paperless without docker
 After grabbing the new release and unpacking the contents, do the following:
-1.  Update python requirements. Paperless uses
+1.  Update dependencies. New paperless version may require additional
-    `Pipenv`_ for managing dependencies:
+    dependencies. The dependencies required are listed in the section about 
    :ref:`bare metal installations <setup-bare_metal>`.
 2.  Update python requirements. If you use Pipenv, this is done with the following steps.
    .. code:: shell-session
@@ -132,14 +135,14 @@ After grabbing the new release and unpacking the contents, do the following:
    This creates a new virtual environment (or uses your existing environment)
    and installs all dependencies into it.
-2.  Collect static files.
+3.  Collect static files.
    .. code:: shell-session
        $ cd src
        $ pipenv run python3 manage.py collectstatic --clear
-3.  Migrate the database.
+4.  Migrate the database.
    .. code:: shell-session
@@ -153,14 +156,14 @@ Management utilities
 Paperless comes with some management commands that perform various maintenance
 tasks on your paperless instance. You can invoke these commands either by
-.. code:: bash
+.. code:: shell-session
    $ cd /path/to/paperless
    $ docker-compose run --rm webserver <command> <arguments>
 or
-.. code:: bash
+.. code:: shell-session
    $ cd /path/to/paperless/src
    $ pipenv run python manage.py <command> <arguments>
@@ -366,7 +369,7 @@ is specified, the archiver will only process that document.
 .. note::
    Some documents will cause errors and cannot be converted into PDF/A documents,
-    such as encrypted PDF documents. The archiver will skip over these Documents
+    such as encrypted PDF documents. The archiver will skip over these documents
    each time it sees them.
 .. _utilities-encyption:
--- a/docs/extending.rst
+++ b/docs/extending.rst
@@ -118,114 +118,80 @@ This will test and assemble everything and also build and tag a docker image.
 Extending Paperless
 ===================
-.. warning::
+Paperless does not have any fancy plugin systems and will probably never have. However,
 some parts of the application have been designed to allow easy integration of additional
 features without any modification to the base code.
-    This section is not updated to paperless-ng yet.
+Making custom parsers
 ---------------------
-For the most part, Paperless is monolithic, so extending it is often best
+Paperless uses parsers to add documents to paperless. A parser is responsible for:
 managed by way of modifying the code directly and issuing a pull request on
 `GitHub`_.  However, over time the project has been evolving to be a little
 more "pluggable" so that users can write their own stuff that talks to it.
-.. _GitHub: https://github.com/the-paperless-project/paperless
+*   Retrieve the content from the original
 *   Create a thumbnail
 *   Optional: Retrieve a created date from the original
 *   Optional: Create an archived document from the original
 Custom parsers can be added to paperless to support more file types. In order to do that,
 you need to write the parser itself and announce its existence to paperless.
-.. _extending-parsers:
+The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the
-
+methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to
-Parsers
+``get_date`` if you don't want to rely on paperless' default date guessing mechanisms.
 -------
 You can leverage Paperless' consumption model to have it consume files *other*
 than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``.  To do so,
 you simply follow Django's convention of creating a new app, with a few key
 requirements.
 .. _extending-parsers-parserspy:
 parsers.py
 ..........
 In this file, you create a class that extends
 ``documents.parsers.DocumentParser`` and go about implementing the three
 required methods:
 * ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for
  this document.
 * ``get_text()``: Returns the text from the document and only the text.
 * ``get_date()``: If possible, this returns the date of the document, otherwise
  it should return ``None``.
 .. _extending-parsers-signalspy:
 signals.py
 ..........
 At consumption time, Paperless emits a ``document_consumer_declaration``
 signal which your module has to react to in order to let the consumer know
 whether or not it's capable of handling a particular file.  Think of it like
 this:
 1. Consumer finds a file in the consumption directory.
 2. It asks all the available parsers: *"Hey, can you handle this file?"*
 3. Each parser responds with either ``None`` meaning they can't handle the
   file, or a dictionary in the following format:
 .. code:: python
-    {
+    class MyCustomParser(DocumentParser):
        "parser": <the class name>,
        "weight": <an integer>
    }
-The consumer compares the ``weight`` values from all respondents and uses the
+        def parse(self, document_path, mime_type):
-class with the highest value to consume the document.  The default parser,
+            # This method does not return anything. Rather, you should assign
-``RasterisedDocumentParser`` has a weight of ``0``.
+            # whatever you got from the document to the following fields:
            # The content of the document.
            self.text = "content"
            # Optional: path to a PDF document that you created from the original.
            self.archive_path = os.path.join(self.tempdir, "archived.pdf")
-.. _extending-parsers-appspy:
+            # Optional: "created" date of the document.
            self.date = get_created_from_metadata(document_path)
-apps.py
+        def get_thumbnail(self, document_path, mime_type):
-.......
+            # This should return the path to a thumbnail you created for this
            # document.
            return os.path.join(self.tempdir, "thumb.png")
-This is a standard Django file, but you'll need to add some code to it to
+If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``.
 connect your parser to the ``document_consumer_declaration`` signal.
 The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty
 and removed after consumption finished. You can use that directory to store any
 intermediate files and also use it to store the thumbnail / archived document.
-.. _extending-parsers-finally:
+After that, you need to announce your parser to paperless. You need to connect a
-
+handler to the ``document_consumer_declaration`` signal. Have a look in the file
-Finally
+``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method
-.......
+that returns information about your parser:
 The last step is to update ``settings.py`` to include your new module.
 Eventually, this will be dynamic, but at the moment, you have to edit the
 ``INSTALLED_APPS`` section manually.  Simply add the path to your AppConfig to
 the list like this:
 .. code:: python
-    INSTALLED_APPS = [
+    def myparser_consumer_declaration(sender, **kwargs):
-        ...
+        return {
-        "my_module.apps.MyModuleConfig",
+            "parser": MyCustomParser,
-        ...
+            "weight": 0,
-    ]
+            "mime_types": {
                "application/pdf": ".pdf",
                "image/jpeg": ".jpg",
            }
        }
-Order doesn't matter, but generally it's a good idea to place your module lower
+*   ``parser`` is a reference to a class that extends ``DocumentParser``.
 in the list so that you don't end up accidentally overriding project defaults
 somewhere.
 *   ``weight`` is used whenever two or more parsers are able to parse a file: The parser with
    the higher weight wins. This can be used to override the parsers provided by
    paperless.
-.. _extending-parsers-example:
+*   ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value
-
+    is the default file extension that paperless should use when storing files and serving them for
-An Example
+    download. We could guess that from the file extensions, but some mime types have many extensions
-..........
+    associated with them and the python methods responsible for guessing the extension do not always
-
+    return the same value.
 The core Paperless functionality is based on this design, so if you want to see
 what a parser module should look like, have a look at `parsers.py`_,
 `signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module.
 .. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py
 .. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py
 .. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py
 .. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/
--- a/docs/faq.rst
+++ b/docs/faq.rst
@@ -73,7 +73,7 @@ in your browser and paperless has to do much less work to serve the data.
 **Q:** *How do I install paperless-ng on Raspberry Pi?*
-**A:** There is not docker image for ARM available. If you know how to build
+**A:** There is no docker image for ARM available. If you know how to build
 that automatically, I'm all ears. For now, you have to grab the latest release
 archive from the project page and build the image yourself. The release comes
 with the front end already compiled, so you don't have to do this on the Pi.