documentation.

This commit is contained in:
jonaswinkler 2020-12-06 14:41:14 +01:00
parent 65816a434c
commit 278f6da16a
3 changed files with 67 additions and 98 deletions

View File

@ -119,8 +119,11 @@ Updating paperless without docker
After grabbing the new release and unpacking the contents, do the following:
1. Update python requirements. Paperless uses
`Pipenv`_ for managing dependencies:
1. Update dependencies. New paperless version may require additional
dependencies. The dependencies required are listed in the section about
:ref:`bare metal installations <setup-bare_metal>`.
2. Update python requirements. If you use Pipenv, this is done with the following steps.
.. code:: shell-session
@ -132,14 +135,14 @@ After grabbing the new release and unpacking the contents, do the following:
This creates a new virtual environment (or uses your existing environment)
and installs all dependencies into it.
2. Collect static files.
3. Collect static files.
.. code:: shell-session
$ cd src
$ pipenv run python3 manage.py collectstatic --clear
3. Migrate the database.
4. Migrate the database.
.. code:: shell-session
@ -153,14 +156,14 @@ Management utilities
Paperless comes with some management commands that perform various maintenance
tasks on your paperless instance. You can invoke these commands either by
.. code:: bash
.. code:: shell-session
$ cd /path/to/paperless
$ docker-compose run --rm webserver <command> <arguments>
or
.. code:: bash
.. code:: shell-session
$ cd /path/to/paperless/src
$ pipenv run python manage.py <command> <arguments>
@ -366,7 +369,7 @@ is specified, the archiver will only process that document.
.. note::
Some documents will cause errors and cannot be converted into PDF/A documents,
such as encrypted PDF documents. The archiver will skip over these Documents
such as encrypted PDF documents. The archiver will skip over these documents
each time it sees them.
.. _utilities-encyption:

View File

@ -118,114 +118,80 @@ This will test and assemble everything and also build and tag a docker image.
Extending Paperless
===================
.. warning::
Paperless does not have any fancy plugin systems and will probably never have. However,
some parts of the application have been designed to allow easy integration of additional
features without any modification to the base code.
This section is not updated to paperless-ng yet.
Making custom parsers
---------------------
For the most part, Paperless is monolithic, so extending it is often best
managed by way of modifying the code directly and issuing a pull request on
`GitHub`_. However, over time the project has been evolving to be a little
more "pluggable" so that users can write their own stuff that talks to it.
Paperless uses parsers to add documents to paperless. A parser is responsible for:
.. _GitHub: https://github.com/the-paperless-project/paperless
* Retrieve the content from the original
* Create a thumbnail
* Optional: Retrieve a created date from the original
* Optional: Create an archived document from the original
Custom parsers can be added to paperless to support more file types. In order to do that,
you need to write the parser itself and announce its existence to paperless.
.. _extending-parsers:
Parsers
-------
You can leverage Paperless' consumption model to have it consume files *other*
than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``. To do so,
you simply follow Django's convention of creating a new app, with a few key
requirements.
.. _extending-parsers-parserspy:
parsers.py
..........
In this file, you create a class that extends
``documents.parsers.DocumentParser`` and go about implementing the three
required methods:
* ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for
this document.
* ``get_text()``: Returns the text from the document and only the text.
* ``get_date()``: If possible, this returns the date of the document, otherwise
it should return ``None``.
.. _extending-parsers-signalspy:
signals.py
..........
At consumption time, Paperless emits a ``document_consumer_declaration``
signal which your module has to react to in order to let the consumer know
whether or not it's capable of handling a particular file. Think of it like
this:
1. Consumer finds a file in the consumption directory.
2. It asks all the available parsers: *"Hey, can you handle this file?"*
3. Each parser responds with either ``None`` meaning they can't handle the
file, or a dictionary in the following format:
The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the
methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to
``get_date`` if you don't want to rely on paperless' default date guessing mechanisms.
.. code:: python
{
"parser": <the class name>,
"weight": <an integer>
}
class MyCustomParser(DocumentParser):
The consumer compares the ``weight`` values from all respondents and uses the
class with the highest value to consume the document. The default parser,
``RasterisedDocumentParser`` has a weight of ``0``.
def parse(self, document_path, mime_type):
# This method does not return anything. Rather, you should assign
# whatever you got from the document to the following fields:
# The content of the document.
self.text = "content"
# Optional: path to a PDF document that you created from the original.
self.archive_path = os.path.join(self.tempdir, "archived.pdf")
.. _extending-parsers-appspy:
# Optional: "created" date of the document.
self.date = get_created_from_metadata(document_path)
apps.py
.......
def get_thumbnail(self, document_path, mime_type):
# This should return the path to a thumbnail you created for this
# document.
return os.path.join(self.tempdir, "thumb.png")
This is a standard Django file, but you'll need to add some code to it to
connect your parser to the ``document_consumer_declaration`` signal.
If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``.
The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty
and removed after consumption finished. You can use that directory to store any
intermediate files and also use it to store the thumbnail / archived document.
.. _extending-parsers-finally:
Finally
.......
The last step is to update ``settings.py`` to include your new module.
Eventually, this will be dynamic, but at the moment, you have to edit the
``INSTALLED_APPS`` section manually. Simply add the path to your AppConfig to
the list like this:
After that, you need to announce your parser to paperless. You need to connect a
handler to the ``document_consumer_declaration`` signal. Have a look in the file
``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method
that returns information about your parser:
.. code:: python
INSTALLED_APPS = [
...
"my_module.apps.MyModuleConfig",
...
]
def myparser_consumer_declaration(sender, **kwargs):
return {
"parser": MyCustomParser,
"weight": 0,
"mime_types": {
"application/pdf": ".pdf",
"image/jpeg": ".jpg",
}
}
Order doesn't matter, but generally it's a good idea to place your module lower
in the list so that you don't end up accidentally overriding project defaults
somewhere.
* ``parser`` is a reference to a class that extends ``DocumentParser``.
* ``weight`` is used whenever two or more parsers are able to parse a file: The parser with
the higher weight wins. This can be used to override the parsers provided by
paperless.
.. _extending-parsers-example:
An Example
..........
The core Paperless functionality is based on this design, so if you want to see
what a parser module should look like, have a look at `parsers.py`_,
`signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module.
.. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py
.. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py
.. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py
.. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/
* ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value
is the default file extension that paperless should use when storing files and serving them for
download. We could guess that from the file extensions, but some mime types have many extensions
associated with them and the python methods responsible for guessing the extension do not always
return the same value.

View File

@ -73,7 +73,7 @@ in your browser and paperless has to do much less work to serve the data.
**Q:** *How do I install paperless-ng on Raspberry Pi?*
**A:** There is not docker image for ARM available. If you know how to build
**A:** There is no docker image for ARM available. If you know how to build
that automatically, I'm all ears. For now, you have to grab the latest release
archive from the project page and build the image yourself. The release comes
with the front end already compiled, so you don't have to do this on the Pi.