From 278f6da16afe33198be59761d2a2ac938b050271 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Sun, 6 Dec 2020 14:41:14 +0100 Subject: [PATCH] documentation. --- docs/administration.rst | 17 +++-- docs/extending.rst | 146 +++++++++++++++------------------------- docs/faq.rst | 2 +- 3 files changed, 67 insertions(+), 98 deletions(-) diff --git a/docs/administration.rst b/docs/administration.rst index 001d608e1..8885b7322 100644 --- a/docs/administration.rst +++ b/docs/administration.rst @@ -119,8 +119,11 @@ Updating paperless without docker After grabbing the new release and unpacking the contents, do the following: -1. Update python requirements. Paperless uses - `Pipenv`_ for managing dependencies: +1. Update dependencies. New paperless version may require additional + dependencies. The dependencies required are listed in the section about + :ref:`bare metal installations `. + +2. Update python requirements. If you use Pipenv, this is done with the following steps. .. code:: shell-session @@ -132,14 +135,14 @@ After grabbing the new release and unpacking the contents, do the following: This creates a new virtual environment (or uses your existing environment) and installs all dependencies into it. -2. Collect static files. +3. Collect static files. .. code:: shell-session $ cd src $ pipenv run python3 manage.py collectstatic --clear -3. Migrate the database. +4. Migrate the database. .. code:: shell-session @@ -153,14 +156,14 @@ Management utilities Paperless comes with some management commands that perform various maintenance tasks on your paperless instance. You can invoke these commands either by -.. code:: bash +.. code:: shell-session $ cd /path/to/paperless $ docker-compose run --rm webserver or -.. code:: bash +.. code:: shell-session $ cd /path/to/paperless/src $ pipenv run python manage.py @@ -366,7 +369,7 @@ is specified, the archiver will only process that document. .. note:: Some documents will cause errors and cannot be converted into PDF/A documents, - such as encrypted PDF documents. The archiver will skip over these Documents + such as encrypted PDF documents. The archiver will skip over these documents each time it sees them. .. _utilities-encyption: diff --git a/docs/extending.rst b/docs/extending.rst index a0f14f2aa..28da1f56b 100644 --- a/docs/extending.rst +++ b/docs/extending.rst @@ -118,114 +118,80 @@ This will test and assemble everything and also build and tag a docker image. Extending Paperless =================== -.. warning:: +Paperless does not have any fancy plugin systems and will probably never have. However, +some parts of the application have been designed to allow easy integration of additional +features without any modification to the base code. - This section is not updated to paperless-ng yet. +Making custom parsers +--------------------- -For the most part, Paperless is monolithic, so extending it is often best -managed by way of modifying the code directly and issuing a pull request on -`GitHub`_. However, over time the project has been evolving to be a little -more "pluggable" so that users can write their own stuff that talks to it. +Paperless uses parsers to add documents to paperless. A parser is responsible for: -.. _GitHub: https://github.com/the-paperless-project/paperless +* Retrieve the content from the original +* Create a thumbnail +* Optional: Retrieve a created date from the original +* Optional: Create an archived document from the original +Custom parsers can be added to paperless to support more file types. In order to do that, +you need to write the parser itself and announce its existence to paperless. -.. _extending-parsers: - -Parsers -------- - -You can leverage Paperless' consumption model to have it consume files *other* -than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``. To do so, -you simply follow Django's convention of creating a new app, with a few key -requirements. - - -.. _extending-parsers-parserspy: - -parsers.py -.......... - -In this file, you create a class that extends -``documents.parsers.DocumentParser`` and go about implementing the three -required methods: - -* ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for - this document. -* ``get_text()``: Returns the text from the document and only the text. -* ``get_date()``: If possible, this returns the date of the document, otherwise - it should return ``None``. - - -.. _extending-parsers-signalspy: - -signals.py -.......... - -At consumption time, Paperless emits a ``document_consumer_declaration`` -signal which your module has to react to in order to let the consumer know -whether or not it's capable of handling a particular file. Think of it like -this: - -1. Consumer finds a file in the consumption directory. -2. It asks all the available parsers: *"Hey, can you handle this file?"* -3. Each parser responds with either ``None`` meaning they can't handle the - file, or a dictionary in the following format: +The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the +methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to +``get_date`` if you don't want to rely on paperless' default date guessing mechanisms. .. code:: python - { - "parser": , - "weight": - } + class MyCustomParser(DocumentParser): -The consumer compares the ``weight`` values from all respondents and uses the -class with the highest value to consume the document. The default parser, -``RasterisedDocumentParser`` has a weight of ``0``. + def parse(self, document_path, mime_type): + # This method does not return anything. Rather, you should assign + # whatever you got from the document to the following fields: + # The content of the document. + self.text = "content" + + # Optional: path to a PDF document that you created from the original. + self.archive_path = os.path.join(self.tempdir, "archived.pdf") -.. _extending-parsers-appspy: + # Optional: "created" date of the document. + self.date = get_created_from_metadata(document_path) -apps.py -....... + def get_thumbnail(self, document_path, mime_type): + # This should return the path to a thumbnail you created for this + # document. + return os.path.join(self.tempdir, "thumb.png") -This is a standard Django file, but you'll need to add some code to it to -connect your parser to the ``document_consumer_declaration`` signal. +If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``. +The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty +and removed after consumption finished. You can use that directory to store any +intermediate files and also use it to store the thumbnail / archived document. -.. _extending-parsers-finally: - -Finally -....... - -The last step is to update ``settings.py`` to include your new module. -Eventually, this will be dynamic, but at the moment, you have to edit the -``INSTALLED_APPS`` section manually. Simply add the path to your AppConfig to -the list like this: +After that, you need to announce your parser to paperless. You need to connect a +handler to the ``document_consumer_declaration`` signal. Have a look in the file +``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method +that returns information about your parser: .. code:: python - INSTALLED_APPS = [ - ... - "my_module.apps.MyModuleConfig", - ... - ] + def myparser_consumer_declaration(sender, **kwargs): + return { + "parser": MyCustomParser, + "weight": 0, + "mime_types": { + "application/pdf": ".pdf", + "image/jpeg": ".jpg", + } + } -Order doesn't matter, but generally it's a good idea to place your module lower -in the list so that you don't end up accidentally overriding project defaults -somewhere. +* ``parser`` is a reference to a class that extends ``DocumentParser``. +* ``weight`` is used whenever two or more parsers are able to parse a file: The parser with + the higher weight wins. This can be used to override the parsers provided by + paperless. -.. _extending-parsers-example: - -An Example -.......... - -The core Paperless functionality is based on this design, so if you want to see -what a parser module should look like, have a look at `parsers.py`_, -`signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module. - -.. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py -.. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py -.. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py -.. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/ +* ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value + is the default file extension that paperless should use when storing files and serving them for + download. We could guess that from the file extensions, but some mime types have many extensions + associated with them and the python methods responsible for guessing the extension do not always + return the same value. diff --git a/docs/faq.rst b/docs/faq.rst index 887946074..6eac18617 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -73,7 +73,7 @@ in your browser and paperless has to do much less work to serve the data. **Q:** *How do I install paperless-ng on Raspberry Pi?* -**A:** There is not docker image for ARM available. If you know how to build +**A:** There is no docker image for ARM available. If you know how to build that automatically, I'm all ears. For now, you have to grab the latest release archive from the project page and build the image yourself. The release comes with the front end already compiled, so you don't have to do this on the Pi.