mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-11-03 03:16:10 -06:00 
			
		
		
		
	Merge branch 'master' into dev
This commit is contained in:
		@@ -38,6 +38,7 @@ Here's what you get:
 | 
			
		||||
	* When adding documents from mails, paperless can move these mails to a new folder, mark them as read, flag them or delete them.
 | 
			
		||||
* Machine learning powered document matching.
 | 
			
		||||
	* Paperless learns from your documents and will be able to automatically assign tags, correspondents and types to documents once you've stored a few documents in paperless.
 | 
			
		||||
* We have a mobile app that offers a 'Share with paperless' option over at https://github.com/qcasey/paperless_share. You can use that in combination with any of the mobile scanning apps out there. It's still a little rough around the edges, but it works!
 | 
			
		||||
* A task processor that processes documents in parallel and also tells you when something goes wrong. On modern multi core systems, consumption is blazing fast.
 | 
			
		||||
* Code cleanup in many, MANY areas. Some of the code from OG paperless was just overly complicated.
 | 
			
		||||
* More tests, more stability.
 | 
			
		||||
@@ -50,7 +51,6 @@ For a complete list of changes from paperless, check out the [changelog](https:/
 | 
			
		||||
 | 
			
		||||
- Make the front end nice (except mobile).
 | 
			
		||||
- Test coverage at 90%.
 | 
			
		||||
- Store archived documents with an embedded OCR text layer, while keeping originals available. Making good progress in the `feature-ocrmypdf` branch.
 | 
			
		||||
- Fix whatever bugs I and you find.
 | 
			
		||||
 | 
			
		||||
## Roadmap for versions beyond 1.0
 | 
			
		||||
 
 | 
			
		||||
@@ -119,8 +119,11 @@ Updating paperless without docker
 | 
			
		||||
 | 
			
		||||
After grabbing the new release and unpacking the contents, do the following:
 | 
			
		||||
 | 
			
		||||
1.  Update python requirements. Paperless uses
 | 
			
		||||
    `Pipenv`_ for managing dependencies:
 | 
			
		||||
1.  Update dependencies. New paperless version may require additional
 | 
			
		||||
    dependencies. The dependencies required are listed in the section about 
 | 
			
		||||
    :ref:`bare metal installations <setup-bare_metal>`.
 | 
			
		||||
 | 
			
		||||
2.  Update python requirements. If you use Pipenv, this is done with the following steps.
 | 
			
		||||
 | 
			
		||||
    .. code:: shell-session
 | 
			
		||||
 | 
			
		||||
@@ -132,14 +135,14 @@ After grabbing the new release and unpacking the contents, do the following:
 | 
			
		||||
    This creates a new virtual environment (or uses your existing environment)
 | 
			
		||||
    and installs all dependencies into it.
 | 
			
		||||
 | 
			
		||||
2.  Collect static files.
 | 
			
		||||
3.  Collect static files.
 | 
			
		||||
 | 
			
		||||
    .. code:: shell-session
 | 
			
		||||
 | 
			
		||||
        $ cd src
 | 
			
		||||
        $ pipenv run python3 manage.py collectstatic --clear
 | 
			
		||||
    
 | 
			
		||||
3.  Migrate the database.
 | 
			
		||||
4.  Migrate the database.
 | 
			
		||||
 | 
			
		||||
    .. code:: shell-session
 | 
			
		||||
 | 
			
		||||
@@ -153,14 +156,14 @@ Management utilities
 | 
			
		||||
Paperless comes with some management commands that perform various maintenance
 | 
			
		||||
tasks on your paperless instance. You can invoke these commands either by
 | 
			
		||||
 | 
			
		||||
.. code:: bash
 | 
			
		||||
.. code:: shell-session
 | 
			
		||||
 | 
			
		||||
    $ cd /path/to/paperless
 | 
			
		||||
    $ docker-compose run --rm webserver <command> <arguments>
 | 
			
		||||
 | 
			
		||||
or
 | 
			
		||||
 | 
			
		||||
.. code:: bash
 | 
			
		||||
.. code:: shell-session
 | 
			
		||||
 | 
			
		||||
    $ cd /path/to/paperless/src
 | 
			
		||||
    $ pipenv run python manage.py <command> <arguments>
 | 
			
		||||
@@ -366,7 +369,7 @@ is specified, the archiver will only process that document.
 | 
			
		||||
.. note::
 | 
			
		||||
 | 
			
		||||
    Some documents will cause errors and cannot be converted into PDF/A documents,
 | 
			
		||||
    such as encrypted PDF documents. The archiver will skip over these Documents
 | 
			
		||||
    such as encrypted PDF documents. The archiver will skip over these documents
 | 
			
		||||
    each time it sees them.
 | 
			
		||||
 | 
			
		||||
.. _utilities-encyption:
 | 
			
		||||
 
 | 
			
		||||
@@ -118,114 +118,80 @@ This will test and assemble everything and also build and tag a docker image.
 | 
			
		||||
Extending Paperless
 | 
			
		||||
===================
 | 
			
		||||
 | 
			
		||||
.. warning::
 | 
			
		||||
Paperless does not have any fancy plugin systems and will probably never have. However,
 | 
			
		||||
some parts of the application have been designed to allow easy integration of additional
 | 
			
		||||
features without any modification to the base code.
 | 
			
		||||
 | 
			
		||||
    This section is not updated to paperless-ng yet.
 | 
			
		||||
Making custom parsers
 | 
			
		||||
---------------------
 | 
			
		||||
 | 
			
		||||
For the most part, Paperless is monolithic, so extending it is often best
 | 
			
		||||
managed by way of modifying the code directly and issuing a pull request on
 | 
			
		||||
`GitHub`_.  However, over time the project has been evolving to be a little
 | 
			
		||||
more "pluggable" so that users can write their own stuff that talks to it.
 | 
			
		||||
Paperless uses parsers to add documents to paperless. A parser is responsible for:
 | 
			
		||||
 | 
			
		||||
.. _GitHub: https://github.com/the-paperless-project/paperless
 | 
			
		||||
*   Retrieve the content from the original
 | 
			
		||||
*   Create a thumbnail
 | 
			
		||||
*   Optional: Retrieve a created date from the original
 | 
			
		||||
*   Optional: Create an archived document from the original
 | 
			
		||||
 | 
			
		||||
Custom parsers can be added to paperless to support more file types. In order to do that,
 | 
			
		||||
you need to write the parser itself and announce its existence to paperless.
 | 
			
		||||
 | 
			
		||||
.. _extending-parsers:
 | 
			
		||||
 | 
			
		||||
Parsers
 | 
			
		||||
-------
 | 
			
		||||
 | 
			
		||||
You can leverage Paperless' consumption model to have it consume files *other*
 | 
			
		||||
than ones handled by default like ``.pdf``, ``.jpg``, and ``.tiff``.  To do so,
 | 
			
		||||
you simply follow Django's convention of creating a new app, with a few key
 | 
			
		||||
requirements.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
.. _extending-parsers-parserspy:
 | 
			
		||||
 | 
			
		||||
parsers.py
 | 
			
		||||
..........
 | 
			
		||||
 | 
			
		||||
In this file, you create a class that extends
 | 
			
		||||
``documents.parsers.DocumentParser`` and go about implementing the three
 | 
			
		||||
required methods:
 | 
			
		||||
 | 
			
		||||
* ``get_thumbnail()``: Returns the path to a file we can use as a thumbnail for
 | 
			
		||||
  this document.
 | 
			
		||||
* ``get_text()``: Returns the text from the document and only the text.
 | 
			
		||||
* ``get_date()``: If possible, this returns the date of the document, otherwise
 | 
			
		||||
  it should return ``None``.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
.. _extending-parsers-signalspy:
 | 
			
		||||
 | 
			
		||||
signals.py
 | 
			
		||||
..........
 | 
			
		||||
 | 
			
		||||
At consumption time, Paperless emits a ``document_consumer_declaration``
 | 
			
		||||
signal which your module has to react to in order to let the consumer know
 | 
			
		||||
whether or not it's capable of handling a particular file.  Think of it like
 | 
			
		||||
this:
 | 
			
		||||
 | 
			
		||||
1. Consumer finds a file in the consumption directory.
 | 
			
		||||
2. It asks all the available parsers: *"Hey, can you handle this file?"*
 | 
			
		||||
3. Each parser responds with either ``None`` meaning they can't handle the
 | 
			
		||||
   file, or a dictionary in the following format:
 | 
			
		||||
The parser itself must extend ``documents.parsers.DocumentParser`` and must implement the
 | 
			
		||||
methods ``parse`` and ``get_thumbnail``. You can provide your own implementation to
 | 
			
		||||
``get_date`` if you don't want to rely on paperless' default date guessing mechanisms.
 | 
			
		||||
 | 
			
		||||
.. code:: python
 | 
			
		||||
 | 
			
		||||
    {
 | 
			
		||||
        "parser": <the class name>,
 | 
			
		||||
        "weight": <an integer>
 | 
			
		||||
    }
 | 
			
		||||
    class MyCustomParser(DocumentParser):
 | 
			
		||||
 | 
			
		||||
The consumer compares the ``weight`` values from all respondents and uses the
 | 
			
		||||
class with the highest value to consume the document.  The default parser,
 | 
			
		||||
``RasterisedDocumentParser`` has a weight of ``0``.
 | 
			
		||||
        def parse(self, document_path, mime_type):
 | 
			
		||||
            # This method does not return anything. Rather, you should assign
 | 
			
		||||
            # whatever you got from the document to the following fields:
 | 
			
		||||
 | 
			
		||||
            # The content of the document.
 | 
			
		||||
            self.text = "content"
 | 
			
		||||
            
 | 
			
		||||
            # Optional: path to a PDF document that you created from the original.
 | 
			
		||||
            self.archive_path = os.path.join(self.tempdir, "archived.pdf")
 | 
			
		||||
 | 
			
		||||
.. _extending-parsers-appspy:
 | 
			
		||||
            # Optional: "created" date of the document.
 | 
			
		||||
            self.date = get_created_from_metadata(document_path)
 | 
			
		||||
 | 
			
		||||
apps.py
 | 
			
		||||
.......
 | 
			
		||||
        def get_thumbnail(self, document_path, mime_type):
 | 
			
		||||
            # This should return the path to a thumbnail you created for this
 | 
			
		||||
            # document.
 | 
			
		||||
            return os.path.join(self.tempdir, "thumb.png")
 | 
			
		||||
 | 
			
		||||
This is a standard Django file, but you'll need to add some code to it to
 | 
			
		||||
connect your parser to the ``document_consumer_declaration`` signal.
 | 
			
		||||
If you encounter any issues during parsing, raise a ``documents.parsers.ParseError``.
 | 
			
		||||
 | 
			
		||||
The ``self.tempdir`` directory is a temporary directory that is guaranteed to be empty
 | 
			
		||||
and removed after consumption finished. You can use that directory to store any
 | 
			
		||||
intermediate files and also use it to store the thumbnail / archived document.
 | 
			
		||||
 | 
			
		||||
.. _extending-parsers-finally:
 | 
			
		||||
 | 
			
		||||
Finally
 | 
			
		||||
.......
 | 
			
		||||
 | 
			
		||||
The last step is to update ``settings.py`` to include your new module.
 | 
			
		||||
Eventually, this will be dynamic, but at the moment, you have to edit the
 | 
			
		||||
``INSTALLED_APPS`` section manually.  Simply add the path to your AppConfig to
 | 
			
		||||
the list like this:
 | 
			
		||||
After that, you need to announce your parser to paperless. You need to connect a
 | 
			
		||||
handler to the ``document_consumer_declaration`` signal. Have a look in the file
 | 
			
		||||
``src/paperless_tesseract/apps.py`` on how that's done. The handler is a method
 | 
			
		||||
that returns information about your parser:
 | 
			
		||||
 | 
			
		||||
.. code:: python
 | 
			
		||||
 | 
			
		||||
    INSTALLED_APPS = [
 | 
			
		||||
        ...
 | 
			
		||||
        "my_module.apps.MyModuleConfig",
 | 
			
		||||
        ...
 | 
			
		||||
    ]
 | 
			
		||||
    def myparser_consumer_declaration(sender, **kwargs):
 | 
			
		||||
        return {
 | 
			
		||||
            "parser": MyCustomParser,
 | 
			
		||||
            "weight": 0,
 | 
			
		||||
            "mime_types": {
 | 
			
		||||
                "application/pdf": ".pdf",
 | 
			
		||||
                "image/jpeg": ".jpg",
 | 
			
		||||
            }
 | 
			
		||||
        }
 | 
			
		||||
 | 
			
		||||
Order doesn't matter, but generally it's a good idea to place your module lower
 | 
			
		||||
in the list so that you don't end up accidentally overriding project defaults
 | 
			
		||||
somewhere.
 | 
			
		||||
*   ``parser`` is a reference to a class that extends ``DocumentParser``.
 | 
			
		||||
 | 
			
		||||
*   ``weight`` is used whenever two or more parsers are able to parse a file: The parser with
 | 
			
		||||
    the higher weight wins. This can be used to override the parsers provided by
 | 
			
		||||
    paperless.
 | 
			
		||||
 | 
			
		||||
.. _extending-parsers-example:
 | 
			
		||||
 | 
			
		||||
An Example
 | 
			
		||||
..........
 | 
			
		||||
 | 
			
		||||
The core Paperless functionality is based on this design, so if you want to see
 | 
			
		||||
what a parser module should look like, have a look at `parsers.py`_,
 | 
			
		||||
`signals.py`_, and `apps.py`_ in the `paperless_tesseract`_ module.
 | 
			
		||||
 | 
			
		||||
.. _parsers.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/parsers.py
 | 
			
		||||
.. _signals.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/signals.py
 | 
			
		||||
.. _apps.py: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/apps.py
 | 
			
		||||
.. _paperless_tesseract: https://github.com/the-paperless-project/paperless/blob/master/src/paperless_tesseract/
 | 
			
		||||
*   ``mime_types`` is a dictionary. The keys are the mime types your parser supports and the value
 | 
			
		||||
    is the default file extension that paperless should use when storing files and serving them for
 | 
			
		||||
    download. We could guess that from the file extensions, but some mime types have many extensions
 | 
			
		||||
    associated with them and the python methods responsible for guessing the extension do not always
 | 
			
		||||
    return the same value.
 | 
			
		||||
 
 | 
			
		||||
@@ -73,7 +73,7 @@ in your browser and paperless has to do much less work to serve the data.
 | 
			
		||||
 | 
			
		||||
**Q:** *How do I install paperless-ng on Raspberry Pi?*
 | 
			
		||||
 | 
			
		||||
**A:** There is not docker image for ARM available. If you know how to build
 | 
			
		||||
**A:** There is no docker image for ARM available. If you know how to build
 | 
			
		||||
that automatically, I'm all ears. For now, you have to grab the latest release
 | 
			
		||||
archive from the project page and build the image yourself. The release comes
 | 
			
		||||
with the front end already compiled, so you don't have to do this on the Pi.
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user