docs and readme

This commit is contained in:
Jonas Winkler 2020-11-13 19:27:22 +01:00
parent c84b61807c
commit 070f8ee949
10 changed files with 85 additions and 150 deletions

View File

@ -28,35 +28,14 @@ Here's what you get:
I wanted to make big changes to the project that will impact the way it is used by its users greatly. Among the users who currently use paperless in production there are probably many that don't want these changes right away. I also wanted to have more control over what goes into the code and what does not. Therefore, paperless-ng was created. NG stands for both Angular (the framework used for the Frontend) and next-gen. Publishing this project under a different name also avoids confusion between paperless and paperless-ng.
This is a list of changes that have been made to the original project.
The gist of the changes is the following:
## Added
- **A new single page UI** built with bootstrap and Angular. Its much more responsive than the django admin pages. It features the follwing improvements over the old django admin interface:
- *Dashboard.* The landing page shows some useful information, such as statistics, recently scanned documents, file uploading, and possibly more in the future.
- *Document uploading on the web page.* This is very crude right now, but gets the job done. It simply uploads the documents and stores them in the configured consumer directory. The API for that has always been in the project, there simply was no form on the UI to support it.
- *Full text search* with a proper document indexer: The search feature sorts documents by relevance to the search query, highlights query terms in the found documents and provides autocomplete while typing the query. This is still very basic but will see extensions in the future.
- *Saveable filters.* Save filter and sorting presets and optionally display a couple documents of saved filters (i.e., your inbox sorted descending by added date, or tagged TODO, oldest to newest) on the dash board.
- *Statistics.* Provides basic statistics about your document collection.
- **Document types.** Similar to correspondents, each document may have a type (i.e., invoice, letter, receipt, bank statement, ...). I've initially intented to use this for some individual processing of differently typed documents, however, no such features exists yet.
- **Inbox tags.** These tags are automatically assigned to every newly scanned document. They are intented to be removed once you have manually edited the meta data of a document.
- **Automatic matching** for document types, correspondents, and tags. A new matching algorithm has been implemented (Auto), which is based on a classification model (simple feed forward neural nets are used). This classifier is trained on your document collection and learns to assign metadata to new documents based on their similiarity to existing documents.
- If, for example, all your bank statements for a specific account are tagged with "bank_account_1234" and the matching algorithm of that tag is set to Auto, the classifier learns relevant phrases and words in the documents and assigns this tag automatically to newly scanned and matching documents.
- This works reasonably well, if there is a correlation between the tag and the content of the document. Tags such as 'TODO' or 'Contact Correspondent' cannot be assigned automatically.
- **Archive serial numbers.** These are there to support the recommended workflow for storing physical copies of very important documents. The idea is that if a document has to be kept in physical form, you write a running number on the document before scanning (the archive serial number) and keep these documents sorted by number in a binder. If you need to access a specific physical document at some point in time, search for the document in paperless, identify the ASN and grab the document.
* New front end. This will eventually be mobile friendly as well.
* New full text search.
* Machine learning powered document matching.
* Code cleanup in many, MANY areas.
## Modified
- **(BREAKING) REST API changes.** In order to support the new UI, changes had to be made to the API. Some filters are not available anymore, other filters were added. Furthermore, foreign key relationships are not expressed with URLs anymore, but with their respective ids. Also, the urls for fetching documents and thumbnails have changed. Redirects are in place to support the old urls.
## Internal changes
- Many improvements to the code. More concise logging of the consumer, better multithreading of the tesseract parser for large documents, less hacks overall.
- Updated docker image. This image runs everything in a single container. (Except the optional database, of course)
## Removed
These features were removed each due to two reasons. First, I did not feel these features contributed all that much to the over project, and second, I don't want to maintain these features.
- **(BREAKING) Reminders.** I have no idea what they were used for and thus removed them from the project.
- **Every customization made to the admin interface.** Since this is not the primary interface for the application anymore, there is no need to keep and maintain these. Besides, some changes were incompatible with the most recent versions of django. The interface is completely usable, though.
For a complete list of changes, check out the [changelog](https://paperless-ng.readthedocs.io/en/latest/changelog.html)
## Planned
@ -67,6 +46,7 @@ These features will make it into the application at some point, sorted by priori
- Ability to search for “Similar documents” in the search results
- Provide corrections for mispelled queries
- **More robust consumer** that shows its progress on the web page.
- **More rigid email processing**. Like, dont delete imported mail, provide filters, etc...
- **Arbitrary tag colors**. Allow the selection of any color with a color picker.
## On the chopping block.
@ -74,39 +54,34 @@ These features will make it into the application at some point, sorted by priori
I don't know if these features are used all that much. I don't exactly know how they work and will probably remove them at some point in the future.
- **GnuPG encrypion.** Since its disabled by default and the website allows transparent access to encrypted documents anyway, this doesnt really provide any benefit over having the application stored on an encrypted file system.
- **E-Mail scanning.** I dont use it and dont know the state of the implementation. Ill have to look into that.
# Getting started
The recommended way to deploy paperless is docker-compose.
The recommended way to deploy paperless is docker-compose. Use the provided docker-compose.yml files to get started. This pulls the image from Docker hub. Alternatively, you can build the image yourself.
git clone https://github.com/jonaswinkler/paperless
cd paperless
cp docker-compose.yml.example docker-compose.yml
cp docker-compose.env.example docker-compose.env
docker-compose up -d
Please be aware that this uses a postgres database instead of sqlite. If you want to continue using sqlite, remove the database-related options from the docker-compose.env file.
Read the [documentation](https://paperless-ng.readthedocs.io/en/latest/setup.html#installation) on how to get started.
Alternatively, you can install the dependencies and setup apache and a database server yourself. Details for that will be available in the documentation.
# Migrating to paperless-ng
Don't do it yet. The migrations are in place, but I have not verified yet that they work.
Read the section about [migration](https://paperless-ng.readthedocs.io/en/latest/setup.html#migration-to-paperless-ng) in the documentation.
# Documentation
The documentation for Paperless is available on [ReadTheDocs](https://paperless-ng.readthedocs.io/). Updated documentation for this project is not yet available.
The documentation for Paperless-ng is available on [ReadTheDocs](https://paperless-ng.readthedocs.io/).
# Affiliated Projects
Paperless has been around a while now, and people are starting to build stuff on top of it. If you're one of those people, we can add your project to this list:
* [Paperless App](https://github.com/bauerj/paperless_app): An Android/iOS app for Paperless. This app is not compatible at this point.
* [Paperless App](https://github.com/bauerj/paperless_app): An Android/iOS app for Paperless.
* [Paperless Desktop](https://github.com/thomasbrueggemann/paperless-desktop): A desktop UI for your Paperless installation. Runs on Mac, Linux, and Windows.
* [ansible-role-paperless](https://github.com/ovv/ansible-role-paperless): An easy way to get Paperless running via Ansible.
* [paperless-cli](https://github.com/stgarf/paperless-cli): A golang command line binary to interact with a Paperless instance.
Compatibility with Paperless-ng is unknown.
# Important Note
Document scanners are typically used to scan sensitive documents. Things like your social insurance number, tax records, invoices, etc. Everything is stored in the clear without encryption by default (it needs to be searchable, so if someone has ideas on how to do that on encrypted data, I'm all ears). This means that Paperless should never be run on an untrusted host. Instead, I recommend that if you do want to use it, run it locally on a server in your own home.

View File

@ -22,47 +22,6 @@ into an arbitrary directory.
Restoring
=========
Restoring your data is just as easy, since nearly all of your data exists either
in the file names, or in the contents of the files themselves. You just need to
create an empty database (just follow the
:ref:`installation instructions <setup-installation>` again) and then import the
``tags.json`` file you created as part of your backup. Lastly, copy your
exported documents into the consumption directory and start up the consumer.
.. code-block:: shell-session
$ cd /path/to/project
$ rm data/db.sqlite3 # Delete the database
$ cd src
$ ./manage.py migrate # Create the database
$ ./manage.py createsuperuser
$ ./manage.py loaddata /path/to/arbitrary/place/tags.json
$ cp /path/to/exported/docs/* /path/to/consumption/dir/
$ ./manage.py document_consumer
Importing your data if you are :ref:`using Docker <setup-installation-docker>`
is almost as simple:
.. code-block:: shell-session
# Stop and remove your current containers
$ docker-compose stop
$ docker-compose rm -f
# Recreate them, add the superuser
$ docker-compose up -d
$ docker-compose run --rm webserver createsuperuser
# Load the tags
$ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin -
# Load your exported documents into the consumption directory
# (How you do this highly depends on how you have set this up)
$ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/
After loading the documents into the consumption directory the consumer will
immediately start consuming the documents.
.. _administration-updating:
Updating paperless
@ -93,8 +52,7 @@ is typically added by way of an environment variable set in ``paperless.conf``.
You may want to take a look at the ``paperless.conf.example`` file to see if
there's anything new in there compared to what you've got in ``/etc``.
If you are :ref:`using Docker <setup-installation-docker>` the update process
is similar:
If you are using docker the update process is similar:
.. code-block:: shell-session
@ -162,6 +120,8 @@ depending on whether you use docker or not.
All commands have built-in help, which can be accessed by executing them with
the argument ``--help``.
.. _utilities-exporter:
Document exporter
=================
@ -290,14 +250,22 @@ scheduler.
Managing filenames
==================
If you use paperless' feature to assign custom filenames to your documents
(TODO ref), you can use this command to move all your files after changing
the naming scheme.
.. warning::
TBD
Since this command moves you documents around alot, it is advised to to
a backup before. The renaming logic is robust and will never overwrite
or delete a file, but you can't ever be careful enough.
.. code::
document_renamer
The command takes no arguments and processes all your documents at once.
.. _utilities-encyption:

View File

@ -220,8 +220,6 @@ the consumption process will begin with the newly modified file.
.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr
.. _consumption-director-hook-variables-post:
Post-consumption script
=======================

View File

@ -1,4 +1,3 @@
.. _api:
************
The REST API
@ -17,12 +16,10 @@ installation.
.. _Django REST Framework: http://django-rest-framework.org/
.. _api-uploading:
Uploading
=========
File uploads in an API are hard and so far as I've been able to tell, there's
no standard way of accepting them, so rather than crowbar file uploads into the
REST API and endure that headache, I've left that process to a simple HTTP
POST, documented on the :ref:`consumption page <consumption-http>`.
POST.

View File

@ -1,10 +1,12 @@
.. _paperless_changelog:
*********
Changelog
#########
*********
paperless-ng 1.0
================
################
* **Deprecated:** GnuPG. Don't use it. If you're still using it, be aware that it
offers no protection at all, since the passphrase is stored alongside with the
@ -49,7 +51,9 @@ paperless-ng 1.0
Username, database and password all default to ``paperless`` if not specified.
* **docker-compose.yml uses PostgreSQL by default.**
* **Modified [breaking]:** document_retagger management command rework. See TODO hyperref
* **Modified [breaking]:** document_retagger management command rework. See
:ref:`utilities-retagger` for details. Replaces ``document_correspondents``
management command.
* **Removed [breaking]:** Reminders.
* **Removed:** All customizations made to the django admin pages.
@ -75,11 +79,11 @@ paperless-ng 1.0
* Many more small changes here and there. The usual stuff.
2.7.0
=====
#####
* `syntonym`_ submitted a pull request to catch IMAP connection errors `#475`_.
* `Stéphane Brunner`_ added ``psycopg2`` to the Pipfile `#489`_. He also fixed
a syntax error in ``docker-compose.yml.example`` `#488`_ and added [DjangoQL](https://github.com/ivelum/djangoql),
a syntax error in ``docker-compose.yml.example`` `#488`_ and added `DjangoQL`_,
which allows a litany of handy search functionality `#492`_.
* `CkuT`_ and `JOKer`_ hacked out a simple, but super-helpful optimisation to
how the thumbnails are served up, improving performance considerably `#481`_.
@ -92,7 +96,7 @@ paperless-ng 1.0
2.6.1
=====
#####
* We now have a logo, complete with a favicon :-)
* Removed some problematic tests.
@ -104,7 +108,7 @@ paperless-ng 1.0
2.6.0
=====
#####
* Allow an infinite number of logs to be deleted. Thanks to `Ulli`_ for noting
the problem in `#433`_.
@ -125,7 +129,7 @@ paperless-ng 1.0
2.5.0
=====
#####
* **New dependency**: Paperless now optimises thumbnail generation with
`optipng`_, so you'll need to install that somewhere in your PATH or declare
@ -169,7 +173,7 @@ paperless-ng 1.0
2.4.0
=====
#####
* A new set of actions are now available thanks to `jonaswinkler`_'s very first
pull request! You can now do nifty things like tag documents in bulk, or set
@ -190,7 +194,7 @@ paperless-ng 1.0
2.3.0
=====
#####
* Support for consuming plain text & markdown documents was added by
`Joshua Taillon`_! This was a long-requested feature, and it's addition is
@ -208,14 +212,14 @@ paperless-ng 1.0
2.2.1
=====
#####
* `Kyle Lucy`_ reported a bug quickly after the release of 2.2.0 where we broke
the ``DISABLE_LOGIN`` feature: `#392`_.
2.2.0
=====
#####
* Thanks to `dadosch`_, `Wolfgang Mader`_, and `Tim Brooks`_ this is the first
version of Paperless that supports Django 2.0! As a result of their hard
@ -232,7 +236,7 @@ paperless-ng 1.0
2.1.0
=====
#####
* `Enno Lohmeier`_ added three simple features that make Paperless a lot more
user (and developer) friendly:
@ -251,7 +255,7 @@ paperless-ng 1.0
2.0.0
=====
#####
This is a big release as we've changed a core-functionality of Paperless: we no
longer encrypt files with GPG by default.
@ -267,7 +271,7 @@ that it was more an annoyance than anything else, so this feature is now turned
off unless you explicitly set a passphrase in your config file.
Migrating from 1.x
------------------
==================
Encryption isn't gone, it's just off for new users. So long as you have
``PAPERLESS_PASSPHRASE`` set in your config or your environment, Paperless
@ -283,7 +287,7 @@ Special thanks to `erikarvstedt`_, `matthewmoto`_, and `mcronce`_ who did the
bulk of the work on this big change.
1.4.0
=====
#####
* `Quentin Dawans`_ has refactored the document consumer to allow for some
command-line options. Notably, you can now direct it to consume from a
@ -318,7 +322,7 @@ bulk of the work on this big change.
to some excellent work from `erikarvstedt`_ on `#351`_
1.3.0
=====
#####
* You can now run Paperless without a login, though you'll still have to create
at least one user. This is thanks to a pull-request from `matthewmoto`_:
@ -341,7 +345,7 @@ bulk of the work on this big change.
problem and helping me find where to fix it.
1.2.0
=====
#####
* New Docker image, now based on Alpine, thanks to the efforts of `addadi`_
and `Pit`_. This new image is dramatically smaller than the Debian-based
@ -360,7 +364,7 @@ bulk of the work on this big change.
in the document text.
1.1.0
=====
#####
* Fix for `#283`_, a redirect bug which broke interactions with
paperless-desktop. Thanks to `chris-aeviator`_ for reporting it.
@ -370,7 +374,7 @@ bulk of the work on this big change.
`Dan Panzarella`_
1.0.0
=====
#####
* Upgrade to Django 1.11. **You'll need to run
``pip install -r requirements.txt`` after the usual ``git pull`` to
@ -389,14 +393,14 @@ bulk of the work on this big change.
`Lukas Winkler`_'s issue `#278`_
0.8.0
=====
#####
* Paperless can now run in a subdirectory on a host (``/paperless``), rather
than always running in the root (``/``) thanks to `maphy-psd`_'s work on
`#255`_.
0.7.0
=====
#####
* **Potentially breaking change**: As per `#235`_, Paperless will no longer
automatically delete documents attached to correspondents when those
@ -408,7 +412,7 @@ bulk of the work on this big change.
`Kusti Skytén`_ for posting the correct solution in the Github issue.
0.6.0
=====
#####
* Abandon the shared-secret trick we were using for the POST API in favour
of BasicAuth or Django session.
@ -422,7 +426,7 @@ bulk of the work on this big change.
the help with this feature.
0.5.0
=====
#####
* Support for fuzzy matching in the auto-tagger & auto-correspondent systems
thanks to `Jake Gysland`_'s patch `#220`_.
@ -440,13 +444,13 @@ bulk of the work on this big change.
* Amended the Django Admin configuration to have nice headers (`#230`_)
0.4.1
=====
#####
* Fix for `#206`_ wherein the pluggable parser didn't recognise files with
all-caps suffixes like ``.PDF``
0.4.0
=====
#####
* Introducing reminders. See `#199`_ for more information, but the short
explanation is that you can now attach simple notes & times to documents
@ -456,7 +460,7 @@ bulk of the work on this big change.
like to make use of this feature in his project.
0.3.6
=====
#####
* Fix for `#200`_ (!!) where the API wasn't configured to allow updating the
correspondent or the tags for a document.
@ -470,7 +474,7 @@ bulk of the work on this big change.
documentation is on its way.
0.3.5
=====
#####
* A serious facelift for the documents listing page wherein we drop the
tabular layout in favour of a tiled interface.
@ -481,7 +485,7 @@ bulk of the work on this big change.
consumption.
0.3.4
=====
#####
* Removal of django-suit due to a licensing conflict I bumped into in 0.3.3.
Note that you *can* use Django Suit with Paperless, but only in a
@ -494,26 +498,26 @@ bulk of the work on this big change.
API thanks to @thomasbrueggemann. See `#179`_.
0.3.3
=====
#####
* Thumbnails in the UI and a Django-suit -based face-lift courtesy of @ekw!
* Timezone, items per page, and default language are now all configurable,
also thanks to @ekw.
0.3.2
=====
#####
* Fix for `#172`_: defaulting ALLOWED_HOSTS to ``["*"]`` and allowing the
user to set her own value via ``PAPERLESS_ALLOWED_HOSTS`` should the need
arise.
0.3.1
=====
#####
* Added a default value for ``CONVERT_BINARY``
0.3.0
=====
#####
* Updated to using django-filter 1.x
* Added some system checks so new users aren't confused by misconfigurations.
@ -526,7 +530,7 @@ bulk of the work on this big change.
``PAPERLESS_SHARED_SECRET`` respectively instead.
0.2.0
=====
#####
* `#150`_: The media root is now a variable you can set in
``paperless.conf``.
@ -554,7 +558,7 @@ bulk of the work on this big change.
to `Martin Honermeyer`_ and `Tim White`_ for working with me on this.
0.1.1
=====
#####
* Potentially **Breaking Change**: All references to "sender" in the code
have been renamed to "correspondent" to better reflect the nature of the
@ -578,7 +582,7 @@ bulk of the work on this big change.
to be imported but made unavailable.
0.1.0
=====
#####
* Docker support! Big thanks to `Wayne Werner`_, `Brian Conn`_, and
`Tikitu de Jager`_ for this one, and especially to `Pit`_
@ -597,14 +601,14 @@ bulk of the work on this big change.
* Added tox with pep8 checking
0.0.6
=====
#####
* Added support for parallel OCR (significant work from `Pit`_)
* Sped up the language detection (significant work from `Pit`_)
* Added simple logging
0.0.5
=====
#####
* Added support for image files as documents (png, jpg, gif, tiff)
* Added a crude means of HTTP POST for document imports
@ -613,7 +617,7 @@ bulk of the work on this big change.
* Documentation for the above as well as data migration
0.0.4
=====
#####
* Added automated tagging basted on keyword matching
* Cleaned up the document listing page
@ -621,19 +625,19 @@ bulk of the work on this big change.
* Added ``pytz`` to the list of requirements
0.0.3
=====
#####
* Added basic tagging
0.0.2
=====
#####
* Added language detection
* Added datestamps to ``document_exporter``.
* Changed ``settings.TESSERACT_LANGUAGE`` to ``settings.OCR_LANGUAGE``.
0.0.1
=====
#####
* Initial release
@ -812,6 +816,6 @@ bulk of the work on this big change.
.. _#489: https://github.com/the-paperless-project/paperless/pull/489
.. _#492: https://github.com/the-paperless-project/paperless/pull/492
.. _pipenv: https://docs.pipenv.org/
.. _a new home on Docker Hub: https://hub.docker.com/r/danielquinn/paperless/
.. _optipng: http://optipng.sourceforge.net/
.. _DjangoQL: https://github.com/ivelum/djangoql

View File

@ -1,7 +1,6 @@
.. _index:
*********
Paperless
=========
*********
Paperless is a simple Django application running in two parts:
a *Consumer* (the thing that does the indexing) and
@ -10,8 +9,6 @@ download already-indexed documents). If you want to learn more about its
functions keep on reading after the installation section.
.. _index-why-this-exists:
Why This Exists
===============

View File

@ -1,3 +1,4 @@
.. _scanners:
***********************

View File

@ -120,7 +120,7 @@ At its core, paperless-ng is still paperless and fully compatible. However, some
things have changed under the hood, so you need to adapt your setup depending on
how you installed paperless. The important things to keep in mind are as follows.
* Read the :ref:`paperless_changelog` and take note of breaking changes.
* Read the :ref:`changelog <paperless_changelog>` and take note of breaking changes.
* It is recommended to use postgresql as the database now. The docker-compose
deployment will automatically create a postgresql instance and instruct
paperless to use it. This means that if you use the docker-compose script

View File

@ -1,12 +1,10 @@
.. _troubleshooting:
***************
Troubleshooting
===============
***************
.. _troubleshooting-languagemissing:
Consumer warns ``OCR for XX failed``
------------------------------------
####################################
If you find the OCR accuracy to be too low, and/or the document consumer warns
that ``OCR for XX failed, but we're going to stick with what we've got since
@ -20,10 +18,9 @@ box, and your documents are written in Spanish you may need to run::
apt-get install -y tesseract-ocr-spa
.. _troubleshooting-convertpixelcache:
Consumer dies with ``convert: unable to extent pixel cache``
------------------------------------------------------------
############################################################
During the consumption process, Paperless invokes ImageMagick's ``convert``
program to translate the source document into something that the OCR engine can
@ -48,10 +45,9 @@ that's actually on a physical disk (and writable by the user running
Paperless), like ``/var/tmp/paperless`` or ``/home/my_user/tmp`` in a pinch.
.. _troubleshooting-decompressionbombwarning:
DecompressionBombWarning and/or no text in the OCR output
---------------------------------------------------------
#########################################################
Some users have had issues using Paperless to consume PDFs that were created
by merging Very Large Scanned Images into one PDF. If this happens to you,
it's likely because the PDF you've created contains some very large pages
@ -72,4 +68,4 @@ with a DPI of 300, then merging the images into the single PDF with
For more information on this and situations like it, you should take a look
at `Issue #118`_ as that's where this tip originated.
.. _Issue #118: https://github.com/the-paperless-project/paperless/issues/118
.. _Issue #118: https://github.com/the-paperless-project/paperless/issues/118

View File

@ -130,7 +130,6 @@ REST API
You can also submit a document using the REST API, see the API section for details.
.. _usage-recommended_workflow:
The recommended workflow