diff --git a/docs/_static/Screenshot_first_logged.png b/docs/_static/Screenshot_first_logged.png deleted file mode 100644 index 9b9c6072b..000000000 Binary files a/docs/_static/Screenshot_first_logged.png and /dev/null differ diff --git a/docs/_static/Screenshot_first_run_login.png b/docs/_static/Screenshot_first_run_login.png deleted file mode 100644 index d704f1682..000000000 Binary files a/docs/_static/Screenshot_first_run_login.png and /dev/null differ diff --git a/docs/_static/Screenshot_upload_and_scanned.png b/docs/_static/Screenshot_upload_and_scanned.png deleted file mode 100644 index 7b433b2ca..000000000 Binary files a/docs/_static/Screenshot_upload_and_scanned.png and /dev/null differ diff --git a/docs/administration.rst b/docs/administration.rst new file mode 100644 index 000000000..881d31135 --- /dev/null +++ b/docs/administration.rst @@ -0,0 +1,354 @@ + +************** +Administration +************** + + +Making backups +############## + +.. warning:: + + This section is not updated yet. + +So you're bored of this whole project, or you want to make a remote backup of +your files for whatever reason. This is easy to do, simply use the +:ref:`exporter <utilities-exporter>` to dump your documents and database out +into an arbitrary directory. + + +.. _migrating-restoring: + +Restoring +========= + +Restoring your data is just as easy, since nearly all of your data exists either +in the file names, or in the contents of the files themselves. You just need to +create an empty database (just follow the +:ref:`installation instructions <setup-installation>` again) and then import the +``tags.json`` file you created as part of your backup. Lastly, copy your +exported documents into the consumption directory and start up the consumer. + +.. code-block:: shell-session + + $ cd /path/to/project + $ rm data/db.sqlite3 # Delete the database + $ cd src + $ ./manage.py migrate # Create the database + $ ./manage.py createsuperuser + $ ./manage.py loaddata /path/to/arbitrary/place/tags.json + $ cp /path/to/exported/docs/* /path/to/consumption/dir/ + $ ./manage.py document_consumer + +Importing your data if you are :ref:`using Docker <setup-installation-docker>` +is almost as simple: + +.. code-block:: shell-session + + # Stop and remove your current containers + $ docker-compose stop + $ docker-compose rm -f + + # Recreate them, add the superuser + $ docker-compose up -d + $ docker-compose run --rm webserver createsuperuser + + # Load the tags + $ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin - + + # Load your exported documents into the consumption directory + # (How you do this highly depends on how you have set this up) + $ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/ + +After loading the documents into the consumption directory the consumer will +immediately start consuming the documents. + +.. _administration-updating: + +Updating paperless +################## + +.. warning:: + + This section is not updated yet. + +For the most part, all you have to do to update Paperless is run ``git pull`` +on the directory containing the project files, and then use Django's +``migrate`` command to execute any database schema updates that might have been +rolled in as part of the update: + +.. code-block:: shell-session + + $ cd /path/to/project + $ git pull + $ pip install -r requirements.txt + $ cd src + $ ./manage.py migrate + +Note that it's possible (even likely) that while ``git pull`` may update some +files, the ``migrate`` step may not update anything. This is totally normal. + +Additionally, as new features are added, the ability to control those features +is typically added by way of an environment variable set in ``paperless.conf``. +You may want to take a look at the ``paperless.conf.example`` file to see if +there's anything new in there compared to what you've got in ``/etc``. + +If you are :ref:`using Docker <setup-installation-docker>` the update process +is similar: + +.. code-block:: shell-session + + $ cd /path/to/project + $ git pull + $ docker build -t paperless . + $ docker-compose run --rm consumer migrate + $ docker-compose up -d + +If ``git pull`` doesn't report any changes, there is no need to continue with +the remaining steps. + +This depends on the route you've chosen to run paperless. + + a. If you are not using docker, update python requirements. Paperless uses + `Pipenv`_ for managing dependencies: + + .. code:: bash + + $ pip install --upgrade pipenv + $ cd /path/to/paperless + $ pipenv install + + This creates a new virtual environment (or uses your existing environment) + and installs all dependencies into it. Running commands inside the environment + is done via + + .. code:: bash + + $ cd /path/to/paperless/src + $ pipenv run python3 manage.py my_command + + You will also need to build the frontend each time a new update is pushed. + See updating paperless for more information. TODO REFERENCE + + b. If you are using docker, build the docker image. + + .. code:: bash + + $ docker build -t jonaswinkler/paperless-ng:latest . + + Copy either docker-compose.yml.example or docker-compose.yml.sqlite.example + to docker-compose.yml and adjust the consumption directory. + +Management utilities +#################### + +Paperless comes with some management commands that perform various maintenance +tasks on your paperless instance. You can invoce these commands either by + +.. code:: bash + + $ cd /path/to/paperless + $ docker-compose run --rm webserver <command> <arguments> + +or + +.. code:: bash + + $ cd /path/to/paperless/src + $ pipenv run python manage.py <command> <arguments> + +depending on whether you use docker or not. + +All commands have built-in help, which can be accessed by executing them with +the argument ``--help``. + +Document exporter +================= + +The document exporter exports all your data from paperless into a folder for +backup or migration to another DMS. + +.. code:: + + document_exporter target + +``target`` is a folder to which the data gets written. This includes documents, +thumbnails and a ``manifest.json`` file. The manifest contains all metadata from +the database (correspondents, tags, etc). + +When you use the provided docker compose script, specify ``../export`` as the +target. This path inside the container is automatically mounted on your host on +the folder ``export``. + + +.. _utilities-importer: + +Document importer +================= + +The document importer takes the export produced by the `Document exporter`_ and +imports it into paperless. + +The importer works just like the exporter. You point it at a directory, and +the script does the rest of the work: + +.. code:: + + document_importer source + +When you use the provided docker compose script, put the export inside the +``export`` folder in your paperless source directory. Specify ``../export`` +as the ``source``. + + +.. _utilities-retagger: + +Document retagger +================= + +Say you've imported a few hundred documents and now want to introduce +a tag or set up a new correspondent, and apply its matching to all of +the currently-imported docs. This problem is common enough that +there are tools for it. + +.. code:: + + document_retagger [-h] [-c] [-T] [-t] [-i] [--use-first] [-f] + + optional arguments: + -c, --correspondent + -T, --tags + -t, --document_type + -i, --inbox-only + --use-first + -f, --overwrite + +Run this after changing or adding matching rules. It'll loop over all +of the documents in your database and attempt to match documents +according to the new rules. + +Specify any combination of ``-c``, ``-T`` and ``-t`` to have the +retagger perform matching of the specified metadata type. If you don't +specify any of these options, the document retagger won't do anything. + +Specify ``-i`` to have the document retagger work on documents tagged +with inbox tags only. This is useful when you don't want to mess with +your already processed documents. + +When multiple document types or correspondents match a single document, +the retagger won't assign these to the document. Specify ``--use-first`` +to override this behaviour and just use the first correspondent or type +it finds. This option does not apply to tags, since any amount of tags +can be applied to a document. + +Finally, ``-f`` specifies that you wish to overwrite already assigned +correspondents, types and/or tags. The default behaviour is to not +assign correspondents and types to documents that have this data already +assigned. ``-f`` works differently for tags: By default, only additional tags get +added to documents, no tags will be removed. With ``-f``, tags that don't +match a document anymore get removed as well. + + +Managing the Automatic matching algorithm +========================================= + +The *Auto* matching algorithm requires a trained neural network to work. +This network needs to be updated whenever somethings in your data +changes. The docker image takes care of that automatically with the task +scheduler. You can manually renew the classifier by invoking the following +management command: + +.. code:: + + document_create_classifier + +This command takes no arguments. + + +Managing the document search index +================================== + +The document search index is responsible for delivering search results for the +website. The document index is automatically updated whenever documents get +added to, changed, or removed from paperless. However, if the search yields +non-existing documents or won't find anything, you may need to recreate the +index manually. + +.. code:: + + document_index {reindex,optimize} + +Specify ``reindex`` to have the index created from scratch. This may take some +time. + +Specify ``optimize`` to optimize the index. This updates certain aspects of +the index and usually makes queries faster and also ensures that the +autocompletion works properly. This command is regularly invoked by the task +scheduler. + + +Managing filenames +================== + +.. warning:: + + TBD + +.. code:: + + document_renamer + + +.. _utilities-encyption: + +Managing encrpytion +=================== + +Documents can be stored in Paperless using GnuPG encryption. + +.. danger:: + + Decryption is depreceated since paperless-ng 1.0 and doesn't really provide any + additional security, since you have to store the passphrase in a configuration + file on the same system as the encrypted documents for paperless to work. Also, + paperless provides transparent access to your encrypted documents. + + Consider running paperless on an encrypted filesystem instead, which will then + at least provide security against physical hardware theft. + +.. code:: + + change_storage_type [--passphrase PASSPHRASE] {gpg,unencrypted} {gpg,unencrypted} + + positional arguments: + {gpg,unencrypted} The state you want to change your documents from + {gpg,unencrypted} The state you want to change your documents to + + optional arguments: + --passphrase PASSPHRASE + +Enabling encryption +------------------- + +Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**): + +(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here) + +.. code:: + + change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg + + +Disabling encryption +-------------------- + +Basic usage to enable encryption of your document store: + +(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here) + +.. code:: + + change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted + + +.. _Pipenv: https://pipenv.pypa.io/en/latest/ \ No newline at end of file diff --git a/docs/advanced_usage.rst b/docs/advanced_usage.rst new file mode 100644 index 000000000..7faba180a --- /dev/null +++ b/docs/advanced_usage.rst @@ -0,0 +1,244 @@ +*************** +Advanced topics +*************** + +Paperless offers a couple features that automate certain tasks and make your life +easier. + +Guesswork +######### + + +Any document you put into the consumption directory will be consumed, but if +you name the file right, it'll automatically set some values in the database +for you. This is is the logic the consumer follows: + +1. Try to find the correspondent, title, and tags in the file name following + the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that + the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or + ``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC". + The tags are optional, so the format ``Date - Correspondent - Title.pdf`` + works as well. +2. If that doesn't work, we skip the date and try this pattern: + ``Correspondent - Title - tag,tag,tag.pdf``. +3. If that doesn't work, we try to find the correspondent and title in the file + name following the pattern: ``Correspondent - Title.pdf``. +4. If that doesn't work, just assume that the name of the file is the title. + +So given the above, the following examples would work as you'd expect: + +* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` +* ``Another Company - Letter of Reference.jpg`` +* ``Dad's Recipe for Pancakes.png`` + +These however wouldn't work: + +* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` +* ``Another Company- Letter of Reference.jpg`` + +Do I have to be so strict about naming? +======================================= + +Rather than using the strict document naming rules, one can also set the option +``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order +that is accepted by dateparser_. Doing so will cause ``paperless`` to default +to any date format that is found in the title, instead of a date pulled from +the document's text, without requiring the strict formatting of the document +filename as described above. + +.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings + +Transforming filenames for parsing +================================== + +Some devices can't produce filenames that can be parsed by the default +parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in +``paperless.conf`` one can add transformations that are applied to the filename +before it's parsed. + +The option contains a list of dictionaries of regular expressions (key: +``pattern``) and replacements (key: ``repl``) in JSON format, which are +applied in order by passing them to ``re.subn``. Transformation stops +after the first match, so at most one transformation is applied. The general +syntax is + +.. code:: python + + [{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}] + +The example below is for a Brother ADS-2400N, a scanner that allows +different names to different hardware buttons (useful for handling +multiple entities in one instance), but insists on adding ``_<count>`` +to the filename. + +.. code:: python + + # Brother profile configuration, support "Name_Date_Count" (the default + # setting) and "Name_Count" (use "Name" as tag and "Count" as title). + PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}] + + +Matching tags, correspondents and document types +################################################ + +After the consumer has tried to figure out what it could from the file name, +it starts looking at the content of the document itself. It will compare the +matching algorithms defined by every tag and correspondent already set in your +database to see if they apply to the text in that document. In other words, +if you defined a tag called ``Home Utility`` that had a ``match`` property of +``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will +automatically tag your newly-consumed document with your ``Home Utility`` tag +so long as the text ``bc hydro`` appears in the body of the document somewhere. + +The matching logic is quite powerful, and supports searching the text of your +document with different algorithms, and as such, some experimentation may be +necessary to get things right. + +In order to have a tag, correspondent or type assigned automatically to newly +consumed documents, assign a match and matching algorithm using the web +interface. These settings define when to assign correspondents, tags and types +to documents. + +The following algorithms are available: + +* **Any:** Looks for any occurrence of any word provided in match in the PDF. + If you define the match as ``Bank1 Bank2``, it will match documents containing + either of these terms. +* **All:** Requires that every word provided appears in the PDF, albeit not in the + order provided. +* **Literal:** Matches only if the match appears exactly as provided in the PDF. +* **Regular expression:** Parses the match as a regular expression and tries to + find a match within the document. +* **Fuzzy match:** I dont know. Look at the source. +* **Auto:** Tries to automatically match new documents. This does not require you + to set a match. See the notes below. + +When using the "any" or "all" matching algorithms, you can search for terms +that consist of multiple words by enclosing them in double quotes. For example, +defining a match text of ``"Bank of America" BofA`` using the "any" algorithm, +will match documents that contain either "Bank of America" or "BofA", but will +not match documents containing "Bank of South America". + +Then just save your tag/correspondent and run another document through the +consumer. Once complete, you should see the newly-created document, +automatically tagged with the appropriate data. + + +Automatic matching +================== + +Paperless-ng comes with a new matching algorithm called *Auto*. This matching +algorithm tries to assign tags, correspondents and document types to your +documents based on how you have assigned these on existing documents. It +uses a neural network under the hood. + +If, for example, all your bank statements of your account 123 at the Bank of +America are tagged with the tag "bofa_123" and the matching algorithm of this +tag is set to *Auto*, this neural network will examine your documents and +automatically learn when to assign this tag. + +There are a couple caveats you need to keep in mind when using this feature: + +* Changes to your documents are not immediately reflected by the matching + algorithm. The neural network needs to be *trained* on your documents after + changes. Paperless periodically (default: once each hour) checks for changes + and does this automatically for you. +* The Auto matching algorithm only takes documents into account which are NOT + placed in your inbox (i.e., have inbox tags assigned to them). This ensures + that the neural network only learns from documents which you have correctly + tagged before. +* The matching algorithm can only work if there is a correlation between the + tag, correspondent or document type and the document itself. Your bank + statements usually contain your bank account number and the name of the bank, + so this works reasonably well, However, tags such as "TODO" cannot be + automatically assigned. +* The matching algorithm needs a reasonable number of documents to identify when + to assign tags, correspondents, and types. If one out of a thousand documents + has the correspondent "Very obscure web shop I bought something five years + ago", it will probably not assign this correspondent automatically if you buy + something from them again. The more documents, the better. + +Hooking into the consumption process +#################################### + +Sometimes you may want to do something arbitrary whenever a document is +consumed. Rather than try to predict what you may want to do, Paperless lets +you execute scripts of your own choosing just before or after a document is +consumed using a couple simple hooks. + +Just write a script, put it somewhere that Paperless can read & execute, and +then put the path to that script in ``paperless.conf`` with the variable name +of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or +``PAPERLESS_POST_CONSUME_SCRIPT``. + +.. TODO HYPEREF TO CONFIG + +.. important:: + + These scripts are executed in a **blocking** process, which means that if + a script takes a long time to run, it can significantly slow down your + document consumption flow. If you want things to run asynchronously, + you'll have to fork the process in your script and exit. + + +Pre-consumption script +====================== + +Executed after the consumer sees a new document in the consumption folder, but +before any processing of the document is performed. This script receives exactly +one argument: + +* Document file name + +A simple but common example for this would be creating a simple script like +this: + +``/usr/local/bin/ocr-pdf`` + +.. code:: bash + + #!/usr/bin/env bash + pdf2pdfocr.py -i ${1} + +``/etc/paperless.conf`` + +.. code:: bash + + ... + PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf" + ... + +This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``, +which will in turn call `pdf2pdfocr.py`_ on your document, which will then +overwrite the file with an OCR'd version of the file and exit. At which point, +the consumption process will begin with the newly modified file. + +.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr + + +.. _consumption-director-hook-variables-post: + +Post-consumption script +======================= + +Executed after the consumer has successfully processed a document and has moved it +into paperless. It receives the following arguments: + +* Document id +* Generated file name +* Source path +* Thumbnail path +* Download URL +* Thumbnail URL +* Correspondent +* Tags + +The script can be in any language you like, but for a simple shell script +example, you can take a look at ``post-consumption-example.sh`` in the +``scripts`` directory in this project. + +The post consumption script cannot cancel the consumption process. diff --git a/docs/api.rst b/docs/api.rst index d08826a33..a561de89d 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -1,7 +1,12 @@ .. _api: +************ The REST API -############ +************ + +.. warning:: + + This section is not updated yet. Paperless makes use of the `Django REST Framework`_ standard API interface because of its inherent awesomeness. Conveniently, the system is also @@ -15,7 +20,7 @@ installation. .. _api-uploading: Uploading ---------- +========= File uploads in an API are hard and so far as I've been able to tell, there's no standard way of accepting them, so rather than crowbar file uploads into the diff --git a/docs/changelog.rst b/docs/changelog.rst index 8a0528a88..b2fc8692c 100644 --- a/docs/changelog.rst +++ b/docs/changelog.rst @@ -1,6 +1,79 @@ +.. _paperless_changelog: + Changelog ######### +paperless-ng 1.0 +================ + +* **Deprecated:** GnuPG. Don't use it. If you're still using it, be aware that it + offers no protection at all, since the passphrase is stored alongside with the + encrypted documents itself. This features will most likely be removed in future + versions. + +* **Added:** New frontend. Features: + + * Single page application: It's much more responsive than the django admin pages. + * Dashboard. Shows recently scanned documents, or todos, or other documents + at wish. Allows uploading of documents. Shows basic statistics. + * Better document list with multiple display options. + * Full text search with result highlighting, auto completion and scoring based + on the query. It uses a document search index in the background. + * Saveable filters. + * Better log viewer. + +* **Added:** Document types. Assign these to documents just as correspondents. + They may be used in the future to perform automatic operations on documents + depending on the type. +* **Added:** Inbox tags. Define an inbox tag and it will automatically be + assigned to any new document scanned into the system. +* **Added:** Automatic matching. A new matching algorithm that automatically + assigns tags, document types and correspondents to your documents. It uses + a neural network trained on your data. +* **Added:** Archive serial numbers. Assign these to quickly find documents stored in + physical binders. +* **Added:** Enabled the internal user management of django. This isn't really a + multi user solution, however, it allows more than one user to access the website + and set some basic permissions / renew passwords. + +* **Modified [breaking]:** REST Api changes: + + * New filters added, other filters removed (case sensitive filters, slug filters) + * Endpoints for thumbnails, previews and downloads replace the old ``/fetch/`` urls. Redirects are in place. + * Endpoint for document uploads replaces the old ``/push`` url. Redirects are in place. + * Foreign key relationships are now served as IDs, not as urls. + +* **Modified [breaking]:** PostgreSQL: + + * If ``PAPERLESS_DBHOST`` is specified in the settings, paperless uses postgresql instead of sqlite. + Username, database and password all default to ``paperless`` if not specified. + * **docker-compose.yml uses PostgreSQL by default.** + +* **Modified [breaking]:** document_retagger management command rework. See TODO hyperref +* **Removed [breaking]:** Reminders. +* **Removed:** All customizations made to the django admin pages. + +* **Internal changes:** Mostly code cleanup, including: + + * Rework of the code of the tesseract parser. This is now a lot cleaner. + * Rework of the filename handling code. It was a mess. + * Fixed some issues with the document exporter not exporting all documents when encountering duplicate filenames. + * Consumer rework: now uses the excellent watchdog library, lots of code removed. + * Added a task scheduler that takes care of checking mail, training the classifier and maintaining the document search index. + * Updated dependencies. Now uses Pipenv all around. + * Updated Dockerfile and docker-compose. Now uses ``supervisord`` to run everything paperless-related in a single container. + +* **Settings:** + + * ``PAPERLESS_FORGIVING_OCR`` is now default and gone. Reason: Even if ``langdetect`` fails to detect + a language, tesseract still does a very good job at ocr'ing a document with the default language. + Certain language specifics such as umlauts may not get picked up properly. + * ``PAPERLESS_DEBUG`` defaults to ``false``. + * The presence of ``PAPERLESS_DBHOST`` now determines whether to use PostgreSQL or + sqlite. + +* Many more small changes here and there. The usual stuff. + 2.7.0 ===== diff --git a/docs/changelog_jonaswinkler.rst b/docs/changelog_jonaswinkler.rst deleted file mode 100644 index 35824198f..000000000 --- a/docs/changelog_jonaswinkler.rst +++ /dev/null @@ -1,15 +0,0 @@ -Changelog (jonaswinkler) -######################## - -1.0.0 -===== - -* First release based on paperless 2.6.0 -* Added: Automatic document classification using neural networks (replaces - regex-based tagging) -* Added: Document types -* Added: Archive serial number allows easy referencing of physical document - copies -* Added: Inbox tags (added automatically to newly consumed documents) -* Added: Document viewer on document edit page -* Database backend is now configurable diff --git a/docs/conf.py b/docs/conf.py index eb6720dbb..7ebc82ea7 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -54,7 +54,7 @@ source_suffix = '.rst' master_doc = 'index' # General information about the project. -project = u'Paperless' +project = u'Paperless-ng' copyright = u'2015, Daniel Quinn' # The version info for the project you're documenting, acts as replacement for @@ -205,7 +205,8 @@ try: import sphinx_rtd_theme html_theme = "sphinx_rtd_theme" html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] -except ImportError: +except ImportError as e: + print("error " + str(e)) pass # -- Options for LaTeX output --------------------------------------------- diff --git a/docs/consumption.rst b/docs/consumption.rst deleted file mode 100644 index 15f6c6393..000000000 --- a/docs/consumption.rst +++ /dev/null @@ -1,255 +0,0 @@ -.. _consumption: - -Consumption -########### - -Once you've got Paperless setup, you need to start feeding documents into it. -Currently, there are three options: the consumption directory, IMAP (email), and -HTTP POST. - - -.. _consumption-directory: - -The Consumption Directory -========================= - -The primary method of getting documents into your database is by putting them in -the consumption directory. The ``document_consumer`` script runs in an infinite -loop looking for new additions to this directory and when it finds them, it goes -about the process of parsing them with the OCR, indexing what it finds, and -encrypting the PDF (if ``PAPERLESS_PASSPHRASE`` is set), storing it in the -media directory. - -Getting stuff into this directory is up to you. If you're running Paperless -on your local computer, you might just want to drag and drop files there, but if -you're running this on a server and want your scanner to automatically push -files to this directory, you'll need to setup some sort of service to accept the -files from the scanner. Typically, you're looking at an FTP server like -`Proftpd`_ or `Samba`_. - -.. _Proftpd: http://www.proftpd.org/ -.. _Samba: http://www.samba.org/ - -So where is this consumption directory? It's wherever you define it. Look for -the ``CONSUMPTION_DIR`` value in ``settings.py``. Set that to somewhere -appropriate for your use and put some documents in there. When you're ready, -follow the :ref:`consumer <utilities-consumer>` instructions to get it running. - - -.. _consumption-directory-hook: - -Hooking into the Consumption Process ------------------------------------- - -Sometimes you may want to do something arbitrary whenever a document is -consumed. Rather than try to predict what you may want to do, Paperless lets -you execute scripts of your own choosing just before or after a document is -consumed using a couple simple hooks. - -Just write a script, put it somewhere that Paperless can read & execute, and -then put the path to that script in ``paperless.conf`` with the variable name -of either ``PAPERLESS_PRE_CONSUME_SCRIPT`` or -``PAPERLESS_POST_CONSUME_SCRIPT``. The script will be executed before or -or after the document is consumed respectively. - -.. important:: - - These scripts are executed in a **blocking** process, which means that if - a script takes a long time to run, it can significantly slow down your - document consumption flow. If you want things to run asynchronously, - you'll have to fork the process in your script and exit. - - -.. _consumption-directory-hook-variables: - -What Can These Scripts Do? -.......................... - -It's your script, so you're only limited by your imagination and the laws of -physics. However, the following values are passed to the scripts in order: - - -.. _consumption-director-hook-variables-pre: - -Pre-consumption script -:::::::::::::::::::::: - -* Document file name - -A simple but common example for this would be creating a simple script like -this: - -``/usr/local/bin/ocr-pdf`` - -.. code:: bash - - #!/usr/bin/env bash - pdf2pdfocr.py -i ${1} - -``/etc/paperless.conf`` - -.. code:: bash - - ... - PAPERLESS_PRE_CONSUME_SCRIPT="/usr/local/bin/ocr-pdf" - ... - -This will pass the path to the document about to be consumed to ``/usr/local/bin/ocr-pdf``, -which will in turn call `pdf2pdfocr.py`_ on your document, which will then -overwrite the file with an OCR'd version of the file and exit. At which point, -the consumption process will begin with the newly modified file. - -.. _pdf2pdfocr.py: https://github.com/LeoFCardoso/pdf2pdfocr - - -.. _consumption-director-hook-variables-post: - -Post-consumption script -::::::::::::::::::::::: - -* Document id -* Generated file name -* Source path -* Thumbnail path -* Download URL -* Thumbnail URL -* Correspondent -* Tags - -The script can be in any language you like, but for a simple shell script -example, you can take a look at ``post-consumption-example.sh`` in the -``scripts`` directory in this project. - - -.. _consumption-imap: - -IMAP (Email) -============ - -Another handy way to get documents into your database is to email them to -yourself. The typical use-case would be to be out for lunch and want to send a -copy of the receipt back to your system at home. Paperless can be taught to -pull emails down from an arbitrary account and dump them into the consumption -directory where the process :ref:`above <consumption-directory>` will follow the -usual pattern on consuming the document. - -Some things you need to know about this feature: - -* It's disabled by default. By setting the values below it will be enabled. -* It's been tested in a limited environment, so it may not work for you (please - submit a pull request if you can!) -* It's designed to **delete mail from the server once consumed**. So don't go - pointing this to your personal email account and wonder where all your stuff - went. -* Currently, only one photo (attachment) per email will work. - -So, with all that in mind, here's what you do to get it running: - -1. Setup a new email account somewhere, or if you're feeling daring, create a - folder in an existing email box and note the path to that folder. -2. In ``/etc/paperless.conf`` set all of the appropriate values in - ``PATHS AND FOLDERS`` and ``SECURITY``. - If you decided to use a subfolder of an existing account, then make sure you - set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set - the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll - have to include that in every email you send. -3. Restart the :ref:`consumer <utilities-consumer>`. The consumer will check - the configured email account at startup and from then on every 10 minutes - for something new and pulls down whatever it finds. -4. Send yourself an email! Note that the subject is treated as the file name, - so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll - get what you expect. Also, you must include the aforementioned secret - string in every email so the fetcher knows that it's safe to import. - Note that Paperless only allows the email title to consist of safe characters - to be imported. These consist of alpha-numeric characters and ``-_ ,.'``. -5. After a few minutes, the consumer will poll your mailbox, pull down the - message, and place the attachment in the consumption directory with the - appropriate name. A few minutes later, the consumer will import it like any - other file. - - -.. _consumption-http: - -HTTP POST -========= - -You can also submit a document via HTTP POST, so long as you do so after -authenticating. To push your document to Paperless, send an HTTP POST to the -server with the following name/value pairs: - -* ``correspondent``: The name of the document's correspondent. Note that there - are restrictions on what characters you can use here. Specifically, - alphanumeric characters, `-`, `,`, `.`, and `'` are ok, everything else is - out. You also can't use the sequence ` - ` (space, dash, space). -* ``title``: The title of the document. The rules for characters is the same - here as the correspondent. -* ``document``: The file you're uploading - -Specify ``enctype="multipart/form-data"``, and then POST your file with:: - - Content-Disposition: form-data; name="document"; filename="whatever.pdf" - -An example of this in HTML is a typical form: - -.. code:: html - - <form method="post" enctype="multipart/form-data"> - <input type="text" name="correspondent" value="My Correspondent" /> - <input type="text" name="title" value="My Title" /> - <input type="file" name="document" /> - <input type="submit" name="go" value="Do the thing" /> - </form> - -But a potentially more useful way to do this would be in Python. Here we use -the requests library to handle basic authentication and to send the POST data -to the URL. - -.. code:: python - - import os - - from hashlib import sha256 - - import requests - from requests.auth import HTTPBasicAuth - - # You authenticate via BasicAuth or with a session id. - # We use BasicAuth here - username = "my-username" - password = "my-super-secret-password" - - # Where you have Paperless installed and listening - url = "http://localhost:8000/push" - - # Document metadata - correspondent = "Test Correspondent" - title = "Test Title" - - # The local file you want to push - path = "/path/to/some/directory/my-document.pdf" - - - with open(path, "rb") as f: - - response = requests.post( - url=url, - data={"title": title, "correspondent": correspondent}, - files={"document": (os.path.basename(path), f, "application/pdf")}, - auth=HTTPBasicAuth(username, password), - allow_redirects=False - ) - - if response.status_code == 202: - - # Everything worked out ok - print("Upload successful") - - else: - - # If you don't get a 202, it's probably because your credentials - # are wrong or something. This will give you a rough idea of what - # happened. - - print("We got HTTP status code: {}".format(response.status_code)) - for k, v in response.headers.items(): - print("{}: {}".format(k, v)) diff --git a/docs/customising.rst b/docs/customising.rst deleted file mode 100644 index 0d8e428cd..000000000 --- a/docs/customising.rst +++ /dev/null @@ -1,42 +0,0 @@ -.. _customising: - -Customising Paperless -##################### - -Currently, the Paperless' interface is just the default Django admin, which -while powerful, is rather boring. If you'd like to give the site a bit of a -face-lift, or if you simply want to adjust the colours, contrast, or font size -to make things easier to read, you can do that by adding your own CSS or -Javascript quite easily. - - -.. _customising-overrides: - -Overrides -========= - -On every page load, Paperless looks for two files in your media root directory -(the directory defined by your ``PAPERLESS_MEDIADIR`` configuration variable or -the default, ``<project root>/media/``) for two files: - -* ``overrides.css`` -* ``overrides.js`` - -If it finds either or both of those files, they'll be loaded into the page: the -CSS in the ``<head>``, and the Javascript stuffed into the last line of the -``<body>``. - - -.. _customising-overrides-note: - -An important note about customisation -------------------------------------- - -Any changes you make to the site with your CSS or Javascript are likely to -depend on the structure of the current HTML and/or the existing CSS rules. For -the most part it's safe to assume that these bits won't change, but *sometimes -they do* as features are added or bugs are fixed. - -If you make a change that you think others would appreciate though, submit it -as a pull request and maybe we can find a way to work it into the project by -default! \ No newline at end of file diff --git a/docs/guesswork.rst b/docs/guesswork.rst deleted file mode 100644 index c12ecd0c4..000000000 --- a/docs/guesswork.rst +++ /dev/null @@ -1,131 +0,0 @@ -.. _guesswork: - -Guesswork -######### - -During the consumption process, Paperless tries to guess some of the attributes -of the document it's looking at. To do this it uses two approaches: - - -.. _guesswork-naming: - -File Naming -=========== - -Any document you put into the consumption directory will be consumed, but if -you name the file right, it'll automatically set some values in the database -for you. This is is the logic the consumer follows: - -1. Try to find the correspondent, title, and tags in the file name following - the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``. Note that - the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or - ``YYYYMMDDZ``. The ``Z`` refers "Zulu time" AKA "UTC". - The tags are optional, so the format ``Date - Correspondent - Title.pdf`` - works as well. -2. If that doesn't work, we skip the date and try this pattern: - ``Correspondent - Title - tag,tag,tag.pdf``. -3. If that doesn't work, we try to find the correspondent and title in the file - name following the pattern: ``Correspondent - Title.pdf``. -4. If that doesn't work, just assume that the name of the file is the title. - -So given the above, the following examples would work as you'd expect: - -* ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` -* ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` -* ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` -* ``Another Company - Letter of Reference.jpg`` -* ``Dad's Recipe for Pancakes.png`` - -These however wouldn't work: - -* ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` -* ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` -* ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` -* ``Another Company- Letter of Reference.jpg`` - -Do I have to be so strict about naming? ---------------------------------------- -Rather than using the strict document naming rules, one can also set the option -``PAPERLESS_FILENAME_DATE_ORDER`` in ``paperless.conf`` to any date order -that is accepted by dateparser_. Doing so will cause ``paperless`` to default -to any date format that is found in the title, instead of a date pulled from -the document's text, without requiring the strict formatting of the document -filename as described above. - -.. _dateparser: https://github.com/scrapinghub/dateparser/blob/v0.7.0/docs/usage.rst#settings - -Transforming filenames for parsing ----------------------------------- -Some devices can't produce filenames that can be parsed by the default -parser. By configuring the option ``PAPERLESS_FILENAME_PARSE_TRANSFORMS`` in -``paperless.conf`` one can add transformations that are applied to the filename -before it's parsed. - -The option contains a list of dictionaries of regular expressions (key: -``pattern``) and replacements (key: ``repl``) in JSON format, which are -applied in order by passing them to ``re.subn``. Transformation stops -after the first match, so at most one transformation is applied. The general -syntax is - -.. code:: python - - [{"pattern":"pattern1", "repl":"repl1"}, {"pattern":"pattern2", "repl":"repl2"}, ..., {"pattern":"patternN", "repl":"replN"}] - -The example below is for a Brother ADS-2400N, a scanner that allows -different names to different hardware buttons (useful for handling -multiple entities in one instance), but insists on adding ``_<count>`` -to the filename. - -.. code:: python - - # Brother profile configuration, support "Name_Date_Count" (the default - # setting) and "Name_Count" (use "Name" as tag and "Count" as title). - PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}] - -.. _guesswork-content: - -Reading the Document Contents -============================= - -After the consumer has tried to figure out what it could from the file name, -it starts looking at the content of the document itself. It will compare the -matching algorithms defined by every tag and correspondent already set in your -database to see if they apply to the text in that document. In other words, -if you defined a tag called ``Home Utility`` that had a ``match`` property of -``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will -automatically tag your newly-consumed document with your ``Home Utility`` tag -so long as the text ``bc hydro`` appears in the body of the document somewhere. - -The matching logic is quite powerful, and supports searching the text of your -document with different algorithms, and as such, some experimentation may be -necessary to get things Just Right. - - -.. _guesswork-content-howto: - -How Do I Set Up These Matching Algorithms? ------------------------------------------- - -Setting up of the algorithms is easily done through the admin interface. When -you create a new correspondent or tag, there are optional fields for matching -text and matching algorithm. From the help info there: - -.. note:: - - Which algorithm you want to use when matching text to the OCR'd PDF. Here, - "any" looks for any occurrence of any word provided in the PDF, while "all" - requires that every word provided appear in the PDF, albeit not in the - order provided. A "literal" match means that the text you enter must - appear in the PDF exactly as you've entered it, and "regular expression" - uses a regex to match the PDF. If you don't know what a regex is, you - probably don't want this option. - -When using the "any" or "all" matching algorithms, you can search for terms -that consist of multiple words by enclosing them in double quotes. For example, -defining a match text of ``"Bank of America" BofA`` using the "any" algorithm, -will match documents that contain either "Bank of America" or "BofA", but will -not match documents containing "Bank of South America". - -Then just save your tag/correspondent and run another document through the -consumer. Once complete, you should see the newly-created document, -automatically tagged with the appropriate data. diff --git a/docs/index.rst b/docs/index.rst index 75046b3a4..531a9ae04 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,8 +4,8 @@ Paperless ========= Paperless is a simple Django application running in two parts: -a :ref:`consumer <utilities-consumer>` (the thing that does the indexing) and -the :ref:`webserver <utilities-webserver>` (the part that lets you search & +a *Consumer* (the thing that does the indexing) and +the *Web server* (the part that lets you search & download already-indexed documents). If you want to learn more about its functions keep on reading after the installation section. @@ -25,26 +25,34 @@ finding stuff again. I feed documents right from the post box into the scanner and then shred them. Perhaps you might find it useful too. +Paperless-ng +============ + +I wanted to make big changes to the project that will impact the way it is used +by its users greatly. Among the users who currently use paperless in production +there are probably many that don't want these changes right away. I also wanted +to have more control over what goes into the code and what does not. Therefore, +paperless-ng was created. NG stands for both Angular (the framework used for the +Frontend) and next-gen. Publishing this project under a different name also +avoids confusion between paperless and paperless-ng. + +It would be great if this project could eventually merge back into the main +repository, but it needs a lot more work before that can happen. Contents ======== .. toctree:: - :maxdepth: 2 + :maxdepth: 1 - requirements setup - consumption + usage_overview + advanced_usage + administration api - utilities - guesswork - migrating - customising extending troubleshooting contributing scanners - screenshots changelog - changelog_jonaswinkler diff --git a/docs/migrating.rst b/docs/migrating.rst deleted file mode 100644 index c3e702bd5..000000000 --- a/docs/migrating.rst +++ /dev/null @@ -1,109 +0,0 @@ -.. _migrating: - -Migrating, Updates, and Backups -=============================== - -As Paperless is still under active development, there's a lot that can change -as software updates roll out. You should backup often, so if anything goes -wrong during an update, you at least have a means of restoring to something -usable. Thankfully, there are automated ways of backing up, restoring, and -updating the software. - - -.. _migrating-backup: - -Backing Up ----------- - -So you're bored of this whole project, or you want to make a remote backup of -your files for whatever reason. This is easy to do, simply use the -:ref:`exporter <utilities-exporter>` to dump your documents and database out -into an arbitrary directory. - - -.. _migrating-restoring: - -Restoring ---------- - -Restoring your data is just as easy, since nearly all of your data exists either -in the file names, or in the contents of the files themselves. You just need to -create an empty database (just follow the -:ref:`installation instructions <setup-installation>` again) and then import the -``tags.json`` file you created as part of your backup. Lastly, copy your -exported documents into the consumption directory and start up the consumer. - -.. code-block:: shell-session - - $ cd /path/to/project - $ rm data/db.sqlite3 # Delete the database - $ cd src - $ ./manage.py migrate # Create the database - $ ./manage.py createsuperuser - $ ./manage.py loaddata /path/to/arbitrary/place/tags.json - $ cp /path/to/exported/docs/* /path/to/consumption/dir/ - $ ./manage.py document_consumer - -Importing your data if you are :ref:`using Docker <setup-installation-docker>` -is almost as simple: - -.. code-block:: shell-session - - # Stop and remove your current containers - $ docker-compose stop - $ docker-compose rm -f - - # Recreate them, add the superuser - $ docker-compose up -d - $ docker-compose run --rm webserver createsuperuser - - # Load the tags - $ cat /path/to/arbitrary/place/tags.json | docker-compose run --rm webserver loaddata_stdin - - - # Load your exported documents into the consumption directory - # (How you do this highly depends on how you have set this up) - $ cp /path/to/exported/docs/* /path/to/mounted/consumption/dir/ - -After loading the documents into the consumption directory the consumer will -immediately start consuming the documents. - - -.. _migrating-updates: - -Updates -------- - -For the most part, all you have to do to update Paperless is run ``git pull`` -on the directory containing the project files, and then use Django's -``migrate`` command to execute any database schema updates that might have been -rolled in as part of the update: - -.. code-block:: shell-session - - $ cd /path/to/project - $ git pull - $ pip install -r requirements.txt - $ cd src - $ ./manage.py migrate - -Note that it's possible (even likely) that while ``git pull`` may update some -files, the ``migrate`` step may not update anything. This is totally normal. - -Additionally, as new features are added, the ability to control those features -is typically added by way of an environment variable set in ``paperless.conf``. -You may want to take a look at the ``paperless.conf.example`` file to see if -there's anything new in there compared to what you've got in ``/etc``. - -If you are :ref:`using Docker <setup-installation-docker>` the update process -is similar: - -.. code-block:: shell-session - - $ cd /path/to/project - $ git pull - $ docker build -t paperless . - $ docker-compose run --rm consumer migrate - $ docker-compose up -d - -If ``git pull`` doesn't report any changes, there is no need to continue with -the remaining steps. diff --git a/docs/requirements.rst b/docs/requirements.rst deleted file mode 100644 index 54f0d9216..000000000 --- a/docs/requirements.rst +++ /dev/null @@ -1,125 +0,0 @@ -.. _requirements: - -Requirements -============ - -You need a Linux machine or Unix-like setup (theoretically an Apple machine -should work) that has the following software installed: - -* `Python3`_ (with development libraries, pip and virtualenv) -* `GNU Privacy Guard`_ -* `Tesseract`_, plus its language files matching your document base. -* `Imagemagick`_ version 6.7.5 or higher -* `unpaper`_ -* `libpoppler-cpp-dev`_ PDF rendering library -* `optipng`_ - -.. _Python3: https://python.org/ -.. _GNU Privacy Guard: https://gnupg.org -.. _Tesseract: https://github.com/tesseract-ocr -.. _Imagemagick: http://imagemagick.org/ -.. _unpaper: https://github.com/unpaper/unpaper -.. _libpoppler-cpp-dev: https://poppler.freedesktop.org/ -.. _optipng: http://optipng.sourceforge.net/ - -Notably, you should confirm how you access your Python3 installation. Many -Linux distributions will install Python3 in parallel to Python2, using the -names ``python3`` and ``python`` respectively. The same goes for ``pip3`` and -``pip``. Running Paperless with Python2 will likely break things, so make sure -that you're using the right version. - -For the purposes of simplicity, ``python`` and ``pip`` is used everywhere to -refer to their Python3 versions. - -In addition to the above, there are a number of Python requirements, all of -which are listed in a file called ``requirements.txt`` in the project root -directory. - -If you're not working on a virtual environment (like Docker), you -should probably be using a virtualenv, but that's your call. The reasons why -you might choose a virtualenv or not aren't really within the scope of this -document. Needless to say if you don't know what a virtualenv is, you should -probably figure that out before continuing. - - -.. _requirements-apple: - -Problems with Imagemagick & PDFs --------------------------------- - -Some users have `run into problems`_ with getting ImageMagick to do its thing -with PDFs. Often this is the case with Apple systems using HomeBrew, but other -Linuxes have been a problem as well. The solution appears to be to install -ghostscript as well as ImageMagick: - -.. _run into problems: https://github.com/the-paperless-project/paperless/issues/25 - -.. code:: bash - - $ brew install ghostscript - $ brew install imagemagick - $ brew install libmagic - - -.. _requirements-baremetal: - -Python-specific Requirements: No Virtualenv -------------------------------------------- - -If you don't care to use a virtual env, then installation of the Python -dependencies is easy: - -.. code:: bash - - $ pip install --user --requirement /path/to/paperless/requirements.txt - -This will download and install all of the requirements into -``${HOME}/.local``. Remember that your distribution may be using ``pip3`` as -mentioned above. - - -.. _requirements-virtualenv: - -Python-specific Requirements: Virtualenv ----------------------------------------- - -Using a virtualenv for this is pretty straightforward: create a virtualenv, -enter it, and install the requirements using the ``requirements.txt`` file: - -.. code:: bash - - $ virtualenv --python=/path/to/python3 /path/to/arbitrary/directory - $ . /path/to/arbitrary/directory/bin/activate - $ pip install --requirement /path/to/paperless/requirements.txt - -Now you're ready to go. Just remember to enter (activate) your virtualenv -whenever you want to use Paperless. - - -.. _requirements-documentation: - -Documentation -------------- - -As generation of the documentation is not required for the use of Paperless, -dependencies for this process are not included in ``requirements.txt``. If -you'd like to generate your own docs locally, you'll need to: - -.. code:: bash - - $ pip install sphinx - -and then cd into the ``docs`` directory and type ``make html``. - -If you are using Docker, you can use the following commands to build the -documentation and run a webserver serving it on `port 8001`_: - -.. code:: bash - - $ pwd - /path/to/paperless - - $ docker build -t paperless:docs -f docs/Dockerfile . - $ docker run --rm -it -p "8001:8000" paperless:docs - -.. _port 8001: http://127.0.0.1:8001 diff --git a/docs/requirements.txt b/docs/requirements.txt deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/scanners.rst b/docs/scanners.rst index 9815637b1..7e41ecd53 100644 --- a/docs/scanners.rst +++ b/docs/scanners.rst @@ -1,7 +1,8 @@ .. _scanners: -Scanner Recommendations -======================= +*********************** +Scanner recommendations +*********************** As Paperless operates by watching a folder for new files, doesn't care what scanner you use, but sometimes finding a scanner that will write to an FTP, @@ -23,16 +24,19 @@ that works right for you based on recommentations from other Paperless users. +---------+----------------+-----+-----+-----+----------------+ | Fujitsu | `ix500`_ | yes | | yes | `eonist`_ | +---------+----------------+-----+-----+-----+----------------+ +| Fujitsu | `S1300i`_ | yes | | yes | `jonaswinkler`_| ++---------+----------------+-----+-----+-----+----------------+ .. _ADS-1500W: https://www.brother.ca/en/p/ads1500w .. _MFC-J6930DW: https://www.brother.ca/en/p/MFCJ6930DW .. _MFC-J5910DW: https://www.brother.co.uk/printers/inkjet-printers/mfcj5910dw .. _MFC-9142CDN: https://www.brother.co.uk/printers/laser-printers/mfc9140cdn -.. _ix500: http://www.fujitsu.com/us/products/computing/peripheral/scanners/scansnap/ix500/ +.. _ix500: https://www.fujitsu.com/global/products/computing/peripheral/scanners/scansnap/ix500/ +.. _S1300i: https://www.fujitsu.com/global/products/computing/peripheral/scanners/soho/s1300i/ .. _danielquinn: https://github.com/danielquinn .. _ayounggun: https://github.com/ayounggun .. _bmsleight: https://github.com/bmsleight .. _eonist: https://github.com/eonist .. _REOLDEV: https://github.com/REOLDEV - +.. _jonaswinkler: https://github.com/jonaswinkler diff --git a/docs/screenshots.rst b/docs/screenshots.rst deleted file mode 100644 index 53f564dd6..000000000 --- a/docs/screenshots.rst +++ /dev/null @@ -1,16 +0,0 @@ -.. _screenshots: - -Screenshots -=========== - -Once everything is set-up login to paperless using the web front-end - -.. image:: ./_static/Screenshot_first_run_login.png - -Nice clean interface - -.. image:: ./_static/Screenshot_first_logged.png - -Some documents loaded in via ftp or using the scanners ftp. - -.. image:: ./_static/Screenshot_upload_and_scanned.png diff --git a/docs/setup.rst b/docs/setup.rst index 9b8e3a548..67b65951e 100644 --- a/docs/setup.rst +++ b/docs/setup.rst @@ -1,500 +1,187 @@ -.. _setup: +***** Setup -===== - -Paperless isn't a very complicated app, but there are a few components, so some -basic documentation is in order. If you follow along in this document and -still have trouble, please open an `issue on GitHub`_ so I can fill in the -gaps. - -.. _issue on GitHub: https://github.com/the-paperless-project/paperless/issues - - -.. _setup-download: +***** Download --------- +######## The source is currently only available via GitHub, so grab it from there, -either by using ``git``: +by using ``git``: .. code:: bash - $ git clone https://github.com/the-paperless-project/paperless.git + $ git clone https://github.com/jonaswinkler/paperless-ng.git $ cd paperless -or just download the tarball and go that route: - -.. code:: bash - - $ cd to the directory where you want to run Paperless - $ wget https://github.com/the-paperless-project/paperless/archive/master.zip - $ unzip master.zip - $ cd paperless-master - - -.. _setup-installation: - -Installation & Configuration ----------------------------- +Installation +############ You can go multiple routes with setting up and running Paperless: - * The `bare metal route`_ - * The `docker route`_ - * A suggested `linux containers route`_ +* The `docker route`_ +* The `bare metal route`_ +The recommended setup route is docker, since it takes care of all dependencies +for you. The `docker route`_ is quick & easy. -The `bare metal route`_ is a bit more complicated to setup but makes it easier +The `bare metal route`_ is more complicated to setup but makes it easier should you want to contribute some code back. -The `linux containers route`_ is quick, but makes alot of assumptions on the -set-up, on the other hand the script could be used to install on a base -debian or ubuntu server. +Docker Route +============ -.. _docker route: setup-installation-docker_ -.. _bare metal route: setup-installation-bare-metal_ -.. _Docker Machine: https://docs.docker.com/machine/ +1. Install `Docker`_ and `docker-compose`_. [#compose]_ -.. _setup-installation-bare-metal: + .. caution:: -Standard (Bare Metal) -+++++++++++++++++++++ + If you want to use the included ``docker-compose.yml.example`` file, you + need to have at least Docker version **17.09.0** and docker-compose + version **1.17.0**. -1. Install the requirements as per the :ref:`requirements <requirements>` page. -2. Within the extract of master.zip go to the ``src`` directory. -3. Copy ``../paperless.conf.example`` to ``/etc/paperless.conf`` and open it in - your favourite editor. As this file contains passwords. It should only be - readable by user root and paperless! Set the values for: + See the `Docker installation guide`_ on how to install the current + version of Docker for your operating system or Linux distribution of + choice. To get an up-to-date version of docker-compose, follow the + `docker-compose installation guide`_ if your package repository doesn't + include it. - Set the values for: + .. _Docker installation guide: https://docs.docker.com/engine/installation/ + .. _docker-compose installation guide: https://docs.docker.com/compose/install/ - * ``PAPERLESS_CONSUMPTION_DIR``: this is where your documents will be - dumped to be consumed by Paperless. - * ``PAPERLESS_OCR_THREADS``: this is the number of threads the OCR process - will spawn to process document pages in parallel. - * ``PAPERLESS_PASSPHRASE``: this is only required if you want to use GPG to - encrypt your document files. This is the passphrase Paperless uses to - encrypt/decrypt the original documents. Don't worry about defining this - if you don't want to use encryption (the default). +2. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` + and a copy of ``docker-compose.env.example`` as ``docker-compose.env``. + You'll be editing both these files: taking a copy ensures that you can + ``git pull`` to receive updates without risking merge conflicts with your + modified versions of the configuration files. +3. Modify ``docker-compose.yml`` to your preferences. You should change the path + to the consumption directory in this file. Find the line that specifies where + to mount the consumption directory: - Note also that if you're using the ``runserver`` as mentioned below, you - should make sure that PAPERLESS_DEBUG="true" or is just commented out as - this is the default. + .. code:: + + - ./consume:/usr/src/paperless/consume + + Replace the part BEFORE the colon with a local directory of your choice: -4. Initialise the SQLite database with ``./manage.py migrate``. -5. Collect the static files for the webserver with ``./manage.py collectstatic``. -6. Create a user for your Paperless instance with - ``./manage.py createsuperuser``. Follow the prompts to create your user. -7. Start the webserver with ``./manage.py runserver <IP>:<PORT>``. - If no specific IP or port is given, the default is ``127.0.0.1:8000`` also - known as http://localhost:8000/. - You should now be able to visit your (empty) installation at - `Paperless webserver`_ or whatever you chose before. You can login with the - user/pass you created in #5. + .. code:: -8. In a separate window, change to the ``src`` directory in this repo again, - but this time, you should start the consumer script with - ``./manage.py document_consumer``. -9. Scan something or put a file into the ``CONSUMPTION_DIR``. -10. Wait a few minutes -11. Visit the document list on your webserver, and it should be there, indexed - and downloadable. - -.. caution:: - - This installation is not secure. Once everything is working head over to - `Making things more permanent`_ - -.. _Paperless webserver: http://127.0.0.1:8000 -.. _Making things more permanent: setup-permanent_ - -.. _setup-installation-docker: - -Docker Method -+++++++++++++ - -1. Install `Docker`_. - - .. caution:: - - As mentioned earlier, this guide assumes that you use Docker natively - under Linux. If you are using `Docker Machine`_ under Mac OS X or - Windows, you will have to adapt IP addresses, volume-mounting, command - execution and maybe more. - -2. Install `docker-compose`_. [#compose]_ - - .. caution:: - - If you want to use the included ``docker-compose.yml.example`` file, you - need to have at least Docker version **1.12.0** and docker-compose - version **1.9.0**. - - See the `Docker installation guide`_ on how to install the current - version of Docker for your operating system or Linux distribution of - choice. To get an up-to-date version of docker-compose, follow the - `docker-compose installation guide`_ if your package repository doesn't - include it. - - .. _Docker installation guide: https://docs.docker.com/engine/installation/ - .. _docker-compose installation guide: https://docs.docker.com/compose/install/ - -3. Create a copy of ``docker-compose.yml.example`` as ``docker-compose.yml`` - and a copy of ``docker-compose.env.example`` as ``docker-compose.env``. - You'll be editing both these files: taking a copy ensures that you can - ``git pull`` to receive updates without risking merge conflicts with your - modified versions of the configuration files. -4. Modify ``docker-compose.yml`` to your preferences, following the - instructions in comments in the file. The only change that is a hard - requirement is to specify where the consumption directory should - mount.[#dockercomposeyml]_ - - .. caution:: - - If you are using NFS mounts for the consume directory you also need to - change the command to turn off inotify as it doesn't work with NFS - - ``command: ["document_consumer", "--no-inotify"]`` + - /home/jonaswinkler/paperless-inbox:/usr/src/paperless/consume + + Don't change the part after the colon or paperless wont find your documents. -5. Modify ``docker-compose.env`` and adapt the following environment variables: +4. Modify ``docker-compose.env``, following the comments in the file. The + most important change is to set ``USERMAP_UID`` and ``USERMAP_GID`` + to the uid and gid of your user on the host system. This ensures that + both the docker container and you on the host machine have write access + to the consumption directory. If your UID and GID on the host system is + 1000 (the default for the first normal user on most systems), it will + work out of the box without any modifications. - ``PAPERLESS_PASSPHRASE`` - This is the passphrase Paperless uses to encrypt/decrypt the original - document. If you aren't planning on using GPG encryption, you can just - leave this undefined. - - ``PAPERLESS_OCR_THREADS`` - This is the number of threads the OCR process will spawn to process - document pages in parallel. If the variable is not set, Python determines - the core-count of your CPU and uses that value. - - ``PAPERLESS_OCR_LANGUAGES`` - If you want the OCR to recognize other languages in addition to the - default English, set this parameter to a space separated list of - three-letter language-codes after `ISO 639-2/T`_. For a list of available - languages -- including their three letter codes -- see the - `Alpine packagelist`_. - - ``USERMAP_UID`` and ``USERMAP_GID`` - If you want to mount the consumption volume (directory ``/consume`` within - the containers) to a host-directory -- which you probably want to do -- - access rights might be an issue. The default user and group ``paperless`` - in the containers have an id of 1000. The containers will enforce that the - owning group of the consumption directory will be ``paperless`` to be able - to delete consumed documents. If your host-system has a group with an ID - of 1000 and you don't want this group to have access rights to the - consumption directory, you can use ``USERMAP_GID`` to change the id in the - container and thus the one of the consumption directory. Furthermore, you - can change the id of the default user as well using ``USERMAP_UID``. - - ``PAPERLESS_USE_SSL`` - If you want Paperless to use SSL for the user interface, set this variable - to ``true``. You also need to copy your certificate and key to the ``data`` - directory, named ``ssl.cert`` and ``ssl.key``. - This is not an ideal solution and, if possible, a reverse proxy with nginx - is preferred. - -6. Run ``docker-compose up -d``. This will create and start the necessary +5. Run ``docker-compose up -d``. This will create and start the necessary containers. -7. To be able to login, you will need a super user. To create it, execute the - following command: - .. code-block:: shell-session +6. To be able to login, you will need a super user. To create it, execute the + following command: - $ docker-compose run --rm webserver createsuperuser + .. code-block:: shell-session - This will prompt you to set a username (default ``paperless``), an optional - e-mail address and finally a password. -8. The default ``docker-compose.yml`` exports the webserver on your local port - 8000. If you haven't adapted this, you should now be able to visit your - `Paperless webserver`_ at ``http://127.0.0.1:8000`` (or - ``https://127.0.0.1:8000`` if you enabled SSL). You can login with the - user and password you just created. -9. Add files to consumption directory the way you prefer to. Following are two - possible options: + $ docker-compose run --rm webserver createsuperuser - 1. Mount the consumption directory to a local host path by modifying your - ``docker-compose.yml``: - - .. code-block:: diff - - diff --git a/docker-compose.yml b/docker-compose.yml - --- a/docker-compose.yml - +++ b/docker-compose.yml - @@ -17,9 +18,8 @@ services: - volumes: - - paperless-data:/usr/src/paperless/data - - paperless-media:/usr/src/paperless/media - - - /consume - + - /local/path/you/choose:/consume - - .. danger:: - - While the consumption container will ensure at startup that it can - **delete** a consumed file from a host-mounted directory, it might - not be able to **read** the document in the first place if the access - rights to the file are incorrect. - - Make sure that the documents you put into the consumption directory - will either be readable by everyone (``chmod o+r file.pdf``) or - readable by the default user or group id 1000 (or the one you have - set with ``USERMAP_UID`` or ``USERMAP_GID`` respectively). - - 2. Use ``docker cp`` to copy your files directly into the container: - - .. code-block:: shell-session - - $ # Identify your containers - $ docker-compose ps - Name Command State Ports - ------------------------------------------------------------------------- - paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0 - paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0 - - $ docker cp /path/to/your/file.pdf paperless_consumer_1:/consume - - ``docker cp`` is a one-shot-command, just like ``cp``. This means that - every time you want to consume a new document, you will have to execute - ``docker cp`` again. You can of course automate this process, but option - 1 is generally the preferred one. - - .. danger:: - - ``docker cp`` will change the owning user and group of a copied file - to the acting user at the destination, which will be ``root``. - - You therefore need to ensure that the documents you want to copy into - the container are readable by everyone (``chmod o+r file.pdf``) - before copying them. + This will prompt you to set a username, an optional e-mail address and + finally a password. +7. The default ``docker-compose.yml`` exports the webserver on your local port + 8000. If you haven't adapted this, you should now be able to visit your + Paperless instance at ``http://127.0.0.1:8000``. You can login with the + user and password you just created. .. _Docker: https://www.docker.com/ .. _docker-compose: https://docs.docker.com/compose/install/ -.. _ISO 639-2/T: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes -.. _Alpine packagelist: https://pkgs.alpinelinux.org/packages?name=tesseract-ocr-data*&arch=x86_64 .. [#compose] You of course don't have to use docker-compose, but it simplifies deployment immensely. If you know your way around Docker, feel free to tinker around without using compose! -.. [#dockercomposeyml] If you're upgrading your docker-compose images from - version 1.1.0 or earlier, you might need to change in the - ``docker-compose.yml`` file the ``image: pitkley/paperless`` directive in - both the ``webserver`` and ``consumer`` sections to ``build: ./`` as per the - newer ``docker-compose.yml.example`` file +Bare Metal Route +================ -.. _setup-permanent: +.. warning:: -Making Things a Little more Permanent -------------------------------------- + TBD. User docker for now. -Once you've tested things and are happy with the work flow, you should secure -the installation and automate the process of starting the webserver and -consumer. +Migration to paperless-ng +######################### +At its core, paperless-ng is still paperless and fully compatible. However, some +things have changed under the hood, so you need to adapt your setup depending on +how you installed paperless. The important things to keep in mind are as follows. -.. _setup-permanent-webserver: +* Read the :ref:`paperless_changelog` and take note of breaking changes. +* It is recommended to use postgresql as the database now. The docker-compose + deployment will automatically create a postgresql instance and instruct + paperless to use it. This means that if you use the docker-compose script + with your current paperless media and data volumes and used the default + sqlite database, **it will not use your sqlite database and it may seem + as if your documents are gone**. You may use the provided + ``docker-compose.yml.sqlite.example`` script, which does not use postgresql. +* The task scheduler of paperless, which is used to execute periodic tasks + such as email checking and maintenance, requires a `redis`_ message broker + instance. The docker-compose route takes care of that. +* The layout of the folder structure for your documents and data remains the + same. +* The frontend needs to be built from source. The docker image takes care of + that. -Using a Real Webserver -++++++++++++++++++++++ +Migration to paperless-ng is then performed in a few simple steps: -The default is to use Django's development server, as that's easy and does the -job well enough on a home network. However it is heavily discouraged to use -it for more than that. +1. Do a backup for two purposes: If something goes wrong, you still have your + data. Second, if you don't like paperless-ng, you can switch back to + paperless. -If you want to do things right you should use a real webserver capable of -handling more than one thread. You will also have to let the webserver serve -the static files (CSS, JavaScript) from the directory configured in -``PAPERLESS_STATICDIR``. The default static files directory is ``../static``. +2. Replace the paperless source with paperless-ng. If you're using git, this + is done by: -For that you need to activate your virtual environment and collect the static -files with the command: + .. code:: bash -.. code:: bash + $ git remote set-url origin https://github.com/jonaswinkler/paperless-ng + $ git pull - $ cd <paperless directory>/src - $ ./manage.py collectstatic +3. If you are using docker, copy ``docker-compose.yml.example`` to + ``docker-compose.yml`` and ``docker-compose.env.example`` to + ``docker-compose.env``. Make adjustments to these files as necessary. + See `docker route`_ for details. +4. Update paperless. See :ref:`administration-updating` for details. -Apache -~~~~~~ +5. Start paperless-ng. -This is a configuration supplied by `steckerhalter`_ on GitHub. It uses Apache -and mod_wsgi, with a Paperless installation in ``/home/paperless/``: + .. code:: bash -.. code:: apache + $ docker-compose up + + This will also migrate your database as usual. Verify by inspecting the + output that the migration was successfully executed. CTRL-C will then + gracefully stop the container. After that, you can start paperless-ng as + usuall with - <VirtualHost *:80> - ServerName example.com + .. code:: bash - Alias /static/ /home/paperless/paperless/static/ - <Directory /home/paperless/paperless/static> - Require all granted - </Directory> + $ docker-compose up -d - WSGIScriptAlias / /home/paperless/paperless/src/paperless/wsgi.py - WSGIDaemonProcess example.com user=paperless group=paperless threads=5 python-path=/home/paperless/paperless/src:/home/paperless/.env/lib/python3.6/site-packages - WSGIProcessGroup example.com +6. Paperless installed a permanent redirect to ``admin/`` in your browser. This + redirect is still in place and prevents access to the new UI. Clear + everything related to paperless in your browsers data in order to fix + this issue. - <Directory /home/paperless/paperless/src/paperless> - <Files wsgi.py> - Require all granted - </Files> - </Directory> - </VirtualHost> +Moving data from sqlite to postgresql +===================================== -.. _steckerhalter: https://github.com/steckerhalter +.. warning:: + TBD. -Nginx + Gunicorn -~~~~~~~~~~~~~~~~ - -If you're using Nginx, the most common setup is to combine it with a -Python-based server like Gunicorn so that Nginx is acting as a proxy. Below is -a copy of a simple Nginx configuration fragment making use of a gunicorn -instance listening on localhost port 8000. - -.. code:: nginx - - server { - listen 80; - - index index.html index.htm index.php; - access_log /var/log/nginx/paperless_access.log; - error_log /var/log/nginx/paperless_error.log; - - location /static { - - autoindex on; - alias <path-to-paperless-static-directory>; - - } - - location / { - - proxy_set_header Host $http_host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - proxy_set_header X-Forwarded-Proto $scheme; - - proxy_pass http://127.0.0.1:8000; - } - } - - -The gunicorn server can be started with the command: - -.. code-block:: shell - - $ <path-to-paperless-virtual-environment>/bin/gunicorn --pythonpath=<path-to-paperless>/src paperless.wsgi -w 2 - - -.. _setup-permanent-standard-systemd: - -Standard (Bare Metal + Systemd) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If you're running on a bare metal system that's using Systemd, you can use the -service unit files in the ``scripts`` directory to set this up. - -1. You'll need to create a group and user called ``paperless`` (without login) -2. Setup Paperless to be in a place that this new user can read and write to. -3. Ensure ``/etc/paperless`` is readable by the ``paperless`` user. -4. Copy the service file from the ``scripts`` directory to - ``/etc/systemd/system``. - -.. code-block:: bash - - $ cp /path/to/paperless/scripts/paperless-consumer.service /etc/systemd/system/ - $ cp /path/to/paperless/scripts/paperless-webserver.service /etc/systemd/system/ - -5. Edit the service file to point the ``ExecStart`` line to the proper location - of your paperless install, referencing the appropriate Python binary. For - example: - ``ExecStart=/path/to/python3 /path/to/paperless/src/manage.py document_consumer``. -6. Start and enable (so they start on boot) the services. - -.. code-block:: bash - - $ systemctl enable paperless-consumer - $ systemctl enable paperless-webserver - $ systemctl start paperless-consumer - $ systemctl start paperless-webserver - - -.. _setup-permanent-standard-upstart: - -Standard (Bare Metal + Upstart) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Ubuntu 14.04 and earlier use the `Upstart`_ init system to start services -during the boot process. To configure Upstart to run Paperless automatically -after restarting your system: - -1. Change to the directory where Upstart's configuration files are kept: - ``cd /etc/init`` -2. Create a new file: ``sudo nano paperless-server.conf`` -3. In the newly-created file enter:: - - start on (local-filesystems and net-device-up IFACE=eth0) - stop on shutdown - - respawn - respawn limit 10 5 - - script - exec <path to paperless virtual environment>/bin/gunicorn --pythonpath=<path to parperless>/src paperless.wsgi -w 2 - end script - - Note that you'll need to replace ``/srv/paperless/src/manage.py`` with the - path to the ``manage.py`` script in your installation directory. - - If you are using a network interface other than ``eth0``, you will have to - change ``IFACE=eth0``. For example, if you are connected via WiFi, you will - likely need to replace ``eth0`` above with ``wlan0``. To see all interfaces, - run ``ifconfig -a``. - - Save the file. - -4. Create a new file: ``sudo nano paperless-consumer.conf`` - -5. In the newly-created file enter:: - - start on (local-filesystems and net-device-up IFACE=eth0) - stop on shutdown - - respawn - respawn limit 10 5 - - script - exec <path to paperless virtual environment>/bin/python <path to parperless>/manage.py document_consumer - end script - - Replace the path placeholder and ``eth0`` with the appropriate value and save the file. - -These two configuration files together will start both the Paperless webserver -and document consumer processes when the file system and network interface -specified is available after boot. Furthermore, if either process ever exits -unexpectedly, Upstart will try to restart it a maximum of 10 times within a 5 -second period. - -.. _Upstart: http://upstart.ubuntu.com/ - - -.. _setup-permanent-docker: - -Docker -~~~~~~ - -If you're using Docker, you can set a restart-policy_ in the -``docker-compose.yml`` to have the containers automatically start with the -Docker daemon. - -.. _restart-policy: https://docs.docker.com/engine/reference/commandline/run/#restart-policies-restart - + .. _redis: https://redis.io/ diff --git a/docs/usage_overview.rst b/docs/usage_overview.rst new file mode 100644 index 000000000..719776d92 --- /dev/null +++ b/docs/usage_overview.rst @@ -0,0 +1,216 @@ +************** +Usage Overview +************** + +Paperless is an application that manages your personal documents. With +the help of a document scanner (see :ref:`scanners`), paperless transforms +your wieldy physical document binders into a searchable archive and +provices many utilities for finding and managing your documents. + + +Terms and definitions +##################### + +Paperless esentially consists of two different parts for managing your +documents: + +* The *consumer* watches a specified folder and adds all documents in that + folder to paperless. +* The *web server* provides a UI that you use to manage and search for your + scanned documents. + +Each document has a couple of fields that you can assign to them: + +* A *Document* is a piece of paper that sometimes contains valuable + information. +* The *correspondent* of a document is the person, institution or company that + a document either originates form, or is sent to. +* A *tag* is a label that you can assign to documents. Think of labels as more + powerful folders: Multiple documents can be grouped together with a single + tag, however, a single document can also have multiple tags. This is not + possible with folders. The reason folders are not implemented in paperless + is simply that tags are much more versatile than folders. +* A *document type* is used to demarkate the type of a document such as letter, + bank statement, invoice, contract, etc. It is used to identify what a document + is about. +* The *date added* of a document is the date the document was scanned into + paperless. You cannot and should not change this date. +* The *date created* of a document is the date the document was intially issued. + This can be the date you bought a product, the date you signed a contract, or + the date a letter was sent to you. +* The *archive serial number* (short: ASN) of a document is the identifier of + the document in your physical document binders. See + :ref:`usage-recommended_workflow` below. +* The *content* of a document is the text that was OCR'ed from the document. + This text is fed into the search engine and is used for matching tags, + correspondents and document types. + +.. TODO: hyperref + +Frontend overview +################# + +.. warning:: + + TBD. Add some fancy screenshots! + +Adding documents to paperless +############################# + +Once you've got Paperless setup, you need to start feeding documents into it. +Currently, there are three options: the consumption directory, IMAP (email), and +HTTP POST. + + +The consumption directory +========================= + +The primary method of getting documents into your database is by putting them in +the consumption directory. The consumer runs in an infinite +loop looking for new additions to this directory and when it finds them, it goes +about the process of parsing them with the OCR, indexing what it finds, and storing +it in the media directory. + +Getting stuff into this directory is up to you. If you're running Paperless +on your local computer, you might just want to drag and drop files there, but if +you're running this on a server and want your scanner to automatically push +files to this directory, you'll need to setup some sort of service to accept the +files from the scanner. Typically, you're looking at an FTP server like +`Proftpd`_ or a Windows folder share with `Samba`_. + +.. _Proftpd: http://www.proftpd.org/ +.. _Samba: http://www.samba.org/ + +.. TODO: hyperref to configuration of the location of this magic folder. + + +IMAP (Email) +============ + +Another handy way to get documents into your database is to email them to +yourself. The typical use-case would be to be out for lunch and want to send a +copy of the receipt back to your system at home. Paperless can be taught to +pull emails down from an arbitrary account and dump them into the consumption +directory where the consumer will follow the +usual pattern on consuming the document. + +Some things you need to know about this feature: + +* It's disabled by default. By setting the values below it will be enabled. +* It's been tested in a limited environment, so it may not work for you (please + submit a pull request if you can!) +* It's designed to **delete mail from the server once consumed**. So don't go + pointing this to your personal email account and wonder where all your stuff + went. +* Currently, only one photo (attachment) per email will work. + +So, with all that in mind, here's what you do to get it running: + +1. Setup a new email account somewhere, or if you're feeling daring, create a + folder in an existing email box and note the path to that folder. +2. In ``/etc/paperless.conf`` set all of the appropriate values in + ``PATHS AND FOLDERS`` and ``SECURITY``. + If you decided to use a subfolder of an existing account, then make sure you + set ``PAPERLESS_CONSUME_MAIL_INBOX`` accordingly here. You also have to set + the ``PAPERLESS_EMAIL_SECRET`` to something you can remember 'cause you'll + have to include that in every email you send. +3. Restart paperless. Paperless will check + the configured email account at startup and from then on every 10 minutes + for something new and pulls down whatever it finds. +4. Send yourself an email! Note that the subject is treated as the file name, + so if you set the subject to ``Correspondent - Title - tag,tag,tag``, you'll + get what you expect. Also, you must include the aforementioned secret + string in every email so the fetcher knows that it's safe to import. + Note that Paperless only allows the email title to consist of safe characters + to be imported. These consist of alpha-numeric characters and ``-_ ,.'``. + + +REST API +======== + +You can also submit a document using the REST API, see the API section for details. + + +.. _usage-recommended_workflow: + +The recommended workflow +######################## + +Once you have familiarized yourself with paperless and are ready to use it +for all your documents, the recommended workflow for managing your documents +is as follows. This workflow also takes into account that some documents +have to be kept in physical form, but still ensures that you get all the +advantages for these documents as well. + +Preparations in paperless +========================= + +* Create an inbox tag that gets assigned to all new documents. +* Create a TODO tag. + +Processing of the physical documents +==================================== + +Keep a physical inbox. Whenever you receive a document that you need to +archive, put it into your inbox. Regulary, do the following for all documents +in your inbox: + +1. For each document, decide if you need to keep the document in physical + form. This applies to certain important documents, such as contracts and + certificates. +2. If you need to keep the document, write a running number on the document + before scanning, starting at one and counting upwards. This is the archive + serial number, or ASN in short. +3. Scan the document. +4. If the document has an ASN assigned, store it in a *single* binder, sorted + by ASN. Don't order this binder in any other way. +5. If the document has no ASN, throw it away. Yay! + +Over time, you will notice that your physical binder will fill up. If it is +full, label the binder with the range of ASNs in this binder (i.e., "Documents +1 to 343"), store the binder in your cellar or elsewhere, and start a new +binder. + +The idea behind this process is that you will never have to use the physical +binders to find a document. If you need a specific physical document, you +may find this document by: + +1. Searching in paperless for the document. +2. Identify the ASN of the document, since it appears on the scan. +3. Grab the relevant document binder and get the document. This is easy since + they are sorted by ASN. + +Processing of documents in paperless +==================================== + +Once you have scanned in a document, proceed in paperless as follows. + +1. If the document has an ASN, assign the ASN to the document. +2. Assign a correspondent to the document (i.e., your employer, bank, etc) + This isnt strictly necessary but helps in finding a document when you need + it. +3. Assign a document type (i.e., invoice, bank statement, etc) to the document + This isnt strictly necessary but helps in finding a document when you need + it. +4. Assign a proper title to the document (the name of an item you bought, the + subject of the letter, etc) +5. Check that the date of the document is corrent. Paperless tries to read + the date from the content of the document, but this fails sometimes if the + OCR is bad or multiple dates appear on the document. +6. Remove inbox tags from the documents. + + +Task management +=============== + +Some documents require attention and require you to act on the document. You +may take two different approaches to handle these documents based on how +regularly you intent to use paperless and scan documents. + +* If you scan and process your documents in paperless regularly, assign a + TODO tag to all scanned documents that you need to process. Create a saved + view on the dashboard that shows all documents with this tag. +* If you do not scan documents regularly and use paperless solely for archiving, + create a physical todo box next to your physical inbox and put documents you + need to process in the TODO box. When you performed the task associated with + the document, move it to the inbox. diff --git a/docs/utilities.rst b/docs/utilities.rst deleted file mode 100644 index 3c7e8d542..000000000 --- a/docs/utilities.rst +++ /dev/null @@ -1,284 +0,0 @@ -.. _utilities: - -Utilities -========= - -There's basically three utilities to Paperless: the webserver, consumer, and -if needed, the exporter. They're all detailed here. - - -.. _utilities-webserver: - -The Webserver -------------- - -At the heart of it, Paperless is a simple Django webservice, and the entire -interface is based on Django's standard admin interface. Once running, visiting -the URL for your service delivers the admin, through which you can get a -detailed listing of all available documents, search for specific files, and -download whatever it is you're looking for. - - -.. _utilities-webserver-howto: - -How to Use It -............. - -The webserver is started via the ``manage.py`` script: - -.. code-block:: shell-session - - $ /path/to/paperless/src/manage.py runserver - -By default, the server runs on localhost, port 8000, but you can change this -with a few arguments, run ``manage.py --help`` for more information. - -Add the option ``--noreload`` to reduce resource usage. Otherwise, the server -continuously polls all source files for changes to auto-reload them. - -Note that when exiting this command your webserver will disappear. -If you want to run this full-time (which is kind of the point) -you'll need to have it start in the background -- something you'll need to -figure out for your own system. To get you started though, there are Systemd -service files in the ``scripts`` directory. - - -.. _utilities-consumer: - -The Consumer ------------- - -The consumer script runs in an infinite loop, constantly looking at a directory -for documents to parse and index. The process is pretty straightforward: - -1. Look in ``CONSUMPTION_DIR`` for a document. If one is found, go to #2. - If not, wait 10 seconds and try again. On Linux, new documents are detected - instantly via inotify, so there's no waiting involved. -2. Parse the document with Tesseract -3. Create a new record in the database with the OCR'd text -4. Attempt to automatically assign document attributes by doing some guesswork. - Read up on the :ref:`guesswork documentation<guesswork>` for more - information about this process. -5. Encrypt the document (if you have a passphrase set) and store it in the - ``media`` directory under ``documents/originals``. -6. Go to #1. - - -.. _utilities-consumer-howto: - -How to Use It -............. - -The consumer is started via the ``manage.py`` script: - -.. code-block:: shell-session - - $ /path/to/paperless/src/manage.py document_consumer - -This starts the service that will consume documents as they appear in -``CONSUMPTION_DIR``. - -Note that this command runs continuously, so exiting it will mean your webserver -disappears. If you want to run this full-time (which is kind of the point) -you'll need to have it start in the background -- something you'll need to -figure out for your own system. To get you started though, there are Systemd -service files in the ``scripts`` directory. - -Some command line arguments are available to customize the behavior of the -consumer. By default it will use ``/etc/paperless.conf`` values. Display the -help with: - -.. code-block:: shell-session - - $ /path/to/paperless/src/manage.py document_consumer --help - -.. _utilities-exporter: - -The Exporter ------------- - -Tired of fiddling with Paperless, or just want to do something stupid and are -afraid of accidentally damaging your files? You can export all of your -documents into neatly named, dated, and unencrypted files. - - -.. _utilities-exporter-howto: - -How to Use It -............. - -This too is done via the ``manage.py`` script: - -.. code-block:: shell-session - - $ /path/to/paperless/src/manage.py document_exporter /path/to/somewhere/ - -This will dump all of your unencrypted documents into ``/path/to/somewhere`` -for you to do with as you please. The files are accompanied with a special -file, ``manifest.json`` which can be used to :ref:`import the files -<utilities-importer>` at a later date if you wish. - - -.. _utilities-exporter-howto-docker: - -Docker -______ - -If you are :ref:`using Docker <setup-installation-docker>`, running the -expoorter is almost as easy. To mount a volume for exports, follow the -instructions in the ``docker-compose.yml.example`` file for the ``/export`` -volume (making the changes in your own ``docker-compose.yml`` file, of course). -Once you have the volume mounted, the command to run an export is: - -.. code-block:: shell-session - - $ docker-compose run --rm consumer document_exporter /export - -If you prefer to use ``docker run`` directly, supplying the necessary commandline -options: - -.. code-block:: shell-session - - $ # Identify your containers - $ docker-compose ps - Name Command State Ports - ------------------------------------------------------------------------- - paperless_consumer_1 /sbin/docker-entrypoint.sh ... Exit 0 - paperless_webserver_1 /sbin/docker-entrypoint.sh ... Exit 0 - - $ # Make sure to replace your passphrase and remove or adapt the id mapping - $ docker run --rm \ - --volumes-from paperless_data_1 \ - --volume /path/to/arbitrary/place:/export \ - -e PAPERLESS_PASSPHRASE=YOUR_PASSPHRASE \ - -e USERMAP_UID=1000 -e USERMAP_GID=1000 \ - paperless document_exporter /export - - -.. _utilities-importer: - -The Importer ------------- - -Looking to transfer Paperless data from one instance to another, or just want -to restore from a backup? This is your go-to toy. - - -.. _utilities-importer-howto: - -How to Use It -............. - -The importer works just like the exporter. You point it at a directory, and -the script does the rest of the work: - -.. code-block:: shell-session - - $ /path/to/paperless/src/manage.py document_importer /path/to/somewhere/ - -Docker -______ - -Assuming that you've already gone through the steps above in the -:ref:`export <utilities-exporter-howto-docker>` section, then the easiest thing -to do is just re-use the ``/export`` path you already setup: - -.. code-block:: shell-session - - $ docker-compose run --rm consumer document_importer /export - -Similarly, if you're not using docker-compose, you can adjust the export -instructions above to do the import. - - -.. _utilities-retagger: - -Re-running your tagging and correspondent matchers --------------------------------------------------- - -Say you've imported a few hundred documents and now want to introduce -a tag or set up a new correspondent, and apply its matching to all of -the currently-imported docs. This problem is common enough that -there are tools for it. - - -.. _utilities-retagger-howto: - -How to Do It -............ - -This too is done via the ``manage.py`` script: - -.. code:: bash - - $ /path/to/paperless/src/manage.py document_retagger - -Run this after changing or adding tagging rules. It'll loop over all -of the documents in your database and attempt to match all of your -tags to them. If one matches, it'll be applied. And don't worry, you -can run this as often as you like, it won't double-tag a document. - -.. code:: bash - - $ /path/to/paperless/src/manage.py document_correspondents - -This is the similar command to run after adding or changing a correspondent. - -.. _utilities-encyption: - -Enabling Encrpytion -------------------- - -Let's say you've imported a few documents to play around with paperless and now -you are using it more seriously and want to enable encryption of your files. - -.. utilities-encryption-howto: - -Basic Syntax -............. - -Again we'll use the ``manage.py`` script, passing ``change_storage_type``: - -.. code:: console - - $ /path/to/paperless/src/manage.py change_storage_type --help - usage: manage.py change_storage_type [-h] [--version] [-v {0,1,2,3}] - [--settings SETTINGS] - [--pythonpath PYTHONPATH] [--traceback] - [--no-color] [--passphrase PASSPHRASE] - {gpg,unencrypted} {gpg,unencrypted} - - This is how you migrate your stored documents from an encrypted state to an - unencrypted one (or vice-versa) - - positional arguments: - {gpg,unencrypted} The state you want to change your documents from - {gpg,unencrypted} The state you want to change your documents to - - optional arguments: - --passphrase PASSPHRASE - If PAPERLESS_PASSPHRASE isn't set already, you need to - specify it here - -Enabling Encryption -................... - -Basic usage to enable encryption of your document store (**USE A MORE SECURE PASSPHRASE**): - -(Note: If ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here) - -.. code:: bash - - $ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] unencrypted gpg - - -Disabling Encryption -.................... - -Basic usage to enable encryption of your document store: - -(Note: Again, if ``PAPERLESS_PASSPHRASE`` isn't set already, you need to specify it here) - -.. code:: bash - - $ /path/to/paperless/src/manage.py change_storage_type [--passphrase SECR3TP4SSPHRA$E] gpg unencrypted