mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-07-28 18:24:38 -05:00
Move docs to material-mkdocs
This commit is contained in:
483
docs/usage.md
Normal file
483
docs/usage.md
Normal file
@@ -0,0 +1,483 @@
|
||||
# Usage Overview
|
||||
|
||||
Paperless is an application that manages your personal documents. With
|
||||
the help of a document scanner (see [the scanners wiki](https://github.com/paperless-ngx/paperless-ngx/wiki/Scanner-&-Software-Recommendations)), paperless transforms your wieldy physical document binders
|
||||
into a searchable archive and provides many utilities for finding and
|
||||
managing your documents.
|
||||
|
||||
## Terms and definitions
|
||||
|
||||
Paperless essentially consists of two different parts for managing your
|
||||
documents:
|
||||
|
||||
- The _consumer_ watches a specified folder and adds all documents in
|
||||
that folder to paperless.
|
||||
- The _web server_ provides a UI that you use to manage and search for
|
||||
your scanned documents.
|
||||
|
||||
Each document has a couple of fields that you can assign to them:
|
||||
|
||||
- A _Document_ is a piece of paper that sometimes contains valuable
|
||||
information.
|
||||
- The _correspondent_ of a document is the person, institution or
|
||||
company that a document either originates from, or is sent to.
|
||||
- A _tag_ is a label that you can assign to documents. Think of labels
|
||||
as more powerful folders: Multiple documents can be grouped together
|
||||
with a single tag, however, a single document can also have multiple
|
||||
tags. This is not possible with folders. The reason folders are not
|
||||
implemented in paperless is simply that tags are much more versatile
|
||||
than folders.
|
||||
- A _document type_ is used to demarcate the type of a document such
|
||||
as letter, bank statement, invoice, contract, etc. It is used to
|
||||
identify what a document is about.
|
||||
- The _date added_ of a document is the date the document was scanned
|
||||
into paperless. You cannot and should not change this date.
|
||||
- The _date created_ of a document is the date the document was
|
||||
initially issued. This can be the date you bought a product, the
|
||||
date you signed a contract, or the date a letter was sent to you.
|
||||
- The _archive serial number_ (short: ASN) of a document is the
|
||||
identifier of the document in your physical document binders. See
|
||||
`usage-recommended_workflow`{.interpreted-text role="ref"} below.
|
||||
- The _content_ of a document is the text that was OCR'ed from the
|
||||
document. This text is fed into the search engine and is used for
|
||||
matching tags, correspondents and document types.
|
||||
|
||||
## Adding documents to paperless
|
||||
|
||||
Once you've got Paperless setup, you need to start feeding documents
|
||||
into it. When adding documents to paperless, it will perform the
|
||||
following operations on your documents:
|
||||
|
||||
1. OCR the document, if it has no text. Digital documents usually have
|
||||
text, and this step will be skipped for those documents.
|
||||
2. Paperless will create an archivable PDF/A document from your
|
||||
document. If this document is coming from your scanner, it will have
|
||||
embedded selectable text.
|
||||
3. Paperless performs automatic matching of tags, correspondents and
|
||||
types on the document before storing it in the database.
|
||||
|
||||
!!! tip
|
||||
|
||||
This process can be configured to fit your needs. If you don't want
|
||||
paperless to create archived versions for digital documents, you can
|
||||
configure that by configuring `PAPERLESS_OCR_MODE=skip_noarchive`.
|
||||
Please read the
|
||||
[relevant section in the documentation](/configuration#ocr).
|
||||
|
||||
!!! note
|
||||
|
||||
No matter which options you choose, Paperless will always store the
|
||||
original document that it found in the consumption directory or in the
|
||||
mail and will never overwrite that document. Archived versions are
|
||||
stored alongside the original versions.
|
||||
|
||||
### The consumption directory
|
||||
|
||||
The primary method of getting documents into your database is by putting
|
||||
them in the consumption directory. The consumer runs in an infinite
|
||||
loop, looking for new additions to this directory. When it finds them,
|
||||
the consumer goes about the process of parsing them with the OCR,
|
||||
indexing what it finds, and storing it in the media directory.
|
||||
|
||||
Getting stuff into this directory is up to you. If you're running
|
||||
Paperless on your local computer, you might just want to drag and drop
|
||||
files there, but if you're running this on a server and want your
|
||||
scanner to automatically push files to this directory, you'll need to
|
||||
setup some sort of service to accept the files from the scanner.
|
||||
Typically, you're looking at an FTP server like
|
||||
[Proftpd](http://www.proftpd.org/) or a Windows folder share with
|
||||
[Samba](http://www.samba.org/).
|
||||
|
||||
### Web UI Upload
|
||||
|
||||
The dashboard has a file drop field to upload documents to paperless.
|
||||
Simply drag a file onto this field or select a file with the file
|
||||
dialog. Multiple files are supported.
|
||||
|
||||
You can also upload documents on any other page of the web UI by
|
||||
dragging-and-dropping files into your browser window.
|
||||
|
||||
### Mobile upload {#usage-mobile_upload}
|
||||
|
||||
The mobile app over at <https://github.com/qcasey/paperless_share>
|
||||
allows Android users to share any documents with paperless. This can be
|
||||
combined with any of the mobile scanning apps out there, such as Office
|
||||
Lens.
|
||||
|
||||
Furthermore, there is the [Paperless
|
||||
App](https://github.com/bauerj/paperless_app) as well, which not only
|
||||
has document upload, but also document browsing and download features.
|
||||
|
||||
### IMAP (Email) {#usage-email}
|
||||
|
||||
You can tell paperless-ngx to consume documents from your email
|
||||
accounts. This is a very flexible and powerful feature, if you regularly
|
||||
received documents via mail that you need to archive. The mail consumer
|
||||
can be configured via the frontend settings (/settings/mail) in the following
|
||||
manner:
|
||||
|
||||
1. Define e-mail accounts.
|
||||
2. Define mail rules for your account.
|
||||
|
||||
These rules perform the following:
|
||||
|
||||
1. Connect to the mail server.
|
||||
2. Fetch all matching mails (as defined by folder, maximum age and the
|
||||
filters)
|
||||
3. Check if there are any consumable attachments.
|
||||
4. If so, instruct paperless to consume the attachments and optionally
|
||||
use the metadata provided in the rule for the new document.
|
||||
5. If documents were consumed from a mail, the rule action is performed
|
||||
on that mail.
|
||||
|
||||
Paperless will completely ignore mails that do not match your filters.
|
||||
It will also only perform the action on mails that it has consumed
|
||||
documents from.
|
||||
|
||||
The actions all ensure that the same mail is not consumed twice by
|
||||
different means. These are as follows:
|
||||
|
||||
- **Delete:** Immediately deletes mail that paperless has consumed
|
||||
documents from. Use with caution.
|
||||
- **Mark as read:** Mark consumed mail as read. Paperless will not
|
||||
consume documents from already read mails. If you read a mail before
|
||||
paperless sees it, it will be ignored.
|
||||
- **Flag:** Sets the 'important' flag on mails with consumed
|
||||
documents. Paperless will not consume flagged mails.
|
||||
- **Move to folder:** Moves consumed mails out of the way so that
|
||||
paperless wont consume them again.
|
||||
- **Add custom Tag:** Adds a custom tag to mails with consumed
|
||||
documents (the IMAP standard calls these "keywords"). Paperless
|
||||
will not consume mails already tagged. Not all mail servers support
|
||||
this feature!
|
||||
|
||||
!!! warning
|
||||
|
||||
The mail consumer will perform these actions on all mails it has
|
||||
consumed documents from. Keep in mind that the actual consumption
|
||||
process may fail for some reason, leaving you with missing documents in
|
||||
paperless.
|
||||
|
||||
!!! note
|
||||
|
||||
With the correct set of rules, you can completely automate your email
|
||||
documents. Create rules for every correspondent you receive digital
|
||||
documents from and paperless will read them automatically. The default
|
||||
action "mark as read" is pretty tame and will not cause any damage or
|
||||
data loss whatsoever.
|
||||
|
||||
You can also setup a special folder in your mail account for paperless
|
||||
and use your favorite mail client to move to be consumed mails into that
|
||||
folder automatically or manually and tell paperless to move them to yet
|
||||
another folder after consumption. It's up to you.
|
||||
|
||||
!!! note
|
||||
|
||||
When defining a mail rule with a folder, you may need to try different
|
||||
characters to define how the sub-folders are separated. Common values
|
||||
include ".", "/" or "\|", but this varies by the mail server.
|
||||
Check the documentation for your mail server. In the event of an error
|
||||
fetching mail from a certain folder, check the Paperless logs. When a
|
||||
folder is not located, Paperless will attempt to list all folders found
|
||||
in the account to the Paperless logs.
|
||||
|
||||
!!! note
|
||||
|
||||
Paperless will process the rules in the order defined in the admin page.
|
||||
|
||||
You can define catch-all rules and have them executed last to consume
|
||||
any documents not matched by previous rules. Such a rule may assign an
|
||||
"Unknown mail document" tag to consumed documents so you can inspect
|
||||
them further.
|
||||
|
||||
Paperless is set up to check your mails every 10 minutes. This can be
|
||||
configured on the 'Scheduled tasks' page in the admin.
|
||||
|
||||
### REST API
|
||||
|
||||
You can also submit a document using the REST API, see
|
||||
`api-file_uploads`{.interpreted-text role="ref"} for details.
|
||||
|
||||
## Best practices {#basic-searching}
|
||||
|
||||
Paperless offers a couple tools that help you organize your document
|
||||
collection. However, it is up to you to use them in a way that helps you
|
||||
organize documents and find specific documents when you need them. This
|
||||
section offers a couple ideas for managing your collection.
|
||||
|
||||
Document types allow you to classify documents according to what they
|
||||
are. You can define types such as "Receipt", "Invoice", or
|
||||
"Contract". If you used to collect all your receipts in a single
|
||||
binder, you can recreate that system in paperless by defining a document
|
||||
type, assigning documents to that type and then filtering by that type
|
||||
to only see all receipts.
|
||||
|
||||
Not all documents need document types. Sometimes its hard to determine
|
||||
what the type of a document is or it is hard to justify creating a
|
||||
document type that you only need once or twice. This is okay. As long as
|
||||
the types you define help you organize your collection in the way you
|
||||
want, paperless is doing its job.
|
||||
|
||||
Tags can be used in many different ways. Think of tags are more
|
||||
versatile folders or binders. If you have a binder for documents related
|
||||
to university / your car or health care, you can create these binders in
|
||||
paperless by creating tags and assigning them to relevant documents.
|
||||
Just as with documents, you can filter the document list by tags and
|
||||
only see documents of a certain topic.
|
||||
|
||||
With physical documents, you'll often need to decide which folder the
|
||||
document belongs to. The advantage of tags over folders and binders is
|
||||
that a single document can have multiple tags. A physical document
|
||||
cannot magically appear in two different folders, but with tags, this is
|
||||
entirely possible.
|
||||
|
||||
!!! tip
|
||||
|
||||
This can be used in many different ways. One example: Imagine you're
|
||||
working on a particular task, such as signing up for university. Usually
|
||||
you'll need to collect a bunch of different documents that are already
|
||||
sorted into various folders. With the tag system of paperless, you can
|
||||
create a new group of documents that are relevant to this task without
|
||||
destroying the already existing organization. When you're done with the
|
||||
task, you could delete the tag again, which would be equal to sorting
|
||||
documents back into the folder they belong into. Or keep the tag, up to
|
||||
you.
|
||||
|
||||
All of the logic above applies to correspondents as well. Attach them to
|
||||
documents if you feel that they help you organize your collection.
|
||||
|
||||
When you've started organizing your documents, create a couple saved
|
||||
views for document collections you regularly access. This is equal to
|
||||
having labeled physical binders on your desk, except that these saved
|
||||
views are dynamic and simply update themselves as you add documents to
|
||||
the system.
|
||||
|
||||
Here are a couple examples of tags and types that you could use in your
|
||||
collection.
|
||||
|
||||
- An `inbox` tag for newly added documents that you haven't manually
|
||||
edited yet.
|
||||
- A tag `car` for everything car related (repairs, registration,
|
||||
insurance, etc)
|
||||
- A tag `todo` for documents that you still need to do something with,
|
||||
such as reply, or perform some task online.
|
||||
- A tag `bank account x` for all bank statement related to that
|
||||
account.
|
||||
- A tag `mail` for anything that you added to paperless via its mail
|
||||
processing capabilities.
|
||||
- A tag `missing_metadata` when you still need to add some metadata to
|
||||
a document, but can't or don't want to do this right now.
|
||||
|
||||
## Searching {#basic-usage_searching}
|
||||
|
||||
Paperless offers an extensive searching mechanism that is designed to
|
||||
allow you to quickly find a document you're looking for (for example,
|
||||
that thing that just broke and you bought a couple months ago, that
|
||||
contract you signed 8 years ago).
|
||||
|
||||
When you search paperless for a document, it tries to match this query
|
||||
against your documents. Paperless will look for matching documents by
|
||||
inspecting their content, title, correspondent, type and tags. Paperless
|
||||
returns a scored list of results, so that documents matching your query
|
||||
better will appear further up in the search results.
|
||||
|
||||
By default, paperless returns only documents which contain all words
|
||||
typed in the search bar. However, paperless also offers advanced search
|
||||
syntax if you want to drill down the results further.
|
||||
|
||||
Matching documents with logical expressions:
|
||||
|
||||
```
|
||||
shopname AND (product1 OR product2)
|
||||
```
|
||||
|
||||
Matching specific tags, correspondents or types:
|
||||
|
||||
```
|
||||
type:invoice tag:unpaid
|
||||
correspondent:university certificate
|
||||
```
|
||||
|
||||
Matching dates:
|
||||
|
||||
```
|
||||
created:[2005 to 2009]
|
||||
added:yesterday
|
||||
modified:today
|
||||
```
|
||||
|
||||
Matching inexact words:
|
||||
|
||||
```
|
||||
produ*name
|
||||
```
|
||||
|
||||
!!! note
|
||||
|
||||
Inexact terms are hard for search indexes. These queries might take a
|
||||
while to execute. That's why paperless offers auto complete and query
|
||||
correction.
|
||||
|
||||
All of these constructs can be combined as you see fit. If you want to
|
||||
learn more about the query language used by paperless, paperless uses
|
||||
Whoosh's default query language. Head over to [Whoosh query
|
||||
language](https://whoosh.readthedocs.io/en/latest/querylang.html). For
|
||||
details on what date parsing utilities are available, see [Date
|
||||
parsing](https://whoosh.readthedocs.io/en/latest/dates.html#parsing-date-queries).
|
||||
|
||||
## The recommended workflow {#usage-recommended_workflow}
|
||||
|
||||
Once you have familiarized yourself with paperless and are ready to use
|
||||
it for all your documents, the recommended workflow for managing your
|
||||
documents is as follows. This workflow also takes into account that some
|
||||
documents have to be kept in physical form, but still ensures that you
|
||||
get all the advantages for these documents as well.
|
||||
|
||||
The following diagram shows how easy it is to manage your documents.
|
||||
|
||||
{width=400}
|
||||
|
||||
### Preparations in paperless
|
||||
|
||||
- Create an inbox tag that gets assigned to all new documents.
|
||||
- Create a TODO tag.
|
||||
|
||||
### Processing of the physical documents
|
||||
|
||||
Keep a physical inbox. Whenever you receive a document that you need to
|
||||
archive, put it into your inbox. Regularly, do the following for all
|
||||
documents in your inbox:
|
||||
|
||||
1. For each document, decide if you need to keep the document in
|
||||
physical form. This applies to certain important documents, such as
|
||||
contracts and certificates.
|
||||
2. If you need to keep the document, write a running number on the
|
||||
document before scanning, starting at one and counting upwards. This
|
||||
is the archive serial number, or ASN in short.
|
||||
3. Scan the document.
|
||||
4. If the document has an ASN assigned, store it in a _single_ binder,
|
||||
sorted by ASN. Don't order this binder in any other way.
|
||||
5. If the document has no ASN, throw it away. Yay!
|
||||
|
||||
Over time, you will notice that your physical binder will fill up. If it
|
||||
is full, label the binder with the range of ASNs in this binder (i.e.,
|
||||
"Documents 1 to 343"), store the binder in your cellar or elsewhere,
|
||||
and start a new binder.
|
||||
|
||||
The idea behind this process is that you will never have to use the
|
||||
physical binders to find a document. If you need a specific physical
|
||||
document, you may find this document by:
|
||||
|
||||
1. Searching in paperless for the document.
|
||||
2. Identify the ASN of the document, since it appears on the scan.
|
||||
3. Grab the relevant document binder and get the document. This is easy
|
||||
since they are sorted by ASN.
|
||||
|
||||
### Processing of documents in paperless
|
||||
|
||||
Once you have scanned in a document, proceed in paperless as follows.
|
||||
|
||||
1. If the document has an ASN, assign the ASN to the document.
|
||||
2. Assign a correspondent to the document (i.e., your employer, bank,
|
||||
etc) This isn't strictly necessary but helps in finding a document
|
||||
when you need it.
|
||||
3. Assign a document type (i.e., invoice, bank statement, etc) to the
|
||||
document This isn't strictly necessary but helps in finding a
|
||||
document when you need it.
|
||||
4. Assign a proper title to the document (the name of an item you
|
||||
bought, the subject of the letter, etc)
|
||||
5. Check that the date of the document is correct. Paperless tries to
|
||||
read the date from the content of the document, but this fails
|
||||
sometimes if the OCR is bad or multiple dates appear on the
|
||||
document.
|
||||
6. Remove inbox tags from the documents.
|
||||
|
||||
!!! tip
|
||||
|
||||
You can setup manual matching rules for your correspondents and tags and
|
||||
paperless will assign them automatically. After consuming a couple
|
||||
documents, you can even ask paperless to *learn* when to assign tags and
|
||||
correspondents by itself. For details on this feature, see
|
||||
`advanced-matching`{.interpreted-text role="ref"}.
|
||||
|
||||
### Task management
|
||||
|
||||
Some documents require attention and require you to act on the document.
|
||||
You may take two different approaches to handle these documents based on
|
||||
how regularly you intend to scan documents and use paperless.
|
||||
|
||||
- If you scan and process your documents in paperless regularly,
|
||||
assign a TODO tag to all scanned documents that you need to process.
|
||||
Create a saved view on the dashboard that shows all documents with
|
||||
this tag.
|
||||
- If you do not scan documents regularly and use paperless solely for
|
||||
archiving, create a physical todo box next to your physical inbox
|
||||
and put documents you need to process in the TODO box. When you
|
||||
performed the task associated with the document, move it to the
|
||||
inbox.
|
||||
|
||||
## Architectue
|
||||
|
||||
Paperless-ngx consists of the following components:
|
||||
|
||||
- **The webserver:** This serves the administration pages, the API,
|
||||
and the new frontend. This is the main tool you'll be using to interact
|
||||
with paperless. You may start the webserver directly with
|
||||
|
||||
```shell-session
|
||||
$ cd /path/to/paperless/src/
|
||||
$ gunicorn -c ../gunicorn.conf.py paperless.wsgi
|
||||
```
|
||||
|
||||
or by any other means such as Apache `mod_wsgi`.
|
||||
|
||||
- **The consumer:** This is what watches your consumption folder for
|
||||
documents. However, the consumer itself does not really consume your
|
||||
documents. Now it notifies a task processor that a new file is ready
|
||||
for consumption. I suppose it should be named differently. This was
|
||||
also used to check your emails, but that's now done elsewhere as
|
||||
well.
|
||||
|
||||
Start the consumer with the management command `document_consumer`:
|
||||
|
||||
```shell-session
|
||||
$ cd /path/to/paperless/src/
|
||||
$ python3 manage.py document_consumer
|
||||
```
|
||||
|
||||
- **The task processor:** Paperless relies on [Celery - Distributed
|
||||
Task Queue](https://docs.celeryq.dev/en/stable/index.html) for doing
|
||||
most of the heavy lifting. This is a task queue that accepts tasks
|
||||
from multiple sources and processes these in parallel. It also comes
|
||||
with a scheduler that executes certain commands periodically.
|
||||
|
||||
This task processor is responsible for:
|
||||
|
||||
- Consuming documents. When the consumer finds new documents, it
|
||||
notifies the task processor to start a consumption task.
|
||||
- The task processor also performs the consumption of any
|
||||
documents you upload through the web interface.
|
||||
- Consuming emails. It periodically checks your configured
|
||||
accounts for new emails and notifies the task processor to
|
||||
consume the attachment of an email.
|
||||
- Maintaining the search index and the automatic matching
|
||||
algorithm. These are things that paperless needs to do from time
|
||||
to time in order to operate properly.
|
||||
|
||||
This allows paperless to process multiple documents from your
|
||||
consumption folder in parallel! On a modern multi core system, this
|
||||
makes the consumption process with full OCR blazingly fast.
|
||||
|
||||
The task processor comes with a built-in admin interface that you
|
||||
can use to check whenever any of the tasks fail and inspect the
|
||||
errors (i.e., wrong email credentials, errors during consuming a
|
||||
specific file, etc).
|
||||
|
||||
- A [redis](https://redis.io/) message broker: This is a really
|
||||
lightweight service that is responsible for getting the tasks from
|
||||
the webserver and the consumer to the task scheduler. These run in a
|
||||
different process (maybe even on different machines!), and
|
||||
therefore, this is necessary.
|
||||
|
||||
- Optional: A database server. Paperless supports PostgreSQL, MariaDB
|
||||
and SQLite for storing its data.
|
Reference in New Issue
Block a user