mirror of
				https://github.com/paperless-ngx/paperless-ngx.git
				synced 2025-10-24 03:26:11 -05:00 
			
		
		
		
	Documented all of the guesswork Paperless does
This commit is contained in:
		| @@ -3,6 +3,8 @@ Changelog | |||||||
|  |  | ||||||
| * 0.2.0 | * 0.2.0 | ||||||
|  |  | ||||||
|  |   * `#89`_ Ported the auto-tagging code to correspondents as well.  Thanks to | ||||||
|  |     `Justin Snyman`_ for the pointers in the issue queue. | ||||||
|   * Added support for guessing the date from the file name along with the |   * Added support for guessing the date from the file name along with the | ||||||
|     correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull |     correspondent, title, and tags.  Thanks to `Tikitu de Jager`_ for his pull | ||||||
|     request that I took forever to merge and to `Pit`_ for his efforts on the |     request that I took forever to merge and to `Pit`_ for his efforts on the | ||||||
| @@ -97,6 +99,7 @@ Changelog | |||||||
| .. _zedster: https://github.com/zedster | .. _zedster: https://github.com/zedster | ||||||
| .. _Martin Honermeyer: https://github.com/djmaze | .. _Martin Honermeyer: https://github.com/djmaze | ||||||
| .. _Tim White: https://github.com/timwhite | .. _Tim White: https://github.com/timwhite | ||||||
|  | .. _Justin Snyman: https://github.com/stringlytyped | ||||||
|  |  | ||||||
| .. _#20: https://github.com/danielquinn/paperless/issues/20 | .. _#20: https://github.com/danielquinn/paperless/issues/20 | ||||||
| .. _#44: https://github.com/danielquinn/paperless/issues/44 | .. _#44: https://github.com/danielquinn/paperless/issues/44 | ||||||
| @@ -110,4 +113,5 @@ Changelog | |||||||
| .. _#67: https://github.com/danielquinn/paperless/issues/67 | .. _#67: https://github.com/danielquinn/paperless/issues/67 | ||||||
| .. _#68: https://github.com/danielquinn/paperless/issues/68 | .. _#68: https://github.com/danielquinn/paperless/issues/68 | ||||||
| .. _#71: https://github.com/danielquinn/paperless/issues/71 | .. _#71: https://github.com/danielquinn/paperless/issues/71 | ||||||
|  | .. _#89: https://github.com/danielquinn/paperless/issues/89 | ||||||
| .. _#94: https://github.com/danielquinn/paperless/issues/71 | .. _#94: https://github.com/danielquinn/paperless/issues/71 | ||||||
|   | |||||||
| @@ -3,7 +3,7 @@ | |||||||
| Consumption | Consumption | ||||||
| ########### | ########### | ||||||
|  |  | ||||||
| Once you've got *Paperless* setup, you need to start feeding documents into it. | Once you've got Paperless setup, you need to start feeding documents into it. | ||||||
| Currently, there are three options: the consumption directory, IMAP (email), and | Currently, there are three options: the consumption directory, IMAP (email), and | ||||||
| HTTP POST. | HTTP POST. | ||||||
|  |  | ||||||
| @@ -35,41 +35,6 @@ appropriate for your use and put some documents in there.  When you're ready, | |||||||
| follow the :ref:`consumer <utilities-consumer>` instructions to get it running. | follow the :ref:`consumer <utilities-consumer>` instructions to get it running. | ||||||
|  |  | ||||||
|  |  | ||||||
| .. _consumption-directory-naming: |  | ||||||
|  |  | ||||||
| A Note on File Naming |  | ||||||
| --------------------- |  | ||||||
|  |  | ||||||
| Any document you put into the consumption directory will be consumed, but if |  | ||||||
| you name the file right, it'll automatically set some values in the database |  | ||||||
| for you.  This is is the logic the consumer follows: |  | ||||||
|  |  | ||||||
| 1. Try to find the correspondent, title, and tags in the file name following |  | ||||||
|    the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that |  | ||||||
|    the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or |  | ||||||
|    ``YYYYMMDDZ``.  The ``Z`` is for "Zulu time" AKA "UTC". |  | ||||||
| 2. If that doesn't work, we skip the date and try this pattern: |  | ||||||
|    the pattern: ``Correspondent - Title - tag,tag,tag.pdf``. |  | ||||||
| 3. If that doesn't work, we try to find the correspondent and title in the file |  | ||||||
|    name following the pattern:  ``Correspondent - Title.pdf``. |  | ||||||
| 4. If that doesn't work, just assume that the name of the file is the title. |  | ||||||
|  |  | ||||||
| So given the above, the following examples would work as you'd expect: |  | ||||||
|  |  | ||||||
| * ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` |  | ||||||
| * ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` |  | ||||||
| * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` |  | ||||||
| * ``Another Company - Letter of Reference.jpg`` |  | ||||||
| * ``Dad's Recipe for Pancakes.png`` |  | ||||||
|  |  | ||||||
| These however wouldn't work: |  | ||||||
|  |  | ||||||
| * ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` |  | ||||||
| * ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` |  | ||||||
| * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` |  | ||||||
| * ``Another Company- Letter of Reference.jpg`` |  | ||||||
|  |  | ||||||
|  |  | ||||||
| .. _consumption-imap: | .. _consumption-imap: | ||||||
|  |  | ||||||
| IMAP (Email) | IMAP (Email) | ||||||
|   | |||||||
							
								
								
									
										85
									
								
								docs/guesswork.rst
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										85
									
								
								docs/guesswork.rst
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,85 @@ | |||||||
|  | .. _guesswork: | ||||||
|  |  | ||||||
|  | Guesswork | ||||||
|  | ######### | ||||||
|  |  | ||||||
|  | During the consumption process, Paperless tries to guess some of the attributes | ||||||
|  | of the document it's looking at.  To do this it uses two approaches: | ||||||
|  |  | ||||||
|  |  | ||||||
|  | .. _guesswork-naming: | ||||||
|  |  | ||||||
|  | File Naming | ||||||
|  | =========== | ||||||
|  |  | ||||||
|  | Any document you put into the consumption directory will be consumed, but if | ||||||
|  | you name the file right, it'll automatically set some values in the database | ||||||
|  | for you.  This is is the logic the consumer follows: | ||||||
|  |  | ||||||
|  | 1. Try to find the correspondent, title, and tags in the file name following | ||||||
|  |    the pattern: ``Date - Correspondent - Title - tag,tag,tag.pdf``.  Note that | ||||||
|  |    the format of the date is **rigidly defined** as ``YYYYMMDDHHMMSSZ`` or | ||||||
|  |    ``YYYYMMDDZ``.  The ``Z`` refers "Zulu time" AKA "UTC". | ||||||
|  | 2. If that doesn't work, we skip the date and try this pattern: | ||||||
|  |    ``Correspondent - Title - tag,tag,tag.pdf``. | ||||||
|  | 3. If that doesn't work, we try to find the correspondent and title in the file | ||||||
|  |    name following the pattern: ``Correspondent - Title.pdf``. | ||||||
|  | 4. If that doesn't work, just assume that the name of the file is the title. | ||||||
|  |  | ||||||
|  | So given the above, the following examples would work as you'd expect: | ||||||
|  |  | ||||||
|  | * ``20150314000700Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` | ||||||
|  | * ``20150314Z - Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` | ||||||
|  | * ``Some Company Name - Invoice 2016-01-01 - money,invoices.pdf`` | ||||||
|  | * ``Another Company - Letter of Reference.jpg`` | ||||||
|  | * ``Dad's Recipe for Pancakes.png`` | ||||||
|  |  | ||||||
|  | These however wouldn't work: | ||||||
|  |  | ||||||
|  | * ``2015-03-14 00:07:00 UTC - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` | ||||||
|  | * ``2015-03-14 - Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` | ||||||
|  | * ``Some Company Name, Invoice 2016-01-01, money, invoices.pdf`` | ||||||
|  | * ``Another Company- Letter of Reference.jpg`` | ||||||
|  |  | ||||||
|  |  | ||||||
|  | .. _guesswork-content: | ||||||
|  |  | ||||||
|  | Reading the Document Contents | ||||||
|  | ============================= | ||||||
|  |  | ||||||
|  | After the consumer has tried to figure out what it could from the file name, | ||||||
|  | it starts looking at the content of the document itself.  It will compare the | ||||||
|  | matching algorithms defined by every tag and correspondent already set in your | ||||||
|  | database to see if they apply to the text in that document.  In other words, | ||||||
|  | if you defined a tag called ``Home Utility`` that had a ``match`` property of | ||||||
|  | ``bc hydro`` and a ``matching_algorithm`` of ``literal``, Paperless will | ||||||
|  | automatically tag your newly-consumed document with your ``Home Utility`` tag | ||||||
|  | so long as the text ``bc hydro`` appears in the body of the document somewhere. | ||||||
|  |  | ||||||
|  | The matching logic is quite powerful, and supports searching the text of your | ||||||
|  | document with different algorithms, and as such, some experimentation may be | ||||||
|  | necessary to get things Just Right. | ||||||
|  |  | ||||||
|  |  | ||||||
|  | .. _guesswork-content-howto: | ||||||
|  |  | ||||||
|  | How Do I Set Up These Matching Algorithms? | ||||||
|  | ------------------------------------------ | ||||||
|  |  | ||||||
|  | Setting up of the algorithms is easily done through the admin interface.  When | ||||||
|  | you create a new correspondent or tag, there are optional fields for matching | ||||||
|  | text and matching algorithm.  From the help info there: | ||||||
|  |  | ||||||
|  | .. note:: | ||||||
|  |  | ||||||
|  |     Which algorithm you want to use when matching text to the OCR'd PDF.  Here, | ||||||
|  |     "any" looks for any occurrence of any word provided in the PDF, while "all" | ||||||
|  |     requires that every word provided appear in the PDF, albeit not in the | ||||||
|  |     order provided.  A "literal" match means that the text you enter must | ||||||
|  |     appear in the PDF exactly as you've entered it, and "regular expression" | ||||||
|  |     uses a regex to match the PDF.  If you don't know what a regex is, you | ||||||
|  |     probably don't want this option. | ||||||
|  |  | ||||||
|  | Then just save your tag/correspondent and run another document through the | ||||||
|  | consumer.  Once complete, you should see the newly-created document, | ||||||
|  | automatically tagged with the appropriate data. | ||||||
| @@ -32,6 +32,7 @@ Contents | |||||||
|    consumption |    consumption | ||||||
|    api |    api | ||||||
|    utilities |    utilities | ||||||
|  |    guesswork | ||||||
|    migrating |    migrating | ||||||
|    troubleshooting |    troubleshooting | ||||||
|    changelog |    changelog | ||||||
|   | |||||||
| @@ -52,9 +52,12 @@ for PDF files to parse and index.  The process is pretty straightforward: | |||||||
|    wait 10 seconds and try again. |    wait 10 seconds and try again. | ||||||
| 2. Parse the PDF with Tesseract | 2. Parse the PDF with Tesseract | ||||||
| 3. Create a new record in the database with the OCR'd text | 3. Create a new record in the database with the OCR'd text | ||||||
| 4. Encrypt the PDF and store it in the ``media`` directory under | 4. Attempt to automatically assign document attributes by doing some guesswork. | ||||||
|  |    Read up on the :ref:`guesswork documentation<guesswork>` for more | ||||||
|  |    information about this process. | ||||||
|  | 5. Encrypt the PDF and store it in the ``media`` directory under | ||||||
|    ``documents/pdf``. |    ``documents/pdf``. | ||||||
| 5. Go to #1. | 6. Go to #1. | ||||||
|  |  | ||||||
|  |  | ||||||
| .. _utilities-consumer-howto: | .. _utilities-consumer-howto: | ||||||
|   | |||||||
		Reference in New Issue
	
	Block a user
	 Daniel Quinn
					Daniel Quinn