From 1e95d22e1a39ce81c288b757410ca61eba10b50b Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Mon, 23 Nov 2020 18:40:22 +0100 Subject: [PATCH 1/8] Update README.md --- README.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 365b90211..09a442e55 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,9 @@ The gist of the changes is the following: * New full text search. * New email processing. * Machine learning powered document matching. -* Code cleanup in many, MANY areas. +* A task processor that processes documents in parallel and also tells you when something goes wrong. +* Code cleanup in many, MANY areas. Some of the code was just overly complicated. +* More tests, more stability. If you want to see some screenshots of paperless-ng in action, [some are available in the documentation](https://paperless-ng.readthedocs.io/en/latest/screenshots.html). @@ -45,28 +47,31 @@ For a complete list of changes, check out the [changelog](https://paperless-ng.r These features will make it into the application at some point, sorted by priority. +- **Adding text to PDF documents.** I've seen there are libraries that do this for you. - **More search.** The search backend is incredibly versatile and customizable. Searching is the most important feature of this project and thus, I want to implement things like: - Group and limit search results by correspondent, show “more from this” links in the results. - Ability to search for “Similar documents” in the search results - Provide corrections for mispelled queries -- **More robust consumer** that shows its progress on the web page. +- **An interactive consumer** that shows its progress for documents it processes on the web page. + - With live updates ans websockets. This already works on a dev branch, but requires a lot of new dependencies, which I'm not particular happy about. + - Notifications when a document was added with buttons to open the new document right away. - **Arbitrary tag colors**. Allow the selection of any color with a color picker. ## On the chopping block. -- **GnuPG encrypion.** Since its disabled by default and the website allows transparent access to encrypted documents anyway, this doesn’t really provide any benefit over having the application stored on an encrypted file system. +- **GnuPG encrypion.** [Here's a note about encryption in paperless](https://paperless-ng.readthedocs.io/en/latest/administration.html#managing-encryption). The gist of it is that I don't see which attacks this implementation protects against. It gives a false sense of security to users who don't care about how it works. # Getting started -The recommended way to deploy paperless is docker-compose. Grab the latest release to get started. the dockerfiles archive contains just the docker files which will pull the image from docker hub. The source archive contains everything you need to build the docker image yourself. +The recommended way to deploy paperless is docker-compose. Don't clone the repository, grab the latest release to get started instead. The dockerfiles archive contains just the docker files which will pull the image from docker hub. The source archive contains everything you need to build the docker image yourself (i.e. if you want to run on Raspberry Pi). Read the [documentation](https://paperless-ng.readthedocs.io/en/latest/setup.html#installation) on how to get started. -Alternatively, you can install the dependencies and setup apache and a database server yourself. Details for that will be available in the documentation at some point. +Alternatively, you can install the dependencies and setup apache and a database server yourself. The documenation has information about the individual components of paperless that you need to take care of. # Migrating to paperless-ng -Read the section about [migration](https://paperless-ng.readthedocs.io/en/latest/setup.html#migration-to-paperless-ng) in the documentation. +Read the section about [migration](https://paperless-ng.readthedocs.io/en/latest/setup.html#migration-to-paperless-ng) in the documentation. Its also entirely possible to go back to paperless by reverting the database migrations. # Documentation From ded8f865d83a113ba28c0539f517f97cf61c0fa7 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Mon, 23 Nov 2020 18:45:23 +0100 Subject: [PATCH 2/8] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 09a442e55..7748ecb65 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ For a complete list of changes, check out the [changelog](https://paperless-ng.r These features will make it into the application at some point, sorted by priority. -- **Adding text to PDF documents.** I've seen there are libraries that do this for you. +- **Adding a text layer to ocr'ed PDF documents.** I've seen there are libraries that do this for you. - **More search.** The search backend is incredibly versatile and customizable. Searching is the most important feature of this project and thus, I want to implement things like: - Group and limit search results by correspondent, show “more from this” links in the results. - Ability to search for “Similar documents” in the search results From 977594fece71444ecdb2482017f59d8d2f5ee3a9 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Tue, 24 Nov 2020 15:34:58 +0100 Subject: [PATCH 3/8] Update README.md --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 7748ecb65..fc15dcda7 100644 --- a/README.md +++ b/README.md @@ -77,6 +77,10 @@ Read the section about [migration](https://paperless-ng.readthedocs.io/en/latest The documentation for Paperless-ng is available on [ReadTheDocs](https://paperless-ng.readthedocs.io/). +# Suggestions? Questions? Something not working? + +Please open an issue and start a discussion about it! + # Affiliated Projects Paperless has been around a while now, and people are starting to build stuff on top of it. If you're one of those people, we can add your project to this list: From ae30fef6412caf7dad2ca87224639ad0b169fbf6 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Thu, 26 Nov 2020 18:22:33 +0100 Subject: [PATCH 4/8] Update README.md --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index fc15dcda7..29a17e6e1 100644 --- a/README.md +++ b/README.md @@ -61,6 +61,17 @@ These features will make it into the application at some point, sorted by priori - **GnuPG encrypion.** [Here's a note about encryption in paperless](https://paperless-ng.readthedocs.io/en/latest/administration.html#managing-encryption). The gist of it is that I don't see which attacks this implementation protects against. It gives a false sense of security to users who don't care about how it works. +## Goals for 1.0 + +- Test coverage at 90%. +- Store archived documents with an embedded OCR text layer, while keeping originals available. Making good progress in the `feature-ocrmypdf` branch. +- Fix whatever bugs I and you find. + +## Non-goals for 1.0 + +- Mobile support. +- Any other big feature or improvement listed in the issues section. + # Getting started The recommended way to deploy paperless is docker-compose. Don't clone the repository, grab the latest release to get started instead. The dockerfiles archive contains just the docker files which will pull the image from docker hub. The source archive contains everything you need to build the docker image yourself (i.e. if you want to run on Raspberry Pi). From 68e0c21eb0f324ed0f7314cd380af398f3b2cb0f Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Thu, 26 Nov 2020 19:52:26 +0100 Subject: [PATCH 5/8] Update README.md --- README.md | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 29a17e6e1..ef088ef9e 100644 --- a/README.md +++ b/README.md @@ -43,11 +43,14 @@ If you want to see some screenshots of paperless-ng in action, [some are availab For a complete list of changes, check out the [changelog](https://paperless-ng.readthedocs.io/en/latest/changelog.html) -## Planned +# Roadmap for 1.0 -These features will make it into the application at some point, sorted by priority. +- Test coverage at 90%. +- Store archived documents with an embedded OCR text layer, while keeping originals available. Making good progress in the `feature-ocrmypdf` branch. +- Fix whatever bugs I and you find + +## Roadmap for versions beyond 1.0 -- **Adding a text layer to ocr'ed PDF documents.** I've seen there are libraries that do this for you. - **More search.** The search backend is incredibly versatile and customizable. Searching is the most important feature of this project and thus, I want to implement things like: - Group and limit search results by correspondent, show “more from this” links in the results. - Ability to search for “Similar documents” in the search results @@ -61,17 +64,6 @@ These features will make it into the application at some point, sorted by priori - **GnuPG encrypion.** [Here's a note about encryption in paperless](https://paperless-ng.readthedocs.io/en/latest/administration.html#managing-encryption). The gist of it is that I don't see which attacks this implementation protects against. It gives a false sense of security to users who don't care about how it works. -## Goals for 1.0 - -- Test coverage at 90%. -- Store archived documents with an embedded OCR text layer, while keeping originals available. Making good progress in the `feature-ocrmypdf` branch. -- Fix whatever bugs I and you find. - -## Non-goals for 1.0 - -- Mobile support. -- Any other big feature or improvement listed in the issues section. - # Getting started The recommended way to deploy paperless is docker-compose. Don't clone the repository, grab the latest release to get started instead. The dockerfiles archive contains just the docker files which will pull the image from docker hub. The source archive contains everything you need to build the docker image yourself (i.e. if you want to run on Raspberry Pi). @@ -92,6 +84,12 @@ The documentation for Paperless-ng is available on [ReadTheDocs](https://paperle Please open an issue and start a discussion about it! +## Feel like helping out? + +There's still lots of things to be done, just have a look at that issue log. If you feel like conctributing to the project, please do! Bug fixes and improvements to the front end (I just can't seem to get some of these CSS things right) are always welcome. + +If you want to implement something big: Please start a discussion about that in the issues! Maybe I've already had something similar in mind and we can make it happen together. However, keep in mind that the general roadmap is to make the existing features stable and get them tested. See the roadmap above. + # Affiliated Projects Paperless has been around a while now, and people are starting to build stuff on top of it. If you're one of those people, we can add your project to this list: From 785577b2e87efcc9405ba61488e27eeb26497559 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Fri, 27 Nov 2020 14:19:21 +0100 Subject: [PATCH 6/8] Update CONTRIBUTING.md --- CONTRIBUTING.md | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 08b19bdee..1611bea9e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,13 +1,26 @@ # Contributing -If you feel that somethings is not working, please submit an issue. You can also ask questions on the issue tracker by tagging your question with the question tag. +There's still lots of things to be done, just have a look at that issue log. If you feel like conctributing to the project, please do! Bug fixes and improvements to the front end (I just can't seem to get some of these CSS things right) are always welcome. -Pull requests are welcome, however, I will be a little bit more strict about what goes into the code and what does not. If you want to make a big change, please ask me about it first. +If you want to implement something big: Please start a discussion about that in the issues! Maybe I've already had something similar in mind and we can make it happen together. However, keep in mind that the general roadmap is to make the existing features stable and get them tested. See the roadmap in the readme. * When making additions to the project, consider if the majority of users will benefit from your change. If not, you're probably better of forking the project. * Also consider if your change will get in the way of other users. A good change is a change that enhances the experience of some users who want that change and does not affect users who do not care about the change. -However: +## Python -* Bug fixes and are always welcome. Docker makes things easier, however, I alone cannot ensure that this runs on all platforms. -* Improvements to the styling of the front-end are always welcome. I'm no expert in things UX, and simply copied one of the Bootstrap examples. I think it turned out rather good, but I just can't seem to get some things working properly. +Use python 3.6 for development. Paperless supports python 3.6, 3.7 and 3.8. + +## Branches + +master always reflects the latest release. + +dev contains all changes that will be part of the next release. + +feature-X branches is for experimental stuff that will eventually be merged into dev, and then released as part of the next release. + +## Testing: + +I'm trying to get most of paperless tested, so please do the same for your code! I know its a hassle, but it makes sure that your code works now and will allow us to detect regressions easily. + +To test your code, execute `pytest` in the src/ directory. Executing that in the project root is no good. This also generates a html coverage report, which you can use to see if you missed anything important during testing. From 440a23a054114d17c48e1b15de9ed588f3372079 Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Fri, 27 Nov 2020 14:21:04 +0100 Subject: [PATCH 7/8] Update CONTRIBUTING.md --- CONTRIBUTING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1611bea9e..bd6080d35 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -15,7 +15,7 @@ Use python 3.6 for development. Paperless supports python 3.6, 3.7 and 3.8. master always reflects the latest release. -dev contains all changes that will be part of the next release. +dev contains all changes that will be part of the next release. Use this branch to start making your changes. feature-X branches is for experimental stuff that will eventually be merged into dev, and then released as part of the next release. From 52b30576408e126d7da6080b7071cc95bdc4899f Mon Sep 17 00:00:00 2001 From: jonaswinkler Date: Sat, 28 Nov 2020 11:49:46 +0100 Subject: [PATCH 8/8] fixes to the search index --- src/documents/tasks.py | 4 +++- src/documents/tests/test_api.py | 7 ++++--- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/src/documents/tasks.py b/src/documents/tasks.py index 3c9baad08..cd47892be 100644 --- a/src/documents/tasks.py +++ b/src/documents/tasks.py @@ -12,7 +12,9 @@ from documents.sanity_checker import SanityFailedError def index_optimize(): - index.open_index().optimize() + ix = index.open_index() + with AsyncWriter(ix) as writer: + writer.commit(optimize=True) def index_reindex(): diff --git a/src/documents/tests/test_api.py b/src/documents/tests/test_api.py index bb0581656..dabae6d82 100644 --- a/src/documents/tests/test_api.py +++ b/src/documents/tests/test_api.py @@ -5,6 +5,7 @@ from unittest import mock from django.contrib.auth.models import User from pathvalidate import ValidationError from rest_framework.test import APITestCase +from whoosh.writing import AsyncWriter from documents import index from documents.models import Document, Correspondent, DocumentType, Tag @@ -173,7 +174,7 @@ class DocumentApiTest(DirectoriesMixin, APITestCase): d1=Document.objects.create(title="invoice", content="the thing i bought at a shop and paid with bank account", checksum="A", pk=1) d2=Document.objects.create(title="bank statement 1", content="things i paid for in august", pk=2, checksum="B") d3=Document.objects.create(title="bank statement 3", content="things i paid for in september", pk=3, checksum="C") - with index.open_index(False).writer() as writer: + with AsyncWriter(index.open_index()) as writer: # Note to future self: there is a reason we dont use a model signal handler to update the index: some operations edit many documents at once # (retagger, renamer) and we don't want to open a writer for each of these, but rather perform the entire operation with one writer. # That's why we cant open the writer in a model on_save handler or something. @@ -209,7 +210,7 @@ class DocumentApiTest(DirectoriesMixin, APITestCase): self.assertEqual(len(results), 0) def test_search_multi_page(self): - with index.open_index(False).writer() as writer: + with AsyncWriter(index.open_index()) as writer: for i in range(55): doc = Document.objects.create(checksum=str(i), pk=i+1, title=f"Document {i+1}", content="content") index.update_document(writer, doc) @@ -248,7 +249,7 @@ class DocumentApiTest(DirectoriesMixin, APITestCase): self.assertEqual(len(results), 5) def test_search_invalid_page(self): - with index.open_index(False).writer() as writer: + with AsyncWriter(index.open_index()) as writer: for i in range(15): doc = Document.objects.create(checksum=str(i), pk=i+1, title=f"Document {i+1}", content="content") index.update_document(writer, doc)