58 Commits

Author SHA1 Message Date
Trenton H
f7e6361206 Just in case, catch a sometimes nltk error and return the basic processed content instead 2023-05-24 19:34:49 -07:00
Trenton H
aabcc9a1c4 Upgrades black to v23, upgrades ruff 2023-04-26 09:35:27 -07:00
Trenton H
30655f1b73 Fixes ruff not running isort against the codebase 2023-04-26 09:35:27 -07:00
Trenton H
d2c02b9102 Configures ruff as the one stop linter and resolves warnings it raised 2023-04-01 17:03:52 -07:00
Trenton H
ec2b0eb308 Changes out the settings and a decent amount of test code to be pathlib compatible 2023-03-06 09:16:07 -08:00
Trenton Holmes
73dc928832 Returns to using hashing against primary keys, at least for fields. Improves testing coverage 2023-02-28 08:13:10 -08:00
Trenton Holmes
303e81eb79 Changes from a hash based system to a time based system to prevent extra retrains 2023-02-28 08:13:10 -08:00
Trenton H
21cd76a181 Changes classifier training to hold less data in memory at the same time 2023-02-28 08:13:10 -08:00
Trenton H
2d71415ede Allows disabling NLTK, adds it as a consideration for low power devices 2022-10-10 08:58:23 -07:00
Trenton Holmes
a78d44ec5f Changes the NLTK language to be based on the Tesseract OCR language, with fallback to the default processing 2022-10-10 08:58:23 -07:00
Trenton H
0bc13c2a72 Allows configuration of the NLTK processing language 2022-10-10 08:58:23 -07:00
Trenton Holmes
70b1988a55 Fixes the download and usage of the downloaded data 2022-10-10 08:58:23 -07:00
Trenton Holmes
66884ea035 Updates the pre-processing of document content to be much more robust, with tokenization, stemming and stop word removal 2022-10-10 08:58:23 -07:00
Trenton Holmes
024fd8bc9b When raising an exception during exception handling, chain them together for slightly cleaner logs 2022-08-03 09:00:56 -07:00
Trenton Holmes
64be6dcb36 No need for a branch here, the loop takes care of it 2022-07-05 08:20:35 +02:00
Trenton Holmes
6bd585a9a0 Updates the classifier to catch warnings from scikit-learn and rebuild the model file when this happens 2022-07-05 08:20:35 +02:00
Markus
dd3b5c129c Feature: Dynamic document storage pathes (#916)
* Added devcontainer

* Add feature storage pathes

* Exclude tests and add versioning

* Check escaping

* Check escaping

* Check quoting

* Echo

* Escape

* Escape :

* Double escape \

* Escaping

* Remove if

* Escape colon

* Missing \

* Esacpe :

* Escape all

* test

* Remove sed

* Fix exclude

* Remove SED command

* Add LD_LIBRARY_PATH

* Adjusted to v1.7

* Updated test-cases

* Remove devcontainer

* Removed internal build-file

* Run pre-commit

* Corrected flak8 error

* Adjusted to v1.7

* Updated test-cases

* Corrected flak8 error

* Adjusted to new plural translations

* Small adjustments due to code-review backend

* Adjusted line-break

* Removed PAPERLESS prefix from settings variables

* Corrected style change due to search+replace

* First documentation draft

* Revert changes to Pipfile

* Add sphinx-autobuild with keep-outdated

* Revert merge error that results in wrong storage path is evaluated

* Adjust styles of generated files ...

* Adds additional testing to cover dynamic storage path functionality

* Remove unnecessary condition

* Add hint to edit storage path dialog

* Correct spelling of pathes to paths

* Minor documentation tweaks

* Minor typo

* improving wrapping of filter editor buttons with new storage path button

* Update .gitignore

* Fix select border radius in non input-groups

* Better storage path edit hint

* Add note to edit storage path dialog re document_renamer

* Add note to bulk edit storage path re document_renamer

* Rename FILTER_STORAGE_DIRECTORY to PATH

* Fix broken filter rule parsing

* Show default storage if unspecified

* Remove note re storage path on bulk edit

* Add basic validation of filename variables

Co-authored-by: Markus Kling <markus@markus-kling.net>
Co-authored-by: Trenton Holmes <holmes.trenton@gmail.com>
Co-authored-by: Michael Shamoon <4887959+shamoon@users.noreply.github.com>
Co-authored-by: Quinn Casey <quinn@quinncasey.com>
2022-05-19 14:42:25 -07:00
Trenton Holmes
f62193099c Runs pyupgrade to Python 3.8+ and adds a hook for it 2022-05-06 09:04:08 -07:00
Trenton Holmes
e3f8531c2d Un-pickle and re-pickle the test models to resolve the version difference warning 2022-03-22 09:37:17 +01:00
Johann Bauer
5efa551946 Fix model test 2022-03-21 18:53:53 +01:00
Johann Bauer
9ceae3e0db Increase FORMAT_VERSION to force model re-creation 2022-03-21 18:11:18 +01:00
Trenton Holmes
6635fa5f0d Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
kpj
c56cb25b5f Format Python code with black 2022-02-27 15:26:41 +01:00
jonaswinkler
ddd9ac9a07 write classifier model to temporary file before copying to final location 2021-06-13 12:03:20 +02:00
jonaswinkler
ac9bd6c908 better exception handling 2021-05-19 23:11:24 +02:00
jonaswinkler
0f960755ae catch another exception regarding classifier loading 2021-05-19 22:57:52 +02:00
Jonas Winkler
dc565bd035 correct file mode 2021-05-16 01:22:51 +02:00
jonaswinkler
e4655866f3 fixes #689 2021-03-03 23:35:26 +01:00
jonaswinkler
dac21862fe load sklearn modules only when training data has changed 2021-02-15 11:25:25 +01:00
jonaswinkler
c946263f31 revert a faulty change that caused memory usage to explode #537 2021-02-13 19:51:04 +01:00
jonaswinkler
555e37958f better exception logging 2021-02-11 22:16:41 +01:00
jonaswinkler
85366024ec classifier cache timeout 2021-02-06 21:03:32 +01:00
jonaswinkler
a4c1252a3b classifier caching 2021-02-06 20:54:58 +01:00
jonaswinkler
e5a7dc0cc7 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
d08a530701 don't load sklearn libraries unless needed 2021-02-04 15:15:11 +01:00
jonaswinkler
3461e6f354 pycodestyle 2021-01-30 15:22:51 +01:00
jonaswinkler
a37e41ef0c centralized classifier loading, better error handling, no error messages when auto matching is not used 2021-01-30 14:22:23 +01:00
jonaswinkler
0bc68d7d1a more tests and bugfixes. 2020-11-27 15:36:32 +01:00
Jonas Winkler
c4f5f640ee tests for the classifier and fixes for edge cases with minimal data. 2020-11-26 14:18:34 +01:00
Jonas Winkler
a532200d10 code cleanup 2020-11-21 15:34:00 +01:00
Jonas Winkler
eb6805e37e code style fixes 2020-11-12 21:09:45 +01:00
Jonas Winkler
1c50b7693d fixes #31 2020-11-12 10:04:01 +01:00
Jonas Winkler
33f1c82943 updated the classifier. Its now much faster and does not retrain when data hasnt changed. 2020-11-06 14:46:06 +01:00
Jonas Winkler
9a4ff3f807 replaced usages of .id with .pk, fixed filename issue in exporter 2020-11-03 12:37:37 +01:00
Jonas Winkler
6ce493e3a7 the document classifier is now stateless 2020-10-29 14:33:42 +01:00
Jonas Winkler
dd16b7262e unified document matching, legacy and automatching work alongside now 2020-10-28 11:45:11 +01:00
Jonas Winkler
b71657964b Code style changes 2018-09-26 10:51:42 +02:00
Jonas Winkler
efc7bf1d23 Code style adjustments 2018-09-25 16:09:33 +02:00
Jonas Winkler
20233a1706 Code style changed 2018-09-13 14:15:16 +02:00
Jonas Winkler
35ea0f2add Merge branch 'machine-learning' into dev 2018-09-11 14:36:21 +02:00