125 Commits

Author SHA1 Message Date
Trenton H
9043f45350
Adds more documentation for OCR_PAGES and prevents using 0 for actual OCR (#5275) 2024-01-06 09:06:41 -08:00
Trenton H
a82e3771ae
Fix: Allows pre-consume scripts to modify the working path again (#5260)
* Allows pre-consume scripts to modify the working path again and generally cleans up some confusion about working copy vs original
2024-01-05 21:01:57 -08:00
Trenton H
061f33fb05
Feature: Allow setting backend configuration settings via the UI (#5126)
* Saving some start on this

* At least partially working for the tesseract parser

* Problems with migration testing need to figure out

* Work around that error

* Fixes max m_pixels

* Moving the settings to main paperless application

* Starting some consumer options

* More fixes and work

* Fixes these last tests

* Fix max_length on OcrSettings.mode field

* Fix all fields on Common & Ocr settings serializers

* Umbrellla config view

* Revert "Umbrellla config view"

This reverts commit fbaf9f4be30f89afeb509099180158a3406416a5.

* Updates to use a single configuration object for all settings

* Squashed commit of the following:

commit 8a0a49dd5766094f60462fbfbe62e9921fbd2373
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 23:02:47 2023 -0800

    Fix formatting

commit 66b2d90c507b8afd9507813ff555e46198ea33b9
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 22:36:35 2023 -0800

    Refactor frontend data models

commit 5723bd8dd823ee855625e250df39393e26709d48
Author: Adam Bogdał <adam@bogdal.pl>
Date:   Wed Dec 20 01:17:43 2023 +0100

    Fix: speed up admin panel for installs with a large number of documents (#5052)

commit 9b08ce176199bf9011a6634bb88f616846150d2b
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 15:18:51 2023 -0800

    Update PULL_REQUEST_TEMPLATE.md

commit a6248bec2d793b7690feed95fcaf5eb34a75bfb6
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 15:02:05 2023 -0800

    Chore: Update Angular to v17 (#4980)

commit b1f6f52486d5ba5c04af99b41315eb6428fd1fa8
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 13:53:56 2023 -0800

    Fix: Dont allow null custom_fields property via API (#5063)

commit 638d9970fd468d8c02c91d19bd28f8b0796bdcb1
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 13:43:50 2023 -0800

    Enhancement: symmetric document links (#4907)

commit 5e8de4c1da6eb4eb8f738b20962595c7536b30ec
Author: shamoon <4887959+shamoon@users.noreply.github.com>
Date:   Tue Dec 19 12:45:04 2023 -0800

    Enhancement: shared icon & shared by me filter (#4859)

commit 088bad90306025d3f6b139cbd0ad264a1cbecfe5
Author: Trenton H <797416+stumpylog@users.noreply.github.com>
Date:   Tue Dec 19 12:04:03 2023 -0800

    Bulk updates all the backend libraries (#5061)

* Saving some work on frontend config

* Very basic but dynamically-generated config form

* Saving work on slightly less ugly frontend config

* JSON validation for user_args field

* Fully dynamic config form

* Adds in some additional validators for a nicer error message

* Cleaning up the testing and coverage more

* Reverts unintentional change

* Adds documentation about the settings and the precedence

* Couple more commenting and style fixes

---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2023-12-29 15:42:56 -08:00
Trenton H
92a920021d
Apply user arguments even in the case of the safe fallback to forcing OCR (#4981) 2023-12-14 11:20:47 -08:00
Trenton H
e3f4e0b775
Adds new setting to control color conversions (#4709) 2023-11-29 12:18:44 -08:00
Trenton H
e1b573adeb
Fix: Add a warning about a low image DPI which may cause OCR to fail (#4708) 2023-11-29 11:28:27 -08:00
Trenton H
facb7226fe
Chore: Backend bulk updates (#4509) 2023-11-13 17:09:56 +00:00
shamoon
e14f4c94c2
Fix: ghostscript rendering error doesnt trigger frontend failure message (#4092)
* Raise ParseError from gs rendering error

* catch all parser errors as generic exception

* Differentiate generic vs parse errors during consumption
2023-08-31 19:49:00 -07:00
Trenton H
7e768bfe23 When PDF/A rendering fails, add a warning the user may want to allow it to continue 2023-08-28 18:10:11 -07:00
Trenton H
70f3f98363 Let ruff autofix some things from the newest version 2023-06-13 20:15:18 -07:00
Trenton H
452c79f9a1 Improves the logging mixin and allows it to be typed better 2023-05-23 17:16:39 -07:00
Trenton H
111960c530 Adds better handling for files with invalid utf8 content 2023-05-13 09:29:18 -07:00
Trenton H
6f163111ce Upgrades black to v23, upgrades ruff 2023-04-26 09:35:27 -07:00
Trenton H
3bcbd05252 Fixes ruff not running isort against the codebase 2023-04-26 09:35:27 -07:00
Trenton H
ce41ac9158 Configures ruff as the one stop linter and resolves warnings it raised 2023-04-01 17:03:52 -07:00
Brandon Rothweiler
ca412e0184 Add PAPERLESS_OCR_SKIP_ARCHIVE_FILE config setting 2023-02-23 22:42:57 -05:00
Brandon Rothweiler
8a89f5ae27 Revert "Merge pull request #2732 from bdr99/skip_neverarchive"
This reverts commit 77b23d3acb573232e4e307b63a83f8ff557c0e7e, reversing
changes made to 5d8aa278315dcf92bfa1abe9e1fbd4911f8ed258.
2023-02-23 21:26:53 -05:00
Brandon Rothweiler
93a6391f96 Add a setting to disable creating an archive file 2023-02-22 15:27:17 -05:00
Trenton H
bdcba570cb Adding more test coverage, in particular around Tika and its parser 2023-02-05 11:01:55 -08:00
Trenton H
1e4923835b Small tweak to use the existing tempdir instead of a new one 2023-01-03 13:05:44 -08:00
Trenton Holmes
7be9ae9c02 Try a new way of extracting text from a given PDF file 2023-01-03 12:43:31 -08:00
Trenton H
59e0c1fe4e Let convert handle the removal of the alpha channel 2023-01-03 09:56:19 -08:00
Trenton Holmes
26c7fad005 If extracting text from a fallback file (ie forced), allow the text to be used 2023-01-01 09:57:15 -08:00
Trenton H
a2b7687c3b In the case of an RTL language being extracted via pdfminer.six, fall back to forced OCR, which handles RTL text better 2022-12-29 16:02:02 -08:00
Trenton H
e96d65f945 Allows parsing of WebP format images 2022-11-28 09:35:54 -08:00
Trenton H
b897d6de2e Don't use the sidecar file when redoing the OCR, it only contains new text 2022-11-22 07:22:41 -08:00
Trenton Holmes
d1aa08850d Reverts the change around skip_noarchive to align with how it is documented to work 2022-10-20 13:34:41 -07:00
Trenton Holmes
b3b2519bf0 Fixes the creation of an archive file, even if noarchive was specified 2022-08-20 13:47:56 -07:00
Trenton Holmes
b70e21a6d5 When raising an exception during exception handling, chain them together for slightly cleaner logs 2022-08-03 09:00:56 -07:00
Trenton Holmes
fc26fe0ac0
Updates to provide the user provided max pixel size to ocrmypdf 2022-05-22 16:56:08 -07:00
Trenton Holmes
3003bdd507 Runs pyupgrade to Python 3.8+ and adds a hook for it 2022-05-06 09:04:08 -07:00
Henning Häcker
3b4da70c85 extract OCR_MAX_IMAGE_PIXELS into settings.py 2022-03-30 09:23:45 +02:00
Henning Häcker
95199bd325 formatting according to black 2022-03-30 09:23:45 +02:00
Henning Häcker
a8887b211e implement PAPERLESS_OCR_MAX_IMAGE_PIXELS 2022-03-30 09:23:45 +02:00
Trenton Holmes
1771d18a21 Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
Trenton Holmes
85b210ebf6 Reduces number of warnings from testing from 165 to 128. In doing so, fixes a few minor things in the decrypt and export commands 2022-03-10 18:12:48 -08:00
kpj
fc695896dd Format Python code with black 2022-02-27 15:26:41 +01:00
Martin Müller
1e288100a9 Remove unneded exception handler from has_alpha() 2022-02-21 22:58:19 +01:00
Martin Müller
2a47b3f1a1 Fix code style (line too long) 2022-02-21 22:34:34 +01:00
Martin Müller
41494ee689 Remove alpha layer from PNG files for img2pdf
Fixes issue #1254
2022-02-21 22:06:43 +01:00
jonaswinkler
23c6f849d6 fix bug with DPI calculation 2021-08-18 18:33:33 +02:00
jonaswinkler
1f707e86cc fix logging getting spammed with pdfminer warnings on JPG files 2021-06-13 12:09:16 +02:00
jonaswinkler
814d90745b Workaround for all PDFminer.six issues. 2021-05-15 12:15:32 +02:00
jonaswinkler
0e596bd1fc also apply \0 removal to sidecar contents 2021-03-22 23:08:34 +01:00
jonaswinkler
fda2bfbea7 better exception logging 2021-03-22 23:00:15 +01:00
jonaswinkler
d26c46e034 fixes #794 2021-03-22 22:46:35 +01:00
jonaswinkler
40ce38254b fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
265432f2a5 fix up the ocrmypdf parameter construction for clean-final and redo 2021-02-21 23:39:19 +01:00
jonaswinkler
a13e9f23b1 use archived file for thumbnail, if available 2021-02-21 23:30:14 +01:00
jonaswinkler
14e2ad7bc4 more parameter checking 2021-02-21 22:19:24 +01:00