160 Commits

Author SHA1 Message Date
shamoon
e1d52f4884 Merge pull request #2302 from paperless-ngx/feature-fix-display-rtl-content 2023-01-10 07:30:52 -08:00
Trenton H
b91217064b Fixes some sample test files showing as modified after running tests 2023-01-05 08:39:48 -08:00
Trenton H
cd42d17ffb Small tweak to use the existing tempdir instead of a new one 2023-01-03 13:05:44 -08:00
Trenton Holmes
a185f94c4b Try a new way of extracting text from a given PDF file 2023-01-03 12:43:31 -08:00
Trenton H
fb20c92c51 Adds testing coverage of multipage TIFF with alpha, without and with alpha/sRGB 2023-01-03 09:56:19 -08:00
Trenton H
911d3cb567 Let convert handle the removal of the alpha channel 2023-01-03 09:56:19 -08:00
Trenton Holmes
22620caf6e If extracting text from a fallback file (ie forced), allow the text to be used 2023-01-01 09:57:15 -08:00
Trenton H
79aecebbd2 In the case of an RTL language being extracted via pdfminer.six, fall back to forced OCR, which handles RTL text better 2022-12-29 16:02:02 -08:00
Trenton Holmes
c83d2da67e Fixes language code checks around two part languages 2022-12-04 12:23:12 -08:00
shamoon
7edf178019 Merge pull request #2057 from paperless-ngx/fix/2044-lang-code-diffs
Bugfix: Some tesseract languages aren't detected as installed.
2022-11-28 11:04:44 -08:00
Trenton H
68c62f3857 Allows parsing of WebP format images 2022-11-28 09:35:54 -08:00
Trenton Holmes
90f3266900 Fixes how a language code like chi-sim is treated in the checks 2022-11-27 08:28:22 -08:00
Trenton H
ffd9cd721d Adds a test to cover this edge case 2022-11-22 07:22:41 -08:00
Trenton H
be8fa418bb Don't use the sidecar file when redoing the OCR, it only contains new text 2022-11-22 07:22:41 -08:00
Trenton Holmes
1be8f39aa0 Reverts the change around skip_noarchive to align with how it is documented to work 2022-10-20 13:34:41 -07:00
Trenton Holmes
43d2545321 Fixes the creation of an archive file, even if noarchive was specified 2022-08-20 13:47:56 -07:00
Trenton Holmes
024fd8bc9b When raising an exception during exception handling, chain them together for slightly cleaner logs 2022-08-03 09:00:56 -07:00
Trenton Holmes
8660103563 Changes the simple-alpha parsing test to use a tempdir so the original isn't modified in Git 2022-07-02 16:19:22 +02:00
Trenton Holmes
95bbf47995 Updates to provide the user provided max pixel size to ocrmypdf 2022-05-22 16:56:08 -07:00
Trenton Holmes
f62193099c Runs pyupgrade to Python 3.8+ and adds a hook for it 2022-05-06 09:04:08 -07:00
Henning Häcker
f4a0d8c040 extract OCR_MAX_IMAGE_PIXELS into settings.py 2022-03-30 09:23:45 +02:00
Henning Häcker
6bc2cb0607 formatting according to black 2022-03-30 09:23:45 +02:00
Henning Häcker
2c1f0cd3ee implement PAPERLESS_OCR_MAX_IMAGE_PIXELS 2022-03-30 09:23:45 +02:00
Trenton Holmes
6635fa5f0d Runs the pre-commit hooks over all the Python files 2022-03-11 11:34:28 -08:00
Trenton Holmes
55486ac151 Reduces number of warnings from testing from 165 to 128. In doing so, fixes a few minor things in the decrypt and export commands 2022-03-10 18:12:48 -08:00
kpj
c56cb25b5f Format Python code with black 2022-02-27 15:26:41 +01:00
Martin Müller
5fa3ec6704 Remove unneded exception handler from has_alpha() 2022-02-21 22:58:19 +01:00
Martin Müller
a662ce03ea Modify test for PNG image with alpha 2022-02-21 22:38:25 +01:00
Martin Müller
b0afdc4841 Fix code style (line too long) 2022-02-21 22:34:34 +01:00
Martin Müller
01310b9742 Remove alpha layer from PNG files for img2pdf
Fixes issue #1254
2022-02-21 22:06:43 +01:00
jonaswinkler
95abc7d6d7 fix bug with DPI calculation 2021-08-18 18:33:33 +02:00
jonaswinkler
1402f11dc8 fix logging getting spammed with pdfminer warnings on JPG files 2021-06-13 12:09:16 +02:00
jonaswinkler
271f9001dd Workaround for all PDFminer.six issues. 2021-05-15 12:15:32 +02:00
jonaswinkler
c9d76322eb also apply \0 removal to sidecar contents 2021-03-22 23:08:34 +01:00
jonaswinkler
d85a0f950f better exception logging 2021-03-22 23:00:15 +01:00
jonaswinkler
62f829ae82 fixes #794 2021-03-22 22:46:35 +01:00
jonaswinkler
3a67462396 fixes #631 2021-03-14 14:42:48 +01:00
jonaswinkler
81b787635e update dependencies 2021-02-28 13:01:26 +01:00
jonaswinkler
96088716d9 tests 2021-02-22 00:17:16 +01:00
jonaswinkler
8dd2e1098b fix up the ocrmypdf parameter construction for clean-final and redo 2021-02-21 23:39:19 +01:00
jonaswinkler
3f920a84da use archived file for thumbnail, if available 2021-02-21 23:30:14 +01:00
jonaswinkler
dce65dc0fa more parameter checking 2021-02-21 22:19:24 +01:00
jonaswinkler
3cfd97aa08 pycodestyle 2021-02-21 00:21:43 +01:00
jonaswinkler
26c65b29d5 tests 2021-02-21 00:18:34 +01:00
jonaswinkler
e3dd1863a9 completely reworked the OCRmyPDF parser. 2021-02-21 00:16:57 +01:00
jonaswinkler
99cb371483 add some test files 2021-02-21 00:13:08 +01:00
jonaswinkler
94cc9876d9 local import of ocrmypdf so that the webserver does not load that 2021-02-15 12:18:10 +01:00
jonaswinkler
b04d91d68c fix a bug with thumbnail generation when TIKA was enabled 2021-02-09 22:12:43 +01:00
jonaswinkler
e5a7dc0cc7 rework most of the logging 2021-02-05 01:10:29 +01:00
jonaswinkler
95f5c9f3a6 lazy loading for parsers 2021-02-04 13:17:24 +01:00