mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-04-02 13:45:10 -05:00
Feature: collate two single-sided multipage scans (#3784)
* Feature: collate two single-sided scans Some ADF only support single-sided scans, making scanning double-sided documents a bit annoying. This new feature enables Paperless to do most of the work, by merging two seperate scans into a single one, collating the even and odd numbered pages. * Documentation: clarify that collation is disabled by default * Apply suggestions from code review Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com> * Address code review remarks * Grammar fixes --------- Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
This commit is contained in:
parent
9f5d47c320
commit
8c7554e081
@ -528,7 +528,7 @@ For how to enable barcode usage, see [the configuration](/configuration#barcodes
|
||||
The two settings may be enabled independently, but do have interactions as explained
|
||||
below.
|
||||
|
||||
### Document Splitting
|
||||
### Document Splitting {#document-splitting}
|
||||
|
||||
When enabled, Paperless will look for a barcode with the configured value and create a new document
|
||||
starting from the next page. The page with the barcode on it will _not_ be retained. It
|
||||
@ -543,3 +543,69 @@ If document splitting via barcode is also enabled, documents will be split when
|
||||
barcode is located. However, differing from the splitting, the page with the
|
||||
barcode _will_ be retained. This allows application of a barcode to any page, including
|
||||
one which holds data to keep in the document.
|
||||
|
||||
## Automatic collation of double-sided documents {#collate}
|
||||
|
||||
!!! note
|
||||
|
||||
If your scanner supports double-sided scanning natively, you do not need this feature.
|
||||
|
||||
This feature is turned off by default, see [configuration](/configuration#collate) on how to turn it on.
|
||||
|
||||
### Summary
|
||||
|
||||
If you have a scanner with an automatic document feeder (ADF) that only scans a single side,
|
||||
this feature makes scanning double-sided documents much more convenient by automatically
|
||||
collating two separate scans into one document, reordering the pages as necessary.
|
||||
|
||||
### Usage example
|
||||
|
||||
Suppose you have a double-sided document with 6 pages (3 sheets of paper). First,
|
||||
put the stack into your ADF as normal, ensuring that page 1 is scanned first. Your ADF
|
||||
will now scan pages 1, 3, and 5. Then you (or your the scanner, if it supports it) upload
|
||||
the scan into the correct sub-directory of the consume folder (`double-sided` by default;
|
||||
keep in mind that Paperless will _not_ automatically create the directory for you.)
|
||||
Paperless will then process the scan and move it into an internal staging area.
|
||||
|
||||
The next step is to turn your stack upside down (without reordering the sheets of paper),
|
||||
and scan it once again, your ADF will now scan pages 6, 4, and 2, in that order. Once this
|
||||
scan is copied into the sub-directory, Paperless will collate the previous scan with the
|
||||
new one, reversing the order of the pages on the second, "even numbered" scan. The
|
||||
resulting document will have the pages 1-6 in the correct order, and this new file will
|
||||
then be processed as normal.
|
||||
|
||||
!!! tip
|
||||
|
||||
When scanning the even numbered pages, you can omit the last empty pages, if there are
|
||||
any. For example, if page 6 is empty, you only need to scan pages 2 and 4. _Do not_ omit
|
||||
empty pages in the middle of the document.
|
||||
|
||||
### Things that could go wrong
|
||||
|
||||
Paperless will notice when the first, "odd numbered" scan has less pages than the second
|
||||
scan (this can happen when e.g. the ADF skipped a few pages in the first pass). In that
|
||||
case, Paperless will remove the staging copy as well as the scan, and give you an error
|
||||
message asking you to restart the process from scratch, by scanning the odd pages again,
|
||||
followed by the even pages.
|
||||
|
||||
Another thing that might happen is that you start a double sided scan, but then forget
|
||||
to upload the second file. To avoid collating the wrong documents if you then come back
|
||||
a day later to scan a new double-sided document, Paperless will only keep an "odd numbered
|
||||
pages" file for up to 30 minutes. If more time passes, it will consider the next incoming
|
||||
scan a completely new "odd numbered pages" one. The old staging file will get discarded.
|
||||
|
||||
### Interaction with "subdirs as tags"
|
||||
|
||||
The collation feature can be used together with the "subdirs as tags" feature (but this is not
|
||||
a requirement). Just create a correctly named double-sided subdir in the hierachy and upload
|
||||
your scans there. For example, both `double-sided/foo/bar` as well as `foo/bar/double-sided` will
|
||||
cause the collated document to be treated as if it were uploaded into `foo/bar` and receive both
|
||||
`foo` and `bar` tags, but not `double-sided`.
|
||||
|
||||
### Interaction with document splitting
|
||||
|
||||
You can use the [document splitting](#document-splitting) feature, but if you use a normal
|
||||
single-sided split marker page, the split document(s) will have an empty page at the front (or
|
||||
whatever else was on the backside of the split marker page.) You can work around that by having
|
||||
a split marker page that has the split barcode on _both_ sides. This way, the extra page will
|
||||
get automatically removed.
|
||||
|
@ -1116,6 +1116,43 @@ combination with PAPERLESS_CONSUMER_BARCODE_UPSCALE bigger than 1.0.
|
||||
|
||||
Defaults to "300"
|
||||
|
||||
## Collate Double-Sided Documents {#collate}
|
||||
|
||||
`PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=<bool>`
|
||||
|
||||
: Enables automatic collation of two single-sided scans into a double-sided
|
||||
document.
|
||||
|
||||
This is useful if you have an automatic document feeder that only supports
|
||||
single-sided scans, but you need to scan a double-sided document. If your
|
||||
ADF supports double-sided scans natively, you do not need this feature.
|
||||
|
||||
`PAPERLESS_CONSUMER_RECURSIVE` must be enabled for this to work.
|
||||
|
||||
For more information, read the [corresponding section in the advanced
|
||||
documentation](/advanced_usage#collate).
|
||||
|
||||
Defaults to false.
|
||||
|
||||
`PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME=<str>`
|
||||
|
||||
: The name of the subdirectory that the collate feature expects documents to
|
||||
arrive.
|
||||
|
||||
This only has an effect if `PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED`
|
||||
has been enabled. Note that Paperless will not automatically create the
|
||||
directory.
|
||||
|
||||
Defaults to "double-sided".
|
||||
|
||||
`PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=<bool>`
|
||||
: Whether TIFF image files should be supported when collating documents.
|
||||
This will automatically convert any TIFF image(s) to pdfs for later
|
||||
processing. This only has an effect if
|
||||
`PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED` has been enabled.
|
||||
|
||||
Defaults to false.
|
||||
|
||||
## Binaries
|
||||
|
||||
There are a few external software packages that Paperless expects to
|
||||
|
@ -68,6 +68,9 @@
|
||||
#PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
|
||||
#PAPERLESS_CONSUMER_BARCODE_UPSCALE=0.0
|
||||
#PAPERLESS_CONSUMER_BARCODE_DPI=300
|
||||
#PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=false
|
||||
#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME=double-sided
|
||||
#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=false
|
||||
#PAPERLESS_PRE_CONSUME_SCRIPT=/path/to/an/arbitrary/script.sh
|
||||
#PAPERLESS_POST_CONSUME_SCRIPT=/path/to/an/arbitrary/script.sh
|
||||
#PAPERLESS_FILENAME_DATE_ORDER=YMD
|
||||
|
@ -2,13 +2,11 @@ import logging
|
||||
import tempfile
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from subprocess import run
|
||||
from typing import Dict
|
||||
from typing import Final
|
||||
from typing import List
|
||||
from typing import Optional
|
||||
|
||||
import img2pdf
|
||||
from django.conf import settings
|
||||
from pdf2image import convert_from_path
|
||||
from pdf2image.exceptions import PDFPageCountError
|
||||
@ -16,6 +14,7 @@ from pikepdf import Page
|
||||
from pikepdf import Pdf
|
||||
from PIL import Image
|
||||
|
||||
from documents.converters import convert_from_tiff_to_pdf
|
||||
from documents.data_models import DocumentSource
|
||||
from documents.utils import copy_basic_file_stats
|
||||
from documents.utils import copy_file_with_basic_stats
|
||||
@ -55,7 +54,7 @@ class BarcodeReader:
|
||||
self.mime: Final[str] = mime_type
|
||||
self.pdf_file: Path = self.file
|
||||
self.barcodes: List[Barcode] = []
|
||||
self.temp_dir: Optional[Path] = None
|
||||
self.temp_dir: Optional[tempfile.TemporaryDirectory] = None
|
||||
|
||||
if settings.CONSUMER_BARCODE_TIFF_SUPPORT:
|
||||
self.SUPPORTED_FILE_MIMES = {"application/pdf", "image/tiff"}
|
||||
@ -155,34 +154,7 @@ class BarcodeReader:
|
||||
if self.mime != "image/tiff":
|
||||
return
|
||||
|
||||
with Image.open(self.file) as im:
|
||||
has_alpha_layer = im.mode in ("RGBA", "LA")
|
||||
if has_alpha_layer:
|
||||
# Note the save into the temp folder, so as not to trigger a new
|
||||
# consume
|
||||
scratch_image = Path(self.temp_dir.name) / Path(self.file.name)
|
||||
run(
|
||||
[
|
||||
settings.CONVERT_BINARY,
|
||||
"-alpha",
|
||||
"off",
|
||||
self.file,
|
||||
scratch_image,
|
||||
],
|
||||
)
|
||||
else:
|
||||
# Not modifying the original, safe to use in place
|
||||
scratch_image = self.file
|
||||
|
||||
self.pdf_file = Path(self.temp_dir.name) / Path(self.file.name).with_suffix(
|
||||
".pdf",
|
||||
)
|
||||
|
||||
with scratch_image.open("rb") as img_file, self.pdf_file.open("wb") as pdf_file:
|
||||
pdf_file.write(img2pdf.convert(img_file))
|
||||
|
||||
# Copy what file stat is possible
|
||||
copy_basic_file_stats(self.file, self.pdf_file)
|
||||
self.pdf_file = convert_from_tiff_to_pdf(self.file, Path(self.temp_dir.name))
|
||||
|
||||
def detect(self) -> None:
|
||||
"""
|
||||
|
46
src/documents/converters.py
Normal file
46
src/documents/converters.py
Normal file
@ -0,0 +1,46 @@
|
||||
from pathlib import Path
|
||||
from subprocess import run
|
||||
|
||||
import img2pdf
|
||||
from django.conf import settings
|
||||
from PIL import Image
|
||||
|
||||
from documents.utils import copy_basic_file_stats
|
||||
|
||||
|
||||
def convert_from_tiff_to_pdf(tiff_path: Path, target_directory: Path) -> Path:
|
||||
"""
|
||||
Converts a TIFF file into a PDF file.
|
||||
|
||||
The PDF will be created in the given target_directory and share the name of
|
||||
the original TIFF file, as well as its stats (mtime etc.).
|
||||
|
||||
Returns the path of the PDF created.
|
||||
"""
|
||||
with Image.open(tiff_path) as im:
|
||||
has_alpha_layer = im.mode in ("RGBA", "LA")
|
||||
if has_alpha_layer:
|
||||
# Note the save into the temp folder, so as not to trigger a new
|
||||
# consume
|
||||
scratch_image = target_directory / tiff_path.name
|
||||
run(
|
||||
[
|
||||
settings.CONVERT_BINARY,
|
||||
"-alpha",
|
||||
"off",
|
||||
tiff_path,
|
||||
scratch_image,
|
||||
],
|
||||
)
|
||||
else:
|
||||
# Not modifying the original, safe to use in place
|
||||
scratch_image = tiff_path
|
||||
|
||||
pdf_path = (target_directory / tiff_path.name).with_suffix(".pdf")
|
||||
|
||||
with scratch_image.open("rb") as img_file, pdf_path.open("wb") as pdf_file:
|
||||
pdf_file.write(img2pdf.convert(img_file))
|
||||
|
||||
# Copy what file stat is possible
|
||||
copy_basic_file_stats(tiff_path, pdf_path)
|
||||
return pdf_path
|
131
src/documents/double_sided.py
Normal file
131
src/documents/double_sided.py
Normal file
@ -0,0 +1,131 @@
|
||||
import datetime as dt
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
from django.conf import settings
|
||||
from pikepdf import Pdf
|
||||
|
||||
from documents.consumer import ConsumerError
|
||||
from documents.converters import convert_from_tiff_to_pdf
|
||||
from documents.data_models import ConsumableDocument
|
||||
|
||||
logger = logging.getLogger("paperless.double_sided")
|
||||
|
||||
# Hardcoded for now, could be made a configurable setting if needed
|
||||
TIMEOUT_MINUTES = 30
|
||||
|
||||
# Used by test cases
|
||||
STAGING_FILE_NAME = "double-sided-staging.pdf"
|
||||
|
||||
|
||||
def collate(input_doc: ConsumableDocument) -> str:
|
||||
"""
|
||||
Tries to collate pages from 2 single sided scans of a double sided
|
||||
document.
|
||||
|
||||
When called with a file, it checks whether or not a staging file
|
||||
exists, if not, the current file is turned into that staging file
|
||||
containing the odd numbered pages.
|
||||
|
||||
If a staging file exists, and it is not too old, the current file is
|
||||
considered to be the second part (the even numbered pages) and it will
|
||||
collate the pages of both, the pages of the second file will be added
|
||||
in reverse order, since the ADF will have scanned the pages from bottom
|
||||
to top.
|
||||
|
||||
Returns a status message on succcess, or raises a ConsumerError
|
||||
in case of failure.
|
||||
"""
|
||||
|
||||
# Make sure scratch dir exists, Consumer might not have run yet
|
||||
settings.SCRATCH_DIR.mkdir(exist_ok=True)
|
||||
|
||||
if input_doc.mime_type == "application/pdf":
|
||||
pdf_file = input_doc.original_file
|
||||
elif (
|
||||
input_doc.mime_type == "image/tiff"
|
||||
and settings.CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT
|
||||
):
|
||||
pdf_file = convert_from_tiff_to_pdf(
|
||||
input_doc.original_file,
|
||||
settings.SCRATCH_DIR,
|
||||
)
|
||||
input_doc.original_file.unlink()
|
||||
else:
|
||||
raise ConsumerError("Unsupported file type for collation of double-sided scans")
|
||||
|
||||
staging = settings.SCRATCH_DIR / STAGING_FILE_NAME
|
||||
|
||||
valid_staging_exists = False
|
||||
if staging.exists():
|
||||
stats = os.stat(str(staging))
|
||||
# if the file is older than the timeout, we don't consider
|
||||
# it valid
|
||||
if dt.datetime.now().timestamp() - stats.st_mtime > TIMEOUT_MINUTES * 60:
|
||||
logger.warning("Outdated double sided staging file exists, deleting it")
|
||||
os.unlink(str(staging))
|
||||
else:
|
||||
valid_staging_exists = True
|
||||
|
||||
if valid_staging_exists:
|
||||
try:
|
||||
# Collate pages from second PDF in reverse order
|
||||
with Pdf.open(staging) as pdf1, Pdf.open(pdf_file) as pdf2:
|
||||
pdf2.pages.reverse()
|
||||
try:
|
||||
for i, page in enumerate(pdf2.pages):
|
||||
pdf1.pages.insert(2 * i + 1, page)
|
||||
except IndexError:
|
||||
raise ConsumerError(
|
||||
"This second file (even numbered pages) contains more "
|
||||
"pages than the first/odd numbered one. This means the "
|
||||
"two uploaded files don't belong to the same double-"
|
||||
"sided scan. Please retry, starting with the odd "
|
||||
"numbered pages again.",
|
||||
)
|
||||
# Merged file has the same path, but without the
|
||||
# double-sided subdir. Therefore, it is also in the
|
||||
# consumption dir and will be picked up for processing
|
||||
old_file = input_doc.original_file
|
||||
new_file = Path(
|
||||
*(
|
||||
part
|
||||
for part in old_file.with_name(
|
||||
f"{old_file.stem}-collated.pdf",
|
||||
).parts
|
||||
if part != settings.CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME
|
||||
),
|
||||
)
|
||||
# If the user didn't create the subdirs yet, do it for them
|
||||
new_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
pdf1.save(new_file)
|
||||
logger.info("Collated documents into new file %s", new_file)
|
||||
return (
|
||||
"Success. Even numbered pages of double sided scan collated "
|
||||
"with odd pages"
|
||||
)
|
||||
finally:
|
||||
# Delete staging and recently uploaded file no matter what.
|
||||
# If any error occurs, the user needs to be able to restart
|
||||
# the process from scratch; after all, the staging file
|
||||
# with the odd numbered pages might be the culprit
|
||||
pdf_file.unlink()
|
||||
staging.unlink()
|
||||
|
||||
else:
|
||||
# In Python 3.9 move supports Path objects directly,
|
||||
# but for now we have to be compatible with 3.8
|
||||
shutil.move(str(pdf_file), str(staging))
|
||||
# update access to modification time so we know if the file
|
||||
# is outdated when another file gets uploaded
|
||||
os.utime(str(staging), (dt.datetime.now().timestamp(),) * 2)
|
||||
logger.info(
|
||||
"Got scan with odd numbered pages of double-sided scan, moved it to %s",
|
||||
staging,
|
||||
)
|
||||
return (
|
||||
"Received odd numbered pages of double sided scan, waiting up to "
|
||||
f"{TIMEOUT_MINUTES} minutes for even numbered pages"
|
||||
)
|
@ -25,6 +25,7 @@ from documents.consumer import Consumer
|
||||
from documents.consumer import ConsumerError
|
||||
from documents.data_models import ConsumableDocument
|
||||
from documents.data_models import DocumentMetadataOverrides
|
||||
from documents.double_sided import collate
|
||||
from documents.file_handling import create_source_path_directory
|
||||
from documents.file_handling import generate_unique_filename
|
||||
from documents.models import Correspondent
|
||||
@ -89,10 +90,40 @@ def consume_file(
|
||||
input_doc: ConsumableDocument,
|
||||
overrides: Optional[DocumentMetadataOverrides] = None,
|
||||
):
|
||||
def send_progress(status="SUCCESS", message="finished"):
|
||||
payload = {
|
||||
"filename": overrides.filename or input_doc.original_file.name,
|
||||
"task_id": None,
|
||||
"current_progress": 100,
|
||||
"max_progress": 100,
|
||||
"status": status,
|
||||
"message": message,
|
||||
}
|
||||
try:
|
||||
async_to_sync(get_channel_layer().group_send)(
|
||||
"status_updates",
|
||||
{"type": "status_update", "data": payload},
|
||||
)
|
||||
except ConnectionError as e:
|
||||
logger.warning(f"ConnectionError on status send: {e!s}")
|
||||
|
||||
# Default no overrides
|
||||
if overrides is None:
|
||||
overrides = DocumentMetadataOverrides()
|
||||
|
||||
# Handle collation of double-sided documents scanned in two parts
|
||||
if settings.CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED and (
|
||||
settings.CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME
|
||||
in input_doc.original_file.parts
|
||||
):
|
||||
try:
|
||||
msg = collate(input_doc)
|
||||
send_progress(message=msg)
|
||||
return msg
|
||||
except ConsumerError as e:
|
||||
send_progress(status="FAILURE", message=e.args[0])
|
||||
raise e
|
||||
|
||||
# read all barcodes in the current document
|
||||
if settings.CONSUMER_ENABLE_BARCODES or settings.CONSUMER_ENABLE_ASN_BARCODE:
|
||||
with BarcodeReader(input_doc.original_file, input_doc.mime_type) as reader:
|
||||
@ -102,24 +133,9 @@ def consume_file(
|
||||
):
|
||||
# notify the sender, otherwise the progress bar
|
||||
# in the UI stays stuck
|
||||
payload = {
|
||||
"filename": overrides.filename or input_doc.original_file.name,
|
||||
"task_id": None,
|
||||
"current_progress": 100,
|
||||
"max_progress": 100,
|
||||
"status": "SUCCESS",
|
||||
"message": "finished",
|
||||
}
|
||||
try:
|
||||
async_to_sync(get_channel_layer().group_send)(
|
||||
"status_updates",
|
||||
{"type": "status_update", "data": payload},
|
||||
)
|
||||
except ConnectionError as e:
|
||||
logger.warning(f"ConnectionError on status send: {e!s}")
|
||||
send_progress()
|
||||
# consuming stops here, since the original document with
|
||||
# the barcodes has been split and will be consumed separately
|
||||
|
||||
input_doc.original_file.unlink()
|
||||
return "File successfully split"
|
||||
|
||||
|
BIN
src/documents/tests/samples/double-sided-even.pdf
Normal file
BIN
src/documents/tests/samples/double-sided-even.pdf
Normal file
Binary file not shown.
BIN
src/documents/tests/samples/double-sided-odd.pdf
Normal file
BIN
src/documents/tests/samples/double-sided-odd.pdf
Normal file
Binary file not shown.
253
src/documents/tests/test_double_sided.py
Normal file
253
src/documents/tests/test_double_sided.py
Normal file
@ -0,0 +1,253 @@
|
||||
import datetime as dt
|
||||
import os
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from typing import Union
|
||||
from unittest import mock
|
||||
|
||||
from django.test import TestCase
|
||||
from django.test import override_settings
|
||||
from pdfminer.high_level import extract_text
|
||||
from pikepdf import Pdf
|
||||
|
||||
from documents import tasks
|
||||
from documents.consumer import ConsumerError
|
||||
from documents.data_models import ConsumableDocument
|
||||
from documents.data_models import DocumentSource
|
||||
from documents.double_sided import STAGING_FILE_NAME
|
||||
from documents.double_sided import TIMEOUT_MINUTES
|
||||
from documents.tests.utils import DirectoriesMixin
|
||||
from documents.tests.utils import FileSystemAssertsMixin
|
||||
|
||||
|
||||
@override_settings(
|
||||
CONSUMER_RECURSIVE=True,
|
||||
CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=True,
|
||||
)
|
||||
class TestDoubleSided(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
|
||||
SAMPLE_DIR = Path(__file__).parent / "samples"
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
self.dirs.double_sided_dir = self.dirs.consumption_dir / "double-sided"
|
||||
self.dirs.double_sided_dir.mkdir()
|
||||
self.staging_file = self.dirs.scratch_dir / STAGING_FILE_NAME
|
||||
|
||||
def consume_file(self, srcname, dstname: Union[str, Path] = "foo.pdf"):
|
||||
"""
|
||||
Starts the consume process and also ensures the
|
||||
destination file does not exist afterwards
|
||||
"""
|
||||
src = self.SAMPLE_DIR / srcname
|
||||
dst = self.dirs.double_sided_dir / dstname
|
||||
dst.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy(src, dst)
|
||||
with mock.patch("documents.tasks.async_to_sync"), mock.patch(
|
||||
"documents.consumer.async_to_sync",
|
||||
):
|
||||
msg = tasks.consume_file(
|
||||
ConsumableDocument(
|
||||
source=DocumentSource.ConsumeFolder,
|
||||
original_file=dst,
|
||||
),
|
||||
None,
|
||||
)
|
||||
self.assertIsNotFile(dst)
|
||||
return msg
|
||||
|
||||
def create_staging_file(self, src="double-sided-odd.pdf", datetime=None):
|
||||
shutil.copy(self.SAMPLE_DIR / src, self.staging_file)
|
||||
if datetime is None:
|
||||
datetime = dt.datetime.now()
|
||||
os.utime(str(self.staging_file), (datetime.timestamp(),) * 2)
|
||||
|
||||
def test_odd_numbered_moved_to_staging(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- No staging file exists
|
||||
WHEN:
|
||||
- A file is copied into the double-sided consume directory
|
||||
THEN:
|
||||
- The file becomes the new staging file
|
||||
- The file in the consume directory gets removed
|
||||
- The staging file has the st_mtime set to now
|
||||
- The user gets informed
|
||||
"""
|
||||
|
||||
msg = self.consume_file("double-sided-odd.pdf")
|
||||
|
||||
self.assertIsFile(self.staging_file)
|
||||
self.assertAlmostEqual(
|
||||
dt.datetime.fromtimestamp(self.staging_file.stat().st_mtime),
|
||||
dt.datetime.now(),
|
||||
delta=dt.timedelta(seconds=5),
|
||||
)
|
||||
self.assertIn("Received odd numbered pages", msg)
|
||||
|
||||
def test_collation(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- A staging file not older than TIMEOUT_MINUTES with odd pages exists
|
||||
WHEN:
|
||||
- A file is copied into the double-sided consume directory
|
||||
THEN:
|
||||
- A new file containing the collated staging and uploaded file is
|
||||
created and put into the consume directory
|
||||
- The new file is named "foo-collated.pdf", where foo is the name of
|
||||
the second file
|
||||
- Both staging and uploaded file get deleted
|
||||
- The new file contains the pages in the correct order
|
||||
"""
|
||||
|
||||
self.create_staging_file()
|
||||
self.consume_file("double-sided-even.pdf", "some-random-name.pdf")
|
||||
|
||||
target = self.dirs.consumption_dir / "some-random-name-collated.pdf"
|
||||
self.assertIsFile(target)
|
||||
self.assertIsNotFile(self.staging_file)
|
||||
self.assertRegex(
|
||||
extract_text(str(target)),
|
||||
r"(?s)"
|
||||
r"This is page 1.*This is page 2.*This is page 3.*"
|
||||
r"This is page 4.*This is page 5",
|
||||
)
|
||||
|
||||
def test_staging_file_expiration(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- A staging file older than TIMEOUT_MINUTES exists
|
||||
WHEN:
|
||||
- A file is copied into the double-sided consume directory
|
||||
THEN:
|
||||
- It becomes the new staging file
|
||||
"""
|
||||
|
||||
self.create_staging_file(
|
||||
datetime=dt.datetime.now()
|
||||
- dt.timedelta(minutes=TIMEOUT_MINUTES, seconds=1),
|
||||
)
|
||||
msg = self.consume_file("double-sided-odd.pdf")
|
||||
self.assertIsFile(self.staging_file)
|
||||
self.assertIn("Received odd numbered pages", msg)
|
||||
|
||||
def test_less_odd_pages_then_even_fails(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- A valid staging file
|
||||
WHEN:
|
||||
- A file is copied into the double-sided consume directory
|
||||
that has more pages than the staging file
|
||||
THEN:
|
||||
- Both files get removed
|
||||
- A ConsumerError exception is thrown
|
||||
"""
|
||||
self.create_staging_file("simple.pdf")
|
||||
self.assertRaises(
|
||||
ConsumerError,
|
||||
self.consume_file,
|
||||
"double-sided-even.pdf",
|
||||
)
|
||||
self.assertIsNotFile(self.staging_file)
|
||||
|
||||
@override_settings(CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=True)
|
||||
def test_tiff_upload_enabled(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT is true
|
||||
- No staging file exists
|
||||
WHEN:
|
||||
- A TIFF file gets uploaded into the double-sided
|
||||
consume dir
|
||||
THEN:
|
||||
- The file is converted into a PDF and moved to
|
||||
the staging file
|
||||
"""
|
||||
self.consume_file("simple.tiff", "simple.tiff")
|
||||
self.assertIsFile(self.staging_file)
|
||||
# Ensure the file is a valid PDF by trying to read it
|
||||
Pdf.open(self.staging_file)
|
||||
|
||||
@override_settings(CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=False)
|
||||
def test_tiff_upload_disabled(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT is false
|
||||
- No staging file exists
|
||||
WHEN:
|
||||
- A TIFF file gets uploaded into the double-sided
|
||||
consume dir
|
||||
THEN:
|
||||
- A ConsumerError is raised
|
||||
"""
|
||||
self.assertRaises(
|
||||
ConsumerError,
|
||||
self.consume_file,
|
||||
"simple.tiff",
|
||||
"simple.tiff",
|
||||
)
|
||||
|
||||
@override_settings(CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME="quux")
|
||||
def test_different_upload_dir_name(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- No staging file exists
|
||||
- CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME is set to quux
|
||||
WHEN:
|
||||
- A file is uploaded into the quux dir
|
||||
THEN:
|
||||
- A staging file is created
|
||||
"""
|
||||
self.consume_file("double-sided-odd.pdf", Path("..") / "quux" / "foo.pdf")
|
||||
self.assertIsFile(self.staging_file)
|
||||
|
||||
def test_only_double_sided_dir_is_handled(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- No staging file exists
|
||||
WHEN:
|
||||
- A file is uploaded into the normal consumption dir
|
||||
THEN:
|
||||
- The file is processed as normal
|
||||
"""
|
||||
msg = self.consume_file("simple.pdf", Path("..") / "simple.pdf")
|
||||
self.assertIsNotFile(self.staging_file)
|
||||
self.assertRegex(msg, "Success. New document .* created")
|
||||
|
||||
def test_subdirectory_upload(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- A staging file exists
|
||||
WHEN:
|
||||
- A file gets uploaded into foo/bar/double-sided
|
||||
or double-sided/foo/bar
|
||||
THEN:
|
||||
- The collated file gets put into foo/bar
|
||||
"""
|
||||
for path in [
|
||||
Path("foo") / "bar" / "double-sided",
|
||||
Path("double-sided") / "foo" / "bar",
|
||||
]:
|
||||
with self.subTest(path=path):
|
||||
# Ensure we get fresh directories for each run
|
||||
self.tearDown()
|
||||
self.setUp()
|
||||
|
||||
self.create_staging_file()
|
||||
self.consume_file("double-sided-odd.pdf", path / "foo.pdf")
|
||||
self.assertIsFile(
|
||||
self.dirs.consumption_dir / "foo" / "bar" / "foo-collated.pdf",
|
||||
)
|
||||
|
||||
@override_settings(CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=False)
|
||||
def test_disabled_double_sided_dir_upload(self):
|
||||
"""
|
||||
GIVEN:
|
||||
- CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED is false
|
||||
WHEN:
|
||||
- A file is uploaded into the double-sided directory
|
||||
THEN:
|
||||
- The file is processed like a normal upload
|
||||
"""
|
||||
msg = self.consume_file("simple.pdf")
|
||||
self.assertIsNotFile(self.staging_file)
|
||||
self.assertRegex(msg, "Success. New document .* created")
|
@ -791,6 +791,18 @@ CONSUMER_BARCODE_DPI: Final[str] = int(
|
||||
os.getenv("PAPERLESS_CONSUMER_BARCODE_DPI", 300),
|
||||
)
|
||||
|
||||
CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED: Final[bool] = __get_boolean(
|
||||
"PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED",
|
||||
)
|
||||
|
||||
CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME: Final[str] = os.getenv(
|
||||
"PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME",
|
||||
"double-sided",
|
||||
)
|
||||
|
||||
CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT: Final[bool] = __get_boolean(
|
||||
"PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT",
|
||||
)
|
||||
|
||||
OCR_PAGES = int(os.getenv("PAPERLESS_OCR_PAGES", 0))
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user