mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-01-30 23:08:59 -06:00
Draft up documentation on how to create a plugin
This commit is contained in:
@@ -481,3 +481,147 @@ To get started:
|
|||||||
|
|
||||||
5. The project is ready for debugging, start either run the fullstack debug or individual debug
|
5. The project is ready for debugging, start either run the fullstack debug or individual debug
|
||||||
processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
|
processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
|
||||||
|
|
||||||
|
## Developing Date Parser Plugins
|
||||||
|
|
||||||
|
Paperless-ngx uses a plugin system for date parsing, allowing you to extend or replace the default date parsing behavior. Plugins are discovered using [Python entry points](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
|
||||||
|
|
||||||
|
### Creating a Date Parser Plugin
|
||||||
|
|
||||||
|
To create a custom date parser plugin, you need to:
|
||||||
|
|
||||||
|
1. Create a class that inherits from `DateParserPluginBase`
|
||||||
|
2. Implement the required abstract method
|
||||||
|
3. Register your plugin via an entry point
|
||||||
|
|
||||||
|
#### 1. Implementing the Parser Class
|
||||||
|
|
||||||
|
Your parser must extend `documents.plugins.date_parsing.DateParserPluginBase` and implement the `parse` method:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from collections.abc import Iterator
|
||||||
|
import datetime
|
||||||
|
|
||||||
|
from documents.plugins.date_parsing import DateParserPluginBase
|
||||||
|
|
||||||
|
|
||||||
|
class MyDateParserPlugin(DateParserPluginBase):
|
||||||
|
"""
|
||||||
|
Custom date parser implementation.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def parse(self, filename: str, content: str) -> Iterator[datetime.datetime]:
|
||||||
|
"""
|
||||||
|
Parse dates from the document's filename and content.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
filename: The original filename of the document
|
||||||
|
content: The extracted text content of the document
|
||||||
|
|
||||||
|
Yields:
|
||||||
|
datetime.datetime: Valid datetime objects found in the document
|
||||||
|
"""
|
||||||
|
# Your parsing logic here
|
||||||
|
# Use self.config to access configuration settings
|
||||||
|
|
||||||
|
# Example: parse dates from filename first
|
||||||
|
if self.config.filename_date_order:
|
||||||
|
# Your filename parsing logic
|
||||||
|
yield some_datetime
|
||||||
|
|
||||||
|
# Then parse dates from content
|
||||||
|
# Your content parsing logic
|
||||||
|
yield another_datetime
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Configuration and Helper Methods
|
||||||
|
|
||||||
|
Your parser instance is initialized with a `DateParserConfig` object accessible via `self.config`. This provides:
|
||||||
|
|
||||||
|
- `languages: list[str]` - List of language codes for date parsing
|
||||||
|
- `timezone_str: str` - Timezone string for date localization
|
||||||
|
- `ignore_dates: set[datetime.date]` - Dates that should be filtered out
|
||||||
|
- `reference_time: datetime.datetime` - Current time for filtering future dates
|
||||||
|
- `filename_date_order: str | None` - Date order preference for filenames (e.g., "DMY", "MDY")
|
||||||
|
- `content_date_order: str` - Date order preference for content
|
||||||
|
|
||||||
|
The base class provides two helper methods you can use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _parse_string(
|
||||||
|
self,
|
||||||
|
date_string: str,
|
||||||
|
date_order: str,
|
||||||
|
) -> datetime.datetime | None:
|
||||||
|
"""
|
||||||
|
Parse a single date string using dateparser with configured settings.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _filter_date(
|
||||||
|
self,
|
||||||
|
date: datetime.datetime | None,
|
||||||
|
) -> datetime.datetime | None:
|
||||||
|
"""
|
||||||
|
Validate a parsed datetime against configured rules.
|
||||||
|
Filters out dates before 1900, future dates, and ignored dates.
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. Resource Management (Optional)
|
||||||
|
|
||||||
|
If your plugin needs to acquire or release resources (database connections, API clients, etc.), override the context manager methods. Paperless-ngx will always use plugins as context managers, ensuring resources can be released even in the event of errors.
|
||||||
|
|
||||||
|
#### 4. Registering Your Plugin
|
||||||
|
|
||||||
|
Register your plugin using a setuptools entry point in your package's `pyproject.toml`:
|
||||||
|
|
||||||
|
```toml
|
||||||
|
[project.entry-points."paperless_ngx.date_parsers"]
|
||||||
|
my_parser = "my_package.parsers:MyDateParserPlugin"
|
||||||
|
```
|
||||||
|
|
||||||
|
The entry point name (e.g., `"my_parser"`) is used for sorting when multiple plugins are found. Paperless-ngx will use the first plugin alphabetically by name if multiple plugins are discovered.
|
||||||
|
|
||||||
|
### Plugin Discovery
|
||||||
|
|
||||||
|
Paperless-ngx automatically discovers and loads date parser plugins at runtime. The discovery process:
|
||||||
|
|
||||||
|
1. Queries the `paperless_ngx.date_parsers` entry point group
|
||||||
|
2. Validates that each plugin is a subclass of `DateParserPluginBase`
|
||||||
|
3. Sorts valid plugins alphabetically by entry point name
|
||||||
|
4. Uses the first valid plugin, or falls back to the default `RegexDateParserPlugin` if none are found
|
||||||
|
|
||||||
|
If multiple plugins are installed, a warning is logged indicating which plugin was selected.
|
||||||
|
|
||||||
|
### Example: Simple Date Parser
|
||||||
|
|
||||||
|
Here's a minimal example that only looks for ISO 8601 dates:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import datetime
|
||||||
|
import re
|
||||||
|
from collections.abc import Iterator
|
||||||
|
|
||||||
|
from documents.plugins.date_parsing.base import DateParserPluginBase
|
||||||
|
|
||||||
|
|
||||||
|
class ISODateParserPlugin(DateParserPluginBase):
|
||||||
|
"""
|
||||||
|
Parser that only matches ISO 8601 formatted dates (YYYY-MM-DD).
|
||||||
|
"""
|
||||||
|
|
||||||
|
ISO_REGEX = re.compile(r"\b(\d{4}-\d{2}-\d{2})\b")
|
||||||
|
|
||||||
|
def parse(self, filename: str, content: str) -> Iterator[datetime.datetime]:
|
||||||
|
# Combine filename and content for searching
|
||||||
|
text = f"{filename} {content}"
|
||||||
|
|
||||||
|
for match in self.ISO_REGEX.finditer(text):
|
||||||
|
date_string = match.group(1)
|
||||||
|
# Use helper method to parse with configured timezone
|
||||||
|
date = self._parse_string(date_string, "YMD")
|
||||||
|
# Use helper method to validate the date
|
||||||
|
filtered_date = self._filter_date(date)
|
||||||
|
if filtered_date is not None:
|
||||||
|
yield filtered_date
|
||||||
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user