mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2025-08-26 01:16:16 +00:00
Performance: Classifier performance optimizations (#10363)
This commit is contained in:
34
src/documents/tests/samples/content.txt
Normal file
34
src/documents/tests/samples/content.txt
Normal file
@@ -0,0 +1,34 @@
|
||||
Sample textual document content.
|
||||
Include as many characters as possible, to check the classifier's vectorization.
|
||||
|
||||
Hey 00, this is "a" test0707 content.
|
||||
This is an example document — created on 2025-06-25.
|
||||
|
||||
Digits: 0123456789
|
||||
Punctuation: . , ; : ! ? ' " ( ) [ ] { } — – …
|
||||
English text: The quick brown fox jumps over the lazy dog.
|
||||
English stop words: We’ve been doing it before.
|
||||
Accented Latin (diacritics): àâäæçéèêëîïôœùûüÿñ
|
||||
Arabic: لقد قام المترجم بعمل جيد
|
||||
Greek: Αλφα, Βήτα, Γάμμα, Δέλτα, Ωμέγα
|
||||
Cyrillic: Привет, как дела? Добро пожаловать!
|
||||
Chinese (Simplified): 你好,世界!今天的天气很好。
|
||||
Chinese (Traditional): 歡迎來到世界,今天天氣很好。
|
||||
Japanese (Kanji, Hiragana, Katakana): 東京へ行きます。カタカナ、ひらがな、漢字。
|
||||
Korean (Hangul): 안녕하세요. 오늘 날씨 어때요?
|
||||
Arabic: مرحبًا، كيف حالك؟
|
||||
Hebrew: שלום, מה שלומך?
|
||||
Emoji: 😀 🐍 📘 ✅ ©️ 🇺🇳
|
||||
Symbols: © ® ™ § ¶ † ‡ ∞ µ ∑ ∆ √
|
||||
Math: ∫₀^∞ x² dx = ∞, π ≈ 3.14159, ∇·E = ρ/ε₀
|
||||
Currency: 1$ € ¥ £ ₹
|
||||
Date formats: 25/06/2025, June 25, 2025, 2025年6月25日
|
||||
Quote in French: « Bonjour, ça va ? »
|
||||
Quote in German: „Guten Tag! Wie geht's?“
|
||||
Newline test:
|
||||
\r\n
|
||||
\r
|
||||
|
||||
Tab\ttest\tspacing
|
||||
/ = +) ( []) ~ * #192 +33601010101 § ¤
|
||||
End of document.
|
1
src/documents/tests/samples/preprocessed_content.txt
Normal file
1
src/documents/tests/samples/preprocessed_content.txt
Normal file
@@ -0,0 +1 @@
|
||||
sample textual document content include as many characters as possible to check the classifier s vectorization hey 00 this is a test0707 content this is an example document created on 2025 06 25 digits 0123456789 punctuation english text the quick brown fox jumps over the lazy dog english stop words we ve been doing it before accented latin diacritics àâäæçéèêëîïôœùûüÿñ arabic لقد قام المترجم بعمل جيد greek αλφα βήτα γάμμα δέλτα ωμέγα cyrillic привет как дела добро пожаловать chinese simplified 你好 世界 今天的天气很好 chinese traditional 歡迎來到世界 今天天氣很好 japanese kanji hiragana katakana 東京へ行きます カタカナ ひらがな 漢字 korean hangul 안녕하세요 오늘 날씨 어때요 arabic مرحب ا كيف حالك hebrew שלום מה שלומך emoji symbols µ math ₀ x² dx π 3 14159 e ρ ε₀ currency 1 date formats 25 06 2025 june 25 2025 2025年6月25日 quote in french bonjour ça va quote in german guten tag wie geht s newline test r n r tab ttest tspacing 192 33601010101 end of document
|
@@ -0,0 +1 @@
|
||||
sampl textual document content includ mani charact possibl check classifi vector hey 00 test0707 content exampl document creat 2025 06 25 digit 0123456789 punctuat english text quick brown fox jump lazi dog english stop word accent latin diacrit àâäæçéèêëîïôœùûüÿñ arab لقد قام المترجم بعمل جيد greek αλφα βήτα γάμμα δέλτα ωμέγα cyril привет как дела добро пожаловать chines simplifi 你好 世界 今天的天气很好 chines tradit 歡迎來到世界 今天天氣很好 japanes kanji hiragana katakana 東京へ行きます カタカナ ひらがな 漢字 korean hangul 안녕하세요 오늘 날씨 어때요 arab مرحب ا كيف حالك hebrew שלום מה שלומך emoji symbol µ math ₀ x² dx π 3 14159 e ρ ε₀ currenc 1 date format 25 06 2025 june 25 2025 2025年6月25日 quot french bonjour ça va quot german guten tag wie geht newlin test r n r tab ttest tspace 192 33601010101 end document
|
Reference in New Issue
Block a user