Bladeren bron

fix(language): enhance language detection and text processing

- Improve language detection by removing newline characters from the input text
- Add error handling and fallback mechanism to deal with text containing control characters
myhloli 10 maanden geleden
bovenliggende
commit
29681c4f79
1 gewijzigde bestanden met toevoegingen van 3 en 0 verwijderingen
  1. 3 0
      magic_pdf/libs/language.py

+ 3 - 0
magic_pdf/libs/language.py

@@ -16,11 +16,14 @@ def detect_lang(text: str) -> str:
 
     if len(text) == 0:
         return ""
+
+    text = text.replace("\n", "")
     try:
         lang_upper = detect_language(text)
     except:
         html_no_ctrl_chars = ''.join([l for l in text if unicodedata.category(l)[0] not in ['C', ]])
         lang_upper = detect_language(html_no_ctrl_chars)
+
     try:
         lang = lang_upper.lower()
     except: