If you've been coding in Python for a while, you already know re for regex and maybe NLTK for NLP. But you also know the pain: text full of messy encodings, weird punctuation, invisible characters, and HTML soup. Parsing it with vanilla tools feels like untangling headphones from 2010.
I've been there. I've built scrapers, log parsers, and natural language pipelines. Over the years I've collected a small arsenal of lesser-known but insanely useful libraries that make text cleaning, parsing and analysis dead simple. Here are my favourites.
1. ftfy — Fix Text for You (Invisible Unicode Problems Gone)
Ever seen "Français" instead of "Français"? That's Mojibake. ftfy fixes that automatically.
import ftfy
text = "Français - piñata"
print(ftfy.fix_text(text))
# Français - piñataIt also normalizes quotes, dashes, and whitespace. No regex headaches.
2. Unidecode — Strip Accents Cleanly
When you need ASCII-only slugs or filenames, unidecode transliterates anything:
from unidecode import unidecode
print(unidecode("Супер пример - Français"))
# Super primer - FrancaisPerfect for generating URLs or filenames from messy text.
3. dateparser — Parse Human Dates Like a Boss
Stop writing brittle regexes for "yesterday" or "last Friday 8pm":
import dateparser
print(dateparser.parse("next Thursday at 5pm"))
# 2025-09-25 17:00:00It understands 200+ languages and time zones out of the box.
4. Textacy — NLP Building Blocks Without Boilerplate
Sits on top of spaCy but gives you higher-level helpers like keyword extraction, readability, or term frequency.
import spacy, textacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Python makes text processing simple.")
keywords = textacy.extract.keyterms.textrank(doc, topn=3)
print(list(keywords))Fewer lines, more power.
5. PySBD — Sentence Boundary Detection Without False Splits
You think splitting on . works? It doesn't. PySBD (by the open-source community) handles abbreviations and multilingual text:
import pysbd
seg = pysbd.Segmenter(language="en", clean=True)
print(seg.segment("Dr. Smith went to Washington. It rained."))
# ['Dr. Smith went to Washington.', 'It rained.']Great for preparing text before summarisation or tokenization.
Quick Pause
If you're ready to sharpen your skills and save hours of frustration, 99 PYTHON DEBUGGING TIPS is your go-to guide. Packed with practical techniques and real examples, it's the fastest way to turn debugging from a headache into a superpower.
6. justext — Strip Boilerplate from HTML
Need only the article body without ads or nav bars? justext does it better than BeautifulSoup alone.
import requests, justext
html = requests.get("https://example.com").text
paragraphs = justext.justext(html, justext.get_stoplist("English"))
clean_text = " ".join(p.text for p in paragraphs if not p.is_boilerplate)
print(clean_text)Boom: instant readable content.
7. Clean-Text — One-Liner Cleaning Pipelines
Lowercase, remove URLs, punctuation, emojis, or stopwords in one go:
from cleantext import clean
txt = "Visit https://example.com 😃!!"
print(clean(txt, no_urls=True, no_emoji=True, lower=True))
# visit !!Chain options instead of chaining regexes.
8. Polyglot — Fast Multilingual Named Entity Recognition
Need language detection, transliteration, or entity extraction in 130+ languages?
from polyglot.text import Text
txt = Text("Elon Musk vive en Texas.")
print(txt.entities) # [['Elon Musk', 'I-PER']]
print(txt.language) # ('es', 1.0)Great for global datasets where English-only tools choke.
9. FlashText — Keyword Extraction at Lightning Speed
Regular expressions scale badly. flashtext can replace or extract thousands of keywords in O(n) time.
from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict({'python': ['python', 'py']})
print(kp.extract_keywords("I love Py and python libraries"))
# ['python', 'python']Ideal for massive logs or real-time filtering.
10. LangDetect — Dead Simple Language Detection
Detect the language of any string with one call:
from langdetect import detect
print(detect("Ceci est un test"))
# frUse it as a gatekeeper before choosing the right pipeline.
Debug Smarter, Faster! 🐍 Grab your Python Debugging Guide — Click here to download!
If you enjoyed reading, be sure to give it 50 CLAPS! Follow and don't miss out on any of my future posts — subscribe to my profile for must-read blog updates!
Thanks for reading!