intelli3text
Ingestion • Cleaning • Paragraph-level Language ID • spaCy normalization • PDF/JSON export (PT/EN/ES)
What does it do?
intelli3text ingests Web/PDF/DOCX/TXT, cleans and normalizes text, detects language per paragraph, and exports JSON or a rich PDF report. First run auto-downloads required models and then it works offline.
Install
pip install intelli3text
Quickstart (CLI)
intelli3text "https://en.wikipedia.org/wiki/Natural_language_processing" --export-pdf report.pdf
Quickstart (Python)
from intelli3text import PipelineBuilder, Intelli3Config
cfg = Intelli3Config(
cleaners=["ftfy", "clean_text", "pdf_breaks"],
nlp_model_pref="lg",
export={{"pdf": {{"path": "report.pdf", "include_global_normalized": True}}}},
)
pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("paper.pdf")
print(res["language_global"], len(res["paragraphs"]))
Documentation
Open the full site on GitHub Pages: Docs