From raw documents to clean, normalized text — with paragraph-level Language ID
Ingest Web/PDF/DOCX/TXT, fix Unicode and noise, detect language per paragraph (PT/EN/ES) via fastText, normalize with spaCy (lemmatization & stopwords), and export an auditable PDF report. First run auto-downloads models; subsequent runs work offline.
Simple install
pip install intelli3text. No extra scripts.
Cleaning pipeline
Unicode fixes, noise removal, PDF-aware line-break & hyphen handling.
Paragraph-level LID
fastText LID (176 langs). Global language = most frequent.
spaCy normalization
Lemmas without stopwords/punctuation for PT/EN/ES (lg→md→sm with offline fallback).
CLI & Python API
Script quick tasks or embed as a component.
PDF export
Full report: summary, global normalized, and per-paragraph sections.
Quickstart (CLI)
pip install intelli3text
intelli3text "https://en.wikipedia.org/wiki/Howard_Gardner" --export-pdf out.pdf
# Prints JSON to stdout and saves a detailed PDF report
Quickstart (Python)
from intelli3text import PipelineBuilder, Intelli3Config
cfg = Intelli3Config(
cleaners=["ftfy","clean_text","pdf_breaks"],
nlp_model_pref="lg",
export={"pdf": {"path": "out.pdf", "include_global_normalized": True}}
)
pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://en.wikipedia.org/wiki/Howard_Gardner")
print("Global language:", res["language_global"], "Paragraphs:", len(res["paragraphs"]))