From raw documents to clean, normalized text — with paragraph-level Language ID

Ingest Web/PDF/DOCX/TXT, fix Unicode and noise, detect language per paragraph (PT/EN/ES) via fastText, normalize with spaCy (lemmatization & stopwords), and export an auditable PDF report. First run auto-downloads models; subsequent runs work offline.

Simple install

pip install intelli3text. No extra scripts.

Cleaning pipeline

Unicode fixes, noise removal, PDF-aware line-break & hyphen handling.

Paragraph-level LID

fastText LID (176 langs). Global language = most frequent.

spaCy normalization

Lemmas without stopwords/punctuation for PT/EN/ES (lg→md→sm with offline fallback).

CLI & Python API

Script quick tasks or embed as a component.

PDF export

Full report: summary, global normalized, and per-paragraph sections.

Quickstart (CLI)

pip install intelli3text
intelli3text "https://en.wikipedia.org/wiki/Howard_Gardner" --export-pdf out.pdf
# Prints JSON to stdout and saves a detailed PDF report

Quickstart (Python)

from intelli3text import PipelineBuilder, Intelli3Config
cfg = Intelli3Config(
    cleaners=["ftfy","clean_text","pdf_breaks"],
    nlp_model_pref="lg",
    export={"pdf": {"path": "out.pdf", "include_global_normalized": True}}
)
pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://en.wikipedia.org/wiki/Howard_Gardner")
print("Global language:", res["language_global"], "Paragraphs:", len(res["paragraphs"]))

Documentation