intelli3text

Ingestion • Cleaning • Paragraph-level Language ID • spaCy normalization • PDF/JSON export (PT/EN/ES)

What does it do?

intelli3text ingests Web/PDF/DOCX/TXT, cleans and normalizes text, detects language per paragraph, and exports JSON or a rich PDF report. First run auto-downloads required models and then it works offline.

Install

pip install intelli3text

Quickstart (CLI)

intelli3text "https://en.wikipedia.org/wiki/Natural_language_processing" --export-pdf report.pdf

Quickstart (Python)

from intelli3text import PipelineBuilder, Intelli3Config

cfg = Intelli3Config(
    cleaners=["ftfy", "clean_text", "pdf_breaks"],
    nlp_model_pref="lg",
    export={{"pdf": {{"path": "report.pdf", "include_global_normalized": True}}}},
)

pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("paper.pdf")
print(res["language_global"], len(res["paragraphs"]))

Documentation

Open the full site on GitHub Pages: Docs