Invoz — Audio ML Pipeline

Invoz is a production ML pipeline that analyzes spoken English across 11 scoring dimensions — 7 acoustic (pitch variability, speech rate, pause patterns, volume dynamics, filler words, articulation clarity, rhythm) and 4 linguistic (vocabulary richness, grammar accuracy, coherence, discourse markers). Built from 46 research papers in speech pathology and computational linguistics.

Invoz — Audio ML Pipeline hero visual

The Problem

Non-native English speakers get generic feedback like "speak more clearly" with no specifics. Existing tools score pronunciation at the word level but miss the acoustic and linguistic patterns that actually make speech effective — rhythm, pause placement, vocabulary range, discourse structure.

The Solution

A multi-model pipeline: Whisper for transcription, wav2vec2 for phoneme-level analysis, Parselmouth for acoustic features (F0, jitter, shimmer), Silero VAD for precise speech/silence segmentation, and Claude for linguistic coaching. Each dimension is scored independently with research-backed rubrics.

The Outcome

Speakers get actionable, dimension-specific feedback — not "speak better" but "your pause-to-speech ratio is 0.12 (target: 0.20–0.25) — try inserting pauses after key points." Production-deployed at invoz.io.

Key Features

  • 11-dimension scoring (7 acoustic + 4 linguistic)
  • Research-backed rubrics from 46 papers
  • Phoneme-level pronunciation analysis via wav2vec2
  • Acoustic feature extraction (pitch, jitter, shimmer) via Parselmouth
  • LLM-powered coaching with specific improvement suggestions
  • Real-time processing with streaming results

Technology Stack

PythonFastAPIWhisperwav2vec2ParselmouthSilero VADClaude API

Interested in this project?

I'd love to discuss the technical details, challenges overcome, or similar projects I could build for you.

View Live AppLet's discuss this project
Built a speech ML pipeline from 46 papers. Ask me about it.