Live Demo

Invoz — Audio ML Pipeline

Invoz is a production ML pipeline that analyzes spoken English across 11 scoring dimensions — 7 acoustic (pitch variability, speech rate, pause patterns, volume dynamics, filler words, articulation clarity, rhythm) and 4 linguistic (vocabulary richness, grammar accuracy, coherence, discourse markers). Built from 46 research papers in speech pathology and computational linguistics.

The Problem

Non-native English speakers get generic feedback like "speak more clearly" with no specifics. Existing tools score pronunciation at the word level but miss the acoustic and linguistic patterns that actually make speech effective — rhythm, pause placement, vocabulary range, discourse structure.

The Solution

A multi-model pipeline: Whisper for transcription, wav2vec2 for phoneme-level analysis, Parselmouth for acoustic features (F0, jitter, shimmer), Silero VAD for precise speech/silence segmentation, and Claude for linguistic coaching. Each dimension is scored independently with research-backed rubrics.

The Outcome

Speakers get actionable, dimension-specific feedback — not "speak better" but "your pause-to-speech ratio is 0.12 (target: 0.20–0.25) — try inserting pauses after key points." Production-deployed at invoz.io.

Key Features

11-dimension scoring (7 acoustic + 4 linguistic)
Research-backed rubrics from 46 papers
Phoneme-level pronunciation analysis via wav2vec2
Acoustic feature extraction (pitch, jitter, shimmer) via Parselmouth
LLM-powered coaching with specific improvement suggestions
Real-time processing with streaming results

Technology Stack

PythonFastAPIWhisperwav2vec2ParselmouthSilero VADClaude API

Interested in this project?

I'd love to discuss the technical details, challenges overcome, or similar projects I could build for you.

View Live App Let's discuss this project