Paradigm Shifts in Speech Processing

From fragmented multi-stage hybrid processes to single-shot multi-modal models

1985–1995

1995–2005

2005–2015

2015–2025

2025 onwards

Recognition

HMM
GMM
N-gram LMs

GMM-HMM
MFCC + N-gram
TDNN (early)

DNN-HMM
RNN / LSTM
Early end-to-end (CTC)

Transformer ASR (Conformer, Whisper)
Self-supervised pretraining (wav2vec 2.0, HuBERT)
Multilingual / multimodal models

LLMs!

Synthesis

LPC
Formant synthesizers
Diphone concatenation

Unit selection
HMM parametric (SPSS)
STRAIGHT vocoder

Statistical parametric (HMM-based)
WORLD / early neural vocoders

WaveNet
Tacotron 2, VITS, VALL-E
Flow / diffusion models

LLMs!

Translation

Rule-based MT
Word-based SMT

Phrase-based SMT
Alignment templates
Discriminative tuning (MERT/MIRA)

Neural MT (seq2seq RNNs)
Attention (2015)
Early end-to-end speech translation

Transformer NMT (Transformer, mBART, DeepL)
SeamlessM4T, Whisper Translate
LLM-based multimodal translation

LLMs!