Home

Paradigm Shifts in Speech Processing

From fragmented multi-stage hybrid processes to single-shot multi-modal models

1985–1995
1995–2005
2005–2015
2015–2025
2025 onwards
Recognition
  • HMM
  • GMM
  • N-gram LMs
  • GMM-HMM
  • MFCC + N-gram
  • TDNN (early)
  • DNN-HMM
  • RNN / LSTM
  • Early end-to-end (CTC)
  • Transformer ASR (Conformer, Whisper)
  • Self-supervised pretraining (wav2vec 2.0, HuBERT)
  • Multilingual / multimodal models
  • LLMs!
Synthesis
  • LPC
  • Formant synthesizers
  • Diphone concatenation
  • Unit selection
  • HMM parametric (SPSS)
  • STRAIGHT vocoder
  • Statistical parametric (HMM-based)
  • WORLD / early neural vocoders
  • WaveNet
  • Tacotron 2, VITS, VALL-E
  • Flow / diffusion models
  • LLMs!
Translation
  • Rule-based MT
  • Word-based SMT
  • Phrase-based SMT
  • Alignment templates
  • Discriminative tuning (MERT/MIRA)
  • Neural MT (seq2seq RNNs)
  • Attention (2015)
  • Early end-to-end speech translation
  • Transformer NMT (Transformer, mBART, DeepL)
  • SeamlessM4T, Whisper Translate
  • LLM-based multimodal translation
  • LLMs!