Home
Paradigm Shifts in Speech Processing
From fragmented multi-stage hybrid processes to single-shot multi-modal models
1985–1995
1995–2005
2005–2015
2015–2025
2025 onwards
Recognition
- GMM-HMM
- MFCC + N-gram
- TDNN (early)
- DNN-HMM
- RNN / LSTM
- Early end-to-end (CTC)
- Transformer ASR (Conformer, Whisper)
- Self-supervised pretraining (wav2vec 2.0, HuBERT)
- Multilingual / multimodal models
Synthesis
- LPC
- Formant synthesizers
- Diphone concatenation
- Unit selection
- HMM parametric (SPSS)
- STRAIGHT vocoder
- Statistical parametric (HMM-based)
- WORLD / early neural vocoders
- WaveNet
- Tacotron 2, VITS, VALL-E
- Flow / diffusion models
Translation
- Rule-based MT
- Word-based SMT
- Phrase-based SMT
- Alignment templates
- Discriminative tuning (MERT/MIRA)
- Neural MT (seq2seq RNNs)
- Attention (2015)
- Early end-to-end speech translation
- Transformer NMT (Transformer, mBART, DeepL)
- SeamlessM4T, Whisper Translate
- LLM-based multimodal translation