Retrospective

What a Welsh ASR Project from 2012 Can Teach Today’s Speech Translation Models

How a daily synthetic-data pipeline for Welsh ASR anticipated modern curriculum-learning speech-to-text translation systems.

Twelve years ago, we built a Welsh automatic speech recognition (ASR) system that did something rather unusual: it updated itself every day using synthetic speech. At the time, this was a pragmatic hack to cope with a language whose vocabulary changes faster than its audio data can be collected.

Today, new research in multilingual speech-to-text translation (S2TT) is using remarkably similar ideas — but applying them at global scale. Looking back, it’s striking how many of the modern techniques echo the Welsh pipeline we built years earlier.

This short retrospective looks at what we did, why it mattered, and how it connects to state-of-the-art multilingual models today.

The Problem We Were Solving

Welsh has a rare linguistic property: it is almost perfectly phonetic. If you know how a Welsh word is spelled, you know exactly how to pronounce it.

But its vocabulary evolves quickly:

new English loanwords enter daily;
media, politics, and culture introduce new named entities;
small corpus size makes Welsh a low-resource language.

To maintain good ASR performance, we needed to teach the model new words as soon as they appeared.

Natural speech recordings were too slow and too expensive. So we turned to synthetic data.

Our Approach: Daily Synthetic Vocabulary Expansion

Every day, we automatically scraped the BBC Welsh-language site for new terms. Then we:

Converted the new words to synthetic Welsh speech.
Because spelling → pronunciation is deterministic, the speech quality was excellent.
Injected these synthetic samples into the ASR training pipeline.
This kept the model’s vocabulary fresh without needing any new human recordings.
Fine-tuned the model incrementally.
The system continuously absorbed new words, place names, and contemporary terminology.

In effect, we built a continual learning ASR system, long before the term was widely used.

Welsh ASR Pipeline (2012) vs. Modern Curriculum S2TT (2024)

+--------------------------------------------------------------------------------+
|                      COMPARISON OF TWO PIPELINES (HIGH-LEVEL)                 |
+--------------------------------------------------------------------------------+

    YOUR WELSH ASR PIPELINE (c. 2012)                MODERN CURRICULUM S2TT (2024)
    -------------------------------------            -------------------------------------

[1] BASE ACOUSTIC MODEL                          [1] STAGE 1: ASR-ONLY TRAINING
    - Train ASR on Welsh speech data                 - Train model to transcribe speech
    - Leverage Welsh grapheme→phoneme rules          - Builds strong speech understanding

          |                                                     |
          v                                                     v

[2] DAILY TEXT SCRAPING                            [2] STAGE 2: BILINGUAL S2TT
    - Crawl BBC Cymru for new terms                    - Train on limited speech→text
    - Expand vocabulary organically                     - Learn mapping between languages

          |                                                     |
          v                                                     v

[3] SYNTHETIC SPEECH GENERATION                      [3] SYNTHETIC SPEECH EXPANSION
    - Use Welsh TTS to create audio for new words       - Generate synthetic speech for
    - High fidelity because spelling fully predicts       under-resourced languages
      pronunciation                                      - Greatly expands training data

          |                                                     |
          v                                                     v

[4] CONTINUAL MODEL UPDATES                            [4] STAGE 3: MANY-TO-MANY S2TT
    - Daily fine-tuning with fresh samples                - Train model across many language
    - Model stays current with cultural & lexical shifts    pairs using synthetic + real speech
    - Zero downtime vocabulary coverage                    - Universal, scalable translation

+--------------------------------------------------------------------------------+
|                         KEY SIMILARITIES & INSIGHTS                            |
+--------------------------------------------------------------------------------+

  • Both pipelines use synthetic speech to overcome data scarcity
  • Both expand vocabulary dynamically using fresh text data
  • Both rely on continual or curriculum-style progression
  • Welsh’s deterministic pronunciation gave near-perfect synthetic data
  • The 2012 pipeline anticipated techniques now used for global S2TT systems

What Modern Research Is Doing Now

Fast forward to today, and a major line of research is trying to solve the same core problem:

How do we build robust speech translation systems when some languages have very little audio data?

Recent work on many-to-many speech-to-text translation proposes a solution that looks uncannily familiar. Modern pipelines are also:

curriculum-based (train on simple tasks first);
synthetic data-driven (generate speech where none exists);
continually expanding (add data for hundreds of language pairs).

The difference is scale: they are building universal many-to-many translators, whereas we focused on one very special language. But the principles overlap almost perfectly.

Why Welsh Was the Perfect Testbed

Welsh provided a unique advantage: synthetic speech generation was near lossless because pronunciation was fully predictable.

For many languages — English, French, Chinese — this is not true. Modern systems must compensate with phoneme models, alignment heuristics, or multi-speaker text-to-speech.

Our pipeline succeeded because Welsh’s orthography removed all that complexity. What we generated was effectively indistinguishable from recorded speech for machine-learning purposes.

In a sense, Welsh gave us a preview of what synthetic training pipelines could achieve before the rest of the world caught up.

Lessons for Today’s Speech Models

Looking back, two lessons stand out:

1. Synthetic speech is an under-used, high-leverage resource

If the goal is rapid vocabulary coverage or domain adaptation, synthetic data is unbeatable.

2. Continual training matters as much as model architecture

A model that updates itself — even in small increments — often outperforms a larger model frozen in time.

Both of these principles now sit at the heart of modern multilingual S2TT research.

Conclusion

Our Welsh ASR project was built to solve a specific, practical problem: keeping up with a living, evolving language. But in hindsight, it aligned closely with the direction that speech technology has taken over the past decade.

Today’s universal translators are rediscovering the same ideas — synthetic data, incremental learning, and curriculum-like training — and applying them across hundreds of languages.

For us, the Welsh system wasn’t an experiment in future-proof machine learning. It was just common sense for a minority language with its own rhythm and pace. But its core ideas now underpin some of the most ambitious speech-AI systems being built today.