文档编号 // 43ECFD 在线

Unit-selection Speech Synthesis

2019-09-30

更新: 2026-04-25

2403 字符

这篇文章写于 2019，已经超过 7 年了。内容可能已经过时。

1. About Unit-selection Speech Synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.

And for Unit selection synthesis is using large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences.

2. Unit-selection Speech Synthesis

In Unit-selection Speech Synthesis, we use a large speech database which provides all syllable in. Like Japanese, Use those pronunciation:

あ	か	さ	た	な	あ	ま	や	ら	わ	が	ざ	だ	ば	ぱ
い	き	し	ち	に	ひ	み		り		ぎ	じ		び	ぴ
う	く	す	つ	ぬ	ふ	む	ゆ	る		ぐ	ず		ぶ	ぷ
え	け	せ	て	ね	へ	め		れ		げ	ぜ	で	べ	ぺ
お	こ	そ	と	の	ほ	も	よ	ろ	を	ご	ぞ	ど	ぼ	ぽ
	きゃ	しゃ	ちゃ	にゃ	ひゃ	みゃ		りゃ		ぎゃ	じゃ		びゃ	ぴゃ
	きゅ	しゅ	ちゅ	にゅ	ひゅ	みゅ		りゅ		ぎゅ	じゅ		びゅ	ぴゅ
	きょ	しょ	ちょ	にょ	ひょ	みょ		りょ		ぎょ	じょ		びょ	ぴょ
	きぇ	しぇ	ちぇ	にぇ	ひぇ	みぇ		りぇ		ぎぇ	じぇ		びぇ	ぴぇ
ふぁ	ふぃ	ふぇ	ふぉ
いぁ	うぃ	うぇ	うぉ
つぁ	つぃ	つぇ	つぉ
すぃ	てぃ	てゅ	とぅ
ずぃ	でぃ	でゅ	どぅ
ゔぁ	ゔぃ	ゔ	ゔぇ	ゔぉ

We need collect all pronounce of those words and record it. After that, we need add label for the $Initials$ and $finals$,than, do $prosodic\ parameter\ extraction$ and $spectral\ parameter\ extraction$. After that load into the Unit-selection Synthesis model which waiting for synthesis.

The Unit-selection Speech Synthesis Algorithm

$$ \begin{aligned} C^{c}\left(u_{i-1}, u_{i}\right) &=\sum_{k=1}^{q} w_{k}^{c} C_{k}^{c}\left(u_{i-1}, u_{i}\right) \\ C^{\prime}\left(t_{i}, u_{i}\right) &=\sum_{j=1}^{p} w_{j}^{\prime} C_{j}^{t}\left(t_{i}, u_{i}\right) \end{aligned} $$

$$ \hat{u}_{1}^{n}=\arg \min _{u^{i}}\left\{C\left(t_{1}^{n}, u_{1}^{n}\right)\right\} $$

$$ C\left(t_{1}^{n}, u_{1}^{n}\right)=\sum_{i=1}^{n} C^{\prime}\left(t_{i}, u_{i}\right)+\sum_{i=1}^{n} C^{c}\left(u_{i-1}, u_{i}\right) $$

$$ C^{c}\left(u_{i-1}, u_{i}\right) : \text { Concatenation cost } $$

$$ C^{t}\left(t_{i}, u_{i}\right) : \text { Target cost } $$

REFERENCES

https://en.wikipedia.org/wiki/Speech_synthesis
Expressive Prosody for Unit-selection Speech Synthesis

微信

支付宝

导航 // 相关文章

编辑 Libtorch，从入门到入坟（一）安装，HelloWorld

JUCE 使用内部字库显示中文等字体