HTS - HMM-based speech synthesis system

HTS是什么?

HTS官网上这样说:

Introduction Over the last ten years, the quality of speech synthesis has drastically improved with the rise of general corpus-based speech synthesis. Especially, state-of-the-art unit selection speech synthesis can generate natural-sounding high quality speech. However, for constructing human-like talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker’s voice characteristics, various speaking styles including native and non-native speaking styles in different languages, varying emphasis and focus, and/or emotional expressions; it is still difficult to have such flexibility with unit-selection synthesizers, since they need a large-scale speech corpus for each voice.

In recent years, a kind of statistical parametric speech synthesis based on hidden Markov models (HMMs) has been developed. The system has the following features:

Original speaker’s voice characteristics can easily be reproduced because all speech features including spectral, excitation, and duration parameters are modeled in a unified framework of HMM, and then generated from the trained HMMs themselves. Using a very small amount of adaptation speech data, voice characteristics can easily be modified by transforming HMM parameters by a speaker adaptation technique used in speech recognition systems. From these features, the HMM-based speech synthesis approach is expected to be useful for constructing speech synthesizers which can give us the flexibility we have in human voices.

In this tutorial, the system architecture is outlined, and then basic techniques used in the system, including algorithms for speech parameter generation from HMM, are described with simple examples. Relation to the unit selection approach, trajectory modeling, recent improvements, and evaluation methodologies are are summarized. Techniques developed for increasing the flexibility and improving the speech quality are also reviewed.

说简单点,HTS就是一个基于参数合成的基于隐马尔可夫模型(Hidden Markov Model,HMM)的一个语料库语音合成系统。

隐马尔可夫模型(Hidden Markov Model,HMM)

隐马尔可夫模型(Hidden Markov Model,HMM)是统计模型,它用来描述一个含有隐含未知参数的马尔可夫过程。其难点是从可观察的参数中确定该过程的隐含参数。然后利用这些参数来作进一步的分析,例如模式识别。

HMM(隐马尔可夫模型)是用来描述隐含未知参数的统计模型,举一个经典的例子:一个东京的朋友每天根据天气{下雨,天晴}决定当天的活动{公园散步,购物,清理房间}中的一种,我每天只能在twitter上看到她发的推“啊,我前天公园散步、昨天购物、今天清理房间了!”,那么我可以根据她发的推特推断东京这三天的天气。在这个例子里,显状态是活动,隐状态是天气。

任何一个HMM都可以通过下列五元组来描述:

obs:观测序列
states:隐状态
start_p:初始概率(隐状态)
trans_p:转移概率(隐状态)
emit_p: 发射概率 (隐状态表现为显状态的概率)

1

下面用一个简单的例子来阐述:

假设我手里有三个不同的骰子。第一个骰子是我们平常见的骰子(称这个骰子为D6),6个面,每个面(1,2,3,4,5,6)出现的概率是1/6。第二个骰子是个四面体(称这个骰子为D4),每个面(1,2,3,4)出现的概率是1/4。第三个骰子有八个面(称这个骰子为D8),每个面(1,2,3,4,5,6,7,8)出现的概率是1/8。

2

假设我们开始掷骰子,我们先从三个骰子里挑一个,挑到每一个骰子的概率都是1/3。然后我们掷骰子,得到一个数字,1,2,3,4,5,6,7,8中的一个。不停的重复上述过程,我们会得到一串数字,每个数字都是1,2,3,4,5,6,7,8中的一个。例如我们可能得到这么一串数字(掷骰子10次):1 6 3 5 2 7 3 5 2 4

这串数字叫做可见状态链。但是在隐马尔可夫模型中,我们不仅仅有这么一串可见状态链,还有一串隐含状态链。在这个例子里,这串隐含状态链就是你用的骰子的序列。比如,隐含状态链有可能是:D6 D8 D8 D6 D4 D8 D6 D6 D4 D8

一般来说,HMM中说到的马尔可夫链其实是指隐含状态链,因为隐含状态(骰子)之间存在转换概率(transition probability)。在我们这个例子里,D6的下一个状态是D4,D6,D8的概率都是1/3。D4,D8的下一个状态是D4,D6,D8的转换概率也都一样是1/3。这样设定是为了最开始容易说清楚,但是我们其实是可以随意设定转换概率的。比如,我们可以这样定义,D6后面不能接D4,D6后面是D6的概率是0.9,是D8的概率是0.1。这样就是一个新的HMM。

同样的,尽管可见状态之间没有转换概率,但是隐含状态和可见状态之间有一个概率叫做输出概率(emission probability)。就我们的例子来说,六面骰(D6)产生1的输出概率是1/6。产生2,3,4,5,6的概率也都是1/6。我们同样可以对输出概率进行其他定义。比如,我有一个被赌场动过手脚的六面骰子,掷出来是1的概率更大,是1/2,掷出来是2,3,4,5,6的概率是1/10。

3

4

其实对于HMM来说,如果提前知道所有隐含状态之间的转换概率和所有隐含状态到所有可见状态之间的输出概率,做模拟是相当容易的。但是应用HMM模型时候呢,往往是缺失了一部分信息的,有时候你知道骰子有几种,每种骰子是什么,但是不知道掷出来的骰子序列;有时候你只是看到了很多次掷骰子的结果,剩下的什么都不知道。如果应用算法去估计这些缺失的信息,就成了一个很重要的问题。

HTS的模块

HTS模式图:

5

下载,安装HTS

软件环境需求:

下载编译安装即可

HTS目录结构

HTS的目录

Data目录

label目录

引用文章:
https://www.cnblogs.com/skyme/p/4651331.html
H. Zen et al., Recent development of the HMM-based speech synthesis system (HTS)
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proc. Eurospeech, pages 2347–2350, 1999.
J. Yamagishi. Average-Voice-Based Speech Synthesis. PhD thesis, Tokyo Institute of Technology, 2006.
T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura. Speaker interpolation for HMM-based speech synthesis system. J. Acoust. Soc. Jpn. (E), 21(4):199–206, 2000.
K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Eigenvoices for HMM-based speech synthesis. In Proc. ICSLP, pages 1269–1272, 2002.
K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako, T. Toda, A.W. Black, T. Nose, and K. Oura. The HMM-based speech synthesis system (HTS). http://hts.sp.nitech.ac.jp/.
R. Sproat, J. Hirschberg, and D. Yarowsky. A corpus-based synthesizer. In Proc. ICSLP, pages 563–566, 1992.













本作品
 
HTS的介绍与安装采用 知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可。
基于 http://www.GloomyGhost.com/2018/12/08/HTSoverview.html上的作品创作。


赞赏