Синтез речи: прошлое и настоящее

Арман Калиев; Сергей Витальевич Рыбин

doi:10.32603/2071-2340-2019-1-5-28

Арман Калиев Санкт-Петербургский национальный исследовательский университет информационных технологий, механики и оптики, Кронверкский пр., д. 49, 197101, Санкт-Петербург, Россия https://orcid.org/0000-0001-8399-8379
Сергей Витальевич Рыбин Санкт-Петербургский национальный исследовательский университет информационных технологий, механики и оптики, Кронверкский пр., д. 49, 197101, Санкт-Петербург, Россия http://orcid.org/0000-0002-9095-3168

DOI: https://doi.org/10.32603/2071-2340-2019-1-5-28

Ключевые слова: синтез интонационной речи, речевые сигналы, эмоциональная речь, Unit Selection, глубокие нейронные сети, просодика, акустические параметры

Аннотация

В статье представлено описание развития методов синтеза интонационной речи от истоков до настоящего времени. Рассмотрены основные подходы, сыгравшие важную роль в становлении научного направления синтеза речи, а также современные перспективные методы. Приведена объемная библиография по данному вопросу.

Биографии авторов

Арман Калиев, Санкт-Петербургский национальный исследовательский университет информационных технологий, механики и оптики, Кронверкский пр., д. 49, 197101, Санкт-Петербург, Россия

Аспирант, kaliyev.arman@yandex.kz

Сергей Витальевич Рыбин, Санкт-Петербургский национальный исследовательский университет информационных технологий, механики и оптики, Кронверкский пр., д. 49, 197101, Санкт-Петербург, Россия

канд. физ.-мат. н., доцент, кафедра речевых информационных систем, Санкт-Петербургский национальный исследовательский университет информационных технологий, ООО „ЦРТ“, svrybin@itmo.ru

Литература

B.-H. Juang and L. Rabiner, “Automatic Speech Recognition — A Brief History of the Technology Development,” UC Santa Barbara, 2004. [Online]. Available: http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-1nal-10-8.pdf

C. G. Kratzenstein, “Sur la raissance de la formation des voyelles,” J. Phys., vol. 21, pp. 358–380, 1782.

H. Dudley and T. H. Tarnoczy, “The Speaking Machine of Wolfgang von Kempelen,” J. Acoust. Soc. Am., vol. 22, pp. 151–166, 1950.

C. Wheatstone, The Scienti1c Papers of Sir Charles Wheatstone, London: The Physical Society of London, 1879.

M. J. Jones and R.-A. Knight, eds., The Bloomsbury companion to phonetics, London: A&C Black, 2013.

A. I. Solomennik, “Tekhnologiya sinteza rechi v istoriko-metodologicheskom aspekte” [Technology speech synthesis in the historical and methodological aspect], Speech Technology, no. 1, pp. 42–57, 2013 (in Russian).

H. Von Helmholtz and A. J. Ellis, On the Sensations of Tone as a Physiological Basis for the Theory of Music, London: Longmans, Green and Company, 1875. p. 576.

H. Fletcher, “The nature of speech and its interpretation,” The Bell System Technical Journal, vol. 1, no. 1, pp. 129–144, 1922.

H. Dudley, “The Vocoder,” Bell Labs Record, vol. 17, pp. 122–126, 1939.

H. Dudley, R. R. Riesz, and S. A.Watkins, “A Synthetic Speaker,” J. Franklin Institute, vol. 227, pp. 739–764, 1939.

R. Hoffmann, P. Birkholz, F. Gabriel, and R. J¨ackel, “From Kratzenstein to the Soviet Vocoder: Some Results of a Historic Research Project in Speech Technology,” in International Conference on Speech and Computer. (SPECOM 2018), Springer, Cham, 2018. pp. 215–225. doi: 10.1007/978-3-319-99579-3_23.

R. Hoffmann, “Zur Entwicklung des Vocoders in Deutschland,” Jahrestagung f¨ur Akustik, DAGA, pp. 149–150, 2011.

R. Hoffmann and G. Gramm, “The Sennheiser vocoder goes digital: On a German R&D project in the 1970s,” in 2nd Inernational Workshop on the History of Speech Communication Research (HSCR 2017), 2017. pp. 35–44.

A. Solzhenitsyn, The First Circle, Moscow: INCOM NV, 1991.

M. R. Schroeder, Computer speech: recognition, compression, synthesis, Springer Science & Business Media, vol. 35, 2013.

D. Tompkins, How to wreck a nice beach: The vocoder from World War II to hip-hop, The machine speak, Melville House, 2011; doi: 10.12801/1947-5403.2012.04.02.04

“N. V. Kotel’nikova ob ottse” [N. V. Kotelnikov about father], in Kotelnikov, Sud’ba, okhvativshaya vek, N. V. Kotel’nikova and A. S. Prohorov, eds., vol. 2, Moscow: Phizmatlit, 2011 (in Russian).

K. F. Kolachev, V kruge tret’em. Vospominaniya i razmyshleniya o rabote Mar1nskoi laboratorii v 1948–1951 godakh [In the third circle. Memoirs and Re2ections on the Work of the Martha Laboratory in 1948–1951], Moscow, 1999 (in Russian).

M. R. Schroeder and E. E. David, “A vocoder for transmitting 10 kc/s speech over a 3.5 kc/s channel,” Acta Acustica united with Acustica, vol. 10, no. 1, pp. 35–43, 1960.

W. A. Munson and H. C. Montgomery, “A speech analyzer and synthesizer,” The Journal of the Acoustical Society of America, vol. 22, no. 5, pp. 678–678, 1950.

M. A. Sapozhkov, Rechevoi signal v kibernetike i svyazi [Speech signal in cybernetics and communication], Moscow: Svyaz’izdat, 1963.

W. Koenig, H. K. Dunn, and L. Y. Lacy, “The sound spectrograph,” The Journal of the Acoustical Society of America, vol. 18. no. 1, pp. 19–49, 1946.

F. S. Cooper, A. M. Liberman, and J. M. Borst, “The interconversion of audible and visible patterns as a basis for research in the perception of speech,” Proceedings of the National Academy of Sciences of the United States of America, vol. 37, no. 5, p. 318, 1951.

R.W. Young, “Review of U.S. Patent 2,432,321, Translation of Visual Symbols, R. K. Potter, assignor (9 December 1947),” The Journal of the Acoustical Society of America, vol. 20, no. 6, pp. 888–889, 1948; doi: 10.1121/1.1906454

R.W. Sproat and J. P. Olive, “Text-to-Speech Synthesis,” AT&T Technical Journal, vol. 74, no. 2, pp. 35–44, 1995.

D. H. Klatt, “Review of text-to-speech conversion for English,” The Journal of the Acoustical Society of America, vol. 82, no. 3, pp. 737–793, 1987.

J. Goldsmith and B. Laks, “Generative phonology: its origins, its principles, and its successors,” The Cambridge History of Linguistics, Cambridge: Cambridge University Press, 2011.

J. Mullennix, Computer Synthesized Speech Technologies: Tools for Aiding Impairment, IGI Global, 2010.

D. Suendermann, H. H¨oge, and A. Black, “Challenges in speech synthesis,” Speech Technology, Boston: Springer, pp. 19–32, 2010.

D. G. Stork, HAL’s Legacy: 2001’s Computer as Dream and Reality, Mit Press, 1997.

A. W. Black and K. A. Lenzo, “Building synthetic voices,” Language Technologies Institute, Carnegie Mellon University and Cepstral LLC., vol. 4, no. 2, p. 62, 2003.

B. M. Lobanov and L. I. Tsirul’nik, Komp’yuternyi sintez i klonirovanie rechi [Computer synthesis and speech cloning], Minsk: Belarusian Science, 2008.

F. Charpentier and M. Stella, “Diphone synthesis using an overlap-add technique for speech waveforms concatenation,” in ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, Tokyo, Japan, 1986, pp. 2015–2018; doi: 10.1109/ICASSP.1986.1168657

P. Taylor, Text-to-speech synthesis, Cambridge university press, 2009.

N. Campbell and A.W. Black, “Prosody and the selection of source units for concatenative synthesis,” Progress in speech synthesis, New York: Springer, pp. 279–292, 1997.

Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, “ATR v–Talk Speech Synthesis System,” Proc. ICSLP, pp. 483–486, 1992.

A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using large speech database,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, pp. 373—376.

M. Ostendorf and I. Bulyko, “The impact of speech recognition on speech synthesis”, in Proc, IEEE Workshop Speech Synthesis, Santa Monica, 2002, pp. 99–106.

J. Hirschberg, “Pitch accent in context predicting intonational prominence from text,” Arti1cial Intelligence, vol. 63, no. 1-2, pp. 305–340, 1993.

M. Q. Wang and J. Hirschberg, “Automatic classi1cation of intonational phrase boundaries,” Computer Speech & Language, vol. 6, no. 2, pp. 175–196, 1992.

K. Ross and M. Ostendorf, “Prediction of abstract prosodic labels for speech synthesis,” Computer Speech & Language, vol. 10, no. 3, pp. 155–185, 1996.

P. Taylor and A. W. Black, “Assigning phrase breaks from part-of-speech sequences,” Computer Speech & Language, vol. 12, no. 2, pp. 99–117, 1998.

C. S. Fordyce and M. Ostendorf, “Prosody prediction for speech synthesis using transformational rule-based learning,” in Proc. 5th Int. Conf. on Spoken Language Processing, (ICSLP), Sydney, Australia, 1998.

K. E. A. Silverman, “On customizing prosody in speech synthesis: Names and addresses as a case in point,” in Proc. of the workshop on Human Language Technology. Association for Computational Linguistics, 1993, pp. 317–322.

M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana, “Transformation of formants for voice conversion using arti1cial neural networks,” Speech communication, vol. 16, no. 2, pp. 207–216, 1995.

T. Watanabe, T. Murakami, M. Namba, T. Hoya, and Y. Ishida, “Transformation of spectral envelope for voice conversion based on radial basis function networks,” in Proc. 7th International Conference on Spoken Language Processing (ICSLP), Denver, USA, 2002.

O. Karaali, “Speech synthesis with neural networks,” in Proc. of the World Congress on Neural Networks (WCNN’96), San Diego, USA, 1996, pp. 45–50.

H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013, pp. 7962–7966; doi: 10.1109/ICASSP.2013.6639215

R. E. Donovan and P. C. Woodland, “Improvements in an HMM-based speech synthesizer,” in Proc. 4th European Conf. on Speech Communication and Technology (ESCA), Madrid, Spain, 1995.

N. Campbell and A.W. Black, “Prosody and the selection of source units for concatenative synthesis,” in Progress in speech synthesis, J. P. H. van Santen, J. P. Olive, R. W. Sproat, J. Hirschberg, eds., New York, NY: Springer, 1997, pp. 279–292; doi: 10.1007/978-1-4612-1894-4_22

Y. Jiang, X. Zhou, C. Ding, Y.-J. Hu, Z.-H. Ling, and L.-R. Dai, “The USTC system for Blizzard Challenge 2018,” Blizzard ChallengeWorkshop, 2018.

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis” in Sixth European Conf. on Speech Communication and Technology (EUROSPEECH’99), Budapest, Hungary, 1999.

Z. H. Ling and R. H.Wang, “HMM-based unit selection using frame sized speech segments,” in Ninth Int.l Conf. on Spoken Language Processing (INTERSPEECH 2006 — ICSLP), Pittsburgh, PA, 2006.

A. W. Black, "CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling, in Ninth Int.l Conf. on Spoken Language Processing (INTERSPEECH 2006 — ICSLP), Pittsburgh, PA, 2006.

H. Zen, T. Toda, M. Nakamura, and T. Tokuda, “Details of the Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005,” IEICE transactions on information and systems, vol. 90, no. 1, pp. 325–333, 2007.

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” in 2000 IEEE Int. Conf. on Acoustics, Speech, and Signal Proc., 2000, vol. 3, pp. 1315–1318; doi: 10.1109/ICASSP.2000.861820

M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi, “Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR,” in 2001 IEEE Int. Conf. on Acoustics, Speech, and Signal Proc., Salt Lake City, UT, 2001, vol. 2, pp. 805–808; doi: 10.1109/ICASSP.2001.941037

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Speaker interpolation in HMMbased speech synthesis system,” 5th European Conf. on Speech Communication and Technology, (EUROSPEECH ’97) Rhodes, Greece, 1997.

K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Eigenvoices for HMMbased speech synthesis,” in 7th Int. Conf. on Spoken Language Proc., (INTERSPEECH 2002), Denver, Colorado, 2002, pp. 1269–1272.

T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for HMM-based expressive speech synthesis,” IEICE TRANSACTIONS on Information and Systems, vol. 90, no. 9, pp. 1406–1413, 2007.

Y. Morioka, S. Kataoka, H. Zen, Y. Nankaku, K. Tokuda, and T. Kitamura, “Miniaturization of HMMbased speech synthesis,” in Autumn Meeting of ASJ, 2004, pp. 325–326.

S. J. Kim, J. J. Kim, and M. Hahn, “HMM-based Korean speech synthesis system for handheld devices,” IEEE Transactions on Consumer Electronics, vol. 52, no. 4, pp. 1384–1390, 2006; doi: 10.1109/TCE.2006.273160

A. Gutkin, X. Gonzalvo, S. Breuer, and P. Taylor, “Quantized HMMs for low footprint text-tospeech synthesis,” Eleventh Annu. Conf. of the Int. Speech Communication Association, Makuhari, Chiba, Japan, 2010, pp. 837–840.

J. Yamagishi, T. Nose, H. Zen, Z.-H. Ling, T. Toda, K. Tokuda, S. King, and S. Renals, “Robust speakeradaptive HMM-based text-to-speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1208–1230, 2009; doi: 10.1109/TASL.2009.2016394

M. Tachibana, S. Izawa, T. Nose, and T. Kobayashi, “Speaker and style adaptation using average voice model for style control in HMM-based speech synthesis,” in 2008 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Las Vegas, NV, 2008, pp. 4633–4636; doi: 10.1109/ICASSP.2008.4518689

T. Nose, M. Tachibana, and T. Kobayashi, “HMM-based style control for expressive speech synthesis with arbitrary speaker’s voice using model adaptation,” IEICE Transactions on Information and Systems, vol. 92, no. 3, pp. 489–497, 2009; doi: 10.1587/transinf.E92.D.489

H. Zen, K. Tokuda, and A.W. Black, “Statistical parametric speech synthesis,” Speech communication, vol. 51, no. 11, pp. 1039–1064, 2009; doi: 10.1016/j.specom.2009.04.004

Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. Meng, L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015; doi: 10.1109/MSP.2014.2359987

J. Dean et al, “Large scale distributed deep networks,” in Advances in neural information processing systems (NIPS 2012), Lake Tahoe, NV, 2012, pp. 1223–1231.

D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and 1ne-tuning in context-dependent DBNHMMs for real-world speech recognition,” Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.

G. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary continuous speech recognition with contextdependent DBN-HMMS,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, (ICASSP 2011), Prague, Czech Republic, 2011, pp. 4688–4691.

G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2012; doi: 10.1109/TASL.2011.2134090

A. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012; doi: 10.1109/TASL.2011.2109382

T. N. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran, “Optimization techniques to improve training speed of deep neural networks for large speech tasks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 11, pp. 2267–2276, 2013; doi: 10.1109/TASL.2013.2284378

L. Deng, M. L. Seltzer, D. Yu, A. Acero, A. R. Mohamed, and G. Hinton, “Binary coding of speech spectrograms using a deep auto-encoder,” in Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, 2010, pp. 1692–1695.

X. L. Zhang and J. Wu, “Deep belief networks based voice activity detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, pp. 697–710, 2013.

Z. H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 7825–7829; doi: 10.1109/ICASSP.2013.6639187

Z. H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis,” IEEE transactions on audio, speech, and language processing, vol. 21, no. 10, pp. 2129–2139, 2013; doi: 10.1109/TASL.2013.2269291

S. Kang, X. Qian, and H. Meng, “Multi-distribution deep belief network for speech synthesis,” in 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 8012–8016; doi: 10.1109/ICASSP.2013.6639225

R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory, “F0 contour prediction with a deep belief network-Gaussian process hybrid model,” in 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp. 6885–6889; doi: 10.1109/ICASSP.2013.6638996

H. Lu, S. King, and O. Watts, “Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis,” in Eighth ISCA Workshop on Speech Synthesis, Barcelona, Catalonia, Spain, 2013.

L.-H. Chen, Z.-H. Ling, Y. Song, and L.-R. Dai, “Joint spectral distribution modeling using restricted boltzmann machines for voice conversion,” in 14th Annu. Conf. of the Int. Speech Communication Association (INTERSPEECH 2013), Lyon, France, 2013, pp. 3052–3056.

T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki, “Voice conversion in high-order eigen space using deep belief nets,” in 14th Annu. Conf. of the Int. Speech Communication Association (INTERSPEECH 2013), Lyon, France, 2013, pp. 369–372.

Z. Wu, E. S. Chng, and H. Li, “Conditional restricted boltzmann machine for voice conversion,” in 2013 IEEE China Summit and Int. Conf. on Signal and Information Processing, Beijing, China, 2013, pp. 104–108; doi: 10.1109/ChinaSIP.2013.6625307

X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in 14th Annu. Conf. of the Int. Speech Communication Association (INTERSPEECH 2013), Lyon, France, 2013, pp. 436–440.

B. Xia and C. Bao, “Speech enhancement with weighted denoising auto-encoder,” in 14th Annu. Conf. of the Int. Speech Communication Association (INTERSPEECH 2013), Lyon, France, 2013, pp. 3444–3448.

Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014; doi: 10.1109/LSP.2013.2291240

H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” in 2015 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2015), Brisbane, QLD, Australia, 2015, pp. 4470–4474; doi: 10.1109/ICASSP.2015.7178816

H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak, “Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices,” arXiv preprint, arXiv:1606.06061, 2016.

Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 84–96, 2018, doi: 10.1109/TASLP.2017.2761547

B. Bollepalli, L. Juvela, and P. Alku, “Generative adversarial network?based glottal waveform model for statistical parametric speech synthesis,” in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 3394-–3398.

B. Liu, S. Nie, Y. Zhang, D. Ke, S. Liang, and W. Liu, “Boosting noise robustness of acoustic model via deep adversarial training,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 5034–5038; doi: 10.1109/ICASSP.2018.8462093

J. Han, Z. Zhang, Z. Ren, F. Ringeval, and B. Schuller, “Towards conditional adversarial training for predicting emotions from speech,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 6822–6826; doi: 10.1109/ICASSP.2018.8462579

A. Van Den Oord et al. “WaveNet: A generative model for raw audio,” SSW, 2016, vol. 125.

S. ¨O. Arik et al., “Deep voice: Real-time neural text-to-speech,” in Proc. of the 34th Int. Conf. on Machine Learning (ICML), Sydney, Australia, 2017, pp. 195–204.

Y. Wang et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.

J. Shen et al., “Natural tts synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 4779–4783; doi: 10.1109/ICASSP.2018.8461368

Y. Yasuda, X. Wang, S. Takaki, and J. Yamagishi, “Investigation of enhanced Tacotron text-tospeech synthesis systems with self-attention for pitch accent language,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 6905–6909.

M. Schroder, “Emotional speech synthesis: A review,” in Proc. 7th European Conf. on Speech Communication and Technology, Aalborg, Denmark, 2001, pp. 561–564.

D. Govind and S. R. M. Prasanna, “Expressive speech synthesis: a review,” International Journal of Speech Technology, vol. 16, no. 2, pp. 237–260, 2013

J. E. Cahn, “The generation of affect in synthesized speech,” Journal of the American Voice I/O Society, vol. 8. vo. 1. pp. 1–19, 1989.

F. Burkhardt, W. F. Sendlmeier, “Veri1cation of acoustical correlates of emotional speech using formant-synthesis,” in ISCA Tutorial and Research Workshop on speech and emotion, Newcastle, Northern Ireland, UK, 2000, pp. 151–156.

M. Schroder, “Expressive speech synthesis: Past, present, and possible futures,” in Affective information processing, J. Tao, T. Tan, Eds. London: Springer, 2009, pp. 111–126; doi: 10.1007/978-1-84800-306-4_7

M. Schroder, “Can emotions be synthesized without controlling voice quality,” Phonus, vol. 4, pp. 35–50, 1999.

A. Iida, and N. Campbell, “Speech database design for a concatenative text-to-speech synthesis system for individuals with communication disorders,” International Journal of Speech Technology, vol. 6, no. 4, pp. 379–392, 2003; doi: 10.1023/A:1025761017833

W. L. Johnson, S. S. Narayanan, R. Whitney, R. Das, M. Bulut, and C. LaBore, “Limited domain synthesis of expressive military speech for animated characters,” in Proc. of 2002 IEEE Workshop on Speech Synthesis, Santa Monica, CA, 2002, pp. 163-166; doi: 10.1109/WSS.2002.1224399

J. F. Pitrelli, R. Bakis, E. M. Eide, R. Fernandez, W. Hamza, and M. A. Picheny, “The IBM expressive text-to-speech synthesis system for American English,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1099–1108, 2006; doi: 10.1109/TASL.2006.876123

W. Hamza, E. Eide, R. Bakis, M. Picheny, and J. Pitrelli, “The IBM expressive speech synthesis system,” in 8th Int. Conf. on Spoken Language Processing (INTERSPEECH 2004), Jeju Island, Korea, 2004, pp. 2577–2580.

Y. Zhao, D. Peng, L. Wang, M. Chu, Y. Chen, P. Yu, and J. Guo, “Constructing stylistic synthesis databases from audio books,” in 9th Int. Conf. on Spoken Language Processing (INTERSPEECH 2006), Pittsburgh, PA, 2006.

K. Prahallad, A. R. Toth, and A. W. Black, “Automatic building of synthetic voices from large multiparagraph speech databases,” in Proc. 8th An. Conf. of the International Speech Communication Association, Antwerp, Belgium, 2007.

N. Braunschweiler, M. J. F. Gales, and S. Buchholz, “Lightly supervised recognition for automatic alignment of large coherent speech recordings,” in Proc. 11th An. Conf. of the International Speech Communication Association, Makuhari, Japan, 2010.

F. Eyben , S. Buchholz, N. Braunschweiler et al., “Unsupervised clustering of emotion and voice styles for expressive TTS,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 2012, pp. 4009–4012; doi: 10.1109/ICASSP.2012.6288797

E. Szekely, J. P. Cabral, P. Cahill, and J. Carson-Berndsen, “Clustering expressive speech styles in audiobooks using glottal source parameters,” in 12th Annu. Conf. of the Int. Speech Communication Association, Florence, Italy, 2011, pp. 2409–2412.