Music Information Retrieval – современные задачи и технологии

Кирилл Игоревич Абросимов; Сергей Витальевич Рыбин

doi:10.32603/2071-2340-2023-1-74-95

Кирилл Игоревич Абросимов Санкт-Петербургский национальный исследовательский университет ИТМО, Кронверкский пр., 49, лит. А, 197101, Санкт-Петербург, Россия http://orcid.org/0000-0001-9262-0474
Сергей Витальевич Рыбин Санкт-Петербургский национальный исследовательский университет ИТМО, Кронверкский пр., 49, лит. А, 197101, Санкт-Петербург, Россия http://orcid.org/0000-0002-9095-3168

DOI: https://doi.org/10.32603/2071-2340-2023-1-74-95

Ключевые слова: вычислительное музыковедение, music information retrieval, генерация музыки, автоматическая музыкальная транскрипция, синтез звуков музыкальных инструментов, поиск музыки, синтез певческого голоса.

Аннотация

В работе рассматривается Music Information Retrieval — область вычислительного музыковедения, которая активно развивается в современном мире. В рамках статьи описаны некоторые основные задачи и технологии данного направления, такие как генерация музыки, автоматическая музыкальная транскрипция, синтез звуков музыкальных инструментов, поиск музыки. Особое внимание уделяется одной из интереснейших задач на стыке речевых и музыкальных технологий — синтезу поющего голоса. Рассматриваются различные подходы к этой задаче, существующие проблемы и методы их решения.

Биографии авторов

Кирилл Игоревич Абросимов, Санкт-Петербургский национальный исследовательский университет ИТМО, Кронверкский пр., 49, лит. А, 197101, Санкт-Петербург, Россия

Студент магистратуры, факультет информационных технологий и программирования университета ИТМО, abrosimov.kirill.1999@mail.ru

Сергей Витальевич Рыбин, Санкт-Петербургский национальный исследовательский университет ИТМО, Кронверкский пр., 49, лит. А, 197101, Санкт-Петербург, Россия

Кандидат физико-математических наук, доцент, факультет информационных технологий и программирования, университет ИТМО; факультет компьютерных технологий и информатики СПбГЭТУ «ЛЭТИ», svrybin@itmo.ru

Литература

A. Yolk, F. Wiering, and P. Kranenburg, “Unfolding the potential of computational musicology,” in Proc. of the 13th Int. Conf. on Informatics and Semiotics in Organisations, Leeuwarden, The Netherlands, July 4–6, 2011, pp. 137–144, 2011.

M. Muller, “Fundamentals of Music Processing. Audio, Analysis, Algorithms, Applications,” Berlin: Springer, 2015; doi:10.1007978-3-319-21945-5

M. Sh Bronfeld, The Introduction to Musicology: Textbook for students and teachers of higher musical educational institutions, Saint Petersburg, Russia: Planeta Muziki, 2022 (in Russian).

A. V. Gladkiy and I. A Melchuk, Elements of Mathematical Linguistics, Moscow: Nauka, 1969 (in Russian).

R. G. Piotrovsky, K. B. Bektaev, and A. A Piotrovskaya, Mathematical Linguistics. Textbook for pedagogical institutes, Moscow: Vysshaia shkola, 1977 (in Russian).

M. K. Timofeeva, Introduction to Mathematical Linguistics: Practicum, Novosibirsk, Russia: Novosibirsk, 2018 (in Russian).

E. I. Bolshakova, E. S. Klyshinsky, D. V. Lande, Noskov A. A., Peskova O. V., and Yagunova E. V, Automatic natural language processing and computational linguistics: textbook, Moscow: HSE MIEM, 2011 (in Russian).

L. S. Lomakin and A. S. Surkov, Information technologies for analysis and modeling of text structures, oronezh: Scientific Book Publ., 2015 (in Russian).

A. Balakrishnan, DeepPlaylist : Using Recurrent Neural Networks to Predict Song Similarity, Stanford, CA, USA:

Stanford University, 2016.

Z. Rafii, A. Liutkus, F. R. St¨oter, S. I. Mimilakis, D. FitzGerald, and B. Pardo, “An Overview of Lead and Accompaniment Separation in Music” in arXiv, [Online], preprint arXiv:1804.08300, 2018.

R. G. Lyons, “Understanding digital signal processing,” Moscow: Binom, 2015.

A. V. Oppenheim and R. W. Schafer, Discrete-time signal processing, 3-th ed., Moscow: Technosfera, 2012 (in

Russian).

A. Klapuri and M. Davy, Processing Methods for Music Transcription, New York, USA: Springer Science and

Business Media, 2006.

P. Taylor, Text-to-speech synthesis, Cambridge, England: Cambridge university press, 2009.

D. Guennec, Study of Unit Selection Text-To-Speech Synthesis Algorithms, [PhD diss.], Universite Rennes 1,

Rennes, ˊBrittany, France, 2017.

M. B Stolbov, Fundamentals of analysis and processing of speech signals, [Textbook], St. Petersburg: ITMO University, 2021 (in Russian).

V. Rybin and A. Kaliev, “Speech Synthesis: Past and Present,” Computer Tools in Education, no. 1, pp. 5–28, 2019 (in Russian); doi:10.32603/2071-2340-2019-1-5-28

E. R Miranda and J. Biles, Evolutionary Computer Music, London: Springer Science and Business Media, 2007; doi:10.1007/978-1-84628-600-1

M. Fingerhut, “Music Information Retrieval, or how to search for (and maybe find) music and do away with incipits,” in Proc. of IAML-IASA Congress, Oslo, Norway, August 8–13, 2004, p. 17, 2004.

M. Good, “MusicXML: An Internet-Friendly Format for Sheet Music,” in Proc. of XML 2001, Boston, USA, December 9-14, pp. 03–04, 2001.

D. Huber, The MIDI Manual: A Practical Guide to MIDI within Modern Music Production (Audio Engineering Society Presents), 4th ed., Oxfordshire, England: Routledge, 2020.

I. Oppenheim, “The ABC Music standard 2.0,” in abc.sourceforge.net, 21 Feb. 2008, [Online]. Available: https://abc.sourceforge.net/standard/abc2-draft.html

K. Blum, “OOoLilyPond: Creating musical snippets in LibreOffice documents,” in github.com 2017. [Online]. Available: https://github.com/openlilylib/LO-ly

D. Hannah and M. Saif, “Generating Music from Literature,” in Proc. of the 3rd Workshop on Computational Linguistics for Literature (CLFL), Gothenburg, Sweden, 2014, pp. 1–10, 2017; doi:10.3115/v1/W14-0901

N. B. Zubareva and P. A Kulichkin, Secrets of music and mathematical modeling: Algebra or harmony?.. Harmony and algebra!, Moscow: URSS, 2022 (in Russian).

M. Toro, C. Rueda, C. Agon, and G. Assayag, “Gelisp: a framework to represent musical constraint satisfaction ˊ

problems and search strategies,” J. of Theoretical and Applied Information Technology, vol. 86, no. 2, pp. 327–331, 2016.

D. Quick and P. Hudak, “Grammar-based automated music composition in Haskell,” in Proc. ACM SIGPLAN Workshop on Functional Art, Music, Modeling and Design, FARM’13, pp. 59–70, 2013.

H. V. Koops, J. P. Magalhaes, and W. B de Haas, “A functional approach to automatic melody harmonisation,” in Proc. ACM SIGPLAN Workshop on Functional Art, Music, Modeling and Design, FARM’13, pp 47–58, 2013.

N. dos S. Cunha, A. Subramanian, and D. Herremans, “Generating guitar solos by integer programming,” Journal of the Operational Research Society, vol. 69, no. 6, pp. 971–985, 2017; doi:10.1080/01605682.2017.1390528

J. A Biles, “GenJam: A Genetic Algorithm for Generating Jazz Solos,” in Proc. International Computer Music Conference (ICMC), 1994, pp. 131–137, 1994.

C. Fox, “Genetic Hierarchical Music Structure,” in Proc. of the Nineteenth International Florida Artificial Intelligence Research Society Conference, Melbourne Beach, Florida, USA, May 11–13, 2006, Washington, DC, USA: AAAI Press, pp 243–247, 2006.

S. B Silas, Algorithmic Composition and Reductionist Analysis: Can a Machine Compose?, Cambridge, England: Cambridge University New Music Society, 1997.

J. D. Fernandez and F. Vico, “AI Methods in Algorithmic Composition: A Comprehensive Survey,” ˊ J. of Artificial

Intelligence Research, no. 48, pp. 513–582, 2013.

M. Marchini and H. Purwins, “Unsupervised Analysis and Generation of Audio Percussion Sequences,” in Lecture

Notes in Computer Science book series, Berlin: Springer, pp. 205–218, 2011; doi:10.1007/978-3-642-23126-1-14

K. I. Abrosimov and A. S. Surkova, “Music generation algorithm based on abc notation and distributive semantics,”

in Proc. of the XXVII Int. Scientific and Technical ConferenceInformation systems and technologies (IST-2021),

Nizhny Novgorod state technical university n.a. R. E. Alekseev, pp. 776—781, 2021 (in Russian).

G. Hadjeres, Fo. Pachet, and F. Nielsen, “DeepBach: a Steerable Model for Bach Chorales Generation,” in Proc. of

the 34th Int. Conf. on Machine Learning, PMLR, pp. 1362–1371, 2017.

J. Bach, 389 Chorales (Choral-Gesange), Los Angeles, CA: Alfred Publishing Company, 1985.

D. D. Johnson, “Composing Music With Recurrent Neural Networks,” in www.danieldjohnson.com. [Online].

Available: https://www.danieldjohnson.com/2015/08/03/composing-music-with-recurrent-neural-networks

H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output

layer for low-latency speech synthesis,” in Proc. 2015 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), IEEE, pp. 4470–4474, 2015.

J. Benesty, M. M. Sondhi, and Y. Huang, eds, Springer handbook of speech processing, Berlin: Springer, 2008.

C. Hawthorne et al., “Onsets and frames: Dual-objective piano transcription,” in Proc. of the Int. Society for Music

Information Retrieval Conference, Paris, France, Sep. 23–27, 2018, pp. 50–57, 2018.

Y. Hiramatsu, E. Nakamura, and K. Yoshii, “Joint Estimation of Note Values and Voices for Audio-to-Score Piano

Transcription,” in Proc. of the 22nd Int. Society for Music Information Retrieval Conference, ISMIR 2021, Online,

November 7–12, pp. 278–284, 2021.

Y. Ozaki, J. McBride, et al., “Agreement Among Human and Automated Transcriptions of Global Songs,” in Proc.

of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7–12,

pp. 500–508, 2021; doi:10.31234/osf.io/jsa4u

N. Fletcher and T. Rossing, The Physics of Musical Instruments, New York: Springer-Verlag, 1998.

J. H. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of

musical notes with WaveNet autoencoders,” in Proc. of the 34th Int. Conf. on Machine Learning, vol. 70, pp. 1068–

, 2017.

J. Nistal, S. Lattner, and Richard G, “DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio

Synthesis with GANs,” in Proc. of the 22nd Int. Society for Music Information Retrieval Conf., ISMIR 2021, Online,

November 7–12, pp. 482–494, 2021.

I. J. Goodfellow et al., “Generative Adversarial Networks,” in arXiv, [Online], preprint arXiv:1406.2661, 2014.

B. Hayes, C. Saitis, and G. Fazekas, “Neural Waveshaping Synthesis,” in Proc. of the 22nd Int. Society for Music

Information Retrieval Conf., ISMIR 2021, Online, November 7–12, pp. 254–261, 2021; doi:10.48550/arXiv.2107.05050

A. L.-C. Wang, “An Industrial-Strength Audio Search Algorithm,” in Proc. of the 4th Int. Conf. on Music Information

Retrieval (ISMIR 2003), Baltimore, USA, October, pp. 27–30, 2003; doi:10.5281/zenodo.1416340

P. Taylor, Text-to-Speech Synthesis, Cambridge, England: Cambridge University Press, 2009.

B. M. Lobanov and L. I. Tsirul’nik, Computer synthesis and speech cloning, Minsk, Belarusia: Belorusskaya Nauka,

(in Russian).

P. Chandna, M. Blaauw, J. Bonada, and E. Gomez, “WGANSing: A Multi-Voice Singing Voice Synthesizer Based ˊ

on the Wasserstein-GAN,” Proc. of the 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, ˜

September 2–6, 2019; doi:10.23919/EUSIPCO.2019.8903099

J. Lee, H-S. Choi, J. H. Koo, and K. Lee, “Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis

System,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’2020), Barcelona, Spain, May 4–8,

; doi:10.1109/ICASSP40776.2020.9054636

C. F. Liao, J. Y. Liu, and Y. H. Yang, “KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE Using MelSpectrograms,” in Proc. ICASSP., pp. 956–960, 2022; doi:10.1109/ICASSP43922.2022.9747441

A. W. Black, “CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling,” in 9th Int. Conf. on

Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006; doi:10.21437/Interspeech.2006-488

Y. Gu et al., “ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder

Acoustic Models and WaveRNN Vocoders,” in Proc. of the 12th Int. Symposium on Chinese Spoken Language

Processing (ISCSLP’2021), Hong Kong, Jan. 24–27, pp. 1–5, 2021; doi:10.1109/ISCSLP49672.2021.9362104

L. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis

System,” in Proc. 21st Annual Conf. of the Int. Speech Communication Association, Virtual Event, Shanghai, China,

Oct. 25–29, pp. 1306–1310, 2020; doi:10.21437/Interspeech.2020-1410

Y. Ren et al., “FastSpeech: Fast, robust and controllable text to speech,” in NeurIPS, pp. 1–13, 2019;

doi:10.48550/arXiv.1905.09263

S. Choi and J. Nam, “A Melody-Unsupervision Model for Singing Voice Synthesis,” IEEE Int. Conf.

on Acoustics, Speech and Signal Processing (ICASSP’2022), Singapore, May 23–27, pp. 7242–7246.

doi:10.1109/ICASSP43922.2022.9747422

Z. Zhang, Y. Zheng, X. Li, and L. Lu, “WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses,”

in arXiv, [Online], preprint arXiv:2203.10750, 2022.

S. Guo, J. Shi, T. Qian, S. Watanabe, and Q. Jin, “SingAug: Data Augmentation for Singing Voice Synthesis with Cycleconsistent Training Strategy,” in Proc. INTERSPEECH 2022, pp. 4272–4276, 2022; doi:10.21437/Interspeech.2022-

M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for

real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016;

doi:10.1587/transinf.2015EDP7457