Music information retrieval - modern challenges and technology
Abstract
This paper discusses Music Information Retrieval — a field of computational musicology that is actively developing in the modern world. The paper describes some of the main tasks and technologies of this area, such as music generation, automatic music transcription, synthesis of musical instrument sounds, and music retrieval. Special attention is paid to one of the most interesting tasks at the junction of speech and music technologies —singing voice synthesis. Different approaches to this task, existing problems and methods of their solution are discussed.
References
A. Yolk, F. Wiering, and P. Kranenburg, “Unfolding the potential of computational musicology,” in Proc. of the 13th Int. Conf. on Informatics and Semiotics in Organisations, Leeuwarden, The Netherlands, July 4–6, 2011, pp. 137–144, 2011.
M. Muller, “Fundamentals of Music Processing. Audio, Analysis, Algorithms, Applications,” Berlin: Springer, 2015; doi:10.1007978-3-319-21945-5
M. Sh Bronfeld, The Introduction to Musicology: Textbook for students and teachers of higher musical educational institutions, Saint Petersburg, Russia: Planeta Muziki, 2022 (in Russian).
A. V. Gladkiy and I. A Melchuk, Elements of Mathematical Linguistics, Moscow: Nauka, 1969 (in Russian).
R. G. Piotrovsky, K. B. Bektaev, and A. A Piotrovskaya, Mathematical Linguistics. Textbook for pedagogical institutes, Moscow: Vysshaia shkola, 1977 (in Russian).
M. K. Timofeeva, Introduction to Mathematical Linguistics: Practicum, Novosibirsk, Russia: Novosibirsk, 2018 (in Russian).
E. I. Bolshakova, E. S. Klyshinsky, D. V. Lande, Noskov A. A., Peskova O. V., and Yagunova E. V, Automatic natural language processing and computational linguistics: textbook, Moscow: HSE MIEM, 2011 (in Russian).
L. S. Lomakin and A. S. Surkov, Information technologies for analysis and modeling of text structures, oronezh: Scientific Book Publ., 2015 (in Russian).
A. Balakrishnan, DeepPlaylist : Using Recurrent Neural Networks to Predict Song Similarity, Stanford, CA, USA:
Stanford University, 2016.
Z. Rafii, A. Liutkus, F. R. St¨oter, S. I. Mimilakis, D. FitzGerald, and B. Pardo, “An Overview of Lead and Accompaniment Separation in Music” in arXiv, [Online], preprint arXiv:1804.08300, 2018.
R. G. Lyons, “Understanding digital signal processing,” Moscow: Binom, 2015.
A. V. Oppenheim and R. W. Schafer, Discrete-time signal processing, 3-th ed., Moscow: Technosfera, 2012 (in
Russian).
A. Klapuri and M. Davy, Processing Methods for Music Transcription, New York, USA: Springer Science and
Business Media, 2006.
P. Taylor, Text-to-speech synthesis, Cambridge, England: Cambridge university press, 2009.
D. Guennec, Study of Unit Selection Text-To-Speech Synthesis Algorithms, [PhD diss.], Universite Rennes 1,
Rennes, ˊBrittany, France, 2017.
M. B Stolbov, Fundamentals of analysis and processing of speech signals, [Textbook], St. Petersburg: ITMO University, 2021 (in Russian).
V. Rybin and A. Kaliev, “Speech Synthesis: Past and Present,” Computer Tools in Education, no. 1, pp. 5–28, 2019 (in Russian); doi:10.32603/2071-2340-2019-1-5-28
E. R Miranda and J. Biles, Evolutionary Computer Music, London: Springer Science and Business Media, 2007; doi:10.1007/978-1-84628-600-1
M. Fingerhut, “Music Information Retrieval, or how to search for (and maybe find) music and do away with incipits,” in Proc. of IAML-IASA Congress, Oslo, Norway, August 8–13, 2004, p. 17, 2004.
M. Good, “MusicXML: An Internet-Friendly Format for Sheet Music,” in Proc. of XML 2001, Boston, USA, December 9-14, pp. 03–04, 2001.
D. Huber, The MIDI Manual: A Practical Guide to MIDI within Modern Music Production (Audio Engineering Society Presents), 4th ed., Oxfordshire, England: Routledge, 2020.
I. Oppenheim, “The ABC Music standard 2.0,” in abc.sourceforge.net, 21 Feb. 2008, [Online]. Available: https://abc.sourceforge.net/standard/abc2-draft.html
K. Blum, “OOoLilyPond: Creating musical snippets in LibreOffice documents,” in github.com 2017. [Online]. Available: https://github.com/openlilylib/LO-ly
D. Hannah and M. Saif, “Generating Music from Literature,” in Proc. of the 3rd Workshop on Computational Linguistics for Literature (CLFL), Gothenburg, Sweden, 2014, pp. 1–10, 2017; doi:10.3115/v1/W14-0901
N. B. Zubareva and P. A Kulichkin, Secrets of music and mathematical modeling: Algebra or harmony?.. Harmony and algebra!, Moscow: URSS, 2022 (in Russian).
M. Toro, C. Rueda, C. Agon, and G. Assayag, “Gelisp: a framework to represent musical constraint satisfaction ˊ
problems and search strategies,” J. of Theoretical and Applied Information Technology, vol. 86, no. 2, pp. 327–331, 2016.
D. Quick and P. Hudak, “Grammar-based automated music composition in Haskell,” in Proc. ACM SIGPLAN Workshop on Functional Art, Music, Modeling and Design, FARM’13, pp. 59–70, 2013.
H. V. Koops, J. P. Magalhaes, and W. B de Haas, “A functional approach to automatic melody harmonisation,” in Proc. ACM SIGPLAN Workshop on Functional Art, Music, Modeling and Design, FARM’13, pp 47–58, 2013.
N. dos S. Cunha, A. Subramanian, and D. Herremans, “Generating guitar solos by integer programming,” Journal of the Operational Research Society, vol. 69, no. 6, pp. 971–985, 2017; doi:10.1080/01605682.2017.1390528
J. A Biles, “GenJam: A Genetic Algorithm for Generating Jazz Solos,” in Proc. International Computer Music Conference (ICMC), 1994, pp. 131–137, 1994.
C. Fox, “Genetic Hierarchical Music Structure,” in Proc. of the Nineteenth International Florida Artificial Intelligence Research Society Conference, Melbourne Beach, Florida, USA, May 11–13, 2006, Washington, DC, USA: AAAI Press, pp 243–247, 2006.
S. B Silas, Algorithmic Composition and Reductionist Analysis: Can a Machine Compose?, Cambridge, England: Cambridge University New Music Society, 1997.
J. D. Fernandez and F. Vico, “AI Methods in Algorithmic Composition: A Comprehensive Survey,” ˊ J. of Artificial
Intelligence Research, no. 48, pp. 513–582, 2013.
M. Marchini and H. Purwins, “Unsupervised Analysis and Generation of Audio Percussion Sequences,” in Lecture
Notes in Computer Science book series, Berlin: Springer, pp. 205–218, 2011; doi:10.1007/978-3-642-23126-1-14
K. I. Abrosimov and A. S. Surkova, “Music generation algorithm based on abc notation and distributive semantics,”
in Proc. of the XXVII Int. Scientific and Technical ConferenceInformation systems and technologies (IST-2021),
Nizhny Novgorod state technical university n.a. R. E. Alekseev, pp. 776—781, 2021 (in Russian).
G. Hadjeres, Fo. Pachet, and F. Nielsen, “DeepBach: a Steerable Model for Bach Chorales Generation,” in Proc. of
the 34th Int. Conf. on Machine Learning, PMLR, pp. 1362–1371, 2017.
J. Bach, 389 Chorales (Choral-Gesange), Los Angeles, CA: Alfred Publishing Company, 1985.
D. D. Johnson, “Composing Music With Recurrent Neural Networks,” in www.danieldjohnson.com. [Online].
Available: https://www.danieldjohnson.com/2015/08/03/composing-music-with-recurrent-neural-networks
H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output
layer for low-latency speech synthesis,” in Proc. 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), IEEE, pp. 4470–4474, 2015.
J. Benesty, M. M. Sondhi, and Y. Huang, eds, Springer handbook of speech processing, Berlin: Springer, 2008.
C. Hawthorne et al., “Onsets and frames: Dual-objective piano transcription,” in Proc. of the Int. Society for Music
Information Retrieval Conference, Paris, France, Sep. 23–27, 2018, pp. 50–57, 2018.
Y. Hiramatsu, E. Nakamura, and K. Yoshii, “Joint Estimation of Note Values and Voices for Audio-to-Score Piano
Transcription,” in Proc. of the 22nd Int. Society for Music Information Retrieval Conference, ISMIR 2021, Online,
November 7–12, pp. 278–284, 2021.
Y. Ozaki, J. McBride, et al., “Agreement Among Human and Automated Transcriptions of Global Songs,” in Proc.
of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7–12,
pp. 500–508, 2021; doi:10.31234/osf.io/jsa4u
N. Fletcher and T. Rossing, The Physics of Musical Instruments, New York: Springer-Verlag, 1998.
J. H. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of
musical notes with WaveNet autoencoders,” in Proc. of the 34th Int. Conf. on Machine Learning, vol. 70, pp. 1068–
, 2017.
J. Nistal, S. Lattner, and Richard G, “DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio
Synthesis with GANs,” in Proc. of the 22nd Int. Society for Music Information Retrieval Conf., ISMIR 2021, Online,
November 7–12, pp. 482–494, 2021.
I. J. Goodfellow et al., “Generative Adversarial Networks,” in arXiv, [Online], preprint arXiv:1406.2661, 2014.
B. Hayes, C. Saitis, and G. Fazekas, “Neural Waveshaping Synthesis,” in Proc. of the 22nd Int. Society for Music
Information Retrieval Conf., ISMIR 2021, Online, November 7–12, pp. 254–261, 2021; doi:10.48550/arXiv.2107.05050
A. L.-C. Wang, “An Industrial-Strength Audio Search Algorithm,” in Proc. of the 4th Int. Conf. on Music Information
Retrieval (ISMIR 2003), Baltimore, USA, October, pp. 27–30, 2003; doi:10.5281/zenodo.1416340
P. Taylor, Text-to-Speech Synthesis, Cambridge, England: Cambridge University Press, 2009.
B. M. Lobanov and L. I. Tsirul’nik, Computer synthesis and speech cloning, Minsk, Belarusia: Belorusskaya Nauka,
(in Russian).
P. Chandna, M. Blaauw, J. Bonada, and E. Gomez, “WGANSing: A Multi-Voice Singing Voice Synthesizer Based ˊ
on the Wasserstein-GAN,” Proc. of the 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, ˜
September 2–6, 2019; doi:10.23919/EUSIPCO.2019.8903099
J. Lee, H-S. Choi, J. H. Koo, and K. Lee, “Disentangling Timbre and Singing Style with Multi-Singer Singing Synthesis
System,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’2020), Barcelona, Spain, May 4–8,
; doi:10.1109/ICASSP40776.2020.9054636
C. F. Liao, J. Y. Liu, and Y. H. Yang, “KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE Using MelSpectrograms,” in Proc. ICASSP., pp. 956–960, 2022; doi:10.1109/ICASSP43922.2022.9747441
A. W. Black, “CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling,” in 9th Int. Conf. on
Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006; doi:10.21437/Interspeech.2006-488
Y. Gu et al., “ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder
Acoustic Models and WaveRNN Vocoders,” in Proc. of the 12th Int. Symposium on Chinese Spoken Language
Processing (ISCSLP’2021), Hong Kong, Jan. 24–27, pp. 1–5, 2021; doi:10.1109/ISCSLP49672.2021.9362104
L. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis
System,” in Proc. 21st Annual Conf. of the Int. Speech Communication Association, Virtual Event, Shanghai, China,
Oct. 25–29, pp. 1306–1310, 2020; doi:10.21437/Interspeech.2020-1410
Y. Ren et al., “FastSpeech: Fast, robust and controllable text to speech,” in NeurIPS, pp. 1–13, 2019;
doi:10.48550/arXiv.1905.09263
S. Choi and J. Nam, “A Melody-Unsupervision Model for Singing Voice Synthesis,” IEEE Int. Conf.
on Acoustics, Speech and Signal Processing (ICASSP’2022), Singapore, May 23–27, pp. 7242–7246.
doi:10.1109/ICASSP43922.2022.9747422
Z. Zhang, Y. Zheng, X. Li, and L. Lu, “WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses,”
in arXiv, [Online], preprint arXiv:2203.10750, 2022.
S. Guo, J. Shi, T. Qian, S. Watanabe, and Q. Jin, “SingAug: Data Augmentation for Singing Voice Synthesis with Cycleconsistent Training Strategy,” in Proc. INTERSPEECH 2022, pp. 4272–4276, 2022; doi:10.21437/Interspeech.2022-
M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for
real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016;
doi:10.1587/transinf.2015EDP7457
This work is licensed under a Creative Commons Attribution 4.0 International License.