- Research Article
- 10.5334/tismir.298
- Feb 26, 2026
- Transactions of the International Society for Music Information Retrieval
- Tim Eipert + 1 more
- Research Article
- 10.5334/tismir.326
- Feb 13, 2026
- Transactions of the International Society for Music Information Retrieval
- Stefan Balke + 6 more
The Real World Computing (RWC) Music Database has been a cornerstone of Music Information Retrieval (MIR) research for over two decades, offering high‑quality recordings across multiple genres, including popular, classical, and jazz music. Beyond its extensive audio collection, the dataset is enriched by aligned Musical Instrument Digital Interface (MIDI) encodings and complementary annotations, including beat, structure, and chord labels, making it a valuable resource for music structure analysis, beat tracking, chord recognition, automatic transcription, and music synchronization. Originally, the RWC audio material was distributed on physical media and made available for purchase at a nominal price. A significant development, announced and initiated with this paper, is the release of the RWC dataset under a Creative Commons license, making it freely accessible for research purposes. This transition significantly enhances the dataset’s usability and supports broader adoption within the MIR research community. We outline the steps taken to enable this release and share a vision for transforming RWC into a community‑driven resource that promotes open research and collaboration. With the audio recordings now hosted on Zenodo, we also discuss strategies for dataset maintenance, annotation expansion, and reproducibility through collaborative platforms such as GitHub. This shift promotes transparency and inclusivity, helping to ensure the dataset’s continued relevance for cutting‑edge MIR research. We further revisit the historical significance of the RWC dataset, incorporating insights from an interview with its original creator, Masataka Goto, and provide an overview of its current applications and future potential. In summary, by embracing an open and community‑supported approach, we aim not only to renew the dataset’s impact and preserve its legacy within the MIR community but also to shed light on broader best practices for open, collaborative, and sustainable research infrastructures.
- Research Article
- 10.5334/tismir.230
- Dec 23, 2025
- Transactions of the International Society for Music Information Retrieval
- Katelyn Emerson + 1 more
In many musical styles, performing a piece of music means to produce an ‘interpretation’ of a score. This interpretation involves performers manipulating musical parameters such as timing, dynamics, timbre, and pitch to communicate their artistic conception of the piece, often to an audience. Much previous research into musical interpretations has examined aspects of expressive performance strategies. However, these studies have largely focused solely on the sounds produced in the performance, investigating players’ manipulation of musical parameters but little of the performance’s broader context and impact. Multimodal datasets, which contain multiple diverse data types offering distinct perspectives on the musical performance (e.g. audio, Musical Instrument Digital Interface, video, motion capture, physiological data), can support more holistic cross- and interdisciplinary study of performers’ interpretative decision-making and its effects on audiences. We propose a taxonomy of modalities relevant to study of musicians’ interpretations of musical scores. These modalities are distinct facets of the performance or its context through which the performance and musical interpretation can be analysed (e.g. ‘venue acoustics’, ‘performer movements’, ‘performance sound’). We use this taxonomy to systematically review relevant open-access multimodal datasets and the modalities they support. Underrepresented modalities are then highlighted, along with practical suggestions for including data that support these modalities in future datasets. We next examine key challenges of reporting and working with multimodal datasets, emphasising the need for standardisation of data reporting and reliable options for data storage and access. Finally, we summarise the broader interdisciplinary applications of these datasets in artificial intelligence and performance research.
- Research Article
- 10.5334/tismir.225
- Nov 14, 2025
- Transactions of the International Society for Music Information Retrieval
- Ross Greer + 2 more
In this research, we introduce extensions of the ImproVision framework for multimodal musical human–machine communication. ImproVision Equilibrium integrates real‑time pitch detection, consonant chord determination, and visual cues to guide an ensemble from dissonance to harmony. In addition to Equilibrium, we introduce ImproVision Gestured Improvisation, a complementary mode in which musicians use body gestures to guide generative machine improvisation. These systems demonstrate a spectrum of human–machine interaction dynamics. We evaluate the ImproVision framework using the Standardized Procedure for Evaluating Creative Systems methodology for creative systems, assessing its capacity for co‑creation, communication, and adaptability. Potential applications span from ensemble rehearsal aids to interactive performance tools, opening new avenues through intelligent, responsive, and multimodal machine participation in the arts.
- Research Article
- 10.5334/tismir.265
- Nov 6, 2025
- Transactions of the International Society for Music Information Retrieval
- Juan Sebastián Gómez-Cañón + 4 more
Over the past two decades, the music information retrieval (MIR) community has grown significantly in both the volume and diversity of research contributions. However, questions remain about who is represented within the community—and who is not. The influence of Western views shapes MIR research, affecting author representation, topic selection, cross-cultural considerations, and reproducibility. While discussions on the impact of Western centricity have gained traction in adjacent fields, there remains a need to critically assess its presence and limitations within MIR. This study analyzes the corpus of 2,458 ISMIR conference papers published from 2000–2024 to examine the geographic and institutional distribution of authors. Our findings indicate that International Society for Music Information Retrieval research remains Western-centric, with disproportionate representation from the Global North yet with increasing cross-institutional collaborations. We provide design suggestions to support a more geographically diverse authorship. In support of our findings and to facilitate future research, we release the aggregated data as an open dataset, ISMIR25Meta, along with a topic-based visualizer, ISMIR25Viz.
- Research Article
- 10.5334/tismir.250
- Sep 18, 2025
- Transactions of the International Society for Music Information Retrieval
- Jun-You Wang + 2 more
Motif discovery in polyphonic symbolic music data is an important yet challenging task in music processing. In this paper, we propose a novel motif-discovery method created by combining the traditional rule-based repeated pattern discovery algorithms with a machine learning–based model that performs the task of motif note identification, i.e., identifying whether or not a note belongs to a motif. More specifically, the motif note identification model extracts motif notes for subsequent repeated pattern discovery. Removing non-motif notes can reduce the unwanted outputs in repeated pattern discovery and thereby improve performance. With a limited amount of training data, motif note identification can be implemented by fine-tuning a pre-trained model for symbolic music using pseudo-labels. The results demonstrate the feasibility of applying data-driven methods to assist the motif-discovery task, specifically on the occurrence and three-layer metrics, under the situation that labeled training data of the motif and repeated pattern are scarce.
- Research Article
2
- 10.5334/tismir.251
- Sep 9, 2025
- Transactions of the International Society for Music Information Retrieval
- Alain Riou + 6 more
In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.
- Research Article
1
- 10.5334/tismir.256
- Sep 4, 2025
- Transactions of the International Society for Music Information Retrieval
- Fabio Morreale + 4 more
Up until recently, most approaches to music generation were based on deductive logic: generative rules were devised on the basis of musicians’ preferences, subjective appreciation and dominant music theories. Machine learning (ML) introduced a paradigm shift: vast datasets of existing music are used to train neural networks capable of generating new compositions supposedly without embedding predefined musical rules. We first outline how rule-based systems depend on a series of reductionist processes and assumptions about music that affect what can be generated. We then examine ML-based generative music systems and show that they are still unable to generate the full theoretical space of musical possibilities, they are still grounded on reductionist processes and their soundness is still affected by unquestioned assumptions. We also identify the limitations of semantic bridges used to form musical meaning and the epistemic framework of cascading modules. Finally, we propose that the artistic potential of ML systems might lie beyond attempts to replicate human music-making methods.
- Research Article
- 10.5334/tismir.216
- Sep 4, 2025
- Transactions of the International Society for Music Information Retrieval
- William Wilson + 4 more
Following the exposition of quantitative, identifiable idiosyncrasy in violin performance – via neural network classification – we demonstrate that smartwatch-based synchronous audio-gesture logging facilitates interpretable practice feedback in violin performance. The novelty of our approach is twofold: we exploit convenient multimodal data capture using a consumer smartwatch, recording wrist-movement and audio data in parallel. Further, we prioritise the delivery of performance insights at their most interpretable, quantifying tonal and temporal performance trends. Using such accessible hardware to observe meaningful, approachable performance insights, the feasibility of our approach is maximised for use in real-world teaching and learning environments. Presented analyses draw upon a primary dataset compiled from nine violinists executing defined performance exercises. Recordings segmented via note onset detection are subject to subsequent analyses. Trends identified include a cross-participant tendency to ‘rush’ up-bows versus down-bows, along with lesser temporal and tonal consistency when bowing Spiccato versus Legato.
- Research Article
- 10.5334/tismir.222
- Jul 31, 2025
- Transactions of the International Society for Music Information Retrieval
- Anna-Maria Christodoulou + 3 more
Music question–answering (MQA) is a machine learning task where a computational system analyzes and answers questions about music‑related data. Traditional methods prioritize audio, overlooking visual and embodied aspects crucial to music performance understanding. We introduce MusiQAl, a multimodal dataset of 310 music performance videos and 11,793 human‑annotated question–answer pairs, spanning diverse musical traditions and styles. Grounded in musicology and music psychology, MusiQAl emphasizes multimodal reasoning, causal inference, and cross‑cultural understanding of performer–music interaction. We benchmark AVST and LAVISH architectures on MusiQAI, revealing strengths and limitations, underscoring the importance of integrating multimodal learning and domain expertise to advance MQA and music information retrieval.