Articles published on Audio Data
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
5364 Search results
Sort by Recency
- New
- Research Article
- 10.3390/app16052446
- Mar 3, 2026
- Applied Sciences
- Fabricio Quirós-Corella + 3 more
The Greater Caribbean manatee faces significant conservation challenges due to a lack of demographic data in low-visibility habitats. To address this, we present a refined automated manatee counting method pipeline integrating deep learning-based call detection with unsupervised individual counting. We resolved significant computational bottlenecks by implementing an offline feature extraction strategy, bypassing a 13-h processing lag for 43,031 audio samples. To mitigate overfitting in imbalanced bioacoustic datasets, non-parametric bootstrap resampling was employed to generate 100,000 balanced spectrograms. Benchmarking revealed that transfer learning via a VGG-16 backbone achieved a mean 10-fold cross-validation accuracy of 98.92% (±0.08%) and an F1-score of 98.08% for genuine vocalizations. Following detection, individual counting utilized k-means clustering on prioritized music information retrieval descriptors—spectral bandwidth, centroid, and roll-off—to resolve distinct acoustic signatures. This framework identified three individuals with a silhouette coefficient of 79.20%, demonstrating superior cohesion over previous benchmarks. These results confirm the automatic manatee count method as a robust, scalable framework for generating the scientific evidence required for regional conservation policies.
- New
- Research Article
- 10.1016/j.infbeh.2025.102175
- Mar 1, 2026
- Infant behavior & development
- Alexander Turner + 2 more
Applying a Transformer-based machine-learning model to classify caregiver and infant behaviours during dyadic interactions.
- New
- Research Article
- 10.55041/ijsrem56918
- Feb 25, 2026
- International Journal of Scientific Research in Engineering and Management
- Sagar A Gavade + 4 more
Abstract The increasing consumption of digital music has created a demand for intelligent and automated music identification systems. This research presents the development of an audio-based song identification system using Python and a cloud-based music service. recognition API. The system is designed to detect a song from a short audio clip uploaded by the user. It processes the audio. through a backend server and communicates with a cloud recognition service to identify the corresponding song metadata. detected results include song title, artist name, and direct search links to popular streaming platforms. The proposed The solution demonstrates the integration of web technologies, RESTful services, and audio fingerprinting mechanisms in a scalable and efficient architecture. Experimental evaluation confirms that the system achieves reliable performance for clear audio samples and offers a practical framework for real-world multimedia applications.
- New
- Research Article
- 10.31449/inf.v50i6.9998
- Feb 21, 2026
- Informatica
- K Revathi + 1 more
Audio steganography embeds multimedia files within audio files. Traditional audio steganography used the Least Significant Bit algorithm, which required eight samples to embed one byte of a text file. This study reduces the utilization of audio samples by compressing healthcare data from three characters to one using the proposed 24×8 compression algorithm. Three audio files were used to transmit the probability distributions, encoded values, and index values. The encoded values are embedded using the Bit Comparison and Substitution-3 algorithm, with a three-bit difference from audio samples. It is retrieved using the Bit Comparison and Retrieval-3 algorithm and decompressed with the 8×24 decompression algorithm. For enhanced security, healthcare data was encrypted using the Incremental Order Value Algorithm and decrypted with the Decremental Reverse Order Value Algorithm. The least significant bit algorithm embeds the probability distributions and index values with a secret key. Audio files from Mixkit and healthcare data from the COVID Dialogue Dataset were used for evaluation. The proposed algorithms achieved an average throughput of 8592.74 KB/s, surpassing the 3-DES algorithm due to the incremental shift in ASCII values within healthcare data. A compression ratio of 3:1 was achieved by compressing 3 bytes of data to 1, outperforming Huffman and LZW. The embedding algorithm achieved a PSNR of 42.1480 dB and a BER of 7.0165 × 10⁻⁵, demonstrating improved efficiency with reduced audio samples in embedding compared to the traditional LSB algorithm. 15000 bytes of healthcare data were embedded into 5000 audio samples, resulting in 15000-bit differences between the cover and stego audio files.
- New
- Research Article
- 10.1038/s41597-026-06851-x
- Feb 21, 2026
- Scientific data
- Mark Dourado + 3 more
The GaMMA (Gaze, Motion, and Multi-talker Audio) corpus captures the behavior of polyadic conversations among native Danish speakers under both normal and cocktail party conditions. Eleven groups of four normal-hearing participants are recorded while engaged in natural and spontaneous interactions. All conversations were conducted without conversational tasks. Each group was intentionally composed of participants with prior intragroup and interpersonal relations. Gaze and motion data were collected using an optical tracking system and eye-tracking glasses, while speech was recorded via omnidirectional head-worn microphones and binaural hearing aid microphones with low occlusion. Calibrations were conducted before trials and compensation filters were created to account for differences in microphone placements. Processed versions of the audio signals, with background noise attenuated and crosstalk removed, were used to compute speech activity for all participants. The corpus, including both raw and processed gaze and audio data, as well as filters, calibration signals, and speech activity output, is publicly available.
- New
- Research Article
- 10.29121/shodhkosh.v7.i1s.2026.7196
- Feb 17, 2026
- ShodhKosh: Journal of Visual and Performing Arts
- Dr Biswajit Kalita + 5 more
Artificial intelligence in arts education is an opportunity that opens the possibilities of personalized and immersive learning. This paper suggests a proposal of Adaptive AI-Tutor focused specifically on teaching theatre and drama by improving the skills of performance, interpreting scripts, expressing emotions, and being present on the stage with the help of intelligent and data-oriented advice. As opposed to traditional e-learning systems, the given framework includes the multimodal analysis of audio, video, and textual data gathered during rehearsals and live performance. Transformer-based language models are used in the system to generate dialogues and analyze the scripts as well as speech recognition to assess pronunciation and prosody, and computer vision to evaluate facial expression and body language. The methods that are applied to profile the learners and model their behaviors are learner profiling and behavioral modeling that profile individual strengths, weaknesses, emotional conditions, and learning progress over time. Incorporating the affective computing processes, the AI-Tutor dynamically adjusts the feedback principles and the teaching methods on the basis of the identified emotional responses, providing the cognitive and emotional growth. A recommendation engine is an adaptive recommendation engine proposing custom exercises, interpretation of scenes, and character development policies based on the performance measures of the learner. It has been experimentally validated to have a higher accuracy in delivering dialogues, expressive modulation, and engagement level than other traditional rehearsal approaches. The suggested system shows how intelligent tutoring systems can be used not only in STEM but also in creative arts and encourage inclusive, scalable, and performance-based theatre education. The study adds a new interdisciplinary concept between artificial intelligence, performing arts pedagogy and adaptive learning technologies.
- New
- Research Article
- 10.1002/adma.202514881
- Feb 15, 2026
- Advanced materials (Deerfield Beach, Fla.)
- Mengyuan Li + 19 more
All-organic red-green-blue (RGB) visible light communication (VLC) systems hold significant promise for future wireless communications because they can be readily integrated with existing lighting infrastructures. However, the stability, efficiency, and exciton decay times of printed deep-blue organic light-emitting diodes (OLEDs) currently fall short of the high bandwidth, rapid response, and swift data transmission requirements of VLC systems. Herein, two printable deep-blue fluorescent light-emitting π-conjugated polymers (LπCPs) were fabricated based on a multi-dimensional self-encapsulation strategy for application in all-OLED RGB VLC systems. The printed fluorescent films displayed remarkably fast decay life-times of ∼0.30ns, enabling high bandwidth and fast response. The deep-blue OLEDs presented a CIE coordinate of (0.15, 0.06), narrow deep-blue emission with a full width at half maximum (FWHM) of 21nm, high external quantum efficiency (EQE) of 1.94%, and high brightness of 6698cd/m2 with remarkable durability. Finally, preliminary printed all-OLED RGB VLC systems were successfully established, and through efficient energy transfer, demonstrated the transmission of pseudo-random binary sequence (PRBS) signals and audio data at a rate of 1Mbps. The fast response times, on the order of microseconds, highlight the potential of these all-OLED VLC systems for high-speed data transmission.
- New
- Research Article
- 10.1093/schbul/sbag003.233
- Feb 13, 2026
- Schizophrenia Bulletin
- Bing Yan
Abstract Background Insomnia is a common comorbidity in anxiety disorders, worsening daytime function. Digital health interventions, particularly audio-based content on streaming platforms, are gaining attention for their accessibility. "Sleep playlists" on music platforms, often featuring ambient music and nature soundscapes, may theoretically promote relaxation by modulating autonomic activity. However, controlled studies systematically evaluating the effects of standardized sleep playlists on insomnia symptoms in anxiety patients are lacking. This study investigated the efficacy of a 4-week nighttime listening intervention using a mainstream music platform's standardized sleep playlist, targeting sleep quality and anxiety symptoms in patients with comorbid insomnia and anxiety. Methods This was a two-arm, parallel-group randomized controlled trial. Eighty-six anxiety patients with Pittsburgh Sleep Quality Index (PSQI) scores >7 were enrolled and randomized to a playlist intervention group (n = 43) or a wait-list control group (n = 43). Both continued existing treatments. The intervention group was instructed to listen to a designated sleep playlist on a specified music platform (e. g., NetEase Cloud Music "Deep Sleep" official playlist) for at least 30 minutes before bedtime nightly for 4 weeks. The playlist contained slow-tempo instrumental and linear synth pads. The control group received no directed audio intervention. The primary outcome was change in PSQI total score. Secondary outcomes included the Insomnia Severity Index (ISI), the Self-rating Anxiety Scale (SAS), and adherence (based on platform backend data). All scales were administered at baseline (T0) and post-intervention (T1). Data were analyzed using Analysis of Covariance (ANCOVA) in SPSS 23. 0, controlling for baseline scores. Results ANCOVA revealed that, after adjusting for baseline scores, the intervention group had a significantly lower adjusted mean PSQI score at T1 (9.2) than the control group (12.5), with a statistically significant between-group difference (F = 18.34, p<.001). The reduction in ISI scores was also significantly greater in the intervention group (p<.01). For anxiety symptoms, the intervention group showed a greater reduction in adjusted mean SAS scores (48.6) compared to the control group (52.1) (p<.05). Adherence data showed a mean weekly compliance rate of 78.4% in the intervention group. Follow-up feedback from a subset (n = 20) highlighted three core experiential themes: "Attentional Anchoring Effect" (85% reported music diverted thoughts from anxiety), "Environmental Masking and Safety" (80% felt sound masked nighttime noises, creating safety), and "Sleep Ritual Establishment" (75% stated regular listening became a sleep cue). Discussion A 4-week standardized sleep playlist intervention significantly improved sleep quality and partially alleviated anxiety in patients with comorbid insomnia and anxiety. Potential mechanisms may include establishing a conditioned relaxation response through regular auditory stimuli, providing non-pharmacological sensory shielding, and serving as a low-cognitive-load distractor. These findings support using readily accessible digital audio resources as a feasible adjunct tool for improving sleep within comprehensive anxiety management. Clinicians may consider "prescribing" such interventions as part of behavioral therapy. Future research should compare personalized versus standardized playlists, examine the effects of specific musical elements, and evaluate long-term utility in relapse prevention.
- New
- Research Article
- 10.1097/nnr.0000000000000893
- Feb 10, 2026
- Nursing research
- Katrina M Long + 10 more
Manual transcription can be resource- and time-consuming, while software-based audio coding offers a potentially cheaper and faster alternative. This study aimed to compare the time efficiency, cost effectiveness, and researcher experience of thematic analysis of audio recordings versus transcripts. This was a mixed-methods crossover study with two conditions (audio coding, transcript coding) and three categories of coders (novice, competent, and expert). Ten researchers coded 18 interview segments using NVivo, half in each format. Demographics, coding times, and coding experiences were collected. On average, transcript coding took less time than audio coding, and NVivo experience was negatively associated with coding time across conditions. Economic analysis showed that audio was<60% the cost of transcript coding. Audio coding was perceived as be more difficult, yet coders agreed that both methods led to similar code quality. Audio coding may be a cost-saving alternative to transcript coding. The potential cost savings, coupled with the more "naturalistic" source of audio data, may make audio coding an appropriate approach to consider for the qualitative researcher, despite coder perceptions of its greater difficulty. Audio coding should be considered as part of a qualitative project to enhance immersion in the data or improve coding efficiency. However, this approach should be preceded by careful consideration of the most effective computer-assisted qualitative data analysis software and extensive training and familiarization with audio coding procedures prior to analysis.
- New
- Research Article
- 10.4314/etsj.v16i2.13
- Feb 10, 2026
- Environmental Technology and Science Journal
- B.A Omodunbi + 3 more
Emotion recognition from speech plays a crucial role in enhancing human–computer interaction by enabling systems to interpret and respond to users’ emotional states. This study develops and evaluates a Speech Emotion Recognition (SER) system using three machine learning techniques; Support Vector Machines (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Networks (CNNs). The system is trained and tested on the RAVDESS dataset, which contains 1,440 professionally recorded audio samples representing a wide range of emotions. Our approach involves careful preprocessing of the audio signals, extraction of key acoustic features, and comparative performance evaluation of the three models using standard metrics. Results show that each model exhibits unique strengths and limitations, with CNNs achieving the most robust feature learning and generalization. The study underscores the importance of diverse feature representation for accurate emotion classification and provides insight into how different model architectures handle emotional nuances in speech. Identified challenges such as dataset diversity, feature selection, and computational complexity are discussed, along with recommendations for future research to improve SER systems’ real-world adaptability. This work contributes to ongoing efforts toward developing emotionally aware technologies that can enhance natural human–machine communication.
- New
- Research Article
- 10.1044/2025_jslhr-24-00713
- Feb 10, 2026
- Journal of speech, language, and hearing research : JSLHR
- Amélie Brisebois + 3 more
Lexical performance in discourse is of considerable interest in acquired communication disorders. The transcription-free core lexicon measure evaluates the most typical words a person uses during communication. This study aimed (a) to develop core lexicon lists in Laurentian French speakers without brain injury and (b) to assess their psychometric properties. Spoken discourse was elicited using the picture description task from the Western Aphasia Battery-Revised (WAB-R; Kertesz, 2006) and the Cinderella Story Telling (CST) task. Participants were Laurentian French speakers from Quebec, aged 50-79 years, without brain injury. Sixty-six completed the WAB-R task, and 48 completed the CST task. Core noun and verb lists were created using the CLAN program, including words produced by at least 50% of the sample. Two raters scored all audio samples. Intra- and interrater reliability and long-term test-retest reliability were calculated. Construct validity was examined through correlations with micro- and macrostructural discourse measures. Four core lexicon lists were generated. For the WAB-R, 19 nouns and five verbs were identified; for the CST, 19 nouns and 16 verbs were identified. Intrarater reliability was excellent across variables, and interrater reliability was excellent for all core noun lists and CST core verbs and good for WAB-R core verbs. Long-term test-retest reliability ranged from poor to moderate across measures. Core lexicon scores were significantly and positively correlated with 12 macrostructural and nine microstructural variables. This study supports the rater reliability and construct validity of core lexicon measures in Laurentian French speakers across two discourse tasks. It also provides the first long-term test-retest reliability data for core lexicon scoring, offering insights that guide its clinical and research applications. https://doi.org/10.23641/asha.31236010.
- New
- Research Article
- 10.1177/09574565261419827
- Feb 7, 2026
- Noise & Vibration Worldwide
- Atul Dhakar + 2 more
This study proposes an advanced methodology for fault detection and prediction in air compressor (AC) systems using acoustic signal analysis under both healthy and faulty operating conditions. Audio data were acquired using a unidirectional microphone interfaced with an NI 9234 data acquisition module and an NI 9172 chassis. The collected signals were processed using recent non-traditional techniques, namely Local Mean Decomposition (LMD) and Empirical Mode Decomposition (EMD), to extract detailed fault-related characteristics. To identify the most dominant fault among seven faulty conditions, a Bubble Cloud (B-Cloud) analysis was employed using 15 statistical indicators (SIs) as input features. These indicators were subsequently classified using discriminant-analysis-based machine learning algorithms, including Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). The experimental results reveal that LMD provides superior signal decomposition performance compared to EMD due to its enhanced capability in isolating intrinsic oscillatory components. Among all SIs, the Kurtosis index proved to be the most sensitive and reliable feature for fault discrimination, particularly when combined with LMD outputs. Furthermore, LDA achieved the highest classification accuracy of 88.88%, outperforming QDA, and demonstrating its suitability for real-time fault prediction. Overall, the proposed framework offers a robust, accurate, and efficient solution for identifying critical fault conditions in AC systems, supporting improved predictive maintenance and system reliability.
- New
- Research Article
- 10.1145/3796236
- Feb 6, 2026
- ACM Transactions on Accessible Computing
- Amama Mahmood + 2 more
Our work addresses the challenges older adults face with commercial Voice Assistants (VAs), notably in conversation breakdowns and error handling. Traditional methods of collecting user experiences—usage logs and post-hoc interviews—do not fully capture the intricacies of older adults’ interactions with VAs, particularly regarding their reactions to errors. To bridge this gap, we equipped 15 older adults’ homes with smart speakers integrated with custom audio recorders to collect “in-the-wild” audio interaction data for detailed error analysis. Recognizing the growing use of Large Language Models (LLMs) to enhance capabilities of voice assistants, our study also explored how this integration of LLMs changes older adults’ interaction dynamics, specifically during errors. Midway through our study, we deployed ChatGPT-powered VA to investigate its efficacy for older adults. Our research suggests that while technical improvements—such as leveraging vocal and verbal responses combined with LLMs’ contextual capabilities—can enhance error prevention and management in VAs, interaction-level challenges still remain, particularly those unique to older adults. We propose design considerations to better align future VAs with older adults’ expectations and lived experiences.
- Research Article
- 10.1038/s41746-025-02299-2
- Feb 3, 2026
- NPJ digital medicine
- Sapir Gershov + 3 more
Artificial Intelligence (AI) is reshaping medical education, particularly in the domain of competency-based assessment, where current methods remain subjective and resource-intensive. We introduce a multimodal AI framework that integrates video, audio, and patient monitor data to provide objective and interpretable competency assessments. Using 90 anesthesia residents, we established "ideal" performance benchmarks and trained an anomaly detection model (MEMTO) to quantify deviations from these benchmarks. Competency scores derived from these deviations showed strong alignment with expert ratings (Spearman's ρ = 0.78; ICC = 0.75) and demonstrated high ranking precision (Relative L2-distance = 0.12). SHAP analysis revealed that communication and eye contact with the patient monitor are key drivers of variability. By linking AI-assisted anomaly detection with interpretable feedback, our framework addresses critical challenges of fairness, reliability, and transparency in simulation-based education. This work provides actionable evidence for integrating AI into medical training and advancing scalable, equitable evaluation of competence.
- Research Article
- 10.17507/tpls.1602.07
- Feb 1, 2026
- Theory and Practice in Language Studies
- Wahyu Indrayatti + 3 more
The purpose of this study is to describe 1) the linguistic and communicative needs required by students in the context of tourism and hospitality, 2) the integration of Action-oriented approach principles into digital textbook design for practical and contextual learning, and 3) characteristics of digital learning media to improve speaking skills. This study is qualitative and uses a questionnaire research instrument on students. The results show that most students responded positively to the textbook material offered by the researcher. Students’ communicative needs include various general tasks in the hospitality and tourism sector. The design of digital textbooks integrates real tasks and professional contexts into the components of the material, exercises, and evaluation of textbooks according to the needs of beginner students. The interactive learning media in the vocational field provides various digital audio and video features, which can encourage student independence in practicing speaking skills.
- Research Article
- 10.1111/bjhp.70051
- Feb 1, 2026
- British Journal of Health Psychology
- Lily Hawkins + 7 more
ObjectivesUnderstanding the fidelity of delivery of complex health behaviour interventions is crucial in determining their effectiveness and identifying aspects needing refinement. PROGROUP is a group‐based intervention for people with severe obesity. It aims to promote a shared social identity to support behaviour change. Data from a feasibility randomized controlled trial (fRCT) were used to assess fidelity of intervention delivery and the impact on patient experiences, to optimize the intervention for a main trial.MethodsData from 18 patient and five facilitator interviews, audio and video data of group sessions, two fidelity checklists, support calls and a group processes questionnaire were used to assess fidelity of delivery to intervention principles, patients' experience of the intervention and areas for optimization.ResultsThe number of activities delivered and facilitator confidence and rapport with the group affected fidelity to intervention principles. The facilitators' delivery style, group composition and attendance affected the groups' sense of social identity. Accordingly, the intervention content was revised to ensure better balance between educational material and group activities, to increase facilitator confidence and enable flexible delivery.ConclusionsThe success of group‐based interventions relies on the facilitator addressing the group's needs and creating conditions for a shared social identity to develop. Assessment of fidelity to the manual content and core function of PROGROUP enabled identification of components needing refinement, incorporating both facilitator and patient perspectives. The assessment and optimization process offer a blueprint for evaluating other group‐based interventions.
- Research Article
- 10.1016/j.actpsy.2025.106099
- Feb 1, 2026
- Acta psychologica
- Zheng Wangxiongjie
CEO perceived personality and corporate risk disclosure in prospectus: A multimodal machine learning analysis.
- Research Article
- 10.1016/j.jad.2025.120644
- Feb 1, 2026
- Journal of affective disorders
- Yu Jin + 8 more
Depression screening with textual and audio features based on large language models and machine learning.
- Research Article
- 10.22134/cc4kt319
- Jan 31, 2026
- REVISTA TRACE
- León García Lam
Below, I present a transcribed version of the presentation given by Professor Dominique Chemín on October 3, 2018, during the XX Congress of Otopames (2018), held at the Regional Museum of San Luis Potosí in tribute to the ethnological work of his wife Heidi Chemin Bässler. The transcription was made from a digital audio recording. The initial version faithfully reflects the orality of the speech: it included fillers, repetitions characteristic of Professor Chemín, as well as audience reactions —laughter, applause— and even spontaneous interruptions, such as a phone call taken during the presentation. In contrast, this final version was edited to remove those elements, the style was adjusted, and readability was optimized. However, both the recording and the first transcription are available upon request for those who need to consult them.
- Research Article
- 10.1017/pan.2025.10031
- Jan 30, 2026
- Political Analysis
- Rafael Mestre + 1 more
Abstract Political science is a field rich in multimodal information sources, from televised debates to parliamentary briefings. This paper bridges a gap between computer and political science in multimodal data analysis using audio. The adoption of multimodal analyses in political science (e.g., video/audio with text-as-data approaches) has been relatively slow due to unequal distribution of computational power and skills needed. We provide solutions to challenges encountered when analyzing audio, advancing the potential for multimodal data analysis in political science. Using a dataset of all televised U.S. presidential debates from 1960 to 2020, we focus on three features encountered when analyzing audio data: low-level descriptors (LLDs), such as pitch or energy; Mel-frequency cepstral coefficients (MFCCs); and audio embeddings/encodings, like Wav2Vec. We showcase four applications: (a) forced alignment of audio text using MFCCs, time-stamping transcripts, and speaker information; (b) speech characterization using LLDs; (c) custom-made classification models with audio embeddings and MFCCs; and (d) emotional recognition models using Wav2Vec for classification of discrete emotions and their valence-arousal dominance. We provide explanations to help understand how these features can be applied for different political research questions and advice on vigilance to naive interpretation, for both experienced researchers and those who want to start working with audio.