An Embedded Variable Bit-Rate Audio Coder for Ubiquitous Speech Communications
In this paper, we propose an embedded variable bit-rate (VBR) audio coder to provide the fittest quality of service (QoS) and better connectivity of service for the ubiquitous speech communications. It has scalable bandwidth for narrowband to wideband speech signal, and embedded 8 32 kbit/s VBR corresponding to the network condition and terminal capacity. For the design of the embedded VBR coder, the narrowband signals are compressed by an existing standard speech coding method for the compatibility with G.729 coder, and then the other signals are compressed hierarchically on the basis of CELP enhancement and transform coding with temporal noise shaping (TNS) method. By the objective and subjective quality tests, it is shown that the proposed embedded VBR audio coder provides a reasonable quality compared with existing audio coders such as G.722 and G.722.2 in terms of mean opinion score (MOS) and perceptual evaluation of speech quality of wideband (PESQ-WB).
- Research Article
1
- 10.9734/jerr/2018/v3i116704
- Nov 24, 2018
- Journal of Engineering Research and Reports
Voice service being the major offering of telecommunication networks, its level of Quality of Service (QoS) largely determines the performance of these networks. This work evaluated the state-of-the-art Perceptual Evaluation of Speech Quality (PESQ) objective model for perceptual estimation of the quality of transmitted speech signals. Perceptual estimation of the quality of speech is predominantly done by subjective techniques and the results presented as Mean Opinion Scores (MOS), which has a scale from 1 for poor quality to 5 for excellent quality. Despite constraints of the subjective approach to perceptual speech quality estimation, its scores serves as the basis for correlating quality scores from objective techniques for speech quality estimation. Original or reference speeches were recorded using professional studio equipment and software, and guided by provisions of ITU-T P.830. The speeches were transmitted over three mobile wireless networks. A speech database consisting of 64 original (32 male and 32 female) and 192 transmitted speeches was developed. Reference speeches and their corresponding transmitted (network-degraded) speeches were tested on the PESQ model to estimate their quality scores. The raw PESQ quality scores are within the scale range of -0.5 and 4.5. They were mapped to the MOS scale for linear comparison of the scales. Study of PESQ model showed several shortcomings, some of which have been improved upon by previous researchers. Evaluating PESQ mapping function (in ITU-T Rec P.862.1) showed the need for better coverage of the MOS scale. Analysis of solution for the logistic growth function was done and parameters were optimised which resulted in the development of a new robust logistic mapping function. The raw PESQ quality scores were mapped using the developed mapping function as well as two known standard mapping functions, namely: ITU-T P.862.1 and Morfitt and Cotanis mapping functions. The mapped scores known as PESQ MOS-listening quality objective (PESQ MOS-LQO) obtained with the three functions were tested using ANOVA at a significant figure of . The developed logistic mapping function offered a quality score coverage of 98.6% of the MOS scale. This was evaluated against the two known standard mapping functions and the developed function offered improvement of 11.8 and 4.9% over and above their 86.8 and 93.7% coverage of the MOS scale respectively. At the significance level of , an F-value of 60.6042, a critical-F of 3.04, and a p-value of 4.61721E-21 were obtained. With p < 0.05, the Null Hypothesis was rejected, and the critical-F value being less than the F-statistic value confirmed the rejection. Therefore, the data distribution of at least one of the functions has a different mean and belongs to a separate population of performance.
- Research Article
4
- 10.1007/s10772-011-9126-0
- Jan 12, 2012
- International Journal of Speech Technology
Today, the primary constrain in wireless communication system is limited bandwidth and power. Wireless systems involved in transmission of speech envisage that efficient and effective methods need to be developed for maintaining quality-of-speech, especially at the receiving end, with maximum saving of bandwidth and power. Amongst all elements of the communication system (transmitter, channel and receiver), transmission channel (carrier of information/data, also called the medium) is the most critical and plays a key role in the transmission and reception of information/data. Channel conditions decide the quality of speech at receiver. Modeling a channel is a complex task. Many techniques are adopted to mitigate the effect of the channel. AMR (Adaptive Multi Rate) is one such technique that counteracts the deleterious effect of the channel on speech. This technique employs variable bit rate that dynamically switches to specific modes of operation (switching bit rates--called modes of operation) depending upon the channel conditions. In this paper, the application of Code Excited Linear Prediction (CELP) source coder on speech followed by AMR codec is investigated and studied. An e-test bench using MATLAB is created to implement the CELP based AMR Codec scheme, and the same studied and investigated through a series of simulation. Here, both subjective and objective evaluations are carried out. Objective evaluations are categorized into waveform based, spectral based and perceptual based analysis. The results of the simulations are recorded and compared in various graphs and tables, which include calculation of various parameters like Absolute Error (ABS), Mean Square Error (MSE), Root Mean Square Error (RMSE), Signal to Noise Ratio (SNR), segmental SNR (segSNR) (Y. Hu and P. Loizou in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., vol. 1, pp. 153---156, 2006a; Proc. Interspeech, pp. 1447---1450, 2006b), Weighted-Slope Spectral distance (WSS) (Y. Hu and P. Loizou in Speech Commun. 49, 588---601, 2007), Perceptual Evaluation of Speech Quality (PESQ) (ITU-T rec. P.862, 2000), Log-Likelihood Ratio (LLR), Itakura-Saito Distance measure (ISD), Cepstrum Distance Measures (CEP) (V. Turbin and N. Faucheur in Proc. Online Workshop Meas. Speech Audio Quality Netw., pp. 81---84, 2005), Frequency Weighted Segmental SNR (fwSNRseg), Predicted rating of overall Quality (Covl), Rating of Speech Distortion (Csig), Rating of Background Distortion (Cbak) (ITU-T rec. P.835, 2003) and MeanOpinion Score (MOS). Simulation results clearly advocate that, it is possible to producevariable bitrates (tuning to channel conditions) in CELP coder by affecting coefficients of the coder while still maintaining a good quality of speech. Further, higher the bit-rate used, the better is the quality of speech (which can be verified from the results obtained with PESQ and MOS analysis) and at the same time offered simulation delay time also increases.
- Conference Article
9
- 10.1109/icwt.2017.8284163
- Jul 1, 2017
Mean Opinion Score (MOS) is one of the indicators used to determine the quality of telecommunications services. In this study measured the quality of the telephone conversation on the Over The Top services Call Service. The study was conducted using the Perceptual Evaluation of Speech Quality (PESQ) in accordance with ITU-T P.862. Based on ITU-T P.862, MOS values expressed in the form of assessment of the numbers 1.0 to 4.5.
- Conference Article
- 10.1109/icassp49357.2023.10096572
- Jun 4, 2023
Perceptual speech quality is an important performance metric for teleconferencing applications. The mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample. Recently, there has been increasing research interest in developing models for estimating MOS blindly. Here we propose a multitask framework to include additional labels and data in training to improve the performance of a blind MOS estimation model. Experimental results indicate that the proposed model can be trained to jointly estimate MOS, reverberation time (T60), and clarity (C50) by combining two disjoint data sets in training, one containing only MOS labels and the other containing only T60 and C50 labels. Furthermore, we use a semi-supervised framework to combine two MOS data sets in training, one containing only MOS labels (per ITU-T Recommendation P.808), and the other containing separate scores for speech signal, background noise, and overall quality (per ITU-T Recommendation P.835). Finally, we present preliminary results for addressing individual rater bias in the MOS labels.
- Conference Article
4
- 10.1109/csnt.2012.124
- May 1, 2012
This paper investigates application of Code Excited Linear Prediction algorithm on Adaptive Multi Rate Wideband coder. The proposed coder can adaptively change its bit-rate based on C/I ratio depending on channel conditions. The coder has nine bit-rates from 6.6 kbps to 23.85 kbps. An e-test bench using MATLAB is created to implement proposed coder and series of simulations are carried out to judge the performance of implemented coder using Subjective and Objective analysis. Simulation results clearly advocate that it is possible to produce variable bit rate (by tuning to channel conditions) in CELP coder by affecting coefficients of coder while still maintaining comparable speech quality with reference to AMR WB coder standardized by 3GPP and ITU-T [5]. It is also evident from the simulation results that Signal to Noise Ratio (SNR), Segmented SNR, Perceptual Evaluation of Speech Quality (PESQ) and Mean Opinion Score (MOS) increases with increase in bit rates of proposed coder and Absolute Error (Abs Err), Mean Square Error (MSE), Root Mean Square Error (RMSE) reduces with increase in bitrates.
- Research Article
5
- 10.1121/1.4877764
- Apr 1, 2014
- The Journal of the Acoustical Society of America
In this paper, localization and separation of acoustic sources are examined. Depending on the number of sources in relation to the array channels, the problem is investigated in terms of underdetermined and overdetermined configurations. In the underdetermined configuration, virtual monopole sources are assumed in uniformly spaced angles. The problem is then formulated into compressive sampling (CS) problem which can be solved by using the linearly constrained -norm convex (CVX) optimization. The solution yields the directions of real sources and the source signal spectrum, which enables localization and reconstruction of sources at one shot. In the underdetermined configuration, source localization and signal separation is carried out in two steps. First, the directions of arrival (DOA) are estimated with Minimum Variance Distortionless Response (MVDR) or Multiple Signal Classification (MUSIC). Next, Tikhonov regularization (TIKR) is utilized to recover the source spectrum. In the localization problem for both configurations, Neyman-Pearson detector is employed to determine thresholds for source detection. Numerical and experimental results show that the proposed methods produce improved speech quality in terms of mean opinion score (MOS) in perceptual evaluation of speech quality (PESQ) test.
- Research Article
4
- 10.1007/s10772-012-9178-9
- Oct 23, 2012
- International Journal of Speech Technology
Paper deals with implementation of variable bit rate steganographic data transmission over ETSI GSM 06.10 FR coder at five different bitrates. Then, few modifications are suggested in Regular Pulse Excitation section of ETSI GSM FR coder which ultimately claims to produce state of the art proposed GSM FR coder. In contrast with ETSI GSM FR coder, proposed coder also exhibits same bit rate steganographic data transmission. Here, in order to facilitate the same, few RPE pulses are identified and being utilized for embedding and hiding the information bits into them. Key element of this research is to allow for joint speech coding and data hiding and that is accomplished with two different approaches like Fixed and Joint Approach. These both approaches are implemented on both Standard and Proposed coders for their overall analytical evaluation of performance using Subjective (Mean opinion Score and Degraded MOS) and Objective (Perceptual Evaluation of Speech Quality) analysis. Small data information is represented as stego signal which can be embedded over different encoded wave files (chosen from NOIZEUS corpus) that serve as carrier signal. Simulation results for both coders reveal the trade off between data embedding rate and recovered speech quality (for both approaches). It is quite evident from both Subjective and Objective analysis that proposed coder offers comparable performance at the same time with lesser simulation delay because of its inherent constructional difference. It remains the fact that for both the coders, Joint approach performs better but at the cost of more simulation delay.
- Research Article
11
- 10.4304/jmm.8.3.291-298
- Jun 8, 2013
- Journal of Multimedia
On the Quality of Experience (QoE) evaluation of communication system, the quality of speech is an important factor to evaluate the system. Perceptual evaluation of speech quality (PESQ) is a well known objective speech quality assessment method for the voice QoE evaluation. It is proposed by International Telecommunication Union (ITU) and is formed as the ITU-T P.862 Recommendations. PESQ applies Bark-scale frequency to evaluate the Mean Opinion Score (MOS) for speech quality of voice communication system. But through our research, we find that the PESQ algorithm has limitations for evaluating speech quality. In order to change these limitations, this paper proposes a new objective evaluation algorithm by using ERB-scale to take the place of Bark-scale frequency. The ERB-scale is more accurate than Bark-scale to describe the frequency selectivity of the human ear when frequency at lower domain. We call the new algorithm as NPESQ (New Perceptual Evaluation of Speech Quality) whose accuracy is tested in our experiment. Through experimental comparison against PESQ and NPESQ, the results demonstrate the improvement of the NPESQ. Therefore, it can be concluded that the new algorithm can improve the accuracy of the measurement.
- Research Article
5
- 10.1007/s00530-017-0549-6
- Apr 18, 2017
- Multimedia Systems
This paper proposes two models of Mean Opinion Score (MOS) estimation based on Thai users and the Thai language, referring to packet loss effects, for G.726 and G.729 codecs. Based on Thai users and Thai speech referring to packet loss effects in this work, the Absolute Category Rate (ACR) listening tests were conducted with 89 participants and 107 participants for the MOS estimation model development of G.726 and G.729 respectively, while the same tests were conducted with totally 60 participants for the model evaluation of both codecs. Packet loss rates were 0–15% for G.726 with 5 test conditions and G.729 with 6 test conditions; each condition was conducted with at least 16 participants. After gathering the data, the MOS estimation models for both codecs were simply created and then evaluated with the test sets, comparing Perceptual Evaluation of Speech Quality (PESQ), a popular measurement method. For one of the contributions of this study, after the models were evaluated using Mean Absolute Percentage Error (MAPE), it was found that the proposed models for G.726 and G.729 provided better performance than PESQ, particularly by reducing the MAPE by about 30% and 17% respectively, compared to PESQ.
- Book Chapter
4
- 10.1007/978-3-319-03783-7_22
- Jan 1, 2013
This paper presents the study of VoIP quality measurements from two popular codecs, G.711 and G.729, using the methods of Perceptual Evaluation of Speech Quality (PESQ) and Thai speech. In this study, from four lists of Thai speech, it has been found that G.711 provides better voice quality than G.729 in every condition of packet loss. Also, it has been found that Objective Listening Quality - Mean Opinion Score (MOS-LQO) of male speech is slightly higher than MOS-LQO of female speech, whereas MOS of child speech is the lowest. Then, MOS-LQO values from four Thai speech lists have been compared. Next, MOS-LQO from PESQ of male and female speech at the best condition have been compared with the Subjective Listening Quality Mean Opinion Score (MOS-LQS) from ACR listening tests in another laboratory. Lastly, referring to packet loss effects, objective MOS from PESQ have been compared with subjective MOS from conversation tests. It has been found that there is no significant difference among MOS-LQO from the four Thai speech lists, but it has been found that there is a significant difference between subjective MOS and objective MOS from each codec in each condition. Therefore, one can say that this is evidence that PESQ requires intensive study with Thai speech to modify PESQ for VoIP quality measurement in Thai environments confidently.
- Research Article
2
- 10.9734/jsrr/2019/v22i630106
- Apr 3, 2019
- Journal of Scientific Research and Reports
This paper evaluates voice quality of four Global System for Mobile (GSM) Communication providers in five selected cities in Kwara State with thoughtfulness of network performance evaluation and the quality of service (QoS) improvement of GSM network system. Three assessment components/parameters which are network accessibility, service retainability and connection quality for evaluating QoS on the network were mainly adopted. The parameters were applied on four GSM networks in the studied areas using customers’ complaints method. Also, a standard method known as Perceptual Evaluation of Speech Quality (PESQ) — (International Telecommunication Union-Telecommunication Standardization Sector) ITU-T standard P.862, used for measuring call voice quality and Mean Opinion Score (MOS) is adopted. The two methods were therefore compared to assess call voice quality of the four GSM networks. The Key Performance Indicators (KPIs) on which the GSM networks were tested include call set-up success rates (CSSR), call drop rate (CDR), call completion success rates (CCSR), handover success rates (HSR) and traffic channel congestion rate (TCHR). The result of the study shows that the Quality of Service of GSM system in the selected cities is unreliable. The study also shows that the GSM network accessibility and retainability in the country are unsatisfactory. However, the call voice quality was observed to be on the peak in these cities across the four network providers. At the end of this manuscript, suggestions are given on how to advance both the Quality of Service and the positive impact of GSM network in the selected areas and the country as a whole.
- Conference Article
15
- 10.1109/qomex.2009.5246960
- Jul 1, 2009
ITU-T P.862 - ldquoPerceptual Evaluation of Speech Quality (PESQ)rdquo is well known as an intrusive objective speech quality assessment method. Some reports have found that the PESQ time alignment mechanism fails to estimate delay where signals with high packet loss rate and dynamic time processing are present. A new time-alignment algorithm to improve the PESQ accuracy for time-scale modified voice transmission is suggested here. In the propose model, the time alignment of reference and degraded speech is estimated using Dynamic Time- Warping (DTW) in contrast to correlation and splitting methods used in the standard PESQ. Comparative results versus subjective Mean Opinion Score (MOS) show improvement in cases where dynamic time processing of signals is present.
- Book Chapter
8
- 10.1007/978-3-642-04994-1_8
- Jan 1, 2009
The Voice over Internet Protocol (VoIP) services are currently present in our personal and professional activities and will be key services in Next Generation Networks (NGN). Hence, in order to keep and attract new customers, the quality of delivery for VoIP services needs to be measured and optimized to ensure Quality of Service (QoS) and Quality of Experience (QoE) support to users in future multimedia networking systems. This paper presents the requirements to assess the quality level of VoIP services in NGN and analyzes the limitations of the well-known E-Model and Perceptual Evaluation of Speech Quality (PESQ) metrics for quality evaluation of VoIP services. Additionally, a new QoE metric, named Advanced Model for Perceptual Evaluation of Speech Quality (AdmPESQ), is proposed to overcome the limitations of current proposals concerning packet loss and packet delay awareness and to improve the VoIP assessment process. Performance evaluation was carried out based on simulation experiments to show the benefits of AdmPESQ in assessing the impact of VoIP services on the user's expectation.
- Research Article
4
- 10.1186/1687-1499-2012-53
- Feb 20, 2012
- EURASIP Journal on Wireless Communications and Networking
International audience
- Research Article
25
- 10.1007/s11042-019-7630-4
- Apr 30, 2019
- Multimedia Tools and Applications
Now a days, cases of theft of important data both by employees of the organization and outside hackers are increasing day-by-day. So, new methods for information hiding and secret communication are need of today. Steganography is an option for it. Embedding a secret message into other meaningful message (cover media) without disturbing the features of the cover media is known as steganography. A novel approach for audio steganography is proposed in this paper. Here, secret message and cover media both are digital audio. Proposed approach is robust with respect to both LSB removal and re-sampling attacks. This approach adds extra layer of security because a transformation function is applied on amplitude bits of secret audio before embedding. This approach is more resistive towards white Gaussian noise addition (WGN) during transmission of stego file. The proposed approach is also suitable for embedding secret audio during real time audio communication because processing time is low while embedding capacity is high. Embedding capacity of the proposed approach is same as of conventional LSB approach because in both approaches one bit of secret is being inserted in each sample of cover audio. Standard parameters: Perceptual Evaluation of Speech Quality (PESQ) and Mean Opinion Score (MOS) are used for measuring the imperceptibility between cover audio & stego audio. For the proposed approach, PESQ and MOS are found as 4.47 and 5 that are very close to their respective highest values 4.5 and 5 when there is no attack.