ridoneAugust 17, 2018 at 7:12 am #149
As shown in Eq. (9), the modified spectrum (with the proposed constraints incorporated) can be obtained by applying a binary mask to the enhanced spectrum. In computational auditory scene analysis (CASA) applications, a binary mask is often applied to the noisy speech spectrum to recover the target signal [20–23]. In this section, we show that there exists a relationship between the proposed residual constraints (and associated binary mask) and the ideal binary mask used in CASA and robust speech recognition applications (e.g., ). The goal of CASA techniques is to segregate the target signal from the sound mixtures, and several techniques have been proposed in the literature to achieve that . These techniques can be model-based [24,25] or based on auditory scene analysis principles . Some of the latter techniques use the ideal time-frequency (T-F) binary mask [20,21,27]. The ideal binary “mask” (IdBM) takes values of zero or one, and is constructed by comparing the local SNR in each T-F unit (or frequency bin) against a threshold (e.g., 0 dB). It is commonly applied to the T-F representation of a mixture signal and eliminates portions of a signal (those assigned to a “zero” value) while allowing others (those assigned to a “one” value) to pass through intact. The ideal binary mask provides the only known criterion (SNR ≥ δ dB, for a preset threshold δ) for improving speech intelligibility, and this was confirmed by several intelligibility studies with normal-hearing [28,29] and hearing-impaired listeners [30,31]. IdBM techniques often introduce musical noise, caused by errors in the estimation of the time-frequency masks and manifested in isolated T-F units. A number of techniques have been proposed to suppress musical noise distortions introduced by IdBM techniques [32,33].While musical noise might be distracting to the listeners, it has not been found to be detrimental in terms of speech intelligibility. This was confirmed in two listening studies with IdBM-processed speech [28,29] and in one study with estimated time-frequency masks . Despite the presence of musical noise, normal-hearing listeners were able to recognize estimated  and ideal binary-masked [28,29] speech with nearly 100% accuracy.
The reasons for the improvement in intelligibility with IdBM are not very clear. Li and Wang  argued that the IdBM maximizes the SNR as it minimizes the sum of missing target energy that is discarded and the masker energy that is retained. More specifically, it was proven that the IdBM criterion maximizes the SNRESI metric given in Eq. (1) . The IdBM was also shown to maximize the time-domain based segmental and overall SNR measures, which are often used for assessment of speech quality. Neither of these measures, however, correlates with speech intelligibility . We provide proof in the Appendix that the IdBM criterion maximizes the geometric average of the spectral SNRs, and subsequently maximizes the articulation index (AI), a metric known to correlate highly with speech intelligibility .
As it turns out, the ideal binary mask is not only related to the proposed residual constraints, but is also a special case of the proposed residual constraint for regions I and II. Put differently, the proposed binary mask (see example in Eq. (9)) is a generalized form of the ideal binary mask used in CASA applications. As mentioned earlier, if the estimated sctv streaming magnitude spectrum is restricted to fall within regions I and II, then the SNRESI metric will always be greater than 0 dB. Hence, imposing constraints in region I+II ensures that SNRESI is always positive and greater than 1 (i.e., > 0 dB). As demonstrated in Figure 4, the stimuli constrained in region I+II consistently improved speech intelligibility for all three enhancement algorithms tested. As mentioned earlier, the composite constraint required for the estimated magnitude spectra to fall in region I+II is given
You must be logged in to reply to this topic.