White Paper:Automated Sound Quality Estimation

February 17, 2009 by admin
Filed under: Voice and Sound Quality Testing Software 

Automated sound signals quality estimation

1. INTRODUCTION

Sound signal quality estimation acquires the increasing value with the distribution of mobile communications, systems of a synthetic telephony, VoIP and various portable sound recording and sound reproducing devices. The desire naturally arises to work out a way, which would provide objective estimation (i.e. independently from estimation of particular subject) and the opportunity to automate such estimation. It is of a high importance as for comparison of competitive commercial products as well as for parameters’ optimisation of proprietary products.

One of the main parameters in systems of compression, transfer and reproduction of the sound information is the quality of the restored, received or reproduced sound.

Quantitative measurement of sound quality has specific features due to the fact that the final receiver of a sound signal is always a human, and a human is also a source of the majority of sound signals. According to the well-known fact, sound signals quality is determined not only by the technical characteristics of a sound processing and transfer systems, but also by the properties of individual peculiarities of speech perception and production, which vary in time and from individual to individual.

2.REVIEW OF QUALITY ESTIMATION METHODS

Subjective and objective methods to measure speech quality are distinguished. Subjective methods are those, which include the hearing of a person as a component of a measuring complex. Objective methods, on the contrary, exclude participation of person’s hearing from the process of measurements.

The most widespread subjective method of speech quality estimation is MOS (Mean Opinion Score), five-point scale estimation.

This kind of estimation is determined by processing estimations given by groups of auditors to the sequences of sound signals, reproduced by various audio systems. Each auditor estimates each signal, and then the results are averaged.

To organize and implement subjective estimation is sufficiently difficult, long lasting and expensive activity, therefore investigations have been conducted in order to find objective methods, allowing receiving fast and automated estimations which would well correspond to subjective examinations.

There are various automatic estimation methods; some of them are given below [1]:

AI (Articulation Index). The idea is that the whole frequency range of speech signal is divided into 20 bands and the signal/noise ratio is determined within the band. The band broad is defined in such a way, that every band contributes equally in speech perception. The signal/noise ratio is calculated within every band. Articulation index is supposed to be equal the weighted total of the band values.

The disadvantage of the articulation index is that it does not take into account the properties of hearing and speech production, although it directs towards speech signal.

SII (Speech Intelligibility Index) is the evolution of AI method. The American Standard ANSI S3.5-1997 includes the speech intelligibility index. It provides 4 measuring procedures on different band groups: 21 critical bands, 18 one third-octave bands, 17 equal by their contribution critical bands and 6 octave bands. The signal/noise ratio is calculated within every band and the total SII coefficient, ranged from 0 to 1 is computed.

The speech intelligibility index, however, takes into account only the properties of hearing, not speech production.

STI (Speech Transmission Index). We may approximately consider speech signal as broadband signal modulated by low-frequency signal. Articulation speed determines modulation frequency. When modulation depth decreases, speech signal becomes similar to noise and its intelligibility decreases. Accordingly, intelligibility decrease can be estimated according to modulation depth decrease as well.

Whole speech range is divided into 7octave bands. An octave noise signal is the input. The test signal intensity distribution agrees with the distribution of speech signal intensities. The modulating signal frequencies vary from 0.5 to 12.5 Hz with one-third-octave interval (14 frequencies in all).

The STI measuring method is stated in the International standard IEC 268-16.

RATSI/STIPA (Rapid Speech Transmission Index). The STI method needs a lot of measuring procedures and calculations. A simplified method was developed, which provides for measuring only in 2 bands with 5 modulation frequencies and reduces the number of measuring procedures and calculations. For good intelligibility RASTI values must be not less than 0.6.

Both speech transmission index (STI) as well as rapid speech transmission index (RASTI) imitate speech production process by means of noise model, but to take into account the properties of speech production and hearing in such way is far from optimum.

C50 (factor of clearness) determines sound clearness and clarity. It is computed as near echo/far echo ratio. The method is based on the fact, that echo reduces signal intelligibility. The near echo/far echo ratios in several frequency bands are calculated. They consider near echo (less than 33 ms) as useful signal and far echo (more than 33 ms) as disturbing signal.

The factor of clearness takes into account only one kind of the possible distortions and it is worth to apply it only as one of the speech quality estimations approaches.

ITU P.862 PESQ (Perceptual Evaluation of Speech Quality). PESQ is an objective measurement method that predicts the results of subjective listening tests on telephony systems. PESQ uses a sensory model to compare the original, unprocessed signal with the degraded signal from the network or network element. The resulting quality score is similar to the subjective “Mean Opinion Score” (MOS) measured using panel tests according to ITU-T P.800. The PESQ scores are calibrated using a large database of subjective tests. The method takes into account coding distortions, errors, packet loss, delay and variable delay, and filtering in analogue network components.

Being one of the most popular tools PESQ has a number of disadvantages such as demanding test signals to be speech-like because many systems are optimized for speech and respond in an unrepresentative way to non-speech signals (e.g. tones, noise, ITU-T P.50). PESQ test signal is to be set by tester and thus vendor estimations may vary from end customer estimations. The approach performs signal level equalization what theoretically is not that good because when speaking different sound volumes may have different spectrums. PESQ cannot catch significant quality loss, which occurs when the voice is equalized such that there is far less low frequency and high frequency energy when compared to the original voice file.

The need to develop new methods and to improve existing ones is caused by desire to bring together objective and subjective estimation of quality and to explicitly use in such systems our knowledge about hearing and speech production.

To use arbitrary or particularized signal as a source signal depends on the estimation purpose (speech intelligibility evaluation, sound reproduction quality, quality estimation of speech transmitted through intercommunication channels, etc.) and allows increasing estimation objectivity.

3. GENERAL SCHEME OF THE SYSTEM

The figure 1 represents general scheme of the quality estimation system for sound signals.

Fig.1. General scheme of the quality estimation system for sound signals

A generator of test signals allows sound signal forming according to one of the sound flow models. It can be either a particularized set of sound signals or a signal, received in output of statistical speech model. (Signal models in details are considered later.) Generator’s signal can either be saved for follow-up usage or be exposed to processing and estimation. Bank of signals stores sound data, received as a result of signals’ generator work or from some external sources.

Accordingly, an input of estimation block is a signal of generator directly or one of the bank of signals. Test signal is the input of the synchronizer or of the device under test, which can be for example, a vocoder or a communication channel. The output signal of the device under test is an input of synchronizer also.

The synchronizer matches in time an initial signal and a processed signal. The synchronized signals in chunks input in analytical module, which determines the degree of similarity for signals and issues the quality estimation as the measure of similarity between the initial and the processed signals.

Let’s consider the functioning of system modules in details.

3.1. Generator of test signals

The generator of test signals consists of a generator of noise signals and a simplified statistical speech model. Both of generators simulate the process of “speaking”, but their approaches to speech production simulating differ. The statistical model forms sound flow on the base of human speech patterns and the generator of noise signals bases on knowledge about sound perception and speech production.

3.2. Generator of noise signals

The generator of noise signals operates on speech flow model like one, which used in the STI method. The idea is that we may approximately consider speech signal as broadband signal modulated by low-frequency signal. Articulation speed determines modulation frequency, which varies from 0.63 to 13.44 Hz.

As a modulation signal the noise signal is used, resulting from white noise by means of cutting the critical bands of hearing and speech production. In the first case the signal generated allows estimation of sound signal quality in general, in the other case – particularly speech signal estimation. Critical bands in details are considered in the description of the analytical module.

3.3. Statistical speech model

Language consists of sounds. Every individual generates a unique set of sounds. However, one can distinguish standard speakers (SS), generating average kinds of sounds. Standard speakers are subdivided according to their age, gender, region, social status, education, occupation etc.

One should determine sound frequencies, probabilities of sounds following each other, intonation contours, vocabularies, physical properties of individual sounds for every standard speaker. Based on these data one can simulate natural speech flow.

One should also include in the system statistic information about the population structure and with its help generate speech flows with the features, which characterize population of some region or the whole country.

Broadly speaking, statistic model (fig.2) contains statistic data about the population structure, speech bases of standard speakers, speech signal processing facilities (algorithms of synthesis), means of speed sounds parameters determination, generation algorithms of sounds and standard speakers distributions.

Fig.2. Extended structure of the statistical model

The interface block provides interaction with outer world (or User) and also synchronizes functions of other blocks of statistic model.

The block of speaker choice generates sample of standard speakers (or sequence of indexes of standard speakers). Depending on the command a representative sample of standard speakers or a sample from one standard speaker can be generated. The sample is representative in the sense that the speech parameters distribution in it corresponds to the speech parameters distribution of the population, described in the model.

The sequence of indexes of standard speakers is saved in the block of standard speaker choice for further usage.

The block of sound choice forms the prosodic (the descriptions of sounds). Depending on the command prosodic is constituted either for a representative sound sample, or for a specified sequence of sounds, or for one specified sound.

Prosodic is saved in the prosodic buffer follow-up usage.

The block of speech flow transforms descriptions of sounds in readings of speech signal.

The block of the descriptions of standard speakers stores descriptions of standard speakers and on query returns necessary parts of descriptions, information about their number, list of speakers.

3.4. Signals synchronizer

The synchronizer matches in time domain initial and processed signals. Input of the synchronizer receives signal segments (pDATA), duration of which is equal to VAD (Voice Activity Detection) frame, and criterions of VAD activity for them are specified in the pDATA segments.

Any sound signal can be separated into active and inactive phases. The first corresponds to active sound processes, the latter – to low-level background noise. The elementary way of dividing these two phases is to divide them according to signal energy level. However such approach is not accurate enough. In our approach VAD algorithm presented in recommendation G.723 is used for this purpose (as a part of VAD vocoder).

After filtration the state criterions and signal frames enter the the synchronizer blocks, which combine active signal fragments and pauses. The modules use common data: buffer of active etalon signal (EBuffer1), buffer of active signal under test (TBuffer1), buffer of the etalon signal pause (EBuffer0), buffer of signal under test pause (TBuffer0), readiness criterion of buffers of active signal and pauses (dReady[0..1]). There is also a counter of synchronization errors (dErrorCounter).

Output of the synchronizer is a pair of buffers with active signals or a pair of buffers with pauses. Both of the blocks of synchronizer can initiate an appearance of a pair of synchronized buffers.

The synchronized buffers and the criterion of activity are the input of analytical module.

3.5. Analytical module

The analytical module compares separately the combined pairs of fragments of active and inactive phase signal that allows getting more accurate estimation.

The integral spectrum is determined for each fragment using discrete cosine transformation (DCT). Spectrum integration is calculated according to the proprietary formula.

In the spectrum calculation the interpenetration of windows comes to N/2 samples, the known Hamming or Blackmann-Harris window function is applied to every window.

Levels of spectrum energy on bands are determined for all sets of bands. Groups of critical bands [2-6], determined by different authors resulting from different models of sound perception and speech production are already known.

Band boundaries (initial and terminal indexes) as well as band energy values are determined according to a set of proprietary formulas.

The initial quality estimation value is taken as 100%. Further it decreases proportionally to distinction of energies on bands. Quality estimation values are determined on every set of bands. The overall quality estimation on all bands is calculated according to proprietary formulas.

To determine sound (D) and word (W) intelligibility the following formulas may be used:

, where (4)

S = 0,8 D2 +0,2 D4 – known Pokrovskij’s formula

(5)

To go from the quality loss coefficient to the sound intelligibility value, a correspondent table is used.

To determine value in intermediate points, interpolation (for example, Lagrange interpolation polynomial) is used. Figure 3 represents the diagram of dependence (S(dQ)).

Fig.3. Dependence of the syllable intelligibility from the quality estimation value

Quality estimations can be translated similarly into MOS estimation values.

4. IMPLEMENTATION & CONCLUSIONS

Algorithms described are implemented for voice quality estimation and comparison of external initial signals and signals under test.

As the external arbitrary signals recorded with the sampling frequency of 8 kHz and the capacity of samples equal to 16 bits can be used. Supposed, the signal under test is received from an initial signal as a result of some transformations (for example, compression/restoration, transmission through communication channels, filtration). In additional as an initial external signal a record of the phonetically representative text read aloud by several speaker of different age of both gender.

As internal initial signals (i.e. signals, which the user of the program has no access to) the signals generated according to the noise model (the description of the generator is given below) and the signals, generated on the basic of the statistic model.

The internal signals are put in the system of sound data comparison/restoration, implemented for example as a DLL with the specified interface. The signal processed by means of methods contained in DLL is considered as the signal under test and is exposed to the quality estimation procedure described earlier.

Presented method of sound signal quality estimation has a number of advantages over known methods of quality measurements, namely:

  • it is universal since it allows judging the quality of signals from various source and processed in different ways;

  • one can optimize quality estimation signal depending on the purposes:

    • in speed (for example, it is possible to receive rough estimation quickly);

    • in signal type (using different bands for speech signals and sound signals in general);

  • resulting estimations correlate well with that of МОS;

  • quality estimations received for speech signals can be translated into values of various kinds of intelligibility.

Table 1 represents quality estimations of several standard voice codecs, received on various test signals using the method suggested and the realization described. The table contains MOS estimations for comparison.

Table 1. Sound quality estimation of vocoders

Codec

MOS

Noise model

Statistic model

PhRT

Minimal

Reduced

Complete

-

Vc

-

Vc

-

Vc

-

Vc

-

Vc

A-Law

4,10

4,79

4,73

4,78

4,78

4,78

4,78

4,79

4,80

4,80

4,84

Mu-Law

4,10

4,79

4,84

4,77

4,77

4,77

4,78

4,78

4,79

4,79

4,82

G.723.6.3

3,90

4,25

4,48

4,21

4,29

4,22

4,33

4,15

4,04

4,08

3,95

GSM.6.10

3,70

3,20

1,99

3,01

1,65

3,04

1,78

4,22

3,66

4,01

3,21

G.723.5.3

3,65

4,23

4,44

4,18

4,27

4,19

4,32

4,14

4,04

4,06

3,93

The estimations under the assumption, that bands are of equal probability, are in the column with «-» symbol and the estimation received under taking into account the coefficients of importance are in the column with «Vc».

5. TRENDS OF DEVELOPMENT

According to the structure of the suggested quality estimation system of sound signals the system can develop in following trends:

  • the test signal model improvement. Here the noise model can be supplied with a set of multiband modulated noise signals; the set of data and algorithms of the statistic speech model can be enriched, the number of preprepared test signals (such as records of PhRT) can be enlarged;

  • the development of more upgraded algorithms of synchronization, based, for example, on coincidence of maximums in signal energy spectrums;

  • the acoustic model modernization with taking into account masking effects and the fact that pure tones and band noise cause the hearing in some way differently;

  • the signal comparison scheme modernization. Current distance measure is not accurate enough for strongly different signals. For higher universality of the system it is desired to use the correlation analysis methods for comparison;

  • to solve a number of practical problems the systems requires the possibility to work with multichannel (Stereo-, Quadro-, etc.) and to receive immediate quality estimations;

  • absolutely correct translation of the objective estimations into MOS estimation values requires further experimental researches.

REFERENCES

1. Aldoshina I., “Bases of psychoacoustics”, The sound producer, 2002, №5, 8

2. Sekunov N., “Processing of a sound on PC”, bhv, Saint-Petersburg, 2001

3. Sapozhkov M.A., “Speech signal in cybernetics and communications”, Svyazizdat, Moscow, 1963

4. Pokrovskiy N.B., “Calculation and measurement of speech legibility”, Svyazizdat, Moscow, 1962

5. Sorokin V.N., “Speech synthesis”, Nauka, Moscow, 1992

Share/Save/Bookmark

Comments

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!