[Lecture Notes in Computer Science] Pattern Recognition and Data Mining Volume 3686 ||...

7
S. Singh et al. (Eds.): ICAPR 2005, LNCS 3686, pp. 522 528, 2005. © Springer-Verlag Berlin Heidelberg 2005 Transformations of LPC and LSF Parameters to Speech Recognition Features Vladimir Fabregas Surigué de Alencar and Abraham Alcaim Pontifícia Universidade Católica do Rio de Janeiro – PUC-RIO, Centro de Estudos em Telecomunicações – CETUC, Rua Marquês de São Vicente, 225, 22453-900, Rio de Janeiro / RJ, Brazil {Vladimir, Alcaim}@cetuc.puc-rio.br Abstract. In this paper, we describe and present an overall evaluation of several features for distributed speech recognition systems. These systems are based on a client-server architecture. This means that recognizers access only the coded parameters of the speech coder employed in communication networks (e.g., cellular mobile and IP networks). The recognition features considered in this paper are obtained from transformations of codec parameters. In particular, features generated from LPC and LSF parameters, in intervals of 10 ms and 20 ms, are analyzed in a continuous observation HMM-based speaker independent recognizer. 1 Introduction The growth of the Internet and mobile communication systems has stimulated a great effort to realize speech processing applications in these networks. A particularly impor- tant problem is concerned with Automatic Speech Recognition (ASR) in a server sys- tem, based on the extracted and quantized acoustic parameters at the user terminal. Such systems, usually known as Distributed Speech Recognition (DSR), are very attractive due to the complexity and large memory requirements of ASR systems. Speech coding schemes used in mobile communication systems and IP networks op- erate at low bit rates and utilize, in general, LPC (Linear Predictive Coding) algorithms based on a speech production model. In this model, an excitation signal is applied to an all-pole filter (characterized by the LPC parameters), that represents the spectral enve- lope information of the speech signal. Usually, the LPC parameters are transformed to LSF (Line Spectral Frequencies), due to attractive properties of the latter to the quanti- zation and interpolation procedures. Speech coders employed in cellular and IP net- works use these parameters to caracterize the speech spectral envelope. In distributed ASR systems, it is preferrable to directly use the codec parameters than to extract them from the decoded signal [1]. Since these parameters are not the most adequate ones for the remote recognition system, it is important to consider and exam- ine different codec parameter transformations, in order to improve the recognition per- formance. The main contribution of this paper is to provide a global analysis of different speech features reported in the literature, aiming at improving the performance of DSR

Transcript of [Lecture Notes in Computer Science] Pattern Recognition and Data Mining Volume 3686 ||...

S. Singh et al. (Eds.): ICAPR 2005, LNCS 3686, pp. 522 – 528, 2005. © Springer-Verlag Berlin Heidelberg 2005

Transformations of LPC and LSF Parameters to Speech Recognition Features

Vladimir Fabregas Surigué de Alencar and Abraham Alcaim

Pontifícia Universidade Católica do Rio de Janeiro – PUC-RIO, Centro de Estudos em Telecomunicações – CETUC,

Rua Marquês de São Vicente, 225, 22453-900, Rio de Janeiro / RJ, Brazil

Vladimir, [email protected]

Abstract. In this paper, we describe and present an overall evaluation of several features for distributed speech recognition systems. These systems are based on a client-server architecture. This means that recognizers access only the coded parameters of the speech coder employed in communication networks (e.g., cellular mobile and IP networks). The recognition features considered in this paper are obtained from transformations of codec parameters. In particular, features generated from LPC and LSF parameters, in intervals of 10 ms and 20 ms, are analyzed in a continuous observation HMM-based speaker independent recognizer.

1 Introduction

The growth of the Internet and mobile communication systems has stimulated a great effort to realize speech processing applications in these networks. A particularly impor-tant problem is concerned with Automatic Speech Recognition (ASR) in a server sys-tem, based on the extracted and quantized acoustic parameters at the user terminal. Such systems, usually known as Distributed Speech Recognition (DSR), are very attractive due to the complexity and large memory requirements of ASR systems.

Speech coding schemes used in mobile communication systems and IP networks op-erate at low bit rates and utilize, in general, LPC (Linear Predictive Coding) algorithms based on a speech production model. In this model, an excitation signal is applied to an all-pole filter (characterized by the LPC parameters), that represents the spectral enve-lope information of the speech signal. Usually, the LPC parameters are transformed to LSF (Line Spectral Frequencies), due to attractive properties of the latter to the quanti-zation and interpolation procedures. Speech coders employed in cellular and IP net-works use these parameters to caracterize the speech spectral envelope.

In distributed ASR systems, it is preferrable to directly use the codec parameters than to extract them from the decoded signal [1]. Since these parameters are not the most adequate ones for the remote recognition system, it is important to consider and exam-ine different codec parameter transformations, in order to improve the recognition per-formance. The main contribution of this paper is to provide a global analysis of different speech features reported in the literature, aiming at improving the performance of DSR

Transformations of LPC and LSF Parameters to Speech Recognition Features 523

systems. Moreover, the results are presented at two frame rates: 100 Hz (typical of speech recognizers) and 50 Hz (usually employed by speech codecs).

Features obtained from the LPC and LSF parameters are described in Sections 2 and 3, respectively. Experimental results are presented and analyzed in Section 4. Finally, conclusions are summarized in Section 5.

2 Recognition Features Obtained from Transformations of LPC Parameters

This section deals with the recognition features that can be extracted directly from the LPC parameters, without the need to reconstruct the speech signal. This approach is attractive for DSR due to the speech decoding structures used in mobile and VoIP (Voice over IP) systems. In these structures, LPC parameters are obtained in a stage prior to speech reconstruction. This means that speech features extracted in this stage are computationally more attractive. Moreover, as we have previously mentioned, the use of codec parameters is more efficient for speech recognition than generating features from reconstructed speech.

Recognition features that can be obtained from the LPC parameters are the LPCC (LPC Cepstrum) and the MLPCC (Mel-Frequency LPCC) [2]. The LPCC are computed from the LPC parameters by means of a recursive equation, and the MLPCC are derived from a first-order all-pass filtering operation.

2.1 LPC Cepstrum (LPCC)

The extraction process of the LPCC features from the LPC coefficients is formulated in the z -transform domain, using the complex logarithm of the LPC system transfer function, which is analogous to the cepstrum computation from the discrete Fourier transform of the speech signal [2]. The i-th LPCC parameter is given by the following recursive equation

⎪⎪⎪

⎪⎪⎪

>−

≤<−+

==

=

=−

=−

p

jjji

i

jjjiii

piaci

ji

piaci

jia

ia

iG

c

1

1

1

1

1

1

0)ln(

(1)

where ia is the i-th LPC parameter, p is the LPC system order and G is the gain

factor of the system.

2.2 Mel-Frequency LPCC (MLPCC)

The MLPCC feature is obtained by transforming the real frequency axis of the LPCC to the mel frequency scale. This is performed by a bank of n first-order all-pass filters, where n is the number of LPCC features [3]. The filters have their first-order all-pass transfer function )(zψ [4] given by

524 V.F.S. de Alencar and A. Alcaim

1

*1

1)( −

−−=az

azzψ

(2)

where a is the all-pass filter coefficient and *a is the complex conjugate of a . Each

LPCC parameter, ic , is processed by a different filter.

Since the purpose of each filtering operation is to approximate the mel scale frequency, it is important to analyze the relationship of the transfer function given by (2) and the transformation of the frequency axis. In order to simplify the filter

implementation, let a be a real number [5]. Now rewrite ψ , as a function of Ωje , as

( ) )(Ω−Ω = θψ jj ee (3)

where Ω is the real frequency. From (2) and (3), we can derive the mel scale frequency as a function of the real frequency Ω :

( )( ) ⎥

⎤⎢⎣

⎡−Ω+Ω−=Ω

aa

a

2cos1

sen1arctan)(

2

2

θ (4)

Changing the value of a it is possible to adjust )(Ωθ to the mel scale curve. At an

8kHz sampling frequency, the value of a that best approximates the mel scale curve is 0.3624 [5].

The outputs of the filter bank are the MLPCC features.

3 Recognition Features Obtained from Transformations of LSF Parameters

The Line Spectral Frequencies (LSFs) are often used for speech coding due to their high coding efficiency and their attractive interpolation properties [6].

Extracting recognition features from the LSFs avoids a speech decoding operation, as well as a conversion of LSF to LPC. A distributed speech recognition system that adopts this strategy becomes computationally more efficient than any other one based on speech reconstruction or LPC parameters. The recognition features which can be obtained from LSFs are the PCC (Pseudo-Cepstral Coefficients) [7], MPCC (Mel-Frequency PCC) [7], PCEP (Pseudo-Cepstrum) [1] and the MPCEP (Mel-Frequency PCEP) [1].

It is worth to mention that these features, which are directly obtained from the LSFs, correspond to approximations of the LPCC and MLPCC features. Using these approxi-mations we avoid to recover LPC parameters to obtain the recognition features.

3.1 Pseudo-Cepstral Coefficients (PCC)

The PCC is computed directly from the LSFs. However, its derivation is based on the LPCC. Mathematical manipulations and approximations allow it to be expressed in terms of the LSFs [7]. The n-th PCC is given by the equation

Transformations of LPC and LSF Parameters to Speech Recognition Features 525

( ) ∑=

+−+=p

ii

nn nw

nnc

1

cos1

)1(12

(5)

where iw is the i -th LSF parameter.

3.2 Pseudo-Cepstrum (PCEP)

Using the mathematical expression of the PCC features, it is somewhat trivial to

obtain the PCEP [1]. They are derived from the PCC by eliminating the ( )n

n)1(1

2

1 −+

term. Note that this term does not depend on the speech signal, i.e., it does not depend on the LSF parameters. The n-th PCEP expression is given by

∑=

=p

iin nw

nd

1

cos1ˆ (6)

It is fair to expect a good spectral performance of the PCEP because they provide a spectral envelope very similar to the one provided by the Cepstrum, wich is generated from the original speech signal [1]. The PCEP features have the advantage of presenting a computational load even lower than the PCC.

3.3 Mel-Frequency PCC (MPCC)

To obtain the MPCC features from the PCC, the LSFs iw are replaced by miw , wich

are defined by the transformation

⎟⎟⎠

⎞⎜⎜⎝

⎛−

+= −

i

ii

mi w

wsinww

cos45.01

45.0tan2 1 (7)

This expression transforms the frequency axis of a particular set of parameters to the mel scale frequency axis [8]. The MPCC features are expressed by

( ) ∑=

+−+=p

i

mi

nmn nw

nnc

1

cos1

)1(12

1ˆ (8)

where mnc is the n-th MPCC.

3.4 Mel-Frequency PCEP (MPCEP)

Following the same procedure described for the MPCC, we can express the MPCEP features by

∑=

=p

i

mi

mn nw

nd

1

cos1ˆ (9)

where mnd is the n -th MPCEP.

526 V.F.S. de Alencar and A. Alcaim

4 Experimental Results

The goal of the experiments carried out in this work is to determine which speech recognition features represent a good trade-off between recognition performance and computational load. Of course, the analysis is performed having in mind that they will be used in distributed speech recognition systems. Figure 1 illustrates the features extractors and systems to be investigated in this section. It should be remarked that the quantization effects are not being taken into account in this work. We focus on a global comparative analysis of the features at two different frame rates.

Fig. 1. Features extractors and ASR systems

According to Fig. 1, the following feature extractors will be examined:

• Feature Extractor (1) – provides MFCC (Mel-Frequency Cepstrum Coefficients) features [9]-[10] from the original speech signal in 10 ms and in 20 ms frame intervals

• Feature Extractor (2) – provides the PCC, PCEP, MPCC and MPCEP features from the LSFs in 10 ms and in 20 ms frame intervals

• Feature Extractor (3) – provides the LPCC and MLPCC features from the LPC parameters in 10 ms and in 20 ms frame intervals

It is worth to remark that the MFCC is generated from the original speech signal. It is being considered here, in order to have a performance benchmark for the other features. It is also worth noting that the MFCC is usually employed in speech recognition systems that do not operate in communication networks. Note that this feature cannot be used in communication networks where there is no additional information transmission to the remote ASR system besides the one sent by the encoder.

In all experiments, the feature extractors will generate one set of 10 parameters plus its derivatives ( ∆ parameters) representing a total of 20 recognition features.

In the simulations carried out in this work, the speech frames have 25 ms duration and the frame rate is either 100 Hz or 50 Hz, depending on the desired rate of the LPC or LSF extractions.

The 100 Hz frame rate was chosen because this is the usual value employed by speech recognizers to provide good performance. The 50 Hz frame rate was chosen

Transformations of LPC and LSF Parameters to Speech Recognition Features 527

because this value is usual in voice coders operating in IP networks and mobile environments.

The ASR system considered in our experiments, is a speaker-independent, isolated word recognizer. The speech database is composed of 50 male speakers and 50 female speakers, each one repeating three times the digits 0,1,2,3,4,5,6,7,8,9 and the word “meia” in Portuguese. This represents a total of 3,300 words. A distribution of 70% and 30% of the speech database was used for training and testing, respectively.

The recognition systems use five-state continuous observation HMMs (Hidden Markov Models) with a mixture of three Gaussians per state. They were implemented with the HTK (HMM Toolkit) software [9].

Table 1 shows the recognition performance results when the features are extracted at each 10 ms and at each 20 ms. This corresponds to 100 Hz and 50 Hz frame rate, respectively. It can be seen that the 20 ms feature generation yields a much lower performance (around 5 %) when compared to the 10 ms feature extraction. We can also verify that the mel scale features (MLPCC, MPCEP and MPCC) always provide better performance than the real frequency features (LPCC, PCEP e PCC). This difference is about 3 %. Moreover, it can be observed that the speech recognition features for distributed environments (MLPCC, MPCEP e MPCC) show fairly good results when compared to the MFCC, obtained from the original speech signal. The difference in recognition rate is around 1 %.

Table 1. Recognition performance

Frame Rate LPCC PCC PCEP MLPCC MPCC MPCEP MFCC

100 Hz 95.80% 94.60% 95.00% 98.30% 97.50% 98.20% 99.40%

50 Hz 90.80% 90.20% 90.40% 93.80% 93.10% 93.70% 95.00%

It is important to remind that the MPCEP and MPCC features are obtained at the decoder first stage directly from the LSFs. On the other hand, the MLPCC features can only be generated at the second stage of the decoder, i.e., after the LSF/LPC conversion. These characteristics make the MPCEP and the MPCC computationally more efficient than the MLPCC. This is particularly interesting for systems that provide recognition services and do not intend to simultaneously reconstruct the speech signal. It can also be observed from Table 1 that the maximum performance loss of the the MPCC and the MPCEP, compared to MLPCC is 0.7%, at a frame rate of 50 Hz.

An interesting conclusion that can also be drawn from Table 1 is that the MPCEP features always overperform the MPCC, besides being simpler than the MPCC.

Finally, comparing the MPCEP and the MLPCC feature performances of Table 1, it can be seen that the difference in recognition rate is only 0.1% at both frame rates. This particular result is of major concern if we also take into account the computational complexity. Note that the MPCEP is an approximation to the MLPCC and provides a great computational saving over this feature.

528 V.F.S. de Alencar and A. Alcaim

5 Conclusions

We have analyzed the impact of various speech features over the performance of speech recognizers. The features were obtained from transformations of the LSF and LPC parameters. The results presented in this paper can be useful to distributed speech recognition systems operating in mobile and IP communication networks. We have concluded that the MPCEP feature, obtained from LSFs, is the one that presents the best trade-off between recognition accuracy and computational load. Comparing the recognition performances for features extracted at 100 Hz and 50 Hz frame rates, we have observed a degradation of approximately 4% of the latter relative to the first. Note that the 50 Hz frame rate is the usual condition in speech codecs. It is clear, therefore, that additional processing techniques, such as parameter interpolation, have to be applied in order to achieve results that might be closer to the ones obtained at 100 Hz frame rate.

References

1. H. S. Choi, H. K. Kim, e H. S. Lee, “Speech Recognition Using Quantized LSP Parame-ters and their Transformations in Digital Communication”, vol. 30, pp. 223-233, Speech Communication, (2000)

2. Y. Ohshima, “Environmental Robustness in Speech Recognition using Physiologically-Motivated Signal Processing,” PH. D. Thesis, Carnegie Mellon University, Pittsburgh, Pennsylvanya, December (1993)

3. A. V. Oppenheim e D. H., Johnson, “Discrete Representation of Signals,” Proc. IEEE, vol. 60, pp.681- 691, June (1972)

4. S. K. Mitra, Digital Signal Processing: A Computer-Based Approach, McGraw-Hill International Editions, (1998)

5. M. Wölfel, J. McDonough, e A., Waibel, “Minimum Variance Distortionless Response on a Warped Frequency Scale,” Eurospeech, Geneva, 2003.

6. W. B. Kleijn e K. K. Paliwal, Speech Coding and Synthesis, Amsterdam, The Nether-lands: Elsevier, (1995)

7. H. K. Kim, S. H. Choi e H. S., Lee, “On Approximating Line Spectral Frequencies to LPC Cepstral Coefficients,” IEEE Trans. Speech and Audio Processing, vol. 8, pp. 195 – 199, March (2000)

8. F. S. Gurgen, S. Sagayama, e S. Furui, “Line Spectrum Frequency-Based Distance Meas-ures for Speech Recognition,” pp.521-524, Proc. ICSLP, Kobe, Japan, November (1990)

9. S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, The HTK Book (for HTK Version 3.2.1), December (2002)

10. S. B. Davies and P. Mermelstein, “Comparasion of Parametric Representations for Mono syllabic Word Recognition in Continuously Spoken Sentences,” vol.28, pp.357-366, IEEE Trans. ASSP, August (1980)