Post on 07-Jun-2020
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
Hierarchical Dynamical Systems
Pedro Manuel Nunes Sequeira
PREPARAÇÃO DA DISSERTAÇÃO
PREPARAÇÃO DA DISSERTAÇÃO
Advisor: Jaime dos Santos Cardoso
Co-Advisor: José Carlos Príncipe
February 18, 2015
Hierarchical Dynamical Systems
Pedro Manuel Nunes Sequeira
PREPARAÇÃO DA DISSERTAÇÃO
February 18, 2015
Contents
1 Introduction 1
2 Literature Review 32.1 Speech Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Intensity perception (Loudness) . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Frequency perception (Pitch) . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Phonemes and allophones . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Hierarchical Dynamical Systems 93.1 Hierarchical architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Hierarchical Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . 12
4 Work Plan 154.1 Calendarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Work Done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References 17
i
ii CONTENTS
List of Figures
2.1 Normal equal-loudness-level contours . . . . . . . . . . . . . . . . . . . . . . . 32.2 Pitch perception with frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Impulse response of cochlear filters (gammatone) . . . . . . . . . . . . . . . . . 52.4 Example of a 3-state HMM (from Makhoul and Schwartz (1995)) . . . . . . . . 62.5 Phonetic HMM (from Makhoul and Schwartz (1995)) . . . . . . . . . . . . . . . 6
3.1 DBN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 RBM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 DBN/DNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 DNN-HMM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Long Short-term Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.6 Bidirectional Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . 123.7 Deep Bidirectional Long Short-Term Memory . . . . . . . . . . . . . . . . . . . 123.8 Hierarchical Dynamical System . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iii
Chapter 1
Introduction
This document is the final report of the course Preparação da Dissertação of the Master’s degree
in Electrical and Computer Engineering at FEUP.
The Thesis consists on the study of the algorithm HLDS (Hierarchical Linear Dynamical
Systems) described on the papers (Cinar and Principe, 2014; Cinar et al., 2014), which showed
promising results in pitch estimation of isolated notes.
The first stage of dissertation is the replication of the results presented in the papers. In the
second stage, an adaptation of the algorithm to Speech Recognition will be attempted, more specif-
ically to phonetic classification and recognition.
The main motivation for choosing a hierarchical model is the fact that the auditory cortex has a
layered hierarchical structure (Read et al., 2002). This approach is useful for modeling sequences
on several time scales.
In (Liao, 2005), the state of the art in time series clustering is shown to be modifying existing
static data methods to work in time series. However, (Cinar and Principe, 2014; Cinar et al.,
2014) argue that those methods do not take advantage of the temporal information present in the
structure, ignoring the time dependency between the features. For this reason, Dynamical Systems
are used, since they intrinsically model a temporal structure.
Chapter 2 explains the main characteristics of speech and the difficulties that its recognition
entails. Furthermore, there will presented the State of the Art Algorithms. In Chapter 3 other
Hierarchical Models and the Algorithm of interest are described. Finally, in Chapter 4 the planned
time schedule for the semester will be presented.
1
2 Introduction
Chapter 2
Literature Review
2.1 Speech Perception
To be able to adapt an algorithm to speech recognition, it is necessary to have some notions of the
way that the human perceptual system reacts to the sound waves.
It is important to realize that the relationship between the human auditory perception of sound
and the associated physical quantities are neither simple nor linear. Huang et al. (2001)
The human ear has three sections: the outer ear, the middle ear and the inner ear. The relevant
structure of the inner ear for sound perception is the cochlea, which behaves as a filter bank.
2.1.1 Intensity perception (Loudness)
Generally speaking, sounds with greater intensity usually sound louder. However, the sensitivity
of the ear varies with the frequency and the quality of the sound. Figure 2.1 shows the graph of
equal loudness contours adopted by ISO British Standard (2003) which describes in detail this
non-linearity.
Figure 2.1: Normal equal-loudness-level contours
3
4 Literature Review
2.1.2 Frequency perception (Pitch)
Since the cochlea behaves as a spectrum analyzer, some effort has been made to model its behavior.
As referred in Fletcher (1940), the cochlea behaves as bank of overlapping auditory filters, whose
bandwidths are called critical bandwidth.
The western musical pitch is described in octaves and semi-tones which is a logarithmic fre-
quency scale. Scales based on human perception system however, are roughly logarithmic for
high-frequencies and linear at low-frequencies. Two of these scales (Mel scale and Bark scale) are
expressed by the equations 2.2 and 2.1. They are normalized and plotted together in Figure 2.2.
Barkscale : b( f ) = 13arctan(0.00076 f )+3.5arctan
((f
700
)2)
(2.1)
Melscale : B( f ) = 1127ln(
1+f
700
)(2.2)
Figure 2.2: Pitch perception with frequency
2.1.3 Masking
Frequency masking is a phenomenon under which one sound cannot be perceived if another sound
close in frequency has a high enough level. The first sound masks the other one Huang et al.
(2001). If two sounds played at the same time have frequencies close enough, they will be inter-
preted as a combination tone instead of two separate sounds. This happens due to the filter bank
2.2 Algorithms 5
associated with the cochlea, which splits the signal into different frequency components which are
coded independently on the auditory nerve which transmits the information to the brain.
One model of auditory filters widely used for speech recognition is the Gammatone Filter,
which is described in (Lyon et al., 2010; Qi et al., 2013) and is employed in the HLDS. Its impulse
response is given by equation 2.3 and represented in Figure 2.3.
Gammatone : g(t) = atn−1e−2πbt cos(2π f t +φ) (2.3)
Figure 2.3: Impulse response of cochlear filters (gammatone)
2.1.4 Phonemes and allophones
In speech science, the term phoneme is used to denote any of the minimal units of speech sound
in a language that can serve to distinguish one word from another (Huang et al., 2001). It is an ab-
straction and does not define a realization, as it is a context-independent and speaker-independent
concept. Their phonetic realizations are called allophones. The Phonetic Alphabet is defined in
the International standard, (ipa, 2005).
2.2 Algorithms
2.2.1 Hidden Markov Models
The Hidden Markov Model used to be the state of the art in Speech Recognition systems (Makhoul
and Schwartz, 1995). It can be viewed as a state machine whose transitions are defined by prob-
6 Literature Review
abilities. Also, unlike Non-hidden Markov Models, the state does not correspond to an output
symbol, but defines a probability distribution of output symbols. This model is illustrated in Fig-
ure 2.4, where ai j’s are the transition probabilities and b’s are the output probabilities.
Figure 2.4: Example of a 3-state HMM (from Makhoul and Schwartz (1995))
The procedure of modeling phonetic speech events using HMMs as a generative model is
described by Makhoul and Schwartz (1995) in Figure 2.5.
Figure 2.5: Phonetic HMM (from Makhoul and Schwartz (1995))
We start by noting that the structure of the model only allows transitions in one direction. This
is known as a "left to right" model. This represents the flow of time. Transitions from one state to
itself serve to model different phoneme durations.
2.2 Algorithms 7
The reason for the need of three states is the coarticulation effect, which means that the acous-
tic realization of a phoneme is affected by the preceding and following phonemes, specially by the
two neighboring ones.
The codebook of spectral templates represents the space of possible speech spectra. They
serve as output symbols of the HMM. From the moment we enter into state 1 until we get out of
state 3 a sequence of symbols is generated, this sequence corresponds to a single phoneme.
The recognition process uses the same model. There is one HMM for each phonetic context of
interest. Usually the same structure is employed for every HMM, only differing in the transition
probabilities. For a given input speech spectrum that has been quantized to one of the templates,
we find the probability that the template was the output generated by the model. If we consider
that state sequence followed the path 1 −→ 2, we multiply the current path probability by the
corresponding transition probability and using the new frame of speech, the probability that it was
the output of the state 2. We continue this process until the model is exited. This procedure is
done for every phoneme model and all possible state paths. In the end, the result with the highest
probability is considered the recognized sequence of phonemes.
8 Literature Review
Chapter 3
Hierarchical Dynamical Systems
3.1 Hierarchical architectures
Deep learning, also known as hierarchical learning, is a class of machine learning techniques, in
which the information processing is done in many stages. They are composed of many layers of
nonlinear processing in which the lower layer’s output are the inputs to the layer above. (Deng,
2012)
It has become increasingly popular since the development of new training algorithms, and
the increase of hardware capabilities (GPU’s). These algorithms have shown success in many
applications, such as: audio processing, speech recognition, hand-writing recognition, computer
vision, object recognition and information retrieval.
Most Deep learning architectures can be described as either generative, discriminative ou hy-
brid.
• Generative deep architecture — a model that characterizes the joint probability distri-
bution of the observed data and the corresponding classes. Since it models a probability
distribution, it can be used to generate synthetic data in the input space. Furthermore, using
the Bayes theorem, one can transform this model in a discriminative model.
• Discriminative deep architecture — used for class assignment. This is often done by
characterizing the a posteriori class probabilities conditioned on the input data.
• Hybrid deep architecture — a model whose goal is class assignment but uses the outcomes
of generative models, or a model where discriminative criteria are used to learn parameters
of a generative model.
Deep learning originated in the attempt of increasing the number of layers in Feed-forward
neural networks or multi-layer perceptron(MLP). This didn’t work since the learning algorithms
of the time (back-propagation) would get trapped in poor local optima.
This difficulty in training deep models eased with the research of (Hinton et al., 2006). This
paper introduced the model Deep Belif Network (DBN). As illustrated in Figure 3.1, this is a multi-
layered probabilistic generative model whose two higher layers have symmetric connections and
9
10 Hierarchical Dynamical Systems
whose lower layers have top-down connections with the layer above. The Hidden layers consist on
Restricted Boltzmann Machines (RBM), which is a network of symmetrically connected neuron-
like units which forms a bipartite graph in respect to the visible and hidden units, see Figure 3.2
Figure 3.1: DBN architecture Figure 3.2: RBM architecture
The learning is done in a greedy, layer-by-layer fashion. This algorithm allowed a much better
initialization of the Deep neural network model, see Figure 3.3. This has been shown effective in
the application of speech recognition (Hinton et al., 2012). It is shown that this method achieves
maximum likelihood learning. Since the learning is unsupervised, when a classification is desired,
a final layer of variables (corresponding to the labels) is added.
Figure 3.3: DBN/DNN architecture
Another interesting deep model described in (Deng, 2012) is an interface between the previ-
ously referred DBN-DNN and the HMM. This overcomes the limitation of the input vectors being
restricted to having a fixed dimensionality, which might be relevant in applications such as speech
3.1 Hierarchical architectures 11
recognition and video processing that require sequence recognition. The HMM is a convenient
tool for enabling what was a static classifier to handle dynamic or sequential patterns.
This architecture, represented in Figure 3.4 has been successfully used in speech recognition
in (Dahl et al., 2012).
Figure 3.4: DNN-HMM architecture
Recurrent neural networks (RNN) have a larger state-space and richer dynamics than HMMs,
making them powerful in modeling sequential data like speech, which is a intrinsically dynamic
process. The depth in time of the RNN is given by the model’s structure, which makes its hidden
state a function of all previous hidden state, as it can be observed by the equations 3.1. The
non-linearity H usually represents a elementwise sigmoid function. There are some noticeable
similarities between this model and the state space model of the HLDS.
ht = H (Wxhxt +Whhht−1 +bh)
yt = Whyht +by(3.1)
To adapt the standard RNN model for speech recognition, the authors of (Graves et al., 2013b,a)
have introduced 3 extensions, creating what they called a Deep Bidirectional Long Short-Term
Memory (DBLSTM). This model has achieved the lowest recorded error rates so far on the TIMIT
database.
Firstly, they introduced a much more complicated non-linearity H , represented in Figure 3.5.
Secondly, they made it possible for the model to make use not only of previous context, but also
able to exploit future contex. This is possible since in the speech recognition applications, the
whole utterances are transcribed at once. They included this functionality by including 2 hidden
12 Hierarchical Dynamical Systems
layers in the hierarchy which process the data in both directions of time. This is illustrated in
Figure 3.6.
Figure 3.5: Long Short-term Memory Cell
Lastly, due to the recent interest in deep architectures for being able to build progressively
higher representations of the data, they stacked multiple of these structures on top of each other,
as shown in Figure 3.7
Figure 3.6: Bidirectional Recurrent Neural Network
Figure 3.7: Deep BidirectionalLong Short-Term Memory
3.2 Hierarchical Linear Dynamical Systems
The HLDS model (Cinar and Principe, 2014; Cinar et al., 2014) has an architecture which consists
on a hierarchical structure whose layers are coupled linear dynamical systems. A block diagram
in presented in Figure 3.8. The system dynamics are described in equations 3.2, 3.3 and 3.4.
3.2 Hierarchical Linear Dynamical Systems 13
The model’s hidden states are xt ∈ Rn, ut ∈ Rk and zt ∈ Rs for the first, second and third layer
respectively. The observation vector is yt ∈ Rm.
The dimensionality decreases as we go up in the hierarchy (n > k > s). The reason for this is
so that the states are restricted to smaller representation spaces to be used in clustering.
The authors decided to insert some a priori information in the model by using Gammatone
Filters, which are reliable models for cochlear filters in the observation matrix. A fixed point
behavior is imposed by the identity matrix in the highest layer of the hierarchy. This stabilizes the
system since each layer is driven by the one above it, resulting in the creation of the clusters in the
state space.
As we can see in the model’s equations, the model can be re-written in a joint state space, 3.4.
This enables de estimation of the hidden states of all layers simultaneously using the standard
Kalman Filter equations.
This model learns by estimating the parameters of the matrices while inferring the states of the
HLDS. This is called sequential estimation. For the same observation, we consider two dual sys-
tems, the usual state system and a second one which represents the parameters dynamics. To create
this parameter system, we vectorize the original system’s matrices and treat those parameters as if
they were states. For this dual system we consider an identity transition matrix. Therefore, two
Kalman filters are used in parallel, one for estimation the states and another one for the estimation
of the parameters.
Figure 3.8: Hierarchical Dynamical System
zt = zt−1 + pt
ut = Gut−1 +Dzt−1 + rt
xt = Fxt−1 +But−1 +wt
yt = Hxt−1 + vt
(3.2)
zt
ut
xt
=
I 0 0D G 00 B F
zt−1ut−1xt−1
+pt
rt
wt
yt =
[0 0 H
]zt
ut
xt
+ vt
(3.3)
X̃t = F̃X̃t−1 +W̃t
yt = H̃X̃t + vt(3.4)
14 Hierarchical Dynamical Systems
Chapter 4
Work Plan
4.1 Calendarization
The work plan for this thesis is illustrated in the Gantt chart in Figure 4.1. The thesis officially
starts at the 18th of February and the due date is considered to be the 29th of June, which corre-
sponds to a total time available of 131 days. The constituting tasks are the following:
• Web page development — As a requirement of Preparação da Dissertação, during the thesis
development, a personal website with weekly reports has to created and updated. This will
be executed for the entire duration of the work except for the time reserved for the writing;
• Implementation of the Hierarchical model in study — This will be the starting point
of the thesis, everything will be built on top of this initial system. The time expected to
complete this task is 4 weeks;
• Testing the model performance in musical data — The results will be compared with
the ones shown in the original paper. This is essential for verifying the correctness of the
implementation. The time expected to complete this task is 2 weeks;
• Adaptation and Implementation of the HLDS algorithm for speech — This is the point
were the most difficulties are expected to appear. The algorithm is not expected to perform
well without some modification, due to the convergence time until a cluster is reached. The
time expected to complete this task is 4 weeks;
• Testing the new model performance in speech data — The experimental results will be
measured and compared with the state of the art algorithms. This will make possible the
testing of new ideas by seeing what improves or not the performance of the algorithm. The
time expected to complete this task is 4 weeks;
• Writing the thesis and scientific article — This is the last stage of the project and will
consist in the writing a report describing in detail all work done, the experiments made and
the results obtained. The writing of a scientific article is also expected. The time expected
15
16 Work Plan
to complete this task is at least 4 weeks. However, depending on the workload, this task
might begin earlier in parallel with remaining work;
Figure 4.1: Gantt Chart
4.2 Tools and Resources
The implementation of the algorithms and experiments will be done in Matlab. For the first stage of
the thesis which consists in implementing the initial algorithm and testing for music, the "Musical
Instrument Samples" from University of Iowa Electronic Music Studios (iow, 1997) will be used,
as it was the database employed in the original papers. For the adaptation and testing of the
algorithm for speech, the TIMIT Speech Corpus (Garofolo et al., 1993) will be the database of
choice. The Matlab Toolboxes used will be the MatlabADT (MATLAB Audio Database Toolbox)
for the TIMIT database easy access, and Auditory Toolbox for generating the Auditory Filters
required by the algorithms.
4.3 Work Done
So far the algorithm has been studied and its implementation is about halfway through. Moreover,
the tools and databases required have been acquired. There was a Skype meeting with the authors
of the algorithm in question for some implementation details clarification.
References
University of Iowa Electronic Music Studios, "Musical Instrument Samples". theremin.music.uiowa.edu/, 1997. Accessed: 2015-02-17.
International Phonetic Association, "The International Phonetic Alphabet".internationalphoneticassociation.org/sites/default/files/IPA_chart_%28C%292005.pdf, 2005. Accessed: 2015-02-17.
ISO British Standard. 226: 2003. Acoustics–normal equal-loudness level contours, BSi, 2003.
Goktug T Cinar and Jose C Principe. Clustering of time series using a hierarchical linear dynam-ical system. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on, pages 6741–6745. IEEE, 2014.
Goktug T Cinar, Carlos A Loza, and Jose C Principe. Hierarchical linear dynamical systems: Anew model for clustering of time series. In Neural Networks (IJCNN), 2014 International JointConference on, pages 2464–2470. IEEE, 2014.
George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition. Audio, Speech, and Language Processing,IEEE Transactions on, 20(1):30–42, 2012.
Li Deng. Three classes of deep learning architectures and their applications: A tutorial survey.APSIPA Transactions on Signal and Information Processing, 2012. URL http://research.microsoft.com/apps/pubs/default.aspx?id=192937.
Harvey Fletcher. Auditory patterns. Reviews of modern physics, 12(1):47, 1940.
John S Garofolo, Linguistic Data Consortium, et al. TIMIT: acoustic-phonetic continuous speechcorpus. Linguistic Data Consortium, 1993.
Alex Graves, Navdeep Jaitly, and A-R Mohamed. Hybrid speech recognition with deep bidirec-tional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshopon, pages 273–278. IEEE, 2013a.
Alex Graves, A-R Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neuralnetworks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE InternationalConference on, pages 6645–6649. IEEE, 2013b.
Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep beliefnets. Neural computation, 18(7):1527–1554, 2006.
17
18 REFERENCES
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural net-works for acoustic modeling in speech recognition: The shared views of four research groups.Signal Processing Magazine, IEEE, 29(6):82–97, 2012.
Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Foreword By-Reddy. Spoken languageprocessing: A guide to theory, algorithm, and system development. Prentice Hall PTR, 2001.
T Warren Liao. Clustering of time series data—a survey. Pattern recognition, 38(11):1857–1874,2005.
Richard F Lyon, Andreas G Katsiamis, and Emmanuel M Drakakis. History and future of audi-tory filter models. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE InternationalSymposium on, pages 3809–3812. IEEE, 2010.
John Makhoul and Richard Schwartz. State of the art in continuous speech recognition. Proceed-ings of the National Academy of Sciences, 92(22):9956–9963, 1995.
Jun Qi, Dong Wang, Yi Jiang, and Runsheng Liu. Auditory features based on gammatone fil-ters for robust speech recognition. In Circuits and Systems (ISCAS), 2013 IEEE InternationalSymposium on, pages 305–308. IEEE, 2013.
Heather L Read, Jeffery A Winer, and Christoph E Schreiner. Functional architecture of auditorycortex. Current opinion in neurobiology, 12(4):433–440, 2002.