Dialog History Construction with Long-Short Term Memory for Robust Generative Dialog State Tracking

One of the crucial components of dialog system is the dialog state tracker, which infers user’s intention from preliminary speech processing. Since the overall performance of the dialog system is heavily affected by that of the dialog tracker, it has been one of the core areas of research on dialog systems. In this paper, we present a dialog state tracker that combines a generative probabilistic model of dialog state tracking with the recurrent neural network for encoding important aspects of the dialog history. We describe a two-step gradient descent algorithm that optimizes the tracker with a complex loss function. We demonstrate that this approach yields a dialog state tracker that performs competitively with top-performing trackers participated in the ﬁrst and second Dialog State Tracking Challenges.


Introduction
Spoken dialog systems (SDS) are a rapidly growing subject of research with the grand challenge of developing intelligent systems capable of understanding conversational commands by human users.The generic architecture of a spoken dialog system is shown in Figure 1: the system decodes the user voice into the list of hypotheses in a form of action type and slot-value pairs (see Table 1 for details), which is defined as the spoken language understanding (SLU).The system then analyzes the hypotheses to decide how to respond to the user, which is referred to as dialog management (DM).The DM can be further decomposed into two sub-tasks: the dialog state tracking that identifies the user goal and the dialog policy optimization that finds the appropriate response.The actual vocal response is produced by the speech generation.
Due to the inevitable misinterpretation of the user utterance (i.e.ambiguity) in SLU, robust DM is essential in any dialog system.Since the task of determining an appropriate response based on the history of SLU data aligns well with reinforcement learning framework, a large number of previous studies were based on the Markov decision process (MDP, Levin et al. 1998; Daubigney  Williams and Young 2007;Young et al. 2010;Gasic and Young 2014).One of the most representative work is the SDS-POMDP (Young, 2006).It treats SLU data as noisy and partial observations of the underlying state of dialog that encapsulates sufficient information for DM: user's intention, utterance and the dialog history.Actions of the SDS-POMDP are defined as the abstract system responses, one of which will be selected to maximize the expected accumulation of rewards.In this fashion, the SDS-POMDP provides a unified decision theoretic model for DM.
Given a pre-determined dialog strategy, SDS-POMDP tracks the dialog state using a Bayesian filter.This is often referred to as a generative dialog state tracker, which facilitates identifying core components that can be engineered individually through standard statistical machine learning techniques.However, the performance of the optimized tracker still critically depends on what information is captured from dialog for bookkeeping.If an important aspect of the dialog is not captured and kept appropriately, it is evident that the tracker would not be so accurate in inferring the user's intention.In most of the previous work, this bookkeeping was done through heuristically selecting the information and embedding it into the dialog history (Young et al., 2010).
On the other hand, we note that there are a number of recent approaches using discriminative models, which do not involve explicit bookkeeping mechanism (Metallinou et al., 2013;Lee, 2013;Williams, 2014).They use features that are defined over the finite window of previous dialog turns.A more recent work using recurrent neural network (Henderson et al., 2014b) enables learning with features that are defined over the window of arbitrary length.In these approaches, the bookkeeping mechanism can be viewed as implicit and parameterized, which is optimized through training.This in general made the tracker perform much better than generative ones.
In this paper, we present a generative dialog state tracker that uses Long-Short Term Memory (LSTM), which is one of the deep-learning architectures for recurrent neural network that overcomes the major drawback of original recurrent neural network called vanishing gradients (Bengio et al., 1994;Hochreiter and Schmidhuber, 1997).With this approach, we can preserve the intuitiveness and compositionality of generative dialog state trackers while autonomously bookkeeping appropriate information over time, which results a boost in performance.This is especially useful in practical applications, where we typically have some prior knowledge on the characteristics of the dialog domain and speech and language processing unit.
Our method extends the Bayesian filter in SDS-POMDP so that the dialog history is replaced with the embedding vector computed by LSTM.We then propose a two-step optimization algorithm to deal with the local convergence when training the model.Our tracker performed better than any tracker from the Dialog State Tracking Challenge 1 (DSTC1, Williams et al. 2013) 1 and on par with top-performing trackers from the Dialog State Tracking Challenge 2 (DSTC2, Henderson et al. 2014a) 2 .This paper is an extension of Kim et al. (2013); Lee et al. (2014) using LSTM for the dialog history.
The rest of the paper is organized as follows.Section 2 describes the Bayesian filtering model of the SDS-POMDP, with the additional techniques we adopted in this work.Section 3 describes the component probability models outlined in section 2. The optimization algorithm designed for our tracker is explained in section 4. In section 5, the experiment setup, the result and analyses are described.We conclude the paper with discussions in section 6.

Bayesian Filtering for the Dialog State Tracking
We start from the basic setting of the SDS-POMDP framework: in each turn of the dialog, the system executes system action a, and the user with goal g responds to the system with user action u.The SLU module processes the user action and generates the result as an N -best list (observation, in other words) o = [ ũ1 , f 1 , . . ., ũN , f N ] of the hypothesized user action ũi and its associated confidence score f i .Table 1 shows a dialog where the SLU generates N -best lists from noisy user actions and the system responds to the user after tracking true user goals from the SLU output.
Here, the dialog state in each turn is defined as s = (u, g, h), where h is the dialog history encapsulating additional information needed for tracking the dialog state (Williams, 2008).Because the system cannot exactly identify the user goal, it maintains a posterior probability distribution over user goals, called a belief.The belief over the dialog states is updated by Bayesian filtering: b (u , g , h ) = η Pr(o|u , g , h , a , a) u,g,h Pr(u , g , h |u, g, h, a, a )b(u, g, h) (1 where η is the normalization constant.With reasonable independence assumptions, the joint probabilities in the equation can be factorized and approximated into simpler forms.Further assuming that the user goal does not change, we obtain where Pr(o|u ) is the observation model, Pr(u |g , h , a) is the user action model, Pr(h |u, g, h, a) is the history model.According to the formulation, the posterior computation has to be carried out for all possible user goals in order to obtain the normalizing constant η.This is not feasible for Beginning with one root partition with the probability 1, partitions are split whenever the distinction is required by observations, i.e. a user action hypothesis from the SLU output.This confines the possible goal states to the values that have been appeared at least once as an SLU hypothesis, and provides scalability without a significant loss in accuracy when the coverage of N-best list is extensive enough to include the true user action.By defining belief refinement model Pr(ψ |ψ) to be a probability mass ratio that the partition splits, the Bayesian filtering equation ( 4) becomes: where ψ ⊃ ψ denotes that ψ is a parent partition (superset) of ψ .
Even with the partitioned approach, the total number of partitions grows very fast as dialog turn progresses (exponential in number of observed goal types, polynomial in number of observed goals), and it is required to give certain limit to the number of partitions.We used the incremental partition recombination algorithm (Williams, 2010), which recombines the less important partitions in each turn if the number of partitions exceeds the threshold.

Designing Component Probability Models
The equation ( 5) needs a model for each component probability term, i.e. observation model, user action model, history model, and belief update model.Although assigning simple statistics to each component probability model (Young et al., 2010;Kim et al., 2013) can be a valid choice, Lee et al. (2014) has shown that designing each probabilities appropriately and optimizing it results in much better performance.In following subsections, each component probability model is described.

Observation model
The observation model Pr(o|u) represents the probability of structured observation o = [ ũ1 , f 1 , . . ., ũN , f N ] given user action u.Here, we assume that the observation model only depends on the type of user action u and its associated confidence score f i to obtain a simple model.Then, we obtain where w type(u) is the weight associated with specific user action type of u and b type(u) is bias term for the same user action type, which are exponentiated for preserving non-negativity.k is the normalizing constant, which can be ignored since it is subsumed by the constant η in the belief update equation ( 5).
The user action type here is domain-specific attribute of SLU hypothesis, which is given in prior (e.g.inform, affirm, negate in Table 1).Refer to DSTC1 or DSTC2 handbook for full description of user and system action types.

Belief refinement model
The belief refinement model Pr(ψ |ψ) defines the split ratio of partitions when the partition is required to split due to the observations.According to the definition, the full specification of the belief refinement model corresponds to the prior distribution of partitions, which is summation of prior probability of each individual user goals included in partitions.We define the prior on goals Pr(g) as a smoothed empirical distribution With the prior probability of each individual user goals, we can compute belief refinement model by summing the prior probabilities of user goals to get the ratio if the ψ is divided to ψ and other partitions this turn by observation, Pr(ψ |ψ) = 0 otherwise.

User action model and history model
In previous approaches (Young et al., 2010;Lee et al., 2014), the user action model Pr(u |ψ , h , a ) only utilized the features of current turn and relied on manually defined bookkeeping methods for h .This is potentially disadvantageous compared to trackers that are able to learn which information to maintain over multiple dialog turns (Metallinou et al., 2013;Lee, 2013;Williams, 2014;Henderson et al., 2014b).This disadvantage becomes apparent in complex domains like DSTC2, where the conversation topics change over time.In DSTC2, in which the domain was a restaurant information system, the user usually starts with providing search constraints and then proceeds to the stage of choosing the restaurant to be informed.Learning such phase transition in dialogs where the user acts differently is therefore crucial for designing robust user action model.
Recurrent neural networks provide a natural model for handling such problems, as they are able to model any dynamic sequences with complex behaviors by learning to choose what information to keep.Adopting recurrent neural network in dialog state tracking has proven its success in Henderson et al. (2014b), while our approach is more focused on learning the dependencies over dialog turns in a generative state tracker, encoding them as the dialog history.Instead of the conventional recurrent neural network, we adopt one of the variants, namely Long-Short Term Memory (LSTM, Bengio et al. 1994;Hochreiter and Schmidhuber 1997), which effectively mitigates the vanishing gradient problem in training recurrent neural networks.

LSTM ARCHITECTURE USED FOR USER ACTION MODEL
The input to LSTM is a feature constructed from u, ψ and a.We used every combinations of [user action type, system action type, partition-action consistency] that appears in the training set as a feature set, which results in maximum 1 of |u| • |a| • 4 coding 3 .
In the following equations of LSTM, • φ(u t , ψ t , a t ) is 1 of |u| • |a| • 4 coding feature vector, which is the input to LSTM.
• C t−1 , C t are the cell values stored in LSTM.
• y t−1 , y t are the outputs of LSTM, and also stored.3. Since we only considered the feature combinations that appear in the training set, the vector length is much less than |u| • |a| • 4 because of the nonexistent system action-user action combination.Also, due to the dual or more action types, there are the cases where it is not exactly 1-of-k coding.The input layer is passed through the input gate to compute a new cell value.The cell value is computed by adding input and the previous cell if not forgotten, as follows.
where * denotes the elementary multiplication between vectors.Output is then computed by the output gate: We finally get the user action model value Pr(u t |ψ t , h t , a t ) by adding the direct edge from feature function and using a softmax function, where y t on the right hand side of the equation is implicitly dependent on h t via the information stored in LSTM.In practice, however, we cannot sum up the LSTM outputs for every u.Instead, we only evaluate the LSTM for ũi s from N -best list, and evaluate ũofflist for the remaining actions.

HISTORY
where h ⊃ h denotes that h is a parent history of h (i.e.h = {h, a, u, ψ}).This implies that we should maintain |h||ψ| partitions and |h| LSTMs each turn to exactly compute posterior probability over user goals.However, it is impractical due to the exponentially increasing size of history.To gain a better insight, the following equation is what to be evaluated for the dialog state tracking problem: It can be seen that we are taking the weighted average of LSTMs over the joint belief b(u ∈ h , ψ ⊃ ψ , h ⊃ h ) for every h .Similar to the partition recombination technique (Williams, 2010) that limits the number of partition to certain constant, we can approximate it by limiting the number of histories by M (equivalently, M LSTMs).Since the aggregation of histories are not possible, other than M histories with the best belief are ignored.
A belief update with M -LSTMs is performed, as specified in the following algorithm 1.

Optimization
In this paper, we use L2 metric as the loss function since it is found to be most influential to dialog system performance (Lee, 2014).The model is hence optimized to minimize the L2 loss function where i sums over N training instances, t sums over T turns of each training instance and ψ i,t is the group of partitions in the instance i, turn t.r(ψ) is the binary label with value 1 if and only if the partition ψ contains the true user goal.In the rest of the section, gradient methods used to optimize with respect to the loss function are described.

Cascading gradient
Taking the insight from the BPTT algorithm (Mozer, 1989), we can unfold the update formula through time and calculate the gradient with respect to weight vectors.For instance, considering a parameter w from the observation model, ∂L/∂w can be obtained by: We call the above cascading gradient since ∂b (ψ)/∂w requires computation of the gradient in the previous dialog turn ∂b(ψ)/∂w, and hence reflects the temporal impact of the parameter change throughout the dialog turns.Once we obtain the gradients, we can simultaneously update all the parameters with any algorithm.In this paper, L-BFGS was used.

Initialization using simple gradient
The L-BFGS using cascading gradient, however, seriously suffers from the local convergence since the cost function is very complex in parameter space, mostly due to the repeated appearance of features throughout dialog turns.The randomized initialization had a limited effect because of the large dimensionality of parameter space.
If we ignore the gradient from previous turn and treat each turn as individual training instance, the gradient becomes, for example of a parameter w from the observation model, Since it is an approximation that only catches the direct impact of the features and makes loss function much simpler, it takes the parameters close to the global optimum, increasing the chance of L-BFGS converging to the good local optimum.Our optimization algorithm is hence composed of two steps, first initializing with the simple gradient and then training with the cascading gradient.In the experiments, we used datasets from both DSTC1 and DSTC2.Three labeled training datasets (train1a, train2, train3) and four test datasets (test1, test2, test3, test4) were included in DSTC1 whereas two labeled training datasets (DSTC2 train, DSTC2 dev) and one test dataset (DSTC2 test) were included in DSTC2.Description of the datasets is provided in Table 2 and Table  3. Test4 is omitted in this paper since the significant number of missing or incorrect labels were found.

Experiment and Analysis
We can group the datasets based on their characteristics.Datasets train1a, train2, test1, and test2 have many turns in each call.However, they include only one hypothesis in each dialog turn (1-best SLU output) and the user goal rarely changes.These datasets can be said to be relatively easy since there is not much information to examine in inferring the user goal.
On the other hand, datasets train3, test3, DSTC2 test, DSTC2 dev and DSTC2 test are considered harder.These datasets contain dialogs that are closer to real world conversation with latent relations among system actions and user actions and dialog flows.Dependencies that span over multiple dialog turns are more frequent and they even change over turns in DSTC2 datasets.
For each target test dataset, we chose the training datasets that are known to be similar.Specifically, Train1a and train2 are used to train the tracker that tests test1 and test2, whereas train3 is used for test3 and DSTC2 train, DSTC2 dev are used for DSTC2 test.We only used the SLU data for observations (i.e.ignored ASR information), and our result is therefore compared with teams that only used SLU data.Note that the evaluations of our algorithm are taken after the release of the test set while the other teams' results are evaluated before the release of the test set.
We measured the tracker performance according to the following evaluation metrics used in DSTC14 : • accuracy (acc) measures the rate of the most likely hypothesis h 1 being correct.
• average score (avgp) measures the average of scores assigned to the correct hypotheses.
• L2 follows the definition in the equation ( 18).
• mean reciprocal rank (mrr) measures the average of 1/R, where R is the minimum rank of the correct hypothesis.
• ROC equal error rate (eer) is the sum of false accept (FA) and false reject (FR) rates when FA rate=FR rate.
• ROC.v1.P measures correct accept (CA) rate when there are at most P % false accept (FA) rate.
N (FA), N (CR), N (CA) and N (FR) are the number of false accepts (FA), correct rejects (CR), correct accepts (CA), and false rejects (FR) respectively.Additionally, N D is the total number of data instances.The evaluation takes the most likely hypothesis h 1 and its score s 1 and compares them with the threshold θ and the ground truth user goal h * .Each evaluation increments appropriate counters by The CA rate is defined as N (CA) N D and the FA rate is defined as N (FA) N D .ROC.v2.P from the DSTC1 metrics is not included in the experiment.
With datasets, the baseline trackers that work in a simple deterministic manner was included.For example, the most basic one picks the SLU hypothesis with the highest probability so far for each goal slot.More detailed description is given in DSTC2 handbook.

DSTC1 Result
Table 4 compares the performance of our tracker against that of the most competitive entries from DSTC1 participants.Base stands for the baseline tracker and X-Y stands for team X entry Y .The following is the summary description provided by the teams: • Team 1 : deep neural network (Henderson et al., 2013) • Team 6 : feature-rich discriminative model (Lee and Eskenazi, 2013) • Team 9 : generative dialog state tracker with basic statistics (Kim et al., 2013) The following abbreviations are used for our tracker with different history models and optimization algorithms: • C: direct edge only user action model optimized with cascading gradient.
• SC: direct edge only user action model initialized with simple gradient and optimized with cascading gradient.
• LC: LSTM history based user action model optimized with cascading gradient.
• LSC: LSTM history based user action model initialized with simple gradient and optimized with cascading gradient.
We choose the trackers with the minimum training objective function among 100 random seeds.The number of partitions is limited to 10, and the number of histories (LSTM, equivalently) is limited to 3. 50 units are used for all (input, cell, output) layers of LSTMs.This setting of parameters enable the real-time tracking of dialog state, while preserving most of the information required.
Overall, the trackers initialized with simple gradients outperform the trackers without the initialization process.It can be seen that the local convergence problem of cascading gradient is most severe in test3, as the problem gets more complex even though the other domains also show the clear difference in scores.The trackers optimized with two different gradients outperform the scores of other teams participated in DSTC1.
In case of history construction using LSTM, DSTC1 test1 and test2, which are categorized as easy dataset, are not showing any performance gain from the history construction.This is expected due to the fact that the dialogs of test1 and test2 are so simple that keeping extra information from dialog history would not help.The results of test1 by the trackers with the LSTMs are even worse than the trackers without the LSTMs since the increased number of parameters raises the problem of over-fitting.
On the other hand, distinct difference between trackers with and without LSTMs is observed in test3.Although the accuracies are similar (0.891↔0.898), l2 scores show the clear decrease of 0.189 → 0.172 with LSTMs when optimized with two different gradients.

DSTC2 Result
Similar to the previous section, we compare our tracker to other trackers submitted to DSTC2.We do not include trackers that additionally use ASR (automatic speech recognition) information in the comparison; we restricted comparison to trackers that only used NLU data as ours for fair comparison.
The following is the summary description provided by the teams: • Team 1 : linear-chain conditional random field (Kim and Banchs, 2014) • Team 4 : recurrent neural network (Henderson et al., 2014b) • Team 6 : maximum-entropy Markov model (Ren et al., 2014) • Team 7 : combined model of rule-based, maximum-entropy and deep neural network (Sun et al., 2014) Table 5: Results of the trackers of DSTC2 and our tracker using various models and optimization algorithms, evaluated on the joint slot.The bold face denotes top score in each evaluation metric.
Base Evaluation is taken over the joint slot 5 , which was the featured metric in DSTC2, that checks the joint correctness of every goal slot.LSTM parameter configuration identical to DSTC1 experiment was also used for this experiment.
Since this is a complex dialog similar to test3 dataset in DSTC1, the initialization with simple gradient showed a significant performance improvement in both models, with and without LSTMs.On the other hand, LSTM yielded additional significant performance improvement than in DSTC1 datasets, successfully capturing more complex dependencies over dialog turns.While C, SC, LC yielded competitive scores with other teams, LSC scored the highest among the trackers using SLU.
Note that team 4 by Henderson et al. (2014b) adopted a recurrent neural network and results a very similar scores (0.737↔0.741).For a detailed comparison, we replicated the work of team 4 referring to the description in the Henderson's thesis (Henderson, 2015), and presented above as 4 rep.It can be seen that 4 rep is also achieving a very similar score, and the differences in three systems(4-3, 4 rep, and LSC) do not seem significant if the randomness of the learning algorithm is considered.
However, we found out that our system requires much less computational cost when compared to the replicated system.The model itself used ten times smaller number of parameters (3 × 10 5 parameters used for 4 rep while 3 × 10 4 parameters used for LSC) to achieve similar or better performance due to the careful choice of component probability models, encoding prior knowledge.

Conclusion and Discussion
In this paper, we proposed a robust generative dialog state tracker by using Long-Short Term Memory (LSTM) for the dialog history.Based on the Bayesian filtering in the SDS-POMDP framework, we designed each component probability model to capture important dependencies in dialog state tracking.LSTM is used for user action model, aimed at learning complex system-user action dependencies over time, and has exhibited its performance improvment in complex domains where bookkeeping of important dialog aspects is important for accurate tracking.
We also proposed the two-step optimization algorithm that optimizes the tracker.The performance of a local optimization algorithm used to train the tracker was highly dependent on the initial 5.The performance gain between trackers hence cannot be directly compared with the gains from DSTC1.It is harder to get better score in joint slot than the averaged score on individual slots.
solution, as the objective function is highly nonlinear in the parameter space.We therefore introduced a preliminary optimization stage where we use the gradient descent with approximated gradients calculated by ignoring the dependencies over time steps.The solution from the preliminary optimization was used as the initial solution for the second stage, where we calculated the exact gradients by dynamic programming.This two-step algorithm consistently improved the performance over single-step optimization approach with exact gradient and random initial solution.
We have demonstrated the performance of the tracker and the effectiveness of the optimization algorithm by comparing to the state-of-the-art trackers submitted to Dialog State Tracking Challenges 1 and 2.

Figure 1 :
Figure 1: A diagram of general spoken dialog systems w d , w l are weight matrices (or vectors) to be learned.• b i , b f , b c , b o are bias vectors.

Figure 2 :
Figure 2: The Long-Short Term Memory architecture used to model user action model.From the usual LSTM architecture, a linearly compressing input layer and direct edge from input to output is added.The gray colored cell vector is maintained through the time.

Table 1 :
An example of a dialog depicted in DSTC1 MODEL AND M -BEST LSTMS According to the model above, what we are maintaining as a history state h t is the cell layer and the output layer of the last turn's LSTM, in which the sequence [u 1 , a 1 , ψ 1 , . . ., u t−1 , a t−1 , ψ t−1 ] is implicitly embedded.The history model Pr(h t |u t−1 , ψ t−1 , h t−1 , a t−1 ) is therefore deterministic, and the Bayesian filtering equation is then b (ψ , h ) = η u Pr(o|u ) Pr(u |ψ , h , a ) h,u Pr(h |u, ψ, h, a) Pr(ψ |ψ)b(u, ψ ⊃ ψ , h) = η u Pr(o|u ) Pr(u |ψ , h , a ) h,u

Table 2 :
Description of the DSTC1 datasets used for the tracker (Bus Information System)

Table 3 :
Description of the DSTC2 datasets used for the tracker (Restaurant Information System)

Table 4 :
Results of the trackers of DSTC1 and our tracker using various models and optimization algorithms, evaluated by the average of all slots.The bold face denotes top score in each evaluation metric.