Recurrent Polynomial Network for Dialogue State Tracking

Dialogue state tracking (DST) is a process to estimate the distribution of the dialogue states as a dialogue progresses. Recent studies on constrained Markov Bayesian polynomial (CMBP) framework take the first step towards bridging the gap between rule-based and statistical approaches for DST. In this paper, the gap is further bridged by a novel framework -- recurrent polynomial network (RPN). RPN's unique structure enables the framework to have all the advantages of CMBP including efficiency, portability and interpretability. Additionally, RPN achieves more properties of statistical approaches than CMBP. RPN was evaluated on the data corpora of the second and the third Dialog State Tracking Challenge (DSTC-2/3). Experiments showed that RPN can significantly outperform both traditional rule-based approaches and statistical approaches with similar feature set. Compared with the state-of-the-art statistical DST approaches with a lot richer features, RPN is also competitive.


Introduction
A task-oriented spoken dialogue system (SDS) is a system that can interact with a user to accomplish a predefined task through speech.It usually has three modules: input, output and control, shown in Figure 1.The input module consists of automatic speech recognition (ASR) and spoken language understanding (SLU), with which the user speech is converted into text and semantics-level user dialogue acts are extracted.Once the user dialogue acts are received, the control module, also called dialogue management accomplishes two missions.One mission is called dialogue state tracking (DST), which is a process to estimate the distribution of the dialogue states, an encoding of the machine's understanding about the conversion as a dialogue progresses.Another mission is to choose semantics-level machine dialogue acts to direct the dialogue given the information of the dialogue state, referred to as dialogue decision making.The output module converts the machine acts into text via natural language generation and generates speech according to the text via text-tospeech synthesis.
Dialogue management is the core of a SDS.Traditionally, dialogue states are assumed to be observable and hand-crafted rules are employed for dialogue management in most commercial SDSs.However, because of unpredictable user behaviour, inevitable ASR and SLU errors, dialogue state tracking and decision making are difficult (Williams and Young, 2007).Consequently, in recent years, there is a research trend from rule-based dialogue management towards statistical dialogue management.Partially observable Markov decision process (POMDP) framework offers a wellfounded theory to both dialogue state tracking and decision making in statistical dialogue management (Williams and Young, 2007;Thomson and Young, 2010;Gašić and Young, 2011;Young et al., 2010).In previous studies of POMDP, dialogue state tracking and decision making are usually investigated together.In recent years, to advance the research of statistical dialogue management, the DST problem is raised out of the statistical dialogue management framework so that a bunch of models can be investigated for DST.Most early studies of POMDP on DST were devoted to generative models (Young et al., 2010).Fundamental weaknesses of generative model was revealed by the result of (Williams, 2012).In contrast, discriminative state tracking models have been successfully used for SDSs (Deng et al., 2013).The results of the Dialog State Tracking Challenge (DSTC) (Williams et al., 2013;Henderson et al., 2014c,a) further demonstrated the power of discriminative statistical models, such as Maximum Entropy (MaxEnt) (Lee and Eskenazi, 2013), Conditional Random Field (Lee, 2013), Deep Neural Network (DNN) (Sun et al., 2014b), and Recurrent Neural Network (RNN) (Henderson et al., 2014d).In addition to discriminative statistical models, discriminative rule-based models have also been investigated for DST due to their efficiency, portability and interpretability and some of them showed good performance and generalisation ability in DSTC (Zilka et al., 2013;Wang and Lemon, 2013).However, both rule-based and statistical approaches have some disadvantages.Statistical approaches have shown large variation in performance and poor generalisation ability due to the lack of data (Williams, 2012).Moreover, statistical models usually have more complex model structure and features than rule-based models, and thus can hardly achieve efficiency, portability and interpretability as rule-based models.As for rule-based models, their performance is usually not competitive to the best statistical approaches.Furthermore, there is no general way to design rule-based models with prior knowledge and there lacks a way to improve their performance when training data are available.
Recent studies on constrained Markov Bayesian polynomial (CMBP) framework take the first step towards bridging the gap between rule-based and statistical approaches for DST (Sun et al., 2014a;Yu et al., 2015).CMBP formulate rule-based DST in a general way and allow data-driven rules to be generated.Concretely, in the CMBP framework, DST models are defined as polynomial functions of a set of features whose coefficients are integer and satisfy a set of constraints where prior knowledge is encoded.The optimal DST model is selected by evaluating each model on training data.Yu et al. (2015) further extended CMBP to real-coefficient polynomial where the real coefficients can be estimated by optimizing the DST performance on training data using grid search.CMBP offers a way to improve the performance when training data are available and achieves competitive performance to the state-of-the-art statistical approaches, while at the same time keeping most of the advantages of rule-based models.Nevertheless, adding features to CMBP is not as easy as to most statistical approaches because additional prior knowledge is needed to add to keep the search space not too large.For the same reason, increasing the model complexity, such as by using higher-order polynomial, by introducing hidden variables, etc. is also not very convenient.
In this paper, a novel framework, referred to as recurrent polynomial network (RPN), is proposed to further bridge the gap between rule-based and statistical approaches for DST.RPN's unique structure enables the framework to have all the advantages of CMBP including efficiency, portability and interpretability.Additionally, RPN achieves more properties of statistical approaches than CMBP.
The DSTCs have provided the first common testbed in a standard format, along with a suite of evaluation metrics to facilitate direct comparisons among DST models (Williams et al., 2013).To evaluate the effectiveness of RPN for DST, both the dataset from the second Dialog State Tracking Challenge (DSTC-2) which is in restaurants domain (Henderson et al., 2014c) and the dataset from the third Dialog State Tracking Challenge (DSTC-3) which is in tourists domain (Henderson et al., 2014a) are used.For both of the datasets, the dialogue state tracker receives SLU N -best hypotheses for each user turn, each hypothesis having a set of act-slot-value tuples with a confidence score.The dialogue state tracker is supposed to output a set of distributions of the dialogue state.In this paper, only joint goal tracking, which is the most difficult and general task of DSTC-2/3, is of interest.
The rest of the paper is organized as follows.Section 2 discusses ways of bridging rule-based and statistical approaches.Section 3 formulates RPN.The RPN framework for DST is described in section 4, followed by experiments in section 5. Finally, section 6 concludes the paper.

Bridge Rule-based and Statistical Approaches
Broadly, it is straightforward to come up with two possible ways to bridge rule-based and statistical approaches: one starts from rule-based models, while the other starts from statistical models.CMBP takes the first way, which is derived as an extension of a rule-based model (Sun et al., 2014a;Yu et al., 2015).Inspired by the observation that many rule-based models such as models proposed by Wang and Lemon (2013) and Zilka et al. (2013) are based on Bayes' theorem, in the CMBP framework, a DST rule is defined as a polynomial function of a set of probabilities since Bayes' theorem is essentially summation and multiplication of probabilities.Here, the polynomial coefficients can be seen as parameters.To make the model have good DST performance, prior knowledge or intuition is encoded to the polynomial functions by setting certain constraints to the polynomial coefficients, and the coefficients can further be optimized by data-driven optimization.Therefore, starting from rule-based models, CMBP can directly incorporate prior knowledge or intuition into DST, while at the same time, the model is allowed to be data-driven.
More concretely, assuming that both slot and value are independent, a CMBP model can be defined as where b t (v), P + t (v), P − t (v), P + t (v), P − t (v), b r t are all probabilistic features which are defined as below: • b t (v): belief of "the value being v at turn t" • P + t (v): sum of scores of SLU hypotheses informing or affirming value v at turn t : sum of scores of SLU hypotheses denying or negating value v at turn t • b r t : belief of the value being 'None' (the value not mentioned) at turn t and P(•) is a multivariate polynomial function where D + 1 is the number of input variables, n is the order of the polynomial, g k 1 ,••• ,kn is the parameter of CMBP.Order 3 gives good trade-off between complexity and performance, hence order 3 is used in our previous work (Sun et al., 2014a;Yu et al., 2015) and this paper.
The constraints in equation ( 1) encode all necessary probabilistic conditions as well as prior knowledge or intuition.For example, the rule "goal belief should be unchanged or positively correlated with the positive scores from SLU" can be represented by The definition of CMBP formulates a search space of rule-based models, where it is easy to employ data-driven criterion to find a rule-based model with good performance.Considering CMBP is originally motivated from Bayesian probability operation which leads to the natural use of integer polynomial coefficients (g ∈ Z), the data-driven optimization can be formulated by an integer programming program.Additionally, CMBP can also be viewed as a statistical approach.Hence, the polynomial coefficients can be extended to real numbers.The optimization of real-coefficient can be done by first getting an integer-coefficient CMBP and then performing hill climbing search (Yu et al., 2015).

Recurrent Polynomial Network
Recurrent polynomial network, which is proposed in this paper, takes the other way to bridge rulebased and statistical approaches.The basic idea of RPN is to enable a kind of statistical model to take advantage of prior knowledge or intuition by using the parameters of rule-based models to initialize the parameters of statistical models.
Like common neural networks, RPN is a statistical approach so it is as easy to add features and try complex structures in RPN as in neural networks.However, compared with common neural networks which are "black boxes", an RPN can essentially be seen as a polynomial function.Hence, considering that a CMBP is also a polynomial function, the encoded prior knowledge and intuition in CMBP can be transferred to RPN by using the parameters of CMBP to initialize RPN skillfully.In this way, it bridges rule-based models and statistical models.
A recurrent polynomial network is a computational network.The network contains multiple edges and loops.Each node is either an input node, which is used to represent an input value, or a computation node.Each node x is set an initial value u (0) x at time 0, and its value is updated at time 1, 2, • • • .Both the type of edges and the type of nodes decide how the nodes' values are updated.There are two types of edges.One type, referred to as type-1, indicates the value updating at time t takes the value of a node at time t − 1, i.e. type-1 edges are recurrent edges, while the other type, referred to as type-2, indicates the value updating at time t takes another node's value at time t.For simplicity, let I x be the set of nodes index y such which are linked to node x by a type-1 edge, Îx be the set of nodes y which are linked to node x by a type-2 edge.Based on these definitions, two types of computation nodes, sum and product, are introduced.Specifically, at time t > 0, if node x is a sum node, its value u x is updated by where w, ŵ ∈ R are the weights of edges.Similarly, if node x is a product node, its value is updated by where M x,y and Mx,y are integers, denoting the multiplicity of the type-1 edge − → yx, and the multiplicity of the type-2 edge − → yx respectively.It is noted that only w, ŵ are parameters of RPN while M x,y , Mx,y are constant given the structure of an RPN.
Let u (t) , û(t) denote the vector of computation nodes' values and the vector of input nodes' values at time t respectively, then a well-defined RPN can be seen as a polynomial function as below.where ⊕ denotes vector concatenation and P is defined by equation (2).For example, for the RPN in figure 2, its corresponding polynomial function is Each computation node can be regarded as an output node.For example, for the RPN in figure 2, node c and node d can be set as output nodes.

RPN for Dialogue State Tracking
As introduced in section 1, in this paper, the dialogue state tracker receives SLU N -best hypotheses for each user turn, each hypothesis having a set of act-slot-value tuples with a confidence score.The dialogue state tracker is supposed to output a set of distributions of the joint user goal, that is, the value for each slot.For simplicity and consistency with the work of Sun et al. (2014a) and Yu et al. (2015), slot and value independence are assumed in the RPN model for dialogue state tracking, though neither CMBP nor RPN is limited to the assumptions.Besides, in the rest of the paper, b t (v), P + t (v), P − t (v), P + t (v), P − t (v) are abbreviated by b t , P + t , P − t , P + t , P − t repectively in circumstances where there is no ambiguity.

Structure
Before describing details of the structure used in the real situations, to help understand the corresponding relationship between RPN and CMBP, let's first look at a simplified case with a smaller feature set and a smaller order, which is a corresponding relationship between the RPN shown in figure 3 and 2-order polynomial (8) with features b t−1 , P + t , 1: Recall that a CMBP of polynomial order 2 with 3 features is the following equation (refer to equation ( 2)): The RPN in figure 3 has three layers.The first layer only contains input nodes.The second layer only contains product nodes.The third layer only contains sum nodes.Every product node in the second layer denotes a monomial of order 2 such as (b t−1 ) 2 , b t−1 , P + t and so on.Every product node in the second layer is linked to the sum node in the third layer whose value is a weighted sum of value of product nodes.With weight set according to coefficients in equation ( 8), the value of sum node in the third layer is essentially the b t in equation ( 8).
Like the above simplified case, a layered RPN structure shown in figure 4 is used for dialogue state tracking in our first trial which essentially corresponds to 3-order CMBP, though the RPN framework is not limited to the layered topology.Recall that a CMBP of polynomial order 3 is used as shown in the following equation (refer to equation ( 2)): Let (l, i) denote the index of i-th node in the l-th layer.The detailed definitions of each layer are as follows: • First layer / Input layer: Input nodes are features at turn t, which corresponds to variables in CMBP in section 2. i.e. u While 7 features are used in previous work of CMBP (Sun et al., 2014a;Yu et al., 2015), only 6 of them are used in RPN with feature b r t−1 removed (b r t is defined in section 2).Since our experiments showed the performance of CMBP would not become worse without feature b r t−1 , to make the structure more compact, b r t−1 is not used in this paper for RPN.In accordance to this, CMBP mentioned in the rest of paper does not use this feature either.
• Second layer: The value of every product node in the second layer is a monomial like the simplified case.And every product node has indegree 3 which is corresponding to the order of CMBP.
Every monomial in CMBP is the product of three repeatable features.Correspondingly, the value of every product node in second layer is the product of values of three repeatable nodes in the first layer.Every triple 1,k 3 .And different node in the second layer is created by a distinct triple.So given the 6 input features, there are = 56 nodes in the second layer.
To simplify the notation, a bijection from nodes to monomials is defined as: where D + 1 = 6 is the number of nodes in the first layer, i.e. input feature dimension.
• Third layer: The value of sum node x = (3, 0) in the third layer is corresponding to the output value of CMBP.
Every product nodes in the second layer are linked to it.Node x's value u With only sum and product operation involved, every node's value is essentially a polynomial of input features.And just like recurrent neural network, node at time t can be linked to node at time t + 1.That is why this model is called recurrent polynomial network.
The parameters of the RPN can be set skillfully according to CMBP coefficients g k 1 ,k 2 ,k 3 in equation ( 10) so that the output value is the same as the value of CMBP, which is a direct way of applying prior knowledge and intuition to statistical models.It is explained in detail in section 4.4.

Activation Function
In DST, the output value is a belief which should lie in [0, 1], while values of computational nodes are not bound by certain interval in RPN.Experiments showed that if weights are not properly set in RPN and a belief b t−1 output by RPN is larger than 1, then b t may grow much larger because b t is the weighted sum of monomials such as (b t−1 ) 3 .Belief of later turns such as b t+10 may grow to a number which is so large that can hardly be calculated.
Therefore, an activation function is needed to map b t to a legal belief value (referred to as b t ) in (0, 1). 3 kinds of functions, the logistic function, the clip function, and the softclip function have been considered.A logistic function is defined as It can map R to (0, 1) by setting L = 1.However, even with carefully setting η and x 0 , such as η = 5, x 0 = 0.5, the gap between b t and b t can hardly be omitted when b t is in the range of (0, 1) so that it makes RPN inherit the prior knowledge from CMBP more difficult.For example, if logistic function is used and P + t , P − t , P + t , P − t are all 0 at some turn t.If b t−1 is in [0, 1] and activation function is linear on [0, 1], using the constraints given by the previous work of CMBP (Sun et al., 2014a;Yu et al., 2015), with certain parameter set to constant, it is easily ensured that b t = b t−1 .However, constraints in CMBP should be changed to achieve this property if logistic function is used.
As an alternation, a clip function is defined as and L is the loss function, Thus, ∂L ∂bt would be 0 whatever ∂L ∂b t is.This gradient vanishing phenomenon may affect the effectiveness of backpropagation training in section 4.5.
So an activation function sof tclip(•) is introduced, which is a combination of logistic function and clip function.Let denote a small value such as 0.01, δ denote the offset of sigmoid function such that sigmoid ( − 0.5 + δ) = .Here the sigmoid function refers to the special case of the logistic function defined by the formula The softclip function is defined as sof tclip : R → (0, 1) is a non-decreasing, continuous function.However, It is not differentiable when x = or x = 1 − .So we defined its derivative as follows: It is like a clip function.However, its derivative may be small on some inputs but is not zero.Figure 5 shows the comparison among clip function, logistic function, and softclip function.
With the activation function, a new type of computation node, referred to as activation node, is introduced.Activation node only takes one input and only has one input edge of type-2, i.e | Îx | = 1 and I x = ∅.The value of an activation node x is calculated as where  x denotes the input node of node x. i.e.Îx = { x }.
The activation function is used in the rest of the paper.Figure 6 gives an example of RPN with activation function, whose structure is constructed by adding an activation function to the RPN in figure 4.
Figure 6: RPN for DST with activation functions

Further Exploration on Structure
Adding features to CMBP is not easy because additional prior knowledge is needed to add to keep the search space not too large.Concretely, adding features can introduce new monomials.Since the trivial search space is exponentially increasing as the number of monomials, the search space tends to be too large to explore when new features are added.Hence, to reduce the search space, additional prior knowledge is needed, which can introduce new constraints to the polynomial coefficients.For the same reason, increasing the model complexity is also not very convenient in CMBP.
In contrast to that, since RPN can be seen as a statistically model, it is as easy as most statistical approaches such as RNN to add new features to RPN and use more complex structures.At the same time, no matter what new features are used and how complex the structure is, RPN can always take advantage prior knowledge and intuition which is discussed in section 4.4.In this paper, both new features and complex structure are explored.
Adding new features can be done by just adding input nodes which correspond to the new features, and then adding product nodes corresponding to the new possible monomials introduced by the new input nodes.In this paper, for slot s, value v at turn t, in addition to f 0 ∼ f 5 which are defined as b t−1 (v), P + t (v), P − t (v), P + t (v), P − t (v), and 1 respectively, 4 new features are investigated.f 6 and f 7 are features of system acts at the last turn: • f 6 canthelp(s, t, v) ∪ canthelp.missingslot value(s, t) =1 if the system cannot offer a venue with the constraint s = v or the value of slot s is not known for the selected venue, otherwise 0.
• f 7 select(s, t, v) =1 if the system asks the user to pick a suggested value for slot s, otherwise 0.
f 6 and f 7 are introduced because user is likely to change their goal if given machine acts canthelp(s, v), canthelp.missingslot value(s, t) and select(s, v).f 8 and f 9 are features of user acts at the current turn: • f 8 inf orm(s, t, v) = 1 if one of SLU hypotheses to the user is informing slot s is v, otherwise 0.
• f 9 deny(s, t, v) =1 if one of SLU hypotheses to the user is denying slot s is v, otherwise 0. f 8 and f 9 are features about SLU acttype, introduced to make system robust when the confidence scores of SLU hypothesis are not reliable.
In this paper, the complexity of evaluating and training RPN for DST would not increase sharply because a constant order 3 is used and number of product nodes in the second layer grows from 56 to 220 when number of features grows from 6 to 10.
In addition to new features, RPN of more complex structure is also investigated in this paper.To capture some property just like belief b t of dialogue process, a new sum node x = (3, 1) in the third layer is introduced.The connection of (3, 1) is the same as (3, 0), so it introduces a new recurrent connection.The exact meaning of its value is unknown.However, it is the only value used to record information other than b t of previous turns.Every other input features except b t are features of current turn t.Compared with b t , there are fewer restrictions on the value of (3, 1) since its value is not directly supervised by the label.Hence, introducing (3, 1) may help to reduce the effect of inaccurate labels.
The structure of the RPN with 4 new features and 1 new sum node, together with new activation nodes introduced in section 4.2 is shown in figure 7.

RPN Initialization
Like most neural network models such as RNN, the initialization of RPN can be done by setting each weight, i.e. w and ŵ, to be a small random value.However, with its unique structure, the initialization can be much better by taking advantage of the relationship between CMBP and RPN which is introduced in section 4.1.
When RPN is initialized according to a CMBP, prior knowledge and constraints are used to set RPN's initial parameters as a suboptimum point in the whole parameter space.RPN as a statistical model can fully utilize the advantages of statistical approaches.However, RPN is better than real CMBP while they both use data samples to train parameters.In the work of Yu et al. (2015), realcoefficient CMBP uses hill climbing to adjust parameters that are initially not zero and the change of parameters are always a multiple of 0.1.RPN can adjust all parameters including parameters initialized as 0 concurrently, while the complexity of adjusting all parameters concurrently is nearly the same as adjusting one parameter in CMBP.Besides, the change of parameters can be large or small, depending on learning rate.Thus, RPN and CMBP both are bridging rule-based models and statistical ones, while RPN is a statistical model utilizing rule advantages and CMBP is a rule model utilizing statistical advantages.
In fact, given a CMBP, an RPN can achieve the same performance as the CMBP just by setting its weights skillfully according to the coefficients of the CMBP.To illustrate that, the steps of initializing the RPN in figure 7 with a CMBP of features f 0 ∼ f 9 is described below.
First, to ensure that the new added sum node x = (3, 1) will not influence the output b t in RPN with initial parameters, ŵx,y is set to 0 for all y.So node x's value u (t) x is always 0. Next, considering the RPN in figure 7 has more features than CMBP does, the weights related the new features should be set to 0. Specifically, suppose node x is the sum node in the third layer in RPN denoting b t before activation and node y is one of the product nodes in the second layer denoting a monomial, if product node y is products of features f 6 , f 7 , f 8 , f 9 or the added sum node, then node y's value is not a monomial in CMBP, then weights ŵx,y should be set to 0.
Finally, if product node y is the product of features f 0 ∼ f 5 , suppose the order of CMBP is 3, then For RPN of other structures, the initialization can be done by following similar steps.Experiments show that after training, there are only a few weights larger than 0.1, no matter using CMBP or random initialization.

Training RPN
For a slot s, value v, and time t, suppose l t is the indicator of goal s = v being part of joint goal at turn t in the dialogue label.
Suppose node x is the output node at turn t, and u x is the output value at turn t.If the mean squared error (MSE) is used as the training criterion and there are T turns, the cost L is Forward Pass For each training sample, every node's value at every time is evaluated first.When evaluating u x , values of nodes in I x and Îx should be evaluated before.The computation formula should be based on the type of node x.In particular, for a layered RPN structure, we can simply evaluate u t 1 x 1 earlier than u t 2 x 2 if t 1 < t 2 or t 1 = t 2 and x 1 's layer number is smaller than x 2 's.
Backward Pass Backpropagation through time (BPTT) is used in training RPN.Let error of node x at time t δ x should be set according to its label l t and output value u x , otherwise δ (t) x should be initialized to 0. After a node's error δ is determined, it can be passed to δ y (y ∈ Îx ).Error passing should follow the reversed edge's direction.So the order of nodes passing error can follow the reverse order of evaluating nodes' values.
Initialize ∆w xy = 0, ∆ ŵx,y = 0 for every x, y Initialize the value of recurrent node at turn 0 as 0 foreach Training sample slot s, value v do for t ← 1 to T do for d ← 1 to 4 do foreach node x in time t, layer d do evaluate u x ← 2(u x has been evaluated, the increment on weight ŵxy can be calculated by where α is the learning rate.∆w xy can be evaluated similarly.
Note that only w xy and ŵxy are parameters.
The complete formula of evaluating node value u (t) x and passing error δ x can be found in appendix. In The pseudocode of training is shown in algorithm 1.

Experiment
As introduced in section 1, in this paper, DSTC-2 and DSTC-3 tasks are used to evaluate the proposed approach.Both tasks provide training dialogues with turn-level ASR hypotheses, SLU hypotheses and user goal labels.The DSTC-2 task provides 2118 training dialogues in restaurants domain (Henderson et al., 2014c), while in DSTC-3, only 10 in-domain training dialogues in tourists domain are provided, because the DSTC-3 task is to adapt the tracker trained on DSTC-2 data to the new domain with very few dialogues (Henderson et al., 2014a).Table 1 summarizes the size of datasets of DSTC-2 and DSTC-3.The DST evaluation criteria are the joint goal accuracy and the L2 (Henderson et al., 2014c,a).Accuracy is defined as the fraction of turns in which the tracker's 1-best joint goal hypothesis is correct, the larger the better.L2 is the L2 norm between the distribution of all hypotheses output by the tracker and the correct goal distribution (a delta function), the smaller the better.Besides, schedule 2 and labelling scheme A defined in (Henderson et al., 2013) are used in both tasks.Specifically, schedule 2 only counts the turns where new information about some slots either in a system confirmation action or in the SLU list is observed.Labelling scheme A is that the labelled state is accumulated forwards through the whole dialogue.For example, the goal for slot s is "None" until it is informed as s = v by the user, from then on, it is labelled as v until it is again informed otherwise.
It has been shown that the organiser-provided live SLU confidence was not good enough (Zhu et al., 2014;Sun et al., 2014b).Hence, most of the state-of-the-art results from DSTC-2 and DSTC-3 used refined SLU (either explicitly rebuild a SLU component or take the ASR hypotheses into the trackers (Williams, 2014;Sun et al., 2014b;Henderson et al., 2014d,b;Kadlec et al., 2014;Sun et al., 2014a)).In accordance to this, except for the results directly taken from other papers (shown in table 5 and 6), all experiments in this paper used the output from a refined semantic parser (Zhu et al., 2014;Sun et al., 2014b) instead of the live SLU provided by the organizer.
For all experiments, MSE is used as the training criterion and full-batch batch is used.For both DSTC-2 and DSTC-3 tasks, dstc2trn and dstc2dev are used, 60% of the data is used for training and 40% for validation.Validation is performed every 5 epochs.Learning rate is set to 1.0 initially.During the training, learning rate is halved when the MSE starts increasing.Training is stopped when the learning rate is sufficiently small, or the maximum number of training epochs is reached.Here, the maximum number of training epochs is set to 250.

Investigation on RPN Configurations
This section describes the experiments comparing different configurations of RPN.All experiments were performed on both the DSTC-2 and DSTC-3 tasks.
As indicated in section 4.4, an RPN can be initialized by a CMBP.Table 2 shows the performance comparison between initialization with a CMBP and with random values.In this experiment, the structure shown in figure 6  The performance of the RPN initialized by random values sampled from N (0, 0.01) is compared with the performance of the RPN initialized by the integer-coefficient CMBP.Here, the CMBP has 11 non-zero coefficients and has the best performance in DSTC-2.It can be seen from table 2 that the RPN initialized by the CMBP coefficients significantly outperforms the RPN initialized by random values.This demonstrates the encoded prior knowledge and intuition in CMBP can be transferred to RPN to improve RPN's performance, which is one of RPN's advantage, bridging rulebased models and statistical models.In the rest of the experiments, all RPNs use CMBP coefficients for initialization.Since section 4.3 shows that it is convenient to add features and try more complex structures, it is interesting to investigate RPNs with different feature sets and structures, as shown in table 3. It can be seen that while no obvious correlation between the performance and different configurations of feature sets and structures can be observed on dstc2eval, new features and new recurrent connections significantly help improve the performance of RPN on dstc3eval.Thus, in the rest of the paper, both new features and new recurrent connections are used in RPN, unless otherwise stated.

Feature Set
New

Comparison with Other DST Approaches
The previous subsection investigates how to get the RPN with the best configuration.In this subsection, the performance of RPN is compared to both rule-based and statistical approaches.To make fair comparison, all statistical models together with RPN in this subsection use similar feature set.Altogether, 2 rule-based trackers and 2 statistical trackers were built for performance comparison.• MaxConf is a rule-based model commonly used in spoken dialogue systems which always selects the value with the highest confidence score from the 1 st turn to the current turn.It was used as one of the primary baselines in DSTC-2 and DSTC-3.
• HWU is a rule-based model proposed by Wang and Lemon (2013).It is regarded as a simple, yet competitive baseline of DSTC-2 and DSTC-3.
• DNN is a statistical model with probability feature as RPN.Since DNN does not have recurrent structures while RPN does, to fairly take into account this, the DNN feature set at the t th turn is defined as where P (t) is the highest confidence score from the 1 st turn to the t th turn.The DNN has 3 hidden layers with 64 nodes per layer.
• MaxEnt is another statistical model using Maximum Entropy model with the same input feature as DNN.
It can be observed that, with similar feature set, RPN can outperform both rule-based and statistical approaches in terms of joint goal accuracy.Statistical significance tests were also performed assuming a binomial distribution for each turn.RPN was shown to significantly outperform both rule-based and statistical approaches at 95% confidence level.For L2, RPN is competitive to both rule-based and the statistical approaches.

Comparison with State-of-the-art DSTC Trackers
In the DSTCs, the state-of-the-art trackers mostly employed statistical approaches.Usually, richer feature set and more complicated model structures than the statistical models in section 5.2 are used.In this section, the proposed RPN approach is compared to the best submitted trackers in DSTC-2/3 and the best CMBP trackers, regardless of fairness of feature selection and the SLU refinement approach.RPN is compared and the results are shown in table 5  Note that, in DSTC-2, the Williams (2014)'s system employed batch ASR hypothesis information (i.e.off-line ASR re-decoded results) and cannot be used in the normal on-line model in practice.Hence, the practically best tracker is Henderson et al. (2014d) It can be seen from table 6, RPN trained on DSTC-2 can achieve state-of-the-art performance on DSTC-3 without modifying tracking method 1 , outperforming all the submitted trackers in DSTC-3 including the RNN system.This demonstrates that RPN successfully inherits the advantage of good generalization ability of rule-based model.Considering the feature set and structure of RPN are relatively simple in this paper, future work will investigate richer features and more complex structures.

Conclusion
This paper proposes a novel framework, referred to as recurrent polynomial network, to bridge the rule-based model and statistical approaches.With the ability of incorporating prior knowledge into a statistical framework, RPN has the advantages of both rule-based and statistical approaches.Experiments on two DSTC tasks showed that the proposed approach not only is more stable than many major statistical approaches, but also has competitive performance, outperforming many stateof-the-art trackers.

Figure 1 :
Figure 1: Diagram of a spoken dialogue system (SDS)

Figure 2 :
Figure 2: A simple example of RPN.The type of node a, b, c, d is input, input, product, and sum respectively.Edge − → dd is of type-1, while the other edges are of type-2.Ma,c = 2, Mb,c = Mc,d = M d,d = 1.
is a weighted sum of values of product node u (t) 2,i where the weights correspond to g k 1 ,k 2 ,k 3 in equation (10).

Figure 5 :
Figure 5: Comparison among clip function, logistic function, and softclip function

Figure 7 :
Figure 7: RPN with new features and more complex structure for DST this paper, full batch is used in training RPN for DST.In each training epoch, ∆w xy and ∆ ŵxy are calculated for every training sample and added together.The weight w xy and ŵxy is updated by w xy = w xy − ∆w xy (23) ŵxy = ŵxy − ∆ ŵxy

Table 2 :
is used.Performance comparison between the RPN initialized by random values, and the RPN initialized by the CMBP coefficients on dstc2eval and dstc3eval.

Table 3 :
Performance comparison among RPNs with different configurations on dstc2eval and dstc3eval.

Table 4 :
Performance comparison among RPN, rule-based and statistical approaches with similar feature set on dstc2eval and dstc3eval.The performance of CMBP in the table is the performance of the RPN which has been initialized but not been trained.

Table 5 :
and table 6.Note that structure shown in figure7with richer feature set and a new recurrent connection is used here.Performance comparison among RPN, real-coefficient CMBP and best trackers of DSTC-2 on dstc2eval.Baseline* is the best results from the 4 baselines in DSTC2.

Table 6 :
. It can be observed from table 5, RPN ranks only second to the best practical tracker in accuracy and L2.Considering that RPN only used probabilistic features and very limited added features and can operate very efficiently, it is quite competitive.Performance comparison among RPN, real-coefficient CMBP and best trackers of DSTC-3 on dstc3eval.Baseline* is the best results from the 4 baselines in DSTC3.