Universal transformer Hawkes process with adaptive recursive iteration
Introduction
In this era of informatization and data in trajectories, nature and human activities will be recorded as a large amount of asynchronous sequential data, for instance, the record of occurrence of earthquakes and aftershocks (Ogata, 1998), transaction history of financial markets (Bacry et al., 2015), electronic health record (Wang et al., 2018), equipment failure history (Zhang et al., 2020a) and user behavior in social networks (Zhou et al., 2013, Zhao et al., 2015), such as Weibo, Twitter, Facebook, etc. All of these are asynchronous events sequences data.
These asynchronous sequences contain valuable knowledge and information, researchers utilize a variety of methods to better mine these knowledge and information from the data. Among a diversity of methods, point process is one of the most widely used methods in this field, and in point process models, Hawkes process (Hawkes, 1971) is one of the most commonly used models. The self-exciting progress in Hawkes process fits the interaction between events to some extent, thus the application of Hawkes process achieves certain results in sequence analysis.
For example, Zhou et al. present alternating direction method of multipliers (ADMM)-based algorithm to learn Hawkes process to discover the hidden network of social influence (Zhou et al., 2013), Xu et al. make use of non-parametric Hawkes process to reveal the Granger causality between the users’ activity on watching internet protocol television (IPTV) (Xu et al., 2016). Hansen et al. demonstrate the great expressive potential of Hawkes process in neuroscience with least absolute shrinkage and selection operator (LASSO) method (Hansen et al., 2015). Reynaud–Bouret and Schbath provide a new way to detect the favored or avoided distances between genomic events along with deoxyribonucleic acid (DNA) sequences (Reynaud-Bouret and Schbath, 2010). Zhang et al. (2020a) modify the traditional Hawkes process, introducing the time-dependent background intensity to Hawkes process to analyze the background probability of failure and relationship between failures in compressor station.
However, the traditional Hawkes process only considers the positive superposition of effect of historical events, which severely constrains the fitting ability of this type of model. Meanwhile, the lack of nonlinear operations in traditional Hawkes process also sets an upper limit for Hawkes’ expressive ability. Thus, in recent years, due to the strong fitting ability of neural networks and especially the sequence modeling ability of RNN, the research direction in this field is transferred to the neural point process model. For instance, Du et al. (2016) embed the information of sequence data (including time-stamp and event type) into RNN, and come up with the recurrent marked temporal point processes (RMTPP) to model the conditional intensity function by nonlinear dependency of the history. And similar to RMTPP, Mei and Eisner (2017) propose the continuous-time LSTM (Long Short-Term Memory) to model the conditional intensity of point process, which is called as neural Hawkes process, and in this continuous neural Hawkes process, the influence of the previous event decays with time continuously. Xiao et al. use two RNNs to model the conditional intensity function, one is used to process time stamp information, and the other is used to process historical event information (Xiao et al., 2017).
Inevitably, these RNN based-models also inherit the drawbacks of RNN, for instance, it may take a long time for the patient to develop symptoms due to certain sequel, which has obvious long-term characteristics, such as diabetes, cancer and other chronic diseases, while these RNN-based models are hard to reveal the long-term dependency between the distant events in the sequences (Bengio et al., 1994). The ideal point process model should be able to solve these problems. Moreover, in the training of RNN-based models, such problems as vanishing and exploding Gradients (Pascanu et al., 2013) often occur, and then affect the performance of the model.
It is worth noting that in the traditional sequence learning problem, such as machine translation (Raganato et al., 2018) and speech recognition (Dong et al., 2018), transformer model (Vaswani et al., 2017) based on self-attention (Bahdanau et al., 2015) mechanism achieves distinct performance improvement without application of CNN and RNN, meanwhile, the transformer structure free-from recurrent modules makes the model have higher computational efficiency. These achievements give a new insight on the development of sequential data learning. On account of this fact, Zhang et al. present self-attention Hawkes process (Zhang et al., 2020b), furthermore, Zuo et al. (2020) propose transformer Hawkes process based on the attention mechanism and encoder structure in transformer. This model utilizes pure transformer structure without using RNN and CNN, and achieves state-of-the-art performance, but there is still much room for improvement in the transformer models, for instance, transformer simply stack the encoder layer to learn sequence data, and foregoes the recursive bias learning in RNN, while this recursive learning might be more important than commonly believed.
Dehghani et al. point out that re-introduce recurrence calculation in transformer maybe promote the performance of transformer, which is called as universal transformer (Dehghani et al., 2019), this model combines the advantage of transformer and RNN, organically combines the self-attention and recurrence learning mechanism. In the recurrence process, Dehghani et al. make use of the ACT mechanism (Graves, 2016) to decide when the recurrence process will halt. The experimental results of universal transformer demonstrate that the effectiveness of the combination of self-attention mechanism and recurrence mechanism.
Based on the achievements of the universal transformer, we tend to work out a new framework of the transformer Hawkes process based on the idea of universal transformer, we name it as universal transformer Hawkes process. We introduce the recurrent structure in the transformer Hawkes process, and make our model to achieve Turing completeness compared with the previous transformer model. Moreover, we add a convolution module to the position-wise-feed-forward-layer, which enhances the local perception ability of the universal transformer Hawkes process. We conduct the experiments on multiple datasets to compare with state-of-the-art baselines, to validate the effectiveness of our model. We also demonstrate whether the additional RNN layers will have a positive impact on fitting mutual interdependence among the events in the sequence. In addition, to demonstrate the effectiveness of the ACT mechanism, we compare the performances of the universal transformer with and without the ACT mechanism and verify that the halting mechanism of dynamic iteration will make the model perform better overall.
Our paper is organized as follows. In Section 2, we introduce the related work about Hawkes process and point process in view of the neural network. In Section 3, we are going to instruct our model in detail, including structure, condition intensity function, prediction tasks and training process. At last, Section 4 lists our experimental results to illustrate the advantages of the universal transformer Hawkes process and ACT mechanism. At last, Section 5 concludes the article.
Section snippets
Hawkes process
Hawkes process has form shown as the following: where is background intensity function, indicates the background probability of event occurrence, is the impact function, which is used to measure the historical event influence, and records the impact of all historical events on the current instant. The traditional Hawkes process model in Eq. (1) assumes the positive superposition of past historical impact. Until now, there are many variants of
Proposed model
For sequences of asynchronous events, we need to determine what model they have. The symbols used in the paper are shown in Table 1, and in general, we formulate it as following: assume that there are sequences in the dataset of the asynchronous events, which are represented as , and for each sequence, note that their lengths are not the same, for the th sequence , its length is , each sequence is composed with tuples, is the time-stamp of th event, and
Experiments
We compare our model with three baselines on six events sequence datasets, we evaluate these models by per-event-loglikelihood (in nats), root mean square error (RMSE) and event prediction accuracy on held-out test sets, we first introduce the details of the data set and baselines, and then list our experimental results.
Conclusions and future works
In this paper, we come up with UTHP, a new neural point process model to analyze the asynchronous event sequence. UTHP combines the self-attention mechanism in transformer and the recurrence mechanism in RNN, this operation allows our model to organically integrate the advantages of RNN and transformer, moreover, in order to make UTHP to adaptively determine when to stop refining processes of hidden variables, we introduce the ACT mechanism to UTHP. Experimental results verify that our model
CRediT authorship contribution statement
Lu-ning Zhang: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing – original draft, Visualization, Data curation. Jian-wei Liu: Conceptualization, Methodology, Validation, Resources, Formal analysis, Writing – review & editing, Supervision, Funding acquisition, Project administration. Zhi-yan Song: Software. Xin Zuo: Supervision, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Science Foundation of China University of Petroleum, Beijing (No. 2462020YXZZ023). Thanks for Hong-yuan Mei and Si-miao Zuo for their generous help in our research state, their help greatly improved our research.
Lu-ning Zhang, born in 1995. He is currently working toward the Ph.D. degree in control theory and control engineering with the department of automation, college of information science and Engineering, China University of Petroleum, Beijing Campus (CUP). His research interests include deep learning and pattern recognition.
Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China
References (37)
- et al.
Improving social harm indices with a modulated Hawkes process
Int. J. Forecast.
(2018) - et al.
Survival analysis of failures based on Hawkes process with Weibull base intensity
Eng. Appl. Artif. Intell.
(2020) - et al.
Layer normalization
(2016) - et al.
Hawkes processes in finance
Market Microstruct. Liquidity
(2015) - Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua, 2015. Neural machine translation by jointly learning to align and...
- et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Trans. Neural Netw.
(1994) Natural language processing
- et al.
An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure
(2007) - Dehghani, Mostafa, Gouws, Stephan, Vinyals, Oriol, Uszkoreit, Jakob, Kaiser, Lukasz, 2019. Universal transformers. In:...
- et al.
Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition
Adaptive computation time for recurrent neural networks
Lasso and probabilistic inequalities for multivariate point processes
Bernoulli
Spectra of some self-exciting and mutually exciting point processes
Biometrika
Hawkes processes and their applications to finance: A review
Quant. Finance
Mimic-III a freely accessible critical care database
Sci. Data
Cited by (0)
Lu-ning Zhang, born in 1995. He is currently working toward the Ph.D. degree in control theory and control engineering with the department of automation, college of information science and Engineering, China University of Petroleum, Beijing Campus (CUP). His research interests include deep learning and pattern recognition.
Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China
Jian-Wei Liu, born in 1966. He received the Ph.D. degree in control theory and control engineering from DongHua University in 2006. He is now an associate professor with the department of automation, College of Information Science and Engineering, China University of petroleum, Beijing campus (CUP). His research interests include pattern recognition and intelligent Systems, machine learning, analysis, prediction and control of complex nonlinear system. In these areas he has published over 200 papers in international journals or conference proceedings.
Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China
Zhi-yan Song, born in 1997. She is currently working toward the master degree in control science and engineering with the department of automation, college of information science and Engineering, China University of Petroleum, Beijing Campus (CUP). Her research interests include deep learning and pattern recognition.
Postal address: 260 mailbox, China University of Petroleum Chang ping District, Beijing, 102249, China
Xin Zuo, born in 1964. He is now a professor with the department of automation, College of Information Science and Engineering, China university of petroleum, Beijing campus (CUP). His research interest include intelligent control, analysis and design of safety instrumented system and advanced process control.
Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China