Universal transformer Hawkes process with adaptive recursive iteration

doi:10.1016/j.engappai.2021.104416

Engineering Applications of Artificial Intelligence

Volume 105, October 2021, 104416

https://doi.org/10.1016/j.engappai.2021.104416 Get rights and content

Abstract

Asynchronous events sequences are widely distributed in the natural world and human activities, such as earthquakes records, users’ activities in social media, and so on. How to distill the information from these seemingly disorganized data is a persistent topic that researchers focus on. One of the most useful models is the point process model, and on the basis, the researchers obtain many noticeable results. Moreover, in recent years, point process models on the foundation of neural networks, especially recurrent neural networks (RNN) are proposed and compare with the traditional models, their performance is greatly improved. Enlighten by transformer model, which can learn sequence data efficiently without recurrent and convolutional structure, transformer Hawkes process comes out, and achieves state-of-the-art performance. However, there is some research proving that the re-introduction of recursive calculations in transformer can further improve transformer’s performance. Thus, we come out with a new kind of transformer Hawkes process model, universal transformer Hawkes process (UTHP), which contains both recursive mechanism and self-attention mechanism, and to improve the local perception ability of the model, we also introduce convolutional neural network (CNN) in the position-wise-feed-forward part. We conduct experiments on several datasets to validate the effectiveness of UTHP and explore the changes after the introduction of the recursive mechanism. These experiments on multiple datasets demonstrate that the performance of our proposed new model has a certain improvement compared with the previous state-of-the-art models.

Introduction

In this era of informatization and data in trajectories, nature and human activities will be recorded as a large amount of asynchronous sequential data, for instance, the record of occurrence of earthquakes and aftershocks (Ogata, 1998), transaction history of financial markets (Bacry et al., 2015), electronic health record (Wang et al., 2018), equipment failure history (Zhang et al., 2020a) and user behavior in social networks (Zhou et al., 2013, Zhao et al., 2015), such as Weibo, Twitter, Facebook, etc. All of these are asynchronous events sequences data.

These asynchronous sequences contain valuable knowledge and information, researchers utilize a variety of methods to better mine these knowledge and information from the data. Among a diversity of methods, point process is one of the most widely used methods in this field, and in point process models, Hawkes process (Hawkes, 1971) is one of the most commonly used models. The self-exciting progress in Hawkes process fits the interaction between events to some extent, thus the application of Hawkes process achieves certain results in sequence analysis.

For example, Zhou et al. present alternating direction method of multipliers (ADMM)-based algorithm to learn Hawkes process to discover the hidden network of social influence (Zhou et al., 2013), Xu et al. make use of non-parametric Hawkes process to reveal the Granger causality between the users’ activity on watching internet protocol television (IPTV) (Xu et al., 2016). Hansen et al. demonstrate the great expressive potential of Hawkes process in neuroscience with least absolute shrinkage and selection operator (LASSO) method (Hansen et al., 2015). Reynaud–Bouret and Schbath provide a new way to detect the favored or avoided distances between genomic events along with deoxyribonucleic acid (DNA) sequences (Reynaud-Bouret and Schbath, 2010). Zhang et al. (2020a) modify the traditional Hawkes process, introducing the time-dependent background intensity to Hawkes process to analyze the background probability of failure and relationship between failures in compressor station.

However, the traditional Hawkes process only considers the positive superposition of effect of historical events, which severely constrains the fitting ability of this type of model. Meanwhile, the lack of nonlinear operations in traditional Hawkes process also sets an upper limit for Hawkes’ expressive ability. Thus, in recent years, due to the strong fitting ability of neural networks and especially the sequence modeling ability of RNN, the research direction in this field is transferred to the neural point process model. For instance, Du et al. (2016) embed the information of sequence data (including time-stamp and event type) into RNN, and come up with the recurrent marked temporal point processes (RMTPP) to model the conditional intensity function by nonlinear dependency of the history. And similar to RMTPP, Mei and Eisner (2017) propose the continuous-time LSTM (Long Short-Term Memory) to model the conditional intensity of point process, which is called as neural Hawkes process, and in this continuous neural Hawkes process, the influence of the previous event decays with time continuously. Xiao et al. use two RNNs to model the conditional intensity function, one is used to process time stamp information, and the other is used to process historical event information (Xiao et al., 2017).

Inevitably, these RNN based-models also inherit the drawbacks of RNN, for instance, it may take a long time for the patient to develop symptoms due to certain sequel, which has obvious long-term characteristics, such as diabetes, cancer and other chronic diseases, while these RNN-based models are hard to reveal the long-term dependency between the distant events in the sequences (Bengio et al., 1994). The ideal point process model should be able to solve these problems. Moreover, in the training of RNN-based models, such problems as vanishing and exploding Gradients (Pascanu et al., 2013) often occur, and then affect the performance of the model.

It is worth noting that in the traditional sequence learning problem, such as machine translation (Raganato et al., 2018) and speech recognition (Dong et al., 2018), transformer model (Vaswani et al., 2017) based on self-attention (Bahdanau et al., 2015) mechanism achieves distinct performance improvement without application of CNN and RNN, meanwhile, the transformer structure free-from recurrent modules makes the model have higher computational efficiency. These achievements give a new insight on the development of sequential data learning. On account of this fact, Zhang et al. present self-attention Hawkes process (Zhang et al., 2020b), furthermore, Zuo et al. (2020) propose transformer Hawkes process based on the attention mechanism and encoder structure in transformer. This model utilizes pure transformer structure without using RNN and CNN, and achieves state-of-the-art performance, but there is still much room for improvement in the transformer models, for instance, transformer simply stack the encoder layer to learn sequence data, and foregoes the recursive bias learning in RNN, while this recursive learning might be more important than commonly believed.

Dehghani et al. point out that re-introduce recurrence calculation in transformer maybe promote the performance of transformer, which is called as universal transformer (Dehghani et al., 2019), this model combines the advantage of transformer and RNN, organically combines the self-attention and recurrence learning mechanism. In the recurrence process, Dehghani et al. make use of the ACT mechanism (Graves, 2016) to decide when the recurrence process will halt. The experimental results of universal transformer demonstrate that the effectiveness of the combination of self-attention mechanism and recurrence mechanism.

Based on the achievements of the universal transformer, we tend to work out a new framework of the transformer Hawkes process based on the idea of universal transformer, we name it as universal transformer Hawkes process. We introduce the recurrent structure in the transformer Hawkes process, and make our model to achieve Turing completeness compared with the previous transformer model. Moreover, we add a convolution module to the position-wise-feed-forward-layer, which enhances the local perception ability of the universal transformer Hawkes process. We conduct the experiments on multiple datasets to compare with state-of-the-art baselines, to validate the effectiveness of our model. We also demonstrate whether the additional RNN layers will have a positive impact on fitting mutual interdependence among the events in the sequence. In addition, to demonstrate the effectiveness of the ACT mechanism, we compare the performances of the universal transformer with and without the ACT mechanism and verify that the halting mechanism of dynamic iteration will make the model perform better overall.

Our paper is organized as follows. In Section 2, we introduce the related work about Hawkes process and point process in view of the neural network. In Section 3, we are going to instruct our model in detail, including structure, condition intensity function, prediction tasks and training process. At last, Section 4 lists our experimental results to illustrate the advantages of the universal transformer Hawkes process and ACT mechanism. At last, Section 5 concludes the article.

Section snippets

Hawkes process

Hawkes process has form shown as the following: $λ (t) = μ (t) + \sum_{i : t_{i} < t} ϕ (t - t_{i})$ where $μ (t)$ is background intensity function, indicates the background probability of event occurrence, $ϕ (t)$ is the impact function, which is used to measure the historical event influence, and $\sum_{i : t_{i} < t} ϕ (t - t_{i})$ records the impact of all historical events on the current instant. The traditional Hawkes process model in Eq. (1) assumes the positive superposition of past historical impact. Until now, there are many variants of

Proposed model

For sequences of asynchronous events, we need to determine what model they have. The symbols used in the paper are shown in Table 1, and in general, we formulate it as following: assume that there are $N$ sequences in the dataset of the asynchronous events, which are represented as $S_{e} = {s_{n}}_{n = 1}^{N}$ , and for each sequence, note that their lengths are not the same, for the $n$ th sequence $s_{n} = {t_{i}, c_{i}}_{i = 1}^{I_{n}}$ , its length is $I_{n}$ , each sequence $s_{n}$ is composed with $I_{n}$ tuples, $t_{i}$ is the time-stamp of $i$ th event, and $c$

Experiments

We compare our model with three baselines on six events sequence datasets, we evaluate these models by per-event-loglikelihood (in nats), root mean square error (RMSE) and event prediction accuracy on held-out test sets, we first introduce the details of the data set and baselines, and then list our experimental results.

Conclusions and future works

In this paper, we come up with UTHP, a new neural point process model to analyze the asynchronous event sequence. UTHP combines the self-attention mechanism in transformer and the recurrence mechanism in RNN, this operation allows our model to organically integrate the advantages of RNN and transformer, moreover, in order to make UTHP to adaptively determine when to stop refining processes of hidden variables, we introduce the ACT mechanism to UTHP. Experimental results verify that our model

CRediT authorship contribution statement

Lu-ning Zhang: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing – original draft, Visualization, Data curation. Jian-wei Liu: Conceptualization, Methodology, Validation, Resources, Formal analysis, Writing – review & editing, Supervision, Funding acquisition, Project administration. Zhi-yan Song: Software. Xin Zuo: Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Science Foundation of China University of Petroleum, Beijing (No. 2462020YXZZ023). Thanks for Hong-yuan Mei and Si-miao Zuo for their generous help in our research state, their help greatly improved our research.

Lu-ning Zhang, born in 1995. He is currently working toward the Ph.D. degree in control theory and control engineering with the department of automation, college of information science and Engineering, China University of Petroleum, Beijing Campus (CUP). His research interests include deep learning and pattern recognition.

luning˙[email protected]

Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China

References (37)

MohlerGeorge et al.
Improving social harm indices with a modulated Hawkes process
Int. J. Forecast.
(2018)
ZhangLu-ning et al.
Survival analysis of failures based on Hawkes process with Weibull base intensity
Eng. Appl. Artif. Intell.
(2020)
BaJimmy Lei et al.
Layer normalization
(2016)
BacryEmmanuel et al.
Hawkes processes in finance
Market Microstruct. Liquidity
(2015)
Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua, 2015. Neural machine translation by jointly learning to align and...
BengioYoshua et al.
Learning long-term dependencies with gradient descent is difficult
IEEE Trans. Neural Netw.
(1994)
ChowdharyK.R.
Natural language processing
DaleyDaryl J. et al.
An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure
(2007)
Dehghani, Mostafa, Gouws, Stephan, Vinyals, Oriol, Uszkoreit, Jakob, Kaiser, Lukasz, 2019. Universal transformers. In:...
DongLinhao et al.
Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition

Du, Nan, Dai, Hanjun, Trivedi, Rakshit, Upadhyay, Utkarsh, Gomez-Rodriguez, Manuel, Song, Le, 2016. Recurrent marked...

GravesAlex

Adaptive computation time for recurrent neural networks

(2016)

HansenNiels Richard et al.

Lasso and probabilistic inequalities for multivariate point processes

Bernoulli

(2015)

HawkesAlan G.

Spectra of some self-exciting and mutually exciting point processes

Biometrika

(1971)

HawkesAlan G.

Hawkes processes and their applications to finance: A review

Quant. Finance

(2018)

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2016. Deep residual learning for image recognition. In:...

JohnsonAlistair E.W. et al.

Mimic-III a freely accessible critical care database

Sci. Data

(2016)

Kingma, Diederik P., Ba, Jimmy, 2015. Adam: A method for stochastic optimization. In: 3rd International Conference on...

Cited by (0)

luning˙[email protected]

Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China

Jian-Wei Liu, born in 1966. He received the Ph.D. degree in control theory and control engineering from DongHua University in 2006. He is now an associate professor with the department of automation, College of Information Science and Engineering, China University of petroleum, Beijing campus (CUP). His research interests include pattern recognition and intelligent Systems, machine learning, analysis, prediction and control of complex nonlinear system. In these areas he has published over 200 papers in international journals or conference proceedings.

[email protected]

Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China

Zhi-yan Song, born in 1997. She is currently working toward the master degree in control science and engineering with the department of automation, college of information science and Engineering, China University of Petroleum, Beijing Campus (CUP). Her research interests include deep learning and pattern recognition.

[email protected]

Postal address: 260 mailbox, China University of Petroleum Chang ping District, Beijing, 102249, China

Xin Zuo, born in 1964. He is now a professor with the department of automation, College of Information Science and Engineering, China university of petroleum, Beijing campus (CUP). His research interest include intelligent control, analysis and design of safety instrumented system and advanced process control.

[email protected]

Postal address: 260 mailbox China University of Petroleum, Changping District Beijing, 102249, China

View full text

Universal transformer Hawkes process with adaptive recursive iteration

Abstract

Introduction

Section snippets

Hawkes process

Proposed model

Experiments

Conclusions and future works

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Int. J. Forecast.

Eng. Appl. Artif. Intell.

Layer normalization

Hawkes processes in finance

Market Microstruct. Liquidity

Learning long-term dependencies with gradient descent is difficult

IEEE Trans. Neural Netw.

Natural language processing

An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure

Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition

Adaptive computation time for recurrent neural networks

Lasso and probabilistic inequalities for multivariate point processes

Bernoulli

Spectra of some self-exciting and mutually exciting point processes

Biometrika

Hawkes processes and their applications to finance: A review

Quant. Finance

Mimic-III a freely accessible critical care database

Sci. Data