nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 19.05.2022 | Original Article

Discriminative and efficient non-local attention network for league of legends highlight detection

verfasst von: Qian Wan, Aruna Wang, Guoshuai Zhang, Le Liu, Jiaji Wu

Erschienen in: Complex & Intelligent Systems | Ausgabe 6/2022

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

With the growing popularity of eSports, video highlight detection, which encapsulates the most informative parts in a few seconds, has become a critical part of live competition. However, learning the spatial–temporal dependency efficiently and discriminatively in video highlight detection for league of legends (LoL) is a critical problem. In this study, to address these existing problems, we propose a novel discriminative and efficient non-local attention network (DENAN) for LoL highlight detection. In particular, both spatial and temporal dependencies are learned using an end-to-end lightweight trainable framework. An auxiliary triplet loss is used in discriminative training to learn robust LoL video feature representations and improve DENAN’s performance. Our experimental results on the NLACS and LMS datasets show the effectiveness of our method in terms of performance and computation cost.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Recently, eSports has become increasingly popular among players and viewers. The league of legends (LoL), one of the most famous online eSports, has attracted numerous players and fans around the world. According to a Statista report, the LoL World Championships peaked at approximately 46 million concurrent viewers in 2020. Highlight replay is an important part of live streaming because it shows the most amazing fighting fragments during the live streaming. However, current LoL highlight replay largely manual. Therefore, it is critical to implement automatic and efficient highlight detection in LoL.

As an attempt to automate highlight generation, video highlight detection has attracted the attention of both academics and industry. The goal of highlight detection is to generate a short video clip from the candidate video that captures a user’s primary attention or interest [1]. Existing video highlight detection methods could mainly be divided into two categories: structure-driven [2, 3] and keyframe-based methods [4‐6]. Researchers using structure-driven methods have focused on a well-defined data structure in the detected video, such as audience cheering or chatting, score changes, or other special events. The main goal of keyframe-based methods is to optimize the feature representation from the frame level to the clip level. Key-frames are extracted using the histogram of oriented gradient (HOG), scale-invariant feature transform (SIFT), and cluster algorithms, and then video browsing is performed near the selected keyframes, smoothing out video highlights. With the recent advancements in convolutional neural network (CNN), CNN-based methods have been extensively used in video highlight detection [4, 6‐8].

To learn a robust feature representation for video highlight detection, some studies have considered the temporal dependency between video frames. This study [6] used long short-term memory (LSTM) [9] to model the variable-range temporal dependency among video frames to generate both representative and compact video summary. A two-layer recurrent neural network (RNN) was used to construct a hierarchical RNN, which exploits the long temporal dependency among frames [10]. However, learning discriminative and robust video feature representation solely based on temporal dependency is insufficient, and it is natural to model the spatial–temporal dependency inside the video. In a previous study, the features extracted from the spatial and temporal streams are combined to develop a novel pairwise deep ranking model [7]. Another study proposed a deep ranking model to produce a score map for each video segment based on the spatial and temporal stream. For the task of video highlight detection in LoL, we intend to determine the time when the audience is most interested in the live streaming from the perspective of time dimension. Additionally, it is important to determine the position where the audience will pay more attention. Therefore, explicitly learning the spatial–temporal dependency for the LoL highlight detection task is appealing.

A few attempts have recently been made to integrate attention-based methods into video highlight detection, with the goal of weighting the importance of different frames. The attention mechanism learned with Bi-LSTM was used to model the importance of different source frames for video highlight detection [11]. A self-attention mechanism was used to replace complex LSTM, which captures the temporal relationship between input frame features before computing the weighted average of all input features [12]. However, while promising performances have been reported, most current attention-based methods only use complex LSTM to learn the temporal dependency. Therefore, a model for capturing the spatial–temporal dependency with a low computational cost is required for LoL highlight detection.

In this study, we attempt to capture the spatial–temporal feature representation in the candidate video with an attention mechanism for an efficient and discriminative end-to-end trainable LoL video highlight detection task. Based on another study [13], we propose a discriminative and efficient non-local attention network (DENAN), as shown in Fig. 1, which incorporates the non-local attention mechanism into ShuffleNetV2 [14], a light-weight CNN for classification. The non-local attention module refines the LoL video sequence representations by generating mean-weighted attention to the features of different spatial and temporal locations in the sequences. DENAN explores the spatial–temporal diversity of LoL video sequence, and discriminatively and efficiently learns the sequence representation. The main contributions of this study are as follows:

In this study, we propose an end-to-end framework for learning both spatial and temporal dependencies in a discriminative and efficient non-local attention network for LoL video highlight detection.

We significantly reduce the computation cost for LoL video highlight detection task while significantly improving performance.

Experimental results on the NALCS and LMS datasets show that our proposed method outperforms existing methods in terms of efficiency and accuracy.

In this section, we focus on recent methods related to our work, which includes video highlight detection, eSports video highlight detection, and attention-based video highlight detection.

Video highlight detection

Research on video highlight detection is mainly performed along two directions: (a) keyframe-based methods and (b) structure-driven methods. Keyframe-based methods use a subset of representative keyframes from the original video to generate highlights. Most early video highlight detection methods focus on extracting keyframes independently and using them as a classification task. Borth et al. proposed a keyframe extraction approach in which the video is first segmented into shots using shot boundary detection, and then obtained keyframes using the k-means algorithm [15]. Lin et al. used a context-specific highlight support vector machines (SVM) model to summarize video sequences without watching the entire video by predicting the contextual information of each video segment [16]. Even though these methods achieved remarkable performance, they only extract low-level features and ignore the temporal dependency, which describes the relationship between highlight and non-highlight frames.

Unlike keyframe methods, structure-driven methods exploit a well-defined data structure in the detected video, such as audience cheering and chatting, score changes, or other special events. Therefore, structure-based methods are suitable for sports video highlight detection, and they have attracted the attention of many researchers [6, 17, 17]. Zhao et al. proposed a highlight detection model based on audio energy and motion activity. Hsieh et al. proposed a more flexible solution for finding important and meaningful events in sports games by analyzing the messages shared between users on microblog services [18]. While sports video highlight detection has improved, these methods rely on audio, textual, and psychological data, which are not always easy to obtain.

eSports video highlight detection

With the recent rapid development of eSports, video highlight detection has attracted the attention of both industry and academia. Fu et al. proposed a CNN-LSTM model for LoL that combine visual features and real-world audience discourse. Song et al. proposed a cascaded prediction approach for learning convolution filters of visual effects for detecting video highlights in Heroes of the Storm, LoL, and Dota2 [19]. Wang et al. [20] proposed a multi-stream framework to fuse spatial, temporal information, and audio features extracted from Honor of King videos.

Recent eSports video highlight detection attempted to address these problems from a cross-modal perspective. However, for real-world applications, cross-modal information, such as audience chat or audio signal, is either difficult to obtain or requires additional computation cost. Therefore, we attempt to consider eSports video highlight detection using only visual features.

Attention-based video highlight detection

The objective of attention-based video highlight detection is to find what the user is most paying attention to, which is highly correlated with highlights. Ma et al. presented a generic framework in which computational attention models based on the modeling of viewer’s attention were used [21]. Ma et al. presented a generic framework for a user attention model, which estimates the how much attention viewers may pay to video content. Ejaz et al. proposed an efficient visual attention model based on the key frame extraction method and reduced the computational cost using the temporal gradient based on dynamic visual saliency detection [22]. Additionally, some studies integrated spatial–temporal clues into attention-based video highlight detection, with the goal of determining when and where users are most interested. A novel 3D dimensional attention model was proposed, which can automatically localize the key elements in a video without any extra supervised annotations [23].

A self-attention mechanism was proposed to model the long-range dependency in machine translation [24]. Inspired by self-attention mechanism [24] and the non-local means algorithm [25], Wang et al. proposed the non-local attention [13], which computes the response at one position as a weighted sum of the features at all positions, capturing long-range spatial–temporal dependency for video representation. Our work is similar to [23], but [23] computes coarse-grained spatial temporal attention in which the attention matrix only models the relationship between different channels and shares along spatial and temporal dimensions. Conversely, DENAN can capture a fine-grained relationship at the pixel level, capturing the relationship at all positions along spatial and temporal dimensions.

Methods

In this section, we first describe the overview of our proposed DENAN. Then, each sub-module of DENAN will be described in detail.

Network architecture for video highlight detection of LoL

Figure 2 shows an overview of our proposed network, which consists of three parts: (1) a LoL video encoder, which is used to convert LoL video sequences into deep feature representations. ShuffleNetV2 [14] is used as the encoder, (2) non-local attention module, which is used to capture both spatial–temporal and long-range dependencies in LoL video sequences, and (3) discriminative loss function for training. Triplet loss (TL) [26] and cross entropy (CE) loss are used to optimize the entire framework.

Video encoder for LoL

We adopt an efficient video encoder with low computation costs to meet the requirement for a real-time video highlight detection system. Inspired by the recent studies [14, 27], which designed an efficient and light-weight 2D universal CNN, we adopt ShuffleNetV2 [14] (Fig. 3) as a frame-level video encoder for LoL.

The number of feature channels in light-weight network is limited by the computing resources available. Compared with ShuffleNetV1 [27], the main improvement of ShuffleNetV2 [14] is channel split operation. Particularly, the total number of channels of the input feature map is divided into one branch with ${C}^{^{\prime}}$ channels and the other with $\left(C-{C}^{^{\prime}}\right)$ channels ($C$ is the total number of channels). After a three-layer convolution operation for each branch, the features of the two branches are combined. Then, a channel shuffle operation is applied to enable two branches of information to interact.

Given a set of video sequences ${\{{X}_{i}\}}_{i=1}^{N}$, where ${X}_{i}$ denotes one LoL video sequence and $N$ represents the total number of video sequences. Each sequence contains $T$ frames, where $T$ is the video sequence length. Frame-by-frame video images for one sequence are input to the video encoder to obtain a set of frame-level feature sequences ${\chi }_{i}={\{{x}_{i}^{t}\}}_{t=1}^{T}$, where ${x}_{i}^{t}$ represent the feature map for ${t}$th image in ${i}$th sequence. Then, the output from the video encoder is fed into the non-local attention module to further capture the spatial–temporal features of LoL video.

Spatial–temporal feature extraction using non-local attention module

The input of non-local attention module are the frame-level feature sequences ${\chi }_{i}={\{{x}_{i}^{t}\}}_{t=1}^{T}$. Following the non-local form proposed in [13, 25], the non-local operation could be defined as follows:

$${\mathcal{F}}_{i,j}=\frac{1}{N({\chi }_{i})}\sum_{\forall k}s\left({\chi }_{i,j},{\chi }_{i,k}\right)g\left({\chi }_{i,k}\right),$$

(1)

where ${\chi }_{i}$ represent the extracted feature sequences, $j$ represent the computed position index of the response, $k$ represent the possible positions of all input feature sequences in both spatial and temporal dimensions, $s\left({\chi }_{i,j},{\chi }_{i,k}\right)$ denotes the relationship between position $j$ and position $k$ of the input feature sequences, $g({\chi }_{i,k})$ computes the feature representation of input ${\chi }_{i}$ at position $k$, and $N({\chi }_{i})$ is the normalization factor.

We design a non-local attention module based on the non-local operation [13] to capture spatial and temporal pixel dependencies in LoL video (Fig. 4). The definition of non-local operation is similar to Eq. (1), which captures the spatial and temporal long-range pixel relationship. Then, the definition of non-local attention is defined as follows:

$${\stackrel{\sim }{\mathcal{F}}}_{i,j}=\frac{1}{N}\sum_{\forall k}\theta {\left({\upchi }_{i,j}\right)}^{T}\phi \left({\chi }_{i,k}\right)g\left({\chi }_{i,k}\right),$$

(2)

where ${\stackrel{\sim }{\mathcal{F}}}_{i,j}$ represent a specific position element of ${\stackrel{\sim }{\mathcal{F}}}_{i}$ on position $j$, $k$ are all possible positions of ${\chi }_{i}$ to be computed. Here, we project $\chi_{i,j} \; {\text{and}}\; \chi_{i,k}$ to embedding space using a linear transformation. Therefore, $\theta \left({\chi }_{i,j}\right)={W}_{\theta }{\chi }_{i,j}$, $\phi \left({\chi }_{i,k}\right)={W}_{\phi }{\chi }_{i,k}$, and $g\left({\chi }_{i,j}\right)={W}_{g}{\chi }_{i,j}$, where $W_{\theta } ,W_{\phi } ,\;{\text{and}}\; W_{g}$ are weights to be learned. Equation (2) is similar to the self-attention mechanism proposed in [24].

To project ${\stackrel{\sim }{\mathcal{F}}}_{i}$ into the original space, the non-local operation is wrapped into non-local attention module, as shown in Fig. 4, which is defined as follows:

$${Z}_{i}={W}_{z}{\stackrel{\sim }{\mathcal{F}}}_{i}+{\chi }_{i},$$

(3)

where ${W}_{z}$ represent a linear projection matrix, which is implemented using a $1\times 1\times 1$ convolution, and ${Z}_{i}$ represent the final output of non-local attention module. A video highlight is a duration of continuous frames that are not time-independent. However, determining whether a particular frame is a highlight must consider the long-range pixel dependency in the spatial dimension. Therefore, we should consider the relationship between LoL videos from both spatial and temporal dimensions.

To further refine the spatial–temporal information of the learned features, we use global average pooling (GAP) to aggregate information in spatial and temporal dimensions as follows:

$$ \widetilde{{z_{i}^{c} }} = \frac{1}{T \times H \times W}\mathop \sum \limits_{t = 1}^{T} \mathop \sum \limits_{h = 1}^{H} \mathop \sum \limits_{w = 1}^{W} z_{i}^{t} \left( {h,w,c} \right), $$

(4)

where $\stackrel{\sim }{{z}_{i}}=[\stackrel{\sim }{{z}_{i}^{1}},\dots ,\stackrel{\sim }{{z}_{i}^{C}}]$ is the feature representation vector of ${\chi }_{i}$, $c$ denotes the position index of $\stackrel{\sim }{{z}_{i}^{c}}$ in $\stackrel{\sim }{{z}_{i}}$, and $c=1024$ in our proposed DENAN.

Loss function for training DENAN

To extract discriminative and robust features from DENAN, we combined CE loss with TL [26] to train our framework for LoL video highlight detection. The CE loss is defined as follows:

$${\mathcal{L}}_{\mathrm{CE}}=-\frac{1}{N}\sum_{i=1}^{N}{y}_{i}\mathrm{log}f\left(\stackrel{\sim }{{z}_{i}}\right),$$

(5)

$$f\left(\stackrel{\sim }{{z}_{i}}\right)=\frac{\mathrm{exp}({W}_{{y}_{i}}^{T}{z}_{i,j})}{{\sum }_{j}\mathrm{exp}({W}_{{y}_{j}}^{T}{z}_{i,j})},$$

(6)

where $f\left(\cdot \right)$ denotes softmax, ${W}_{y}$ represent the weight, the output $f\left(\stackrel{\sim }{{z}_{i}}\right)$ denotes the probability of whether the input frame is a highlight, $N$ is the number of LoL video sequences in a mini-batch, and $j$ are all classes, including highlight and non-highlight.

Meanwhile, the discriminativeness of the features extracted from DENAN was improved using TL [26]. The batch-hard TL is defined as follows:

$$ {\mathcal{L}}_{{{\text{TL}}}} = \mathop \sum \limits_{a,p,n} \alpha + \overbrace {{\mathop {\max }\limits_{{y_{a} = y_{p} }} \left\| {\tilde{z}_{a} - \tilde{z}_{p} } \right\|^{2} }}^{{\text{hardest positive}}} - \underbrace {{\mathop {\min }\limits_{{y_{a} \ne y_{n} }} \left\| {\tilde{z}_{a} - \tilde{z}_{n} } \right\|^{2} }}_{{\text{hardest negative}}}, $$

(7)

where ${\tilde{z }}_{a},{\tilde{z }}_{p},{\tilde{z }}_{n}$ are features extracted from the anchor, positive and negative samples, respectively, and $\mathrm{\alpha }$ is the margin hyper-parameter to control the differences between intra- and inter-distance. Here, positive and negative samples refer to the LoL video sequence with the same or different class from the anchor.

In summary, the loss function for DENAN training is a combination of TL and CE losses, which is defined as follows:

$${\mathcal{L}}_{\mathrm{total}}={\mathcal{L}}_{\mathrm{CE}}+\lambda {\mathcal{L}}_{\mathrm{TL}},$$

(8)

where $\lambda $ controls the balance of TL and CE loss, $\lambda $ is 1 in this work.

Experiment

Datasets and evaluation metrics

We trained and evaluated our proposed DENAN on two datasets: NALCS [28] and LMS [28]. NALCS contained 218 videos of LoL in the 2017 Spring, with 128 videos divided into a training set, 40 videos divided into a validation set, and 50 videos divided into a test set. The average length of each video is between 30 and 50 min, with both highlight and non-highlight frames. The data labeling process has been described in detail [28].

The LMS dataset contained 103 videos of LoL, including 57 training videos, 18 validation videos, and 28 testing videos. In our experiment, training and validation sets of these two datasets were used for training, while test sets were used for testing.

Based on the commonly used metrics in video summarization tasks [6, 8, 29], we use precision ($\mathrm{P}$), recall ($\mathrm{R}$), and Fl-score ($\mathrm{F}1$) as evaluation metrics in our experiments to evaluate the performance of DENAN. Let $\mathrm{TP}$ denotes positive frames that are correctly predicted, $\mathrm{FP}$ denotes positive frames with an incorrect prediction, $\mathrm{FN}$ denotes negative frames with an incorrect prediction. Then, the $\mathrm{P}$, $\mathrm{R}$, and $\mathrm{F}1$ are calculated as follows:

$$\mathrm{P}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},$$

(9)

$$\mathrm{R}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},$$

(10)

$$\mathrm{F}1=\frac{2\mathrm{PR}}{\mathrm{P}+\mathrm{R}}.$$

(11)

To evaluate the computation complexity of DENAN, we also introduce floating point operations (FLOPs) followed by number of parameters (Num Params) [14, 30]. In particular, FLOPs represents the number of floating point operations the model performs when processing a sequence of data, while Num Params represents the number of DENAN parameters.

Implementation details

In LoL, the attractive action, such as escape and kill, occurs in the later part of the highlight fragment. Because no one can predict what will happen until the last moment, the later part of the highlight fragment is highly related to the final result. To generate a proper data format, we sampled 5000 positive frames from the last labeled positive frames as the real positive for training on NALCS and LMS datasets. Additionally, another 5000 negative frames were sampled over all negative frames. These sampled frames were considered the first frame for each video sequence, and the remaining frames were sampled every ten frames for each video sequence in a 720P 30 FPS video. During testing, the frames were evaluated every 30 frames in a 720P 30 FPS video.

Our proposed DENAN uses ImageNet-pretrained ShuffleNetV2 [14] as a backbone. The last two layers of ShuffleNetV2 (GAP and FC layers) are removed. All the frames in LoL videos are resized to $224\times 224$. There are 32 LoL video sequences in each mini-batch, each of which contains 16 frames, resulting in 512 frames. During the training stage, the initial learning rate is set to 0.01 and is decreased to 0.001 in the 20th epoch. We set the maximum epoch of iterations to 60, which is sufficient to reach convergence. The SGD algorithm is used for optimizing the parameters. The momentum and weight decay for SGD are set to 0.9 and 10^–4, respectively.

Discussion of the experimental results

In this section, we conduct a series of experiments on NALCS and LMS datasets to demonstrate the validity of all components in our proposed DENAN. Additionally, we investigated the effect of the TL margin and the length of the video sequence on model performance.

Table 1 shows the results of an ablation study conducted for each component in DENAN. The ShuffleNetV2 network was trained with CE loss on the NALCS and LMS datasets as the baseline. NAN stands for non-local attention network and TL denotes triplet loss. Compared with the baseline, baseline + NAN improves the P, R, and Fl by 0.04, − 0.02, and 0.01, and 0.07, − 0.04, and 0.01 on NALCS and LMS datasets, respectively. Baseline + TL means that TL is combined with model training over baseline, which improves P, R, Fl by 0.01, 0.01, and 0.01, and 0.02, 0.01, and 0.01 on NALCS and LMS datasets, respectively. The objective of TL is to learn discriminative and robust feature representations. The objective of non-local attention is to capture the long-range dependencies in spatial and temporal dimensions. When TL and non-local attention are combined, the discriminativeness of feature representations learned by non-local attention network is improved. The P, R, and Fl are improved by − 0.03, 0.05, and 0.01 on the NALCS dataset and − 0.03, 0.04, and 0.01 on the LMS dataset over baseline + NAN, respectively. These experimental results show the effectiveness of non-local attention in capturing spatial and temporal long-range dependencies.

Table 1

Ablation study using the NALCS and LMS datasets with each DENAN component

Component		NALCS			LMS			Num params (M)	FLOPs (M)
NAN	TL	P	R	Fl	P	R	Fl	Num params (M)	FLOPs (M)
×	×	0.76	0.71	0.74	0.69	0.78	0.74	1.25	143.88
×	√	0.77	0.72	0.75	0.71	0.79	0.75	1.25	143.88
√	×	0.80	0.69	0.75	0.76	0.74	0.75	3.35	246.77
√	√	0.77	0.74	0.76	0.73	0.78	0.76	3.35	246.77

Best in bold

“×” denotes an unused component, while “√” denotes a used component

In Fig. 5, we conduct experiments on both NALCS and LMS datasets with baseline + NAN + TL to evaluate the performance in terms of the TL margin, which controls the minimum distance between the hardest positive and hardest negative. In these experiments, our framework achieves the best result on NALCS and LMS datasets when $\alpha =0.1$. Smaller values for the TL margin yield promising results because TL focuses directly on the feature representation and can enhance the robustness of an only CE loss-trained backbone. When the TL margin is set to a larger value, DENAN overfitting occurs.

Table 2 shows the experimental results on NALCS and LMS datasets for an ablation study based on the length of input LoL video sequences. For a fair comparison, we use the model trained with a video sequence length of 16 and evaluate the performance of different lengths (4, 8, and 16). Compared with $T=4$, when $T=16$, P, R, and F1 increased by 0.11, 0.04, and 0.08, respectively, on the NALCS dataset, whereas on the LMS dataset, P, R, and F1 increased by 0.08, 0.08, and 0.08, respectively. This is because our proposed DENAN captures the spatial–temporal dependency between different frames. In theory, as the length of the video sequence grows, the performance of the model should increase. However, the length is limited by the GPU memory, and we can only set a maximum length of 16.

Table 2

Ablation study on both NALCS and LMS datasets with different LoL video sequence length

Length	NALCS			LMS
Length	P	R	F1	P	R	F1
4	0.66	0.70	0.68	0.65	0.70	0.68
8	0.73	0.67	0.70	0.66	0.73	0.70
16	0.77	0.74	0.76	0.73	0.78	0.76

Best in bold

Comparison with state-of-the-art methods

As shown in Table 3, our approach significantly outperforms the state-of-the-art methods based on the evaluation metrics with P, R, Fl, FLOPs, and Num Params on NALCS and LMS datasets.

Table 3

Comparison with state-of-the-art methods on both NALCS and LMS datasets

Method	NALCS			LMS			Num params (M)	FLOPs (M)
Method	P	R	Fl	P	R	Fl	Num params (M)	FLOPs (M)
DR-DSN [8]	0.62	0.80	0.70	0.57	0.84	0.68	2.62	41.97
Iv-LSTM [28]	0.79	0.70	0.75	0.72	0.68	0.70	21.35	3663.31
Ours	0.77	0.74	0.76	0.73	0.78	0.76	3.35	246.77

Best in bold

In Table 3. P, R, and Fl of the proposed DENAN are 0.77, 0.74, and 0.76, respectively, on the NALCS dataset and 0.73, 0.78, and 0.76, respectively, on the LMS dataset. In particular, the Num Params and FLOPs of the proposed DENAN are 3.35 and 246.77 M, respectively. DR-DSN [8] develops video summarization as a sequential decision-making process and a deep summarization network to indicate the likelihood of a frame being selected to summarize the video. According to the probability distributions, selected frames are used as video highlights. Our method is 0.06 higher for Fl on the NALCS dataset, and 0.08 higher on the LMS dataset. In terms of Num Params and FLOPs, DENAN outperformed DR-DSN. However, DR-DSN first extracts frame-level features using GoogLeNet [31], and the extracted features are always local. Therefore, our model still has a computational complexity advantage. Iv-LSTM [28] is proposed for video highlight prediction in LoL based on joint visual features and textual analysis of the audience commentary. It achieves 0.79, 0.7, and 0.75, and 0.72, 0.68, and 0.7 on the NALCS and LMS datasets, respectively. Our proposed method outperforms Iv-LSTM by − 0.02, 0.04, and 0.01, and 0.01, 0.1, and 0.06 on the NALCS and LMS datasets, respectively. DENAN outperforms Iv-LSTM with only visual features, which demonstrates the effectiveness of DENAN for LoL video highlight detection.

Visualization of DENAN Performance

We compare the results of LoL video highlight made by Tencent and our proposed DENAN. Figure 6a, c show the LoL video highlight replay made by Tencent. Figure 6b, d show the LoL video highlight detection made by our proposed DENAN. Highlight frames and non-highlight frames are marked with green and red blocks, respectively. Most highlight frames are detected correctly, and our proposed DENAN outperforms state-of-the-art methods, including Tencent’s current methods. Our proposed DENAN not only achieves accurate video highlight detection (higher P, R, and Fl), but it also has lower FLOPs and Num Params, enabling automatic and real-time video highlight detection.

Figure 7 shows the visualization of our proposed method on LoL videos from the NALCS and LMS datasets. We denote video highlight and non-highlight frames as 1 and − 1. The upper part of Fig. 7a, b shows a comparison of ground truth (label) and prediction (predict), where the turquoise line represents the ground truth of each frame in the video, and the light-pink line represents the prediction of our proposed DENAN. The lower part shows the overlap between ground truth and prediction, with green indicating correct prediction and red indicating the opposite. As shown in Fig. 7, most positive labels are predicted correctly, which demonstrates the effectiveness of our method.

Conclusion

In this study, we propose a DENAN for LoL video highlight detection with low computation cost, in which light-weight ShuffleNetV2 video encoder is used to extract frame-level features from LoL video sequences, and non-local attention is used to capture spatial–temporal long-range dependencies. Training with CE loss and TL improves the performance of DENAN. Experimental results on the NALCS and LMS datasets demonstrate the validity of the proposed method.

Acknowledgements

This work was supported by the Key Research and Development Program of Shaanxi Province-Key Industry Innovation Chain (Group)-Industrial Field under No. 2019ZDLGY10-06.

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel A self-organizing map approach for constrained multi-objective optimization problems

Nächster Artikel A novel kinematics and statics correction algorithm of semi-cylindrical foot end structure for 3-DOF LHDS of legged robots

Xiong B, Kalantidis Y, Ghadiyaram D, Grauman K (2019) Less is more: learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1258–1267

Nepal S, Srinivasan U, Reynolds G (2001) Automatic detection of “goal” segments in basketball videos. In: Proceedings of the ninth ACM international conference on multimedia, pp 261–269

Rui Y, Gupta A, Acero A (2000) Automatically extracting highlights for TV baseball programs. In: Proceedings of the eighth ACM international conference on multimedia, pp 105–115

Ma Y-F, Hua X-S, Lie Lu, Zhang H-J (2005) A generic framework of user attention model and its application in video summarization. IEEE Trans Multimed 7(5):907–919. https://doi.org/10.1109/TMM.2005.854410CrossRef

Song Y, Vallmiyana J, Stent A et al (2015) Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187

Zhang K, Chao W-L, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Lecture notes in computer science Eur conference on computer vision. Springer, pp 766–782. https://doi.org/10.1007/978-3-319-46478-7_47

Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990

Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-second AAAI conference on artificial intelligence

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735CrossRef

10.

Zhao B, Li X, Lu X (2017) Hierarchical recurrent neural network for video summarization. In: Proceedings of the 25th ACM international conference on multimedia, pp 863–871

11.

Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30(6):1709–1717CrossRef

12.

Fajtl J, Sokeh HS, Argyriou V, Monekosso D, Remagnino P (2018) Summarizing videos with attention. In: Asian Conference on Computer Vision. Springer, Cham, pp 39–54

13.

Wang X, Girshick R, Gupta A et al (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803

14.

Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient CNN architecture design. In: Lecture notes in computer science proceedings of the European conference on computer vision (ECCV), pp 122–138. https://doi.org/10.1007/978-3-030-01264-9_8

15.

Borth D, Ulges A, Schulze C et al (2008) Keyframe extraction for video tagging and summarization. Informatiktage 2008:45–48

16.

Lin Y-L, Morariu VI, Hsu W (2015) Summarizing while recording: context-based highlight detection for egocentric videos. In: Proceedings of the IEEE international conference on computer vision workshops, pp 51–59

17.

Zhao Z, Jiang S, Huang Q et al (2006) Highlight summarization in sports video based on replay detection. In: Vol. 2006 IEEE international conference on multimedia and expo. IEEE Publications, pp 1613–1616

18.

Hsieh L-C, Lee C-W, Chiu T-H et al (2012) Live semantic sport highlight detection based on analyzing tweets of twitter. In: Vol. 2012 IEEE international conference on multimedia and expo. IEEE Publications, pp 949–954

19.

Song Y (2016) Real-time video highlights for yahoo esports. arXiv preprint http://arxiv.org/abs/1611.08780

20.

Wang L, Sun Z, Yao W et al (2019) ‘Unsupervised multistream highlight detection for the game’ honor of kings. arXiv preprint http://arxiv.org/abs/1910.06189

21.

Ma Y-F, Lu L, Zhang H-J et al (2002) A user attention model for video summarization. In: Proceedings of the tenth ACM international conference on multimedia, pp 533–542

22.

Ejaz N, Mehmood I, Wook Baik SW (2013) Efficient visual attention based framework for extracting key frames from videos. Signal Process Image Commun 28(1):34–44. https://doi.org/10.1016/j.image.2012.10.002CrossRef

23.

Jiao Y, Li Z, Huang S et al (2018) Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Trans Multimed 20(10):2693–2705. https://doi.org/10.1109/TMM.2018.2815998CrossRef

24.

Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 5998–6008

25.

Buades A, Coll B, Morel J-M (2005) A non-local algorithm for image denoising. In: Vol. 2 IEEE comput. Soc. conference on computer vision and pattern recognition (CVPR’05). IEEE Publications, pp 60–65

26.

Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv preprint http://arxiv.org/abs/1703.07737

27.

Zhang X, Zhou X, Lin M et al (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856

28.

Cheng-Yang Fu MB, Lee J, Berg AC (2017) Video highlight prediction using audience chat reactions. In: EMNLP

29.

Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 202–211

30.

Qin Z, Li Z, Zhang Z et al (2019) Thundernet: towards real-time generic object detection. arXiv preprint http://arxiv.org/abs/1903.11752

31.

Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

Titel: Discriminative and efficient non-local attention network for league of legends highlight detection
verfasst von: Qian Wan
Aruna Wang
Guoshuai Zhang
Le Liu
Jiaji Wu
Publikationsdatum: 19.05.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 6/2022
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00762-1

Springer Professional

Discriminative and efficient non-local attention network for league of legends highlight detection

Abstract

Publisher's Note

Introduction

Video highlight detection

eSports video highlight detection

Attention-based video highlight detection

Methods

Network architecture for video highlight detection of LoL

Video encoder for LoL

Spatial–temporal feature extraction using non-local attention module

Loss function for training DENAN

Experiment

Datasets and evaluation metrics

Implementation details

Discussion of the experimental results

Comparison with state-of-the-art methods

Visualization of DENAN Performance

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Video highlight detection

eSports video highlight detection

Attention-based video highlight detection

Methods

Network architecture for video highlight detection of LoL

Video encoder for LoL

Spatial–temporal feature extraction using non-local attention module

Loss function for training DENAN

Experiment

Datasets and evaluation metrics

Implementation details

Discussion of the experimental results

Comparison with state-of-the-art methods

Visualization of DENAN Performance

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 6/2022

Fuzzy acceptance sampling plan for transmuted Weibull distribution

Hall effect on MHD Jeffrey fluid flow with Cattaneo–Christov heat flux model: an application of stochastic neural computing

Storage assignment optimization for fishbone robotic mobile fulfillment systems

New bag-of-feature for histopathology image classification using reinforced cat swarm algorithm and weighted Gaussian mixture modelling

A hybrid multi-objective bi-level interactive fuzzy programming method for solving ECM-DWTA problem

Survey on clothing image retrieval with cross-domain

Premium Partner