A multisource fusion framework driven by user-defined knowledge for egocentric activity recognition

Yu, Haibin; Jia, Wenyan; Li, Zhen; Gong, Feixiang; Yuan, Ding; Zhang, Hong; Sun, Mingui

doi:10.1186/s13634-019-0612-x

Research
Open access
Published: 22 February 2019

A multisource fusion framework driven by user-defined knowledge for egocentric activity recognition

Haibin Yu ORCID: orcid.org/0000-0002-0900-5276¹,
Wenyan Jia²,
Zhen Li³,
Feixiang Gong⁴,
Ding Yuan⁵,
Hong Zhang⁵ &
…
Mingui Sun⁶

EURASIP Journal on Advances in Signal Processing volume 2019, Article number: 14 (2019) Cite this article

2386 Accesses
10 Citations
Metrics details

Abstract

Recently, egocentric activity recognition has attracted considerable attention in the pattern recognition and artificial intelligence communities because of its widespread applicability to human systems, including the evaluation of dietary and physical activity and the monitoring of patients and older adults. In this paper, we present a knowledge-driven multisource fusion framework for the recognition of egocentric activities in daily living (ADL). This framework employs Dezert–Smarandache theory across three information sources: the wearer’s knowledge, images acquired by a wearable camera, and sensor data from wearable inertial measurement units and GPS. A simple likelihood table is designed to provide routine ADL information for each individual. A well-trained convolutional neural network is then used to produce a set of textual tags that, along with routine information and other sensor data, are used to recognize ADLs based on information theory-based statistics and a support vector machine. Our experiments show that the proposed method accurately recognizes 15 predefined ADL classes, including a variety of sedentary activities that have previously been difficult to recognize. When applied to real-life data recorded using a self-constructed wearable device, our method outperforms previous approaches, and an average accuracy of 85.4% is achieved for the 15 ADLs.

1 Introduction

In recent years, a variety of camera-based smart wearable devices have emerged in addition to smart watches and wristbands, such as Google Glass, Microsoft SenseCam, and Narrative. These wearables usually contain not only a camera, but also other sensors such as inertial measurement units (IMUs), global positioning system (GPS), temperature sensors, light sensors, barometers, and physiological sensors. These sensors automatically collect video/image, motion/orientation, environmental, and health data. Because these data are collected from the viewpoint of the wearer, they are called egocentric or first-person data. Tools for the automatic analysis and interpretation of egocentric data have been developed and applied to healthcare [1, 2], rehabilitation [3], smart homes/offices [4], sports [5], and security monitoring [6]. Egocentric activity recognition has now become a major topic of research in the fields of pattern recognition and artificial intelligence [7, 8].

Traditional methods of egocentric activity recognition often utilize motion sensor data from the IMU only and process these data using conventional classification techniques [9]. However, the performance of motion-based methods depends on the location of the IMU sensor on the body, and the classification accuracy tends to be lower when used to distinguish more complex activities in daily living (ADL), especially for certain sedentary activities. A wearable camera can provide more ADL information than motion sensors alone. Therefore, vision-based activity recognition using a wearable camera has become the focus of research in the field of egocentric activity recognition [10, 11].

In recent years, with the continuous development of the deep learning framework, the accuracy of image/video recognition has been improved greatly, and numerous vision-based activity recognition methods, such as deep learning, have emerged [12,13,14]. It has been reported that deep learning achieved a performance improvement of roughly 10% over the traditional trajectory tracking methods [14]. Although there has been significant progress in egocentric ADL recognition, the performance of vision-based methods is still subject to a number of constraints, such as the location of the wearable camera on the human body, image quality, variations in lighting conditions, occlusion, and illumination. In practical applications, no single sensor can be applied for all possible conditions. A common practice to avoid the risk of misrecognition by a single sensor is to fuse multiple recognition results for the same target from different sensors. Therefore, efforts have been made to combine vision and other sensor data for egocentric ADL recognition. For example, egocentric video and IMU data captured synchronously by Google Glass were used to recognize a number of ADL events [15]. Multiple streams of data were processed using convolutional neural networks (CNNs) and long- and short-term memory (LSTM), and the results were fused by maximum pooling. The average accuracy for 20 distinct ADLs reached 80.5%, whereas using individual video and sensor data only yielded accuracies of 75% and 49.5%, respectively. In [16], the dense trajectories of egocentric videos and temporally enhanced trajectory-like features of sensor data were extracted separately and then fused using the multimodal Fisher vector approach. The average recognition accuracy after fusion was 83.7%, compared to 78.4% for video-only and 69.0% for sensor-only data. These results show that, for egocentric ADL recognition, it is beneficial to integrate IMU sensors and cameras at both the hardware and algorithm levels.

Some commonly used multisource fusion methods include Bayesian reasoning, fuzzy-set reasoning, expert systems, and evidence theory composed of Dempster–Shafer evidence theory (DST) [17] and Dezert–Smarandache theory (DSmT) [18]. Among these methods, DST and DSmT have a simple form of reasoning and can represent imprecise and uncertain information using basic belief assignment functions, thus mimicking human thinking in uncertainty reasoning. By generalizing the discernment framework and proportionally redistributing the conflicting beliefs, DSmT usually outperforms DST when dealing with multisource fusion cases with conflicting evidence sources.

In egocentric ADL recognition using evidence theory, an activity model is often required to convert the activity data or features from different sources to the basic belief assignment (BBA). Generally, activity models can be divided into two types: data-driven and knowledge-driven [19]. Most ADLs have certain regularities because they occur at a relatively fixed time and place, and interact with a fixed combination of objects. As a result, abundant information about when, where, and how ADLs occur can be used to establish a knowledge base. Therefore, for ADL recognition, the knowledge-driven model is more intuitive and potentially powerful. Although no special knowledge-driven model for egocentric ADL recognition currently exists, some knowledge-driven models have been established in fields such as ADL recognition in smart homes, e.g., descriptive logic model [20], event calculus model [21], and activity ontology model [22]. Although these models offer semantic clarity and logical simplicity, they are usually complex. Users must contact the developers to convert their own daily routines into model parameters. Considering that this kind of model is best created by the wearers themselves, the current methods for knowledge representation require substantial simplification to improve their usability and adaptability for egocentric ADL recognition.

In this study, we propose a new knowledge-driven multisource fusion framework for egocentric ADL recognition and apply it to egocentric image sequences and other sensor data captured by a self-developed chest-worn device (eButton) [23] for diet and physical activity assessment. The main contributions of this study are as follows:

(1)
A knowledge-driven multisource fusion framework based on DSmT is established for the fusion of prior knowledge, vision-based results, and sensor-based results. This framework enables the accurate recognition of up to 15 kinds of ADLs, including a variety of sedentary activities that are hard to recognize using traditional motion-based methods, e.g., computer use, meetings, reading, telephone use, watching television, and writing.
(2)
The proposed knowledge-driven ADL model can be established by the device user. Previously, users were required to consult with an expert who could represent the user’s life experience quantitatively using certain index values. Our framework simplifies this process significantly, allowing individuals to express their ADL routines using a set of simple association tables.
(3)
A novel activity recognition algorithm based on egocentric images is proposed. With the help of “bags of tags” determined by CNN-based automatic image annotation, the complex image classification task is reduced to a text classification problem. Furthermore, the entropy-based term frequency-inverse document frequency (TF-IDF) algorithm is used to perform feature extraction and ADL recognition.

The remainder of this paper is organized as follows. Our methods for ADL recognition are described in detail in Section 2. A series of experimental results demonstrating the performance of the proposed framework are presented in Section 3. The comparison with existing methods is shown in Section 4. Finally, we conclude this paper in Section 5 by summarizing our approach and results and discussing some directions for future research.

2 Methods

Our multisource ADL recognition method is illustrated in Fig. 1. Conceptually, it consists of four main components: (1) basic information about the ADL routines of an individual (the user of the wearable device) is acquired using a “condition–activity” association table, (2) a CNN-based automatic image annotation pre-classifies the textual results using an entropy representation, (3) a set of motion and GPS data is processed and pre-classified using a support vector machine (SVM), and (4) a final classification is performed analytically by fusing the pre-classified results represented in terms of BBAs based on the DSmT framework.

2.1 BBA of user knowledge

It is widely accepted that “the person who knows you the best is yourself,” although this is not universally true (e.g., a doctor may know better regarding illnesses). Nevertheless, people know their own lifestyle and ADL routines far better than other people or a computer. Therefore, we develop a knowledge-driven ADL model that can be established by the user of a wearable device. Previously, such a model would require the person to consult an expert who represents the user’s life experience quantitatively using certain index values [20,21,22]. In our framework, we simplify this process significantly to allow individuals to express their ADL routines using a set of simple association tables.

Let us consider r sources of information ɛ₁, ɛ₂, …, ɛ_r. As each source may contain multiple information entities, each source ɛ_i is represented as a vector. With this definition, we represent pairwise relationships (ɛ_i, ɛ_j) from the r sources as a rectangular matrix. The matrix entry in row ɛ_i and column ɛ_j expresses the strength (a positive number) of the relation between these two elements. As the relationship between the two elements is not commutative, i.e., A leads to B does not imply B leads to A, the relationship matrix for (ɛ_i, ɛ_j) is generally asymmetric. As an important special case, (ɛ_i, ɛ_j) for i = j represents the relationships among the elements of ɛ_i. According to Zintik and Zupan [24], all (ɛ_i, ɛ_j) can be tiled into a large, sparse global matrix.

As our knowledge-driven model runs under the framework of the Dezert–Smarandache theory, all activity-related conditions (e.g., time, place, order of occurrence) must be specified through the construction of numerical BBAs. Thus, if we view the ADLs and the conditions as different information sources, we can use the above theoretical framework to represent ADLs in relationships with certain conditions, including their time, place, and order of occurrence, and then fill the pairwise matrices (or tables) numerically. In our application, we require a simple and intuitive form that can be used by individuals. Therefore, we design each matrix as an association table containing integer values from 0 (impossible) to 10 (assured). For example, to represent one’s ADLs at different clock times, a hypothetical individual’s time–activity table is presented in Table 1. In this table, the wearer can adjust the time period according to his/her daily routine, especially activities with relatively clear start times, such as getting up, starting work, leaving work, and sleeping. Multiple time–activity tables may be required for weekdays and weekends/holidays (see the examples in Appendixes 1 and 2). Similarly, a location–activity table and an activity transition table (i.e., a table specifying the previous activity and the current activity) can be designed to further enrich the knowledge-driven model. Our experiments indicate that such tables can be completed quickly with little training.

Table 1 Sample time–activity table

A multisource fusion framework driven by user-defined knowledge for egocentric activity recognition

Abstract

1 Introduction

2 Methods

2.1 BBA of user knowledge

2.2 BBA of images

2.2.1 Semantic feature extraction by CNN

2.2.2 BBA construction from BoTs

2.3 BBA of IMU and GPS sensors

2.4 Hierarchical fusion of knowledge, image, and sensor data by DSmT

3 Experimental results

3.1 Experimental setup and data acquisition

3.2 Experimental results

3.2.1 ADL recognition results using images

3.2.2 ADL recognition results using motion sensors

3.2.3 Fusion of three data sources using 2-L HFNDCS

3.2.4 Fusion results of the image-based method and the sensor-based method using simplified 2-L HFNDCS

4 Comparison and discussion

4.1 Performance comparison on their respective datasets

4.2 Discussion of the comparison on the respective datasets

4.3 Performance comparison on the same dataset

4.4 Discussion of the comparison on the same dataset

5 Conclusion

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Appendices

Appendix 1

1.1 Detailed time–activity tables for wearer 1

Appendix 2

1.1 Detailed time–activity tables for wearer 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords