Skip to main content
main-content

Über dieses Buch

This book presents a comprehensive treatment of visual analysis of behaviour from computational-modelling and algorithm-design perspectives. Topics: covers learning-group activity models, unsupervised behaviour profiling, hierarchical behaviour discovery, learning behavioural context, modelling rare behaviours, and “man-in-the-loop” active learning; examines multi-camera behaviour correlation, person re-identification, and “connecting-the-dots” for abnormal behaviour detection; discusses Bayesian information criterion, Bayesian networks, “bag-of-words” representation, canonical correlation analysis, dynamic Bayesian networks, Gaussian mixtures, and Gibbs sampling; investigates hidden conditional random fields, hidden Markov models, human silhouette shapes, latent Dirichlet allocation, local binary patterns, locality preserving projection, and Markov processes; explores probabilistic graphical models, probabilistic topic models, space-time interest points, spectral clustering, and support vector machines.

Inhaltsverzeichnis

Frontmatter

Introduction

Frontmatter

Chapter 1. About Behaviour

Abstract
Understanding and interpreting behaviours of objects, and in particular those of humans, is central to social interaction and communication. In particular, visual behaviour refers to the actions or reactions of a sensory mechanism in response to a visual stimulus, for example, the navigation mechanism of nocturnal bees in dim light, visual search by eye movement of infants or drivers in response to their surrounding environment. If visual behaviour as a search mechanism is a perceptual function that scans actively a visual environment in order to focus attention and seek an object of interest among distracters, visual analysis of behaviour is a perceptual task that interprets actions and reactions of objects, such as people, interacting or co-existing with other objects in a visual environment. Recognising objects visually by behaviour and activity rather than shape and size is important to the human visual system. Since 1970s, the computer vision community has endeavoured to bring about intelligent perceptual capabilities to artificial visual sensors. This endeavour has been intensified in recent years by the need for understanding massive quantity of video data, with the aim to not only comprehend objects spatially in a snapshot but also their spatiotemporal relations over time in a sequence of images. In this chapter, we introduce the computational study of visual analysis of behaviour, in particular of human behaviour, and outline the opportunities and challenges for visual analysis of behaviour.
Shaogang Gong, Tao Xiang

Chapter 2. Behaviour in Context

Abstract
Interpreting behaviour from object action and activity is inherently subject to the context of a visual environment within which action and activity take place. Context embodies not only the spatial and temporal setting, but also the intended functionality of object action and activity. For instance, one recognises, often by inference, whether a hand-held object is a mobile phone or calculator by its relative position to other body parts such as closeness to the ears, even if they are visually similar and partially occluded by the hand. Similarly for behaviour recognition, the arrival of a bus in busy traffic is more likely to be inferred by looking at the passengers’ behaviour at a bus stop. Computer vision research on visual analysis of behaviour embraces a wide range of studies on developing computational models and systems for interpreting behaviour in different contexts. In this chapter, we introduce a range of established topics and emerging trends in visual analysis of behaviour from understanding facial expression, body gesture, action and intent, to the analysis of group activity, crowd and distributed behaviour, and gaining holistic awareness.
Shaogang Gong, Tao Xiang

Chapter 3. Towards Modelling Behaviour

Abstract
Automatic interpretation of object behaviour requires constructing computational models of behaviour. In particular, it is desirable to automatically learn behaviour models directly from visual observations. In order for a computer to learn a behaviour model from data, one needs to select a suitable representation, develop a robust interpretation mechanism, and adopt an effective strategy for model learning. In this chapter, we introduce four different approaches to behaviour representation from visual data: object-based, part-based, pixel-based, and event-based representations. Behavioural interpretation of activities is commonly treated as a problem of reasoning spatio-temporal correlations and causal relationships among temporal processes in a multivariate space within which activities are represented. In this chapter, we introduce a statistical learning approach, in particular probabilistic graphical models, to underpinning the mechanism for behavioural interpretation. A statistical behaviour model is learned from training data. In this chapter, we overview different learning strategies for building behaviour models, ranging from supervised learning, unsupervised learning, semisupervised learning, weakly supervised learning, to active learning.
Shaogang Gong, Tao Xiang

Single-Object Behaviour

Frontmatter

Chapter 4. Understanding Facial Expression

Abstract
Facial expression is a natural and efficient means for humans to communicate their emotions and intentions, as communication is primarily carried out face to face. Expression can be recognised by either static face images in isolation, or sequences of face images. For the former, it is assumed that the static visual appearance of a face contains enough information for conveying an expression. The latter exploits information from facial movement generated by expressions. Computer-based automatic facial expression recognition considers two problems: face image representation and expression classification. A good representational scheme aims to derive a set of features from face images that can most effectively capture the characteristics of facial expression. The optimal features should not only minimise visual appearance differences from the same type of expression, known as within-class variations, but also maximise differences between two different types of expressions, known as between-class variations. If indiscriminative image features are selected for a representation, it is difficult to achieve good recognition regardless the choice of a classification mechanism. In this chapter, we consider the problems of how to construct a suitable representation and design an effective classification model for both static image based and dynamic sequence based automatic facial expression recognition.
Shaogang Gong, Tao Xiang

Chapter 5. Modelling Gesture

Abstract
Gesture, particularly hand gesture, is an important part of human body language which conveys messages and reveals human intention and emotional state. Automatic interpretation of gesture provides an important means for interaction and communication between human and computer, going beyond the conventional text and graphic based interface. Broadly speaking, human gesture can be composed of movements from any body part of a person, although the most relevant body parts are face and hand. In this sense, facial expression is a special case of gesture. Facial expression and hand movement often act together to define a coherent gesture, and can be better understood if analysed together. This is especially true when interpreting human emotional state based on visual observation of body language. A gesture is a dynamic process, typically characterised by the spatio-temporal trajectory of body motion, and modelled as trajectories in a multivariate feature space. In this chapter, we describe plausible methods for tracking both individual body parts and overall body movement to construct trajectories for gesture representation. Unsupervised learning is explored for automatically segmenting a gesture movement sequence into atomic components and discovering the number of distinctive gesture classes. We also study supervised learning for modelling a gesture sequence as a stochastic process, with classification of different gesture processes learned from data. In addition, the problem of affective state recognition is considered by analysing both facial expression and body gesture together for interpreting the emotional state of a person.
Shaogang Gong, Tao Xiang

Chapter 6. Action Recognition

Abstract
Understanding the meaning of actions is essential for human social communication. Automatic visual analysis of actions is important for visual surveillance, video indexing and browsing, and analysis of sporting events. Whilst facial expression and gesture are mainly associated with movement of one or a number of individual body parts, actions are typically associated with whole body movement, for instance, walking, sitting down, riding a horse. Actions also involve possibly multiple people interacting, such as hugging or shaking hands. Modelling and interpreting action is challenging because the same action performed by different people can be visually different due to the variations in the visual appearance of the people, and changes in viewing angle, position, occlusion, illumination, and shadow. Given a different visual environment under which action recognition is likely to be performed, representation and modelling of actions differ from those for facial expression and gesture. In this chapter, we study three different approaches to modelling and interpreting actions when observed under different viewing conditions. These conditions range from a relatively static scene against non-cluttered background from a not too distant view, to a distance further away with a non-stationary viewpoint, and finally, in an unconstrained and crowded visual environment against a background of many distracters.
Shaogang Gong, Tao Xiang

Group Behaviour

Frontmatter

Chapter 7. Supervised Learning of Group Activity

Abstract
In a public space, actions of individuals are commonly observed as elements of group activities, and are likely to involve multiple objects interacting or co-existing in a shared space. Group activity modelling is concerned with modelling not only the actions of individual objects in isolation, but also the interactions and causal relationships among individual actions. In order to make semantic sense of visual observations of group activities, a supervised learning model aims to first automatically segment temporally a video stream into plausible activity elements, followed by constructing a model from the observed visual data so far for describing different categories of activities, and recognising a new instance of activity by classifying it into one of the known categories. To this end, three problems need be addressed: (1) How to select visual features that best represent activities; (2) How to perform automatic video segmentation; and (3) How to model the temporal and causal correlations among objects whose actions are considered to form meaningful group activities. In this chapter, we describe a contextual event based group activity representation, two different methods for activity based video segmentation, and a dynamic Bayesian network for supervised learning of an activity model with its model structure automatically learned from visual observations.
Shaogang Gong, Tao Xiang

Chapter 8. Unsupervised Behaviour Profiling

Abstract
Given a large quantity of unprocessed videos of object activities, the goal of automatic behaviour profiling is to learn a model that is capable of detecting unseen abnormal behaviour patterns whilst recognising novel instances of expected normal behaviour patterns. In this context, an anomaly is defined as an atypical behaviour pattern that is not represented by sufficient examples in previous observations. Behaviour profiling is by unsupervised learning and anomaly detection is treated as a binary classification problem. One of the main challenges for a binary classification model is to differentiate a true anomaly from outliers that give false positives. In this chapter, we consider a clustering model that discovers the intrinsic grouping of behaviour patterns. The method does not require manual data labelling for either feature extraction or discovery of grouping. This is crucial because manual labelling of behaviour patterns is often impractical given the vast amount of video data, and is subject to inconsistency and error prone. The method performs incremental learning to cope with changes of behavioural context. It also detects anomalies on-line so that (a decision on whether a behaviour pattern is normal is made as soon as sufficient visual evidence is collected without the completion of the observed pattern.
Shaogang Gong, Tao Xiang

Chapter 9. Hierarchical Behaviour Discovery

Abstract
Behaviour of groups of objects observed in a crowded public space is typically complex and uncertain. What is considered to be ‘subjectively interesting behaviour’ to a human observer can be influenced by a wide variety of factors including: (1) the activity of a single object over time; (2) the correlated spatial states of multiple objects, for example, a piece of abandoned luggage is defined by separation from its owner; and (3) higher order spatial and temporal correlations among multiple entities, for instance, traffic flow at a road intersection has a particular spatio-temporal order beyond co-occurrence dictated by traffic lights. Constructing computational models that are both flexible and accurate in representing such complex and uncertain characteristics of behaviour is challenging. A dynamic topic model possesses unique computational attributes that make it an attractive framework for addressing these challenges. In this chapter, we describe a Markov clustering topic model designed for unsupervised modelling and on-line processing of multi-object spatio-temporal behaviours in crowded public scenes. A Markov clustering topic model draws on machine learning theories on probabilistic topic models and dynamic Bayesian networks to achieve a robust hierarchical modelling of behaviours and their dynamics.
Shaogang Gong, Tao Xiang

Chapter 10. Learning Behavioural Context

Abstract
Visual context is the environment, background, and settings within which objects and associated behaviours are observed visually. A semantic interpretation of behaviour depends largely on knowledge of spatial and temporal context defining where and when it occurs, and correlational context specifying the expectation on the behaviours of correlated other objects in its vicinity. In this chapter, we address the problem of how to model computationally behavioural context for context-aware behavioural anomaly detection in a visual scene. We consider models for learning three types of behavioural context: spatial context, correlational context and temporal context. Behaviour spatial context provides situational awareness about where a behaviour is likely to take place. A public space can often be segmented by activities into a number of distinctive regions, called “semantic regions”. Behaviours of certain characteristics are expected in one region but differ from those observed in other regions. Behaviour correlational context specifies how the interpretation of a behaviour can be affected by behaviours of other objects either nearby in the same semantic region or further away in other regions. Object behaviours in a complex scene are often correlated and need be interpreted together rather than in isolation. Behaviour temporal context provides information regarding when different behaviours are expected to happen both inside each semantic region and across regions.
Shaogang Gong, Tao Xiang

Chapter 11. Modelling Rare and Subtle Behaviours

Abstract
One of the most desired capabilities for automated visual analysis of behaviour is the identification of rarely occurring and subtle behaviours of genuine interest. This is of practical value because the behaviours of greatest interest for detection are normally rare, for example civil disobedience, shoplifting, driving offenses, and may be intentionally disguised to be visually subtle compared to more obvious ongoing behaviours in a public space. Rare behaviours by definition have few examples for a model to learn from. The most interesting rare behaviours are often also subtle and do not exhibit abundance of strong visual features in the data that describe them. In this chapter, we consider the problem of learning behaviour models from rare and subtle examples. By rare, we mean as few as one example. By ‘subtle’, we mean weak visual evidence. There may only be a few pixels associated with a behaviour of interest captured in video data, and a few more pixels differentiating a rare behaviour from a typical one. To eliminate the prohibitive manual labelling cost, both in time and inconsistency, required by traditional supervised methods, we describe a weakly supervised framework, in which a user needs not, or cannot, explicitly locate the target behaviours of interest in the training video data.
Shaogang Gong, Tao Xiang

Chapter 12. Man in the Loop

Abstract
Video data captured from a public space is typically characterised by highly imbalanced behaviour class distribution. Most of the captured data examples reflect normal behaviours. Unusual behaviours, either because of being rare or abnormal, only constituent a small portion in the observed data. Whilst an unsupervised learning based model can be constructed to detect unusual behaviours through a process of outlier detection, an outlier detection based model is fundamentally limited in a number of ways: (1) The model has difficulties in detecting subtle unusual behaviours. (2) The model does not exploit information from detected unusual behaviours. Image noise and errors in feature extraction could be mistaken as genuine unusual behaviours of interest, giving false alarms in detection. (3) A large amount of rare but benign behaviours give rise to false alarms in unusual behaviour detection. Human knowledge can be exploited in a different way to address this problem. Instead of giving supervision on the input to model learning by labelling the training data, human knowledge can be more effectively utilised by giving selective feedback to model output. In this chapter, we describe an active learning model for seeking human feedback on model selected queries. The query selection criteria are internal to the model rather than decided by the human observer. This is to ensure that not only the most relevant human feedback is sought, but also the model is not subject to human bias without data support.
Shaogang Gong, Tao Xiang

Distributed Behaviour

Frontmatter

Chapter 13. Multi-camera Behaviour Correlation

Abstract
For gaining a more holistic sense of situational awareness, a primary goal for a multi-camera system is to provide a more complete record and survey a trail of object activities in wide-area spaces, both individually and collectively. This allows for a global interpretation of objects’ latent behaviour patterns and intent. In a multicamera system, disjoint cameras with non-overlapping field of views are more prevalent, due to the desire to maximise spatial coverage in a wide-area scene. However, global behaviour analysis across multiple disjoint cameras is hampered by a number of obstacles such as inter-camera visual appearance variation, unknown and arbitrary inter-camera gaps, lack of visual details and crowdedness, and visual context variation. To overcome these obstacles, a key to visual analysis of multi-camera behaviour lies on how well a model can correlate partial observations of object behaviours from different locations in order to carry out ‘joined-up reasoning’. In this chapter, we describe a framework for modelling a joined-up representation of a synchronised global space, within which local activities from different observational viewpoints can be analysed and interpreted holistically. The focus is on developing a suitable mechanism capable of discovering and quantifying unknown correlations in temporal ordering and temporal delays among different camera views.
Shaogang Gong, Tao Xiang

Chapter 14. Person Re-identification

Abstract
A fundamental task for a distributed multi-camera system is to associate people across camera views at different locations and times. In a crowded and uncontrolled environment observed by cameras from a distance, person re-identification by biometrics such as face and gait is infeasible due to insufficient image details and arbitrary viewing conditions. Visual appearance features, extracted mainly from clothing, are intrinsically weak for matching people. For instance, most people in public spaces wear dark clothes in winter. A person’s appearance can also change significantly between different camera views if large changes occur in view angle, lighting, background clutter and occlusion. This results in different people appearing more alike than that of the same person across different camera views. In this chapter, we describe a method for learning the optimal matching distance criterion, regardless feature representation. This approach to person re-identification shifts the burden of computation from finding some universally optimal imagery features to discovering a matching mechanism for selecting adaptively different features that are locally optimal for each and every pairs of matches. Moreover, behaviour correlations hold useful spatio-temporal contextual information about expectations on where and when a person may re-appear in a networked visible space. This information is utilised for improving matching accuracy through context-aware search.
Shaogang Gong, Tao Xiang

Chapter 15. Connecting the Dots

Abstract
One of the ultimate goals for automated visual analysis of distributed object behaviour is to bring about a coherent understanding of partially observed uncertain sensory data from the past and present, and to ‘connect the dots’ in composing a big picture of global situational awareness for explaining away anomalies and discovering hidden patterns of significance. To that end, we consider the computational task and plausible models for modelling global behaviours and detecting global abnormal activities across distributed and disjoint multiple cameras. For constructing a global behaviour model for detecting holistic anomalies, besides model sensitivity and robustness, model tractability and scalability are of a great importance. A typical distributed camera network may consist of dozens to hundreds of cameras, many of which cover a wide-area scene of different distinctive activity semantic regions. In this chapter, we describe three different approaches to building a model for discovering and describing global behaviour patterns emerging from a distributed network of local activity regions. We examine their model characteristics for coping with uncertainty and complexity in temporal delays between activities observed in different camera views, and for maintaining a manageable computational cost and memory consumption.
Shaogang Gong, Tao Xiang

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise