Skip to main content

Über dieses Buch

This edited book is one of the first to describe how Autonomous Virtual Humans and Social Robots can interact with real people and be aware of the surrounding world using machine learning and AI. It includes:

· Many algorithms related to the awareness of the surrounding world such as the recognition of objects, the interpretation of various sources of data provided by cameras, microphones, and wearable sensors

· Deep Learning Methods to provide solutions to Visual Attention, Quality Perception, and Visual Material Recognition

· How Face Recognition and Speech Synthesis will replace the traditional mouse and keyboard interfaces

· Semantic modeling and rendering and shows how these domains play an important role in Virtual and Augmented Reality Applications.

Intelligent Scene Modeling and Human-Computer Interaction explains how to understand the composition and build very complex scenes and emphasizes the semantic methods needed to have an intelligent interaction with them. It offers readers a unique opportunity to comprehend the rapid changes and continuous development in the fields of Intelligent Scene Modeling.



Intelligent Scene Modeling (ISM)


Chapter 1. Introduction

Technological advances have a tremendous impact on everyone’s life and are interwoven into everything we do. The computer was a significant invention in the last century, which has quietly transformed and morphed into many devices that have rapidly become indispensable to our daily life, such as phones, cameras, TVs, and autonomous vehicles.
Nadia Magnenat Thalmann, Jian Jun Zhang, Manoj Ramanathan, Daniel Thalmann

Chapter 2. Object Detection: State of the Art and Beyond

As one of the fundamental problems of scene understanding and modeling, object detection has attracted extensive attention in the research communities of computer vision and artificial intelligence. Recently, inspired by the success of deep learning, various deep neural network-based models have been proposed and become the de facto solution for object detection. Therefore, in this chapter, we propose to present an overview of object detection techniques in the era of deep learning. We will first formulate the problem of object detection in the framework of deep learning, and then present two mainstream architectures, i.e., the one-stage model and the two-stage model, with the widely used detectors such as Fast R-CNN, YOLO, and their variants. Lastly, we will also discuss the potential and possible improvements on current methods and outline trends for further study.
Hanhui Li, Xudong Jiang, Nadia Magnenat Thalmann

Chapter 3. NBNN-Based Discriminative 3D Action and Gesture Recognition

The non-parametric models, e.g., Naive Bayes Nearest Neighbor (NBNN) Boiman et al. (Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1–8, 2008), have achieved great success in object recognition problem. This success in object recognition motivates us to develop non-parametric model to recognize skeleton-based action and gesture sequences. In our proposed method, each action/gesture instance is represented by a set of temporal stage descriptors composed of features from spatial joints in a 3D pose. Considering the sparsity of involved joints in certain actions/gestures and the redundancy of stage descriptors, we choose Principal Component Analysis (PCA) as a pattern mining tool to pick out informative joints with high variance. To further boost the discriminative ability of the low-dimensional stage descriptor, we introduce the idea proposed in Yuan et al. (2009) to help discriminative variation patterns learnt by PCA to emerge. Experiments on two benchmark datasets, MSR-Action 3D dataset and SBU Interaction dataset, show the efficiency of the proposed method. Evaluation on the SBU Interaction dataset shows that our method can achieve better performance than state-of-the-art results using sophisticated models such as deep learning.
Junwu Weng, Xudong Jiang, Junsong Yuan

Chapter 4. Random Forests with Optimized Leaves for Hough-Voting

Random forest-based Hough-voting techniques are important in numerous computer vision problems such as pose estimation and gesture recognition. Particularly, the voting weights of leaf nodes in random forests have a big impact on performance. We propose to improve Hough-voting with random forests by learning optimized weights of leaf nodes during training. We have investigated two ways for the leaf weight optimization problem by either applying L2 constraints or L0 constraints to those weights. We show that with additional L0 constraints, we are able to simultaneously obtain optimized leaf weights and prune unreliable leaf nodes in the forests, but with additional costs of more computational costs involved during training. We have applied the proposed algorithms to a number of different problems in computer vision, including hand pose estimation, head pose estimation, and hand gesture recognition. The experimental results show that with L2-regularization, regression and classification accuracy are improved considerably. Further, with L0-regularization, many unreliable leaf nodes are suppressed and the tree structure is compressed considerably, while the performance is still comparable to L2-regularization.
Hui Liang, Junsong Yuan

Chapter 5. Modeling Human Perception of 3D Scenes

The ultimate goal of computer graphics is to create images for viewing by people. The artificial intelligence-based methods should consider what is known about human visual perception in vision science literature. Modeling visual perception of 3D scenes requires the representation of several complex processes. In this chapter, we survey artificial intelligence and machine learning-based solutions for modeling human perception of 3D scenes. We also suggest future research directions. The topics that we cover include are modeling human visual attention, 3D object quality perception, and material recognition.
Zeynep Cipiloglu Yildiz, Abdullah Bulbul, Tolga Capin

Chapter 6. Model Reconstruction of Real-World 3D Objects: An Application with Microsoft HoloLens

Digital reconstruction of 3D real-world objects has long been a fundamental requirement in computer graphics and vision for virtual reality (VR) and mixed-reality (MR) applications. In recent years, with the availability of portable and low-cost sensing devices, such as the Kinect Sensor, capable of acquiring RGB-Depth data in real-time, has brought about a profound advancement of the object reconstruction approaches. In this chapter, we present our research on using RGB-Depth sensors embedded in the off-the-shelf MR devices such as the Microsoft HoloLens for object model reconstruction. As MR devices are primarily designed to use its RGB-Depth sensors for environmental mapping (via mesh geometry), it lacks the capability for object reconstruction. We fill this gap by proposing a pipeline for an automated ray-casting-based texture mapping approach to the object mesh geometry acquirable from HoloLens. Our preliminary results from real-world object reconstructions, with different sizes and shapes, demonstrate that our approach produces acceptable reconstruction quality with efficient computation.
Younhyun Jung, Yuhao Wu, Hoijoon Jung, Jinman Kim

Chapter 7. Semantic Modeling and Rendering

3D objects get more attention in various applications including computer games, movies, urban planning, training and so on. To model a large-scale 3D scene, semantic modeling method is getting more attention to efficiently handle the complexity of modeling process. Semantic modeling method could be utilized to model a large-scale scene, through interactive editing of the scene, and automatically generating complex environments. Semantic model could be also used to support intelligent behaviour on virtual scene, semantic rendering and adaptive visualization of complex 3D objects. In this chapter, methods to create semantic models from real objects are discussed. In addition, application examples of semantic models are illustrated with large-scale urban scene. Semantic modeling and rendering would provide ability to create interactive and responsive virtual world by considering its semantics in addition to shape appearances.
J. Divya Udayan, HyungSeok Kim

Chapter 8. Content-Aware Semantic Indoor Scene Modeling from a Single Image

Digitalizing indoor scenes into a 3D virtual world enables people to visit and roam in their daily-life environments through remote devices. However, reconstructing indoor geometry with enriched semantics (e.g. the room layout, object category and support relationship) requires computers to parse and holistically understand the scene context, which is challenging considering the complexity and clutter of our living surroundings. However, with the rising development of deep learning techniques, modeling indoor scenes from single RGB images has been available. In this chapter, we introduce an automatic method for semantic indoor scene modeling based on deep convolutional features. Specifically, we decouple the task of indoor scene modeling into different hierarchies of scene understanding subtasks to parse semantic and geometric contents from scene images (i.e. object masks, scene depth map and room layout). Above these semantic and geometric contents, we deploy a data-driven support relation inference to estimate the physical contact between indoor objects. Under the support context, we adopt an image-CAD matching strategy to retrieve an indoor scene from global searching to local fine-tuning. The experiments show that this method can retrieve CAD models efficiently with enriched semantics, and demonstrate its feasibility in handling serious object occlusions.
Yinyu Nie, Jian Chang, Jian Jun Zhang

Chapter 9. Interactive Labeling for Generation of CityGML Building Models from Meshes

While 3D building models inherently contain semantic information that is useful for industrial applications such as urban planning and construction analysis, the data of the buildings obtained from acquisition and modeling processes are often mere geometry/graphics information. This paper considers the problem of converting 3D building models from a graphical format such as Wavefront OBJ to CityGML which includes semantic information. The problem is challenging due to the lack of explicit semantic information in graphical models. We present a recommendation-based approach to generate CityGML files. The basic idea of the approach is to automatically/semi-automatically group faces of the building models and recommend them to the user for labeling. The underlying technique is based on geometric clustering and graph-cut-based selection that is able to effectively group faces of the same semantic attributes. Our method is efficient and reliable since it involves the user to make use of human’s perception about the shape for labeling and meanwhile reduces the user’s interaction by grouping and recommending faces using geometric algorithms. The experiments demonstrate that the proposed approach can substantially speed up the generation of CityGML models.
Pradeep Kumar Jayaraman, Jianmin Zheng, Yihao Chen, Chee Boon Tan

Chapter 10. TooltY: An Approach for the Combination of Motion Capture and 3D Reconstruction to Present Tool Usage in 3D Environments

Visualization techniques for the usage of tools, handicrafts, and assembly operations are employed for demonstrating processes (e.g., assembly instructions). Until today, most commonly used techniques include written information, sketches in manuals, video instructions, etc. The evolution of technology nowadays has generated mature methods for transforming movement to digital information that can be processed and replicated. Motion capture together with 3D reconstruction techniques can provide new ways of digitizing handicrafts. At the same time, Virtual Humans can be used to present craft processes, as well as to demonstrate the usage of tools. For this, the tools utilized in these processes need to be transferred to the digital world. In this paper, we present TooltY, a 3D authoring platform for tool usage presentation in 3D environments, to demonstrate simple operations (e.g., usage of a hammer, scissors, screwdriver), where the tools are the product of 3D reconstruction. The movement of the Virtual Humans derives from motion capture, while for the movement of the tools, a novel approach is used, for inducing the tool motion from the human motion capture. The products of TooltY are Virtual Environments that can be experienced in 3D or through immersion in Virtual Reality.
Evropi Stefanidi, Nikolaos Partarakis, Xenophon Zabulis, Paul Zikas, George Papagiannakis, Nadia Magnenat Thalmann

Chapter 11. Generating 3D Facial Expressions with Recurrent Neural Networks

Learning-based methods have proved effective at high-quality image synthesis tasks, such as content-preserving image rendering with different style, and the generation of new images depicting learned objects. Some of the properties that make neural networks suitable for such tasks, for example, robustness to the input’s low-level feature, and the ability to retrieve contextual information, are also desirable in 3D shape domain. During last decades, data-driven methods have shown successful results in 3D shape modeling tasks, such as human face and body shape synthesis. Subtle, abstract properties on the geometry that are instantly detected by our eyes but are nontrivial to synthesize have successfully been achieved by tuning a shape model built from example shapes. Recent successful learning techniques, e.g., deep neural networks, also exploit this shape model, since the regular grid assumption with 2D images does not have a straightforward equivalent in the common shape representation in 3D, thus do not easily generalize to 3D shapes. Here, we concentrate on the 3D facial expression generation task, an important problem in computer graphics and other application domains, where existing data-driven approaches mostly rely on direct shape capture or shape transfer. At the core of our approach is a recurrent neural network with a marker-based shape representation. The network is trained to estimate a sequence of pose change, thus generate a specific facial expression, by using a set of motion-captured facial expression sequences. Our technique promises to significantly improve the quality of generated expressions while extending the potential applicability of neural networks to sequence of 3D shapes.
Hyewon Seo, Guoliang Luo

Human Computer Interaction (HCI)


Chapter 12. Facilitating Decision-Making with Multimodal Interfaces in Collaborative Analytical Sessions

In collaborative visual analytics sessions, participants analyze data and cooperate toward a shared vision. These decision-making processes are challenging and time-consuming. In this chapter, we introduce a system for facilitating decision-making in exploratory and collaborative visual analytics sessions. Our system comprises an assistant analytical agent, a multi-display wall and a framework for interactive visual analytics. The assistant agent understands participants’ ongoing conversations and exhibits information about the data on displays. The displays are also used to manifest the current state of the session. In addition, the agent answers the participants’ questions either regarding the data or open-domain ones, and preserves the productivity and the efficiency of the session by confirming that the participants do not deviate from the session’s goal. Whereas, our visual analytics medium makes data tangible, hence more comprehensible and natural to operate with. The results of our qualitative study indicate that the proposed system fosters productive multi-user decision-making processes.
Yasin Findik, Hasan Alp Boz, Selim Balcisoy

Chapter 13. Human—Technology Interaction: The State-of-the-Art and the Lack of Naturalism

The current chapter serves as a state-of-the-art, presenting the limitations of the existing technology used in the broad area of human–computer interaction up to now. Although different kind of agents have been used to contribute to several domains, like education, health, entertainment, both in virtual and physical environments, the virtual character or robot that will make a human feel as comfortable as interacting with another human has not been reported yet. What is mainly missing from the up to date state-of-the-art is the direct comparison of all these technologies with the original human–human communication. What we need to do is to keep studying the human–human communication and not only features of the HCI as what is missing is how we, as humans, react in several contexts of communication. Through this kind of research, we can contribute to the enhancement of naturalism of every kind of agent, offering a higher level of understanding and affection in the context of everyday communication.
Evangelia Baka, Nadia Magnenat Thalmann

Chapter 14. Survey of Speechless Interaction Techniques in Social Robotics

With recent developments in the field of artificial intelligence, machine learning, and deep learning, the field of social robotics has gained momentum. Any social robot requires to interact with human and its environment. Any human–robot interaction involves two aspects, speech-based and speechless interactions. Among the two, the latter is an essential requirement to make the social robot appear convincing and believable. In this study, we bridge the robotics hardware with the non-speech components of communication. We discuss the notion of a digital ecosystem with a social robot is a powerful one, connecting the conceptual framework of biological ecology with the swiftly expanding digital world. Traditional speechless interaction considers mainly non-verbal communication cues like gazing, user action/gesture-based interaction, body language, emotion, personality detection. But in current scenario with social robotics finding more applications in healthcare (autism care), education (for deaf and dumb), office work (insurance), etc., other speechless communication techniques can also be considered and integrated into them. By reading, the robot will be able to understand any document or online content, which can help the robot to tell stories, interact with speech-impaired people, handle incoming mails and parcels. Any robot can have simple online communication like email, tweets, and status updates to sophisticated communication frameworks like Facebook, Twitter, and Gmail. In this review, we look at speechless interaction methods and corresponding reaction models in social robotics not restricting to non-verbal cues, but also include how HRI is impacted by the internet, which is relatively new research domain.
Manoj Ramanathan, Ranjan Satapathy, Nadia Magnenat Thalmann

Chapter 15. Exploring Potential and Acceptance of Socially Intelligent Robot

Socially intelligent robots are dedicated mostly for social interaction with humans. They can assume two useful roles: a functional and an affective one. They aim to serve as an interface between humans and technology, and to increase their quality of life by providing companionship and assisting them in everyday tasks and routines. Although there is a growing attention for these robots in the literature, no comprehensive review has been yet performed to investigate the effectiveness and the usefulness of the aforementioned two roles. Therefore, we systematically reviewed and analyzed human interaction with the socially intelligent robot Nadine, under four different scenarios: interviewer, teacher, Customer guide, and companion. To support our research, we recorded EEG signals, body gestures, facial expressions, and psychometric data through a valid questionnaire. The ultimate purpose of this paper is to allow the understanding of human expectations and acceptance for socially intelligent robots.
Nidhi Mishra, Evangelia Baka, Nadia Magnenat Thalmann
Weitere Informationen