nach oben

2014 | Buch

Kapitel lesen Erstes Kapitel lesen

Computer Vision and Machine Learning with RGB-D Sensors

herausgegeben von: Ling Shao, Jungong Han, Pushmeet Kohli, Zhengyou Zhang

Verlag: Springer International Publishing

Buchreihe : Advances in Computer Vision and Pattern Recognition

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book presents an interdisciplinary selection of cutting-edge research on RGB-D based computer vision. Features: discusses the calibration of color and depth cameras, the reduction of noise on depth maps and methods for capturing human performance in 3D; reviews a selection of applications which use RGB-D information to reconstruct human figures, evaluate energy consumption and obtain accurate action classification; presents an approach for 3D object retrieval and for the reconstruction of gas flow from multiple Kinect cameras; describes an RGB-D computer vision system designed to assist the visually impaired and another for smart-environment sensing to assist elderly and disabled people; examines the effective features that characterize static hand poses and introduces a unified framework to enforce both temporal and spatial constraints for hand parsing; proposes a new classifier architecture for real-time hand pose recognition and a novel hand segmentation and gesture recognition system.

Inhaltsverzeichnis

Frontmatter

Surveys

Frontmatter

Chapter 1. 3D Depth Cameras in Vision: Benefits and Limitations of the Hardware

With an Emphasis on the First- and Second-Generation Kinect Models

Abstract

The second-generation Microsoft Kinect uses time-of-flight technology, while the first-generation Kinect uses structured light technology. This raises the question whether one of these technologies is “better” than the other. In this chapter, readers will find an overview of 3D camera technology and the artifacts that occur in depth maps.

Achuta Kadambi, Ayush Bhandari, Ramesh Raskar

Chapter 2. A State of the Art Report on Multiple RGB-D Sensor Research and on Publicly Available RGB-D Datasets

Abstract

That the Microsoft Kinect, an RGB-D sensor, transformed the gaming and end consumer sector has been anticipated by the developers. That it also impacted in rigorous computer vision research has probably been a surprise to the whole community. Shortly before the commercial deployment of its successor, Kinect One, the research literature fills with resumees and state-of-the art papers to summarize the development over the past 3 years. This chapter describes significant research projects which have built on sensoring setups that include two or more RGB-D sensors in one scene and on RGB-D datasets captured with them which were made publicly available.

Kai Berger

Reconstruction, Mapping and Synthesis

Frontmatter

Chapter 3. Calibration Between Depth and Color Sensors for Commodity Depth Cameras

Abstract

Commodity depth cameras have created many interesting new applications in the research community recently. These applications often require the calibration information between the color and the depth cameras. Traditional checkerboard-based calibration schemes fail to work well for the depth camera, since its corner features cannot be reliably detected in the depth image. In this chapter, we present a maximum likelihood solution for the joint depth and color calibration based on two principles. First, in the depth image, points on the checkerboard shall be coplanar, and the plane is known from color camera calibration. Second, additional point correspondences between the depth and color images may be manually specified or automatically established to help improve calibration accuracy. Uncertainty in depth values has been taken into account systematically. The proposed algorithm is reliable and accurate, as demonstrated by extensive experimental results on simulated and real-world examples.

Cha Zhang, Zhengyou Zhang

Chapter 4. Depth Map Denoising via CDT-Based Joint Bilateral Filter

Abstract

Bi-modal image processing can be defined as a series of steps taken to enhance a target image with a guidance image. This is done by using exploitable information derived from acquiring two images of the same scene with different image modalities. However, while the potential benefit of bi-modal image processing may be significant, there is an inherent risk; if noise or defects in the guidance image are allowed to transfer to the target image, the target image could become corrupted rather than improved. In this chapter, we present a new method to enhance a noisy depth map from its color information via the joint bilateral filter (JBF) based on common distance transform (CDT). This method is composed of two main steps: CDT map generation and CDT-based JBF. In the first step, a CDT map is generated that represents the degree of pixel-modal similarity between a depth pixel and its corresponding color pixel. Then, based on the CDT map, JBF is carried out in order to enhance depth information with the aid of color information. Experimental results show that CDT-based JBF outperforms other conventional methods objectively and subjectively in terms of noise reduction, as well as inherent visual artifacts suppression.

Andreas Koschan, Mongi Abidi

Chapter 5. Human Performance Capture Using Multiple Handheld Kinects

Abstract

Capturing real performances of human actors has been an important topic in the fields of computer graphics and computer vision in the last few decades. The reconstructed 3D performance can be used for character animation and free-viewpoint video. While most of the available performance capture approaches rely on a 3D video studio with tens of RGB cameras, this chapter presents a method for marker-less performance capture of single or multiple human characters using only three handheld Kinects. Compared with the RGB camera approaches, the proposed method is more convenient with respect to data acquisition, allowing for much fewer cameras and carry-on camera capture. The method introduced in this chapter reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. It succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even for reconstruction of multiple closely interacting characters.

Yebin Liu, Genzhi Ye, Yangang Wang, Qionghai Dai, Christian Theobalt

Chapter 6. Human-Centered 3D Home Applications via Low-Cost RGBD Cameras

Abstract

In this chapter, we will introduce three human-centered home 3D applications realized by virtue of low-cost RGBD cameras. The first application is personalized avatar for user via multiple Kinects, which can reconstruct a real human and provide personalized avatars for everyday users and enhance interactive experience in game and virtual reality environments. The second application automatically evaluates energy consumption of users in gaming scenarios by a model with tracked skeleton, which may help users to know their exercise effects and even diet or reduce their weights. The final application presents a real-time system that automatically classifies the human action acquired by consumer-priced RGBD sensor.

Zhenbao Liu, Shuhui Bu, Junwei Han

Chapter 7. Matching of 3D Objects Based on 3D Curves

Abstract

In this chapter, we introduce a novel approach to 3D object retrieval capable to match query objects generated by a user with those captured by a depth device (RGB-D). Our processing pipeline consists of several steps. In the preprocessing step, we first detect edges in the depth image and merge them to 2D object curves which allows a back-projection to 3D space. Then, we estimate a local coordinate system for these 3D curves. In the next step, distinctive feature points are localised and shortest paths between these points are determined. Subsequently, the shortest paths are represented by robust descriptors invariant to rotation, scaling, and translation. Finally, all the information collected to describe the object is used for matching. The matching process is transformed to the problem of Maximum Weight Subgraph search. Excellent retrieval results achieved in a comprehensive setup of challenging experiments show the benefits of our method comparing to the state-of-the-art.

Christian Feinen, Joanna Czajkowska, Marcin Grzegorzek, Longin Jan Latecki

Chapter 8. Using Sparse Optical Flow for Two-Phase Gas Flow Capturing with Multiple Kinect

Abstract

The use of multiple Microsoft Kinect has become prominent in the last 2 years and enjoyed widespread acceptance. While several work has been published to mitigate quality degradations in the precomputed depth image, this work focuses on employing an optical flow suitable for dot patterns as employed in the Kinect to retrieve subtle scene data alterations for reconstruction. The method is employed in a multiple Kinect vision architecture to detect the interface of propane flow around occluding objects in air.

Kai Berger, Marc Kastner, Yannic Schroeder, Stefan Guthe

Detection, Segmentation and Tracking

Frontmatter

Chapter 9. RGB-D Sensor-Based Computer Vision Assistive Technology for Visually Impaired Persons

Abstract

A computer vision-based wayfinding and navigation aid can improve the mobility of blind and visually impaired people to travel independently. In this chapter, we focus on RGB-D sensor-based computer vision technologies in application to assist blind and visually impaired persons. We first briefly review the existing computer vision based assistive technology for the visually impaired. Then we provide a detailed description of the recent RGB-D sensor based assistive technology to help blind or visually impaired people. Next, we present the prototype system to detect and recognize stairs and pedestrian crosswalks based on RGB-D images. Since both stairs and pedestrian crosswalks are featured by a group of parallel lines, Hough transform is applied to extract the concurrent parallel lines based on the RGB (Red, Green, and Blue) channels. Then, the Depth channel is employed to recognize pedestrian crosswalks and stairs. The detected stairs are further identified as stairs going up (upstairs) and stairs going down (downstairs). The distance between the camera and stairs is also estimated for blind users. The detection and recognition results on our collected datasets demonstrate the effectiveness and efficiency of our developed prototype. We conclude the chapter by the discussion of the future directions.

Yingli Tian

Chapter 10. RGB-D Human Identification and Tracking in a Smart Environment

Abstract

Elderly and disabled people can particularly benefit from smart environments with integrated sensors, as they offer basic assistive functionalities enabling personal independence and increased safety. In a smart environment, the key issue is to quickly sense the location and identity of its users. In this paper, we aim at enhancing the robustness of human detection and identification algorithm in a home environment based on the Kinect, which is a new and multimodal sensor. The contribution of our work is that we employ different cameras for different algorithmic modules, based on investigating the suitability of each camera in Kinect for a specific processing task, resulting in an efficient and robust human detection, tracking and re-identification system. The total system consists of three processing modules: (1) object labeling and human detection based on depth data, (2) human reentry identification based on both RGB and depth information, and (3) human tracking based on RGB data. Experimental results show that each algorithmic module works well, and the complete system can accurately track up to three persons in a real situation.

Jungong Han, Junwei Han

Learning-based Recognition

Frontmatter

Chapter 11. Feature Descriptors for Depth-Based Hand Gesture Recognition

Abstract

Depth data acquired by consumer depth cameras provide a very informative description of the hand pose that can be exploited for accurate gesture recognition. A typical hand gesture recognition pipeline requires to identify the hand, extract some relevant features and exploit a suitable machine learning technique to recognize the performed gesture. This chapter deals with the recognition of static poses. It starts by describing how the hand can be extracted from the scene exploiting depth and color data. Then several different features that can be extracted from the depth data are presented. Finally, a multi-class support vector machines (SVM) classifier is applied to the presented features in order to evaluate the performance of the various descriptors.

Fabio Dominio, Giulio Marin, Mauro Piazza, Pietro Zanuttigh

Chapter 12. Hand Parsing and Gesture Recognition with a Commodity Depth Camera

Abstract

Hand pose tracking and gesture recognition are useful techniques in human–computer interaction (HCI) scenarios, while previous work in this field suffers from the lack of discriminative features to differentiate and track hand parts. In this chapter, we present a robust hand parsing scheme to obtain a high-level and discriminative representation of the hand from raw depth image. A novel distance-adaptive feature selection method is proposed to generate more discriminative depth-context features for hand parsing. The random decision forest is adopted for per-pixel labeling, and it is combined with the temporal prior to form an ensemble of classifiers for enhanced performance. To enforce the spatial smoothness and remove the misclassified isolated regions, we further build a superpixel Markov random field, which is capable to handle the per-pixel labeling error at variable scales. To demonstrate the effectiveness of our proposed method, we have compared it to the benchmark methods. The results show it produces 17.2 % higher accuracy on the synthesized datasets for single-frame parsing. The tests on the real-world sequences show our method is more robust against complex hand poses. In addition, we develop a hand gesture recognition algorithm with the hand parsing results. The experiments show our method achieves good performance compared to state-of-the-art methods.

Hui Liang, Junsong Yuan

Chapter 13. Learning Fast Hand Pose Recognition

Abstract

Practical real-time hand pose recognition requires a classifier of high accuracy, running in a few millisecond speed. We present a novel classifier architecture, the Discriminative Ferns Ensemble (DFE), for addressing this challenge. The classifier architecture optimizes both classification speed and accuracy when a large training set is available. Speed is obtained using simple binary features and direct indexing into a set of tables, and accuracy by using a large capacity model and careful discriminative optimization. The proposed framework is applied to the problem of hand pose recognition in depth and infrared images, using a very large training set. Both the accuracy and the classification time obtained are considerably superior to relevant competing methods, allowing one to reach accuracy targets with runtime orders of magnitude faster than the competition. We show empirically that using DFE, we can significantly reduce classification time by increasing training sample size for a fixed target accuracy. Finally, scalability to a large number of classes is tested using a synthetically generated data set of \(81\) classes.

Eyal Krupka, Alon Vinnikov, Ben Klein, Aharon Bar-Hillel, Daniel Freedman, Simon Stachniak, Cem Keskin

Chapter 14. Real-Time Hand Gesture Recognition Using RGB-D Sensor

Abstract

RGB-D sensor-based gesture recognition is one of the most effective techniques for human–computer interaction (HCI). In this chapter, we propose a new hand motion capture procedure for establishing the real gesture data set. A hand partition scheme is designed for color-based semi-automatic labeling. This method is integrated into a vision-based hand gesture recognition framework for developing desktop applications. We use the Kinect sensor to achieve more reliable and accurate tracking in the desktop environment. Moreover, a hand contour model is proposed to simplify the gesture matching process, which can reduce the computational complexity of gesture matching. This framework allows tracking hand gestures in 3D space and matching gestures with simple contour model and thus supports complex real-time interactions. The experimental evaluations and a real-world demo of hand gesture interaction demonstrate the effectiveness of this framework.

Yuan Yao, Fan Zhang, Yun Fu

Backmatter

Titel: Computer Vision and Machine Learning with RGB-D Sensors
herausgegeben von: Ling Shao
Jungong Han
Pushmeet Kohli
Zhengyou Zhang
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-08651-4
Print ISBN: 978-3-319-08650-7
DOI: https://doi.org/10.1007/978-3-319-08651-4