Elsevier

Pattern Recognition

Volume 44, Issues 10–11, October–November 2011, Pages 2297-2304
Pattern Recognition

Semi-supervised Elastic net for pedestrian counting

https://doi.org/10.1016/j.patcog.2010.10.002Get rights and content

Abstract

Pedestrian counting plays an important role in public safety and intelligent transportation. Most pedestrian counting algorithms based on supervised learning require much labeling work and rarely exploit the topological information of unlabelled data in a video. In this paper, we propose a Semi-Supervised Elastic Net (SSEN) regression method by utilizing sequential information between unlabelled samples and their temporally neighboring samples as a regularization term. Compared with a state-of-the-art algorithm, extensive experiments indicate that our algorithm can not only select sparse representative features from the original feature space without losing their interpretability, but also attain superior prediction performance with only very few labelled frames.

Introduction

Pedestrian counting in public places plays a key role in many applications, such as evacuating from a dense region to a sparse one when an emergency happens [1], or optimizing the design of traffic infrastructures to provide better transportation services. Furthermore, social security and surveillance strongly depend on the effectiveness of pedestrian counting [2].

Generally speaking, estimating the number of pedestrians can be performed in two different manners. The first approach is based on detection or tracking methods, e.g. Leibe et al. [3] and Kim and Cipolla [4]. Such methods could count pedestrian number accurately when pedestrian density is low. As the density increases, however, the performance of these methods will deteriorate since occlusion will become obvious in the dense crowd. Other aspects such as variances in pedestrian including height, pose and large bags will also impair the performance. The second approach directly estimates the number of pedestrians in a scene through using pedestrians’ pixel and texture information. Without identifying individuals, one remarkable advantage of this approach is that pedestrians’ privacy can be well preserved [5]. Assuming that the number of pixels occupied by pedestrians has a linear relationship with pedestrian counts, Davies et al. [6] used the pixel information to count pedestrians. However, such a way suffers from (1) perspective effect, which means that human size in an image changes as the distance from the subject to the camera varies, and (2) overlapping between pedestrians. For example, the number of pixels occupied by two overlapped pedestrians is less than that of pixels occupied by two non-overlapped pedestrians if the distances between pedestrians and camera keep the same. Therefore, it invalidates the assumption of the linear relationship mentioned above.

To solve the perspective issue, Chan et al. [5] proposed a perspective normalization map in which each entry in the map is regarded as a normalization factor on the corresponding pixels, giving larger factors to distant pixels in which the corresponding detected areas are far away from camera. Ma et al. [7] proposed another perspective correction matrix, based on the fact that a person in the scene is vertical to the ground plane, whether standing or walking. As for the overlapping issue, one way for this is to directly estimate the pedestrian density from each image. Marana et al. [8] claimed that some textures, such as Gray Level Co-occurrence Matrix (GLCM), can describe the crowd density well since the dense crowd always has finer texture, whereas sparse crowd is of coarser texture. Observing that pedestrian edge is also an effective feature to describe the pedestrian density, Marana et al. [9] proposed Minkowski fractal dimension and Kong et al. [10] employed edge orientation histogram based on the pedestrian edge. Chan et al. [5] further refined the performance with the combination of Minkowski fractal dimension and edge orientation histogram features.

Note that all the aforementioned pedestrian counting algorithms are achieved under the supervised learning framework. A major disadvantage for these algorithms is that, to obtain a good performance, it is necessary to collect a great number of labelled frames in the learning process. However, it is a boring and labor-intensive work to label too many frames. In addition, mislabeling often happens easily during manual labeling.

Fortunately, there are a large number of unlabelled frames which provide useful topological information that can benefit to the performance of pedestrian counting. In the machine learning community, semi-supervised learning framework, which utilizes unlabelled samples effectively, has been extensively investigated in last decade [11]. One representative strategy of semi-supervised learning is to use the regularization technique to utilize structure of unlabelled data. Zhu [11] proved that if the unlabelled data structure can approximate the true population distribution well, an expected result will be obtained. To better exploit information in unlabelled data, Zhu and Goldberg [12], Zhang et al. [13] and Hou et al. [14] introduced domain knowledge to build the regularization term for regression and dimension reduction, respectively. It is worth noting that compared with semi-supervised classification, semi-supervised regression is less studied. The main reason is that in classification tasks, the number of categories is finite, whereas object values are continuous in regression tasks. Zhou and Li [15] proposed a co-training style algorithm in which two different regressors make their respective estimations on unlabelled data, and refine the performance by learning the results of each other. Rwebangira and Lafferty [16] proposed a semi-supervised locally linear regression through adding a weighted regularization term to the regression function. However, it is not easy to directly generalize these semi-supervised approach to pedestrian counting since (1) the difficulty of seeking two proper views or distant metrics on pedestrian data for co-training, and (2) the local linear property does not always be true for pedestrian data with rich features.

Another common issue in pedestrian counting is that there are many redundant features to be removed when a collection of features are extracted from pedestrian video. Although Lasso proposed by Tibshirani [17] can be helpful to reduce such features, its efficiency depends greatly on the number of dimensions. When the pairwise correlation between a group of features is high, lasso chooses only one feature from the group. To overcome these two limitations, Elastic net, a variant of Lasso [17], is proposed by adding a L2 norm constraint [18]. Therefore, we attempt to utilize the Elastic net to reduce the redundant features which have less relationship with the properties of pedestrians.

To better reflect the degree of the pedestrian density and effectively utilize the structural information in the unlabelled pedestrian image sequence, we also introduce statistical landscape features (SLF) proposed by Xu and Chen [19] to extract statistical features closely related to the property of pedestrian counts. Then we propose semi-supervised Elastic net (SSEN) by incorporating sequential correlation between frames into the Elastic net. With this way, we can remove redundant features from SLF, and meanwhile utilize sequential information in the unlabelled frames to improve the performance of counting pedestrians.

The main contributions of the proposed SSEN algorithm include: (1) To the best of our knowledge, this is the first time to employ semi-supervised learning to solve the pedestrian counting problem. (2) Domain knowledge which refers to sequential information extracted from pedestrian videos is elegantly utilized. (3) Without losing interpretability, the original features can be effectively reduced by the proposed SSEN algorithm. (4) Compared with the state-of-the-art pedestrian counting algorithm [5], which generally requires hundreds of labelled frames to obtain a relatively good result, SSEN achieves better performance, even with only very few labelled frames.

The remainder of this paper is organized as follows. In Section 2, we describe the procedure of feature extraction, and in Section 3, we detail our SSEN algorithm. Experimental results are provided and analyzed in Section 4. Finally, Section 5 concludes this work.

Section snippets

Feature extraction

To extract a collection of pedestrian features, it is necessary to segment foreground from the background image. We use a moving average [20] to compute a foreground mask, followed by smoothing the mask image with a median filter and mathematical morphology. Then, we obtain the foreground image by multiplying the smoothed mask image with the corresponding frame. Fig. 1 shows an example.

We extract six sets of features from the mask and foreground images so that the properties of pedestrians in

Elastic net

For better understanding of the proposed SSEN algorithm, we will introduce lasso at first. Given a collection of data points {(xi,yi)|i=1,2,,n}, where each independent variable xi consists of m features, and yi is the corresponding response variable, the lasso optimizes the following function [17]:β^=argminβi=1nyij=1mβjxj2s.t.j=1m|βj|t,where β=(β1,β2,,βm)T is the weighted values of features, and t0 is a threshold. When t is large enough, lasso will obtain a similar result as the linear

Semi-supervised Elastic net with sequential data

One common way to construct a semi-supervised algorithm is to add unlabelled data as a regularization term to refine the performance of learning. Such regularization implicitly assumes some topological structures of data, for example, manifold structure [23]. If the assumption does not hold, however, the performance may be deteriorated by the introduction of regularization. The other way is to use domain knowledge to achieve semi-supervised learning. Although the knowledge has no direct

Experiments

To evaluate the performance of the proposed SSEN algorithm, we carry out a series of experiments on two benchmark datasets, i.e., UCSD pedestrian dataset [5] and Fudan pedestrian dataset.1 The UCSD dataset extracted 2000 frames of size 238×158 from a video as ground-truth. Fig. 5(a) shows several sequential frames from the dataset. The Fudan pedestrian dataset contains 1500 sequential frames of size 320×240, as shown in Fig.

Conclusion

In this paper we have proposed a semi-supervised Elastic net algorithm to count pedestrians in the image sequence. Through utilizing sequential information in the video as a regularization, the proposed algorithm can select the representative features from the original high-dimensional features without sacrificing interpretability, and attain better prediction performance with only a few training data compared with state-of-the-art algorithm. In the future, we will study a more effective

Acknowledgements

This work was supported in part by the NFSC (nos. 60635030, 60975044) and 973 program (no. 2010CB327900), Shanghai Leading Academic Discipline Project No. B114 and Scientific research start-up fund of the hundred talents program of CASIA (Y0J2021MZ1). The authors would like to thank the reviewers of the first version of this paper for their various comments, which helped to greatly improve the presentation.

Ben Tan received his B.S. degree in the Department of Computer Science and Technology at Tongji University, Shanghai, China, in 2007. He is now a master student in School of Computer Science, Fudan University, Shanghai, China. His current research interests include image processing and machine learning.

References (26)

  • C.P. Hou et al.

    Multiple view semi-supervised dimensionality reduction

    Pattern Recognition

    (2010)
  • J.J. Verbeek et al.

    Gaussian fields for semi-supervised regression and correspondence learning

    Pattern Recognition

    (2006)
  • L. Nanni et al.

    Dynamic plan generation and real-time management techniques for traffic evacuation

    IEEE Transactions on Intelligent Transportation Systems

    (2008)
  • B.B. Zhan et al.

    Crowd analysis: a survey

    Journal of Machine Vision and Applications

    (2008)
  • B. Leibe, E. Seemann, B. Schiele, Pedestrian detection in crowded scenes, in: IEEE Conference on Computer Vision and...
  • T.-K. Kim, R. Cipolla, Mcboost: multiple classifier boosting for perceptual co-clustering of images and visual...
  • A.B. Chan, Z.J. Liang, N. Vasconcelos, Privacy preserving crowd monitoring: counting people without people models or...
  • A.C. Davies et al.

    Crowd monitoring using image processing

    Electronics & Communication Engineering Journal

    (1995)
  • R. Ma, L. Li, W. Huang, On pixel count based crowd density estimation for visual surveillance, in: IEEE Conference on...
  • A.N. Marana et al.

    Automatic estimation of crowd density using texture

    Safety Science

    (1997)
  • A.N. Marana, L.F. Costa, R.A. Lotufo, S.A. Velastin, Estimating crowd density with Minkowski fractal dimension, in:...
  • D. Kong, D. Gray, H. Tao, A viewpoint invariant approach for crowd counting, in: IEEE International Conference on...
  • X.J. Zhu, Semi-supervised learning literature survey, Technical Report 1530, Computer Sciences, University of...
  • Cited by (65)

    • An efficient semi-supervised manifold embedding for crowd counting

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      Finally, we verify the effectiveness of the proposed method by comparing the experimental results with other state-of-the art competitors. In the field of crowd counting, there are three benchmark datasets, namely UCSD dataset [6], Mall dataset [7], and Fudan dataset [18], widely used to assess the performance of different methods. Some frames of the three databases are displayed in Fig. 2.

    • A hybrid safe semi-supervised learning method

      2020, Expert Systems with Applications
      Citation Excerpt :

      During the past decades, Semi-Supervised Learning (SSL) has received more and more attention in the machine learning field. Different SSL methods (Chapelle, Scholkopf, & Zien, 2006; Chen, Shao, Li, & Deng, 2016; Gan, Sang, Huang, Tong, & Dan, 2013; Tan, Zhang, & Wang, 2011; Zhu, 2005) have been proposed and achieved promising performance with the help of unlabeled instances. Nevertheless, the existing SSL methods can not always obtain the desired performance in all cases.

    View all citing articles on Scopus

    Ben Tan received his B.S. degree in the Department of Computer Science and Technology at Tongji University, Shanghai, China, in 2007. He is now a master student in School of Computer Science, Fudan University, Shanghai, China. His current research interests include image processing and machine learning.

    Junping Zhang received the M.S. degree in control theory and control engineering from Hunan University, Changsha, China, in 2000 and the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2003. Since 2003, he has been an Associate Professor with the School of Computer Science, Fudan University, Shanghai, China. His research includes machine learning, pattern recognition, image processing, biometric authentication, and intelligent transportation systems. He is an associate editor of IEEE Intelligent Systems since 2009. He is also an associate editor of IEEE Transactions on Intelligent Transportation Systems since 2010.

    Liang Wang received both the B.Eng. and M.Eng. degrees in Electronic Engineering from Anhui University, China and the Ph.D. degree in Pattern Recognition and Intelligent System from the Institute of Automation, Chinese Academy of Sciences (CAS). From 2004 to 2009, he was with Imperial College London, UK, Monash University, and University of Melbourne, Australia, respectively. Currently, he is a Lecturer with the Department of Computer Science, University of Bath, United Kingdom. His major research interests include machine learning, pattern recognition, computer vision, multimedia processing, and data mining. He has widely published at highly-ranked international journals such as IEEE TPAMI, IEEE TIP, IEEE TKDE, and IEEE TCSVT and leading international conferences such as CVPR, ICCV and ICDM. He has been serving with more than 20 major international journals and more than 40 major international conferences. He is an associate editor of IEEE Transactions on Systems, Man and Cybernetics: Part B, International Journal of Image and Graphics (WorldSci), International Journal of Signal Processing (Elsevier), and Neurocomputing (Elsevier). He is a lead guest editor of three special issues appearing in PRL (Pattern Recognition Letters), IJPRAI (International Journal of Pattern Recognition and Artificial Intelligence) and IEEE TSMC-B, as well as a co-editor of five edited books.

    View full text