Abstract
With the explosion of multimedia data, it is usual that different multimedia data often coexist in web repositories. Accordingly, it is more and more important to explore underlying intricate cross-media correlation instead of single-modality distance measure so as to improve multimedia semantics understanding. Cross-media distance metric learning focuses on correlation measure between multimedia data of different modalities. However, the existence of content heterogeneity and semantic gap makes it very challenging to measure cross-media distance. In this paper, we propose a novel cross-media distance metric learning framework based on sparse feature selection and multi-view matching. First, we employ sparse feature selection to select a subset of relevant features and remove redundant features for high-dimensional image features and audio features. Secondly, we maximize the canonical coefficient during image-audio feature dimension reduction for cross-media correlation mining. Thirdly, we further construct a Multi-modal Semantic Graph to find embedded manifold cross-media correlation. Moreover, we fuse the canonical correlation and the manifold information into multi-view matching which harmonizes different correlations with an iteration process and build Cross-media Semantic Space for cross-media distance measure. The experiments are conducted on image-audio dataset for cross-media retrieval. Experiment results are encouraging and show that the performance of our approach is effective.
Similar content being viewed by others
References
Bao, L., Cao, J., Zhang, Y., Li, J., Chen, M., Hauptmann, A.G.: Explicit and implicit concept-based video retrieval with bipartite graph propagation model. In: Proceedings of the 18th International Conference on Multimedia, pp 939–942 (2010)
Barnard, K., Duygulu, P., Forsyth, D.A., de Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: CVPR (2), pp 1002–1009 (2004)
Feng, Y.F., Xiao, J., Zhuang, Y.T., Liu, X.M.: Adaptive unsupervised mutli-view feature selection for visual concept recognition. In: Proceedings of the 11-th Asian Conference on Computer Vision (ACCV) (2012)
Gupta, S.K., Phung, D.Q., Adams, B., Tran, T., Venkatesh, S.: Nonnegative shared subspace learning and its application to social media retrieval. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1169–1178 (2010)
Han, Y.H., Wu, F., Tao, D.C., Shao, J., Zhuang, Y.T., Jiang, J.M.: Sparse unsupervised dimensionality reduction for multiple view data. IEEE Trans. Circuits Syst. Video Technol. 22(10), 1485–1496 (2012)
Han, Y.H., Wu, F., Zhuang, Y.T., He, X.F.: Multi-label transfer learning with sparse representation. IEEE Trans. Circuits Syst. Video Technol. (IEEE T-CSVT) 20(8), 1110–1121 (2010)
Han, Y.H., Yang, Y., Ma, Z.G., Shen, H.Q., Sebe, N., Zhou, X.F.: Image attribute adaptation. IEEE Trans. Multimedia (IEEE T-MM) 16(4), 1115–1126 (2014)
Hardoon, D.R., Shawe-Taylor, J.: Sparse canonical correlation analysis. Mach. Learn. 83(3), 331–353 (2011)
Hardoon, D.R., Szedmàk, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: state of the art and challenges. TOMCCAP 2(1), 1–19 (2006)
Liu, Y., Wu, F., Zhuang, Y., Xiao, J.: Active post-refined multimodality video semantic concept detection with tensor representation. In: Proceedings of the 16th International Conference on Multimedia, pp 91–100 (2008)
Liu, H., Yu, L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17(4), 491–502 (2005)
Ma, Q., Nadamoto, A., Tanaka, K.: Complementary information retrieval for cross-media news content. Inf. Syst. 31(7), 659–678 (2006)
Shrager, J., Hogg, T., Huberman, B.A.: Observation of phase transitions in spreading activation networks. Science 236(4805), 1092–1094 (1987)
Snoek, C., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th ACM International Conference on Multimedia, pp 399–402 (2005)
Sun, T., Chen, S.: Locality preserving cca with applications to data visualization and pose estimation. Image Vis. Comput. 25(5), 531–543 (2007)
Tan, M., Wang, L., Tsang, I.W.: Learning sparse SVM for feature selection on very high dimensional datasets. In: Proceedings of the 27th International Conference on Machine Learning (ICML), pp 1047–1054 (2010)
Tang, J., Yan, S., Hong, R., Qi, G., Chua, T.: Inferring semantic concepts from community-contributed images and noisy tags. In: Proceedings of the 17th International Conference on Multimedia, pp 223–232 (2009)
Vogt, C.C., Cottrell, G.W.: Fusion via a linear combination of scores. Inf. Retr. 1(3), 151–173 (1999)
Wang, Z., Feng, Y.F., Yang, X.S., Zhang, J.J.: Adaptive multi-view feature selection for human motion retrieval. Signal Process. (2014). doi:10.1016/j.sigpro.2014.11.015
Wu, Y., Chang, E.Y., Chang, K.C., Smith, J.R.: Optimal multimodal fusion for multimedia data analysis. In: Proceedings of the 12th ACM International Conference on Multimedia, pp 572–579 (2004)
Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: Simfusion: measuring similarity using unified relationship matrix. In: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 130–137, Salvador (2005)
Xiao, J., Feng, Y.F., Ji, M.M., Zhuang, Y.T.: Fast view-based 3D model retrieval via unsupervised multiple feature fusion and online projection learning. Signal Process. (2014). doi:10.1016/j.sigpro.2014.11.020
Yang, Y., Ma, Z.G., Hauptmann, A., Sebe, N.: Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans. Multimedia 15(3), 661–669 (2013)
Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012)
Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013)
Yang, S., Yuan, L., Lai, Y., Shen, X., Wonka, P., Ye, J.: Feature grouping and selection over an undirected graph. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pp 922–930 (2012)
Yang, Y., Zhuang, Y., Wu, F., Pan, Y.: Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. Multimedia 10(3), 437–446 (2008)
Yu, J., Tao, D., Wang, M.: Adaptive hypergraph learning and its application in image classification. IEEE Trans. Image Process. 21(7), 3262–3272 (2012)
Zhang, H., Liu, Y., Ma, Z.: Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119, 10–16 (2013)
Zhang, H., Yu, J., Wang, M., Liu, Y.: Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93, 100–105 (2012)
Zhang, H., Zha, Z., Yang, Y., Yan, S., Gao, Y., Chua, T.: Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: ACM Multimedia Conference, MM ’13, pp 33–42, Barcelona (2013)
Zhang, H., Zhuang, Y., Wu, F.: Cross-modal correlation learning for clustering on image-audio dataset. In: Proceedings of the 15th International Conference on Multimedia, pp 273–276, Augsburg (2007)
Zhang, J.G., Han, Y.H., Tang, J.H., Hu, Q.H., Jiang, J.M.: What can we learn about motion videos from still images?. In: Proceedings of the 17th International Conference on Multimedia, pp 973–976 (2014)
Zhuang, Y., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimedia 10(2), 221–229 (2008)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems NIPS (2003)
Zhou, D., et al.: Ranking on data manifolds. Advances in Neural Information Processing Systems NIPS (2003)
Acknowledgments
This research is supported by the National Natural Science Foundation of China(No.61373109, No.61003127, No.61273303 and No.61440016), State Key Laboratory of Software Engineering (SKLSE2012-09-31), Program for Outstanding Young Science and Technology Innovation Teams in Higher Education Institutions of Hubei Province, China(No.T201202), and the Natural Science Foundation of Hubei Provincial of China (2014CFB247).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, H., Gao, X., Wu, P. et al. A cross-media distance metric learning framework based on multi-view correlation mining and matching. World Wide Web 19, 181–197 (2016). https://doi.org/10.1007/s11280-015-0342-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-015-0342-4