skip to main content
10.1145/1031171.1031238acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Distance-function design and fusion for sequence data

Published:13 November 2004Publication History

ABSTRACT

Sequence-data mining plays a key role in many scientific studies and real-world applications such as bioinformatics, data stream, and sensor networks, where sequence data are processed and their semantics interpreted. In this paper we address two relevant issues: sequence-data representation, and representation-to-semantics mapping. For representation, since the best one is dependent upon the application being used and even the type of query, we propose representing sequence data in multiple views. For each representation, we propose methods to construct a <i>valid kernel</i> as the distance function to measure <i>similarity</i> between sequences. For mapping, we then find the best combination of the individual distance functions, which measure similarity of different views, to depict the target semantics. We propose a <i>super-kernel function-fusion</i> scheme to achieve the optimal mapping. Through theoretical analysis and empirical studies on UCI and real world datasets, we show our approach of multi-view representation and fusion to be mathematically valid and very effective for practical purposes.

References

  1. http://www.bioinfo.rpi.edu/zukerm/bio-5495/rnafold-html/node2.html.]]Google ScholarGoogle Scholar
  2. H. Andrd-J6nsson and D. Badal. Using signature files for querying time-series data. In proceedings of Principles of Data Mining and Knowledge Discovery, 1 st European Symposium, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. B., H. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. In Proc. of the 14th Int'l Conference on Data Engineering. Orlando, FL, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Bengio. Gradient-based optimization of hyper-parameters. Neural Computation, 12(8), 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Bozkaya, N. Yazdani, and Z. M. Ozsoyoglu. Matching and indexing sequences of different lengths. In Proc. of the 6th Int'l Conference on Knowledge Discovery and Data Mining., 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chandrasekaran, B. S. Manjunath, Y. F. Wang, J. Winkeler, and H. Zhang. An eigenspace update algorithm for image analysis. Graphical Models and Image Processing, 59(5), 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3), 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Chartrand. Introductory graph theory. New York: Dover, 1985.]]Google ScholarGoogle Scholar
  10. L. Chen, M. Tamer, and V. Oria. Symbolic representation and retrieval of moving object trajectories. University of Waterloo School of Computer Science Waterloo, Canada, Technical Report CS-2003-30., 2003.]]Google ScholarGoogle Scholar
  11. C. S. Daw, C. E. A. Finney, and E. R. Tracy. A review of symbolic analysis of experimental data. Review of Scientific Instruments, 74(2), 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  12. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In proc. of the ACM SIGMOD Int l Conference on Management of Data., 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Geurts. Pattern extraction for time series classification. In Proc. of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Hammel and J. Patel. Searching on the secondary structure of protein sequences. Proceedings of the 28th VLDB Conference, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Hishiki, N. Collier, C. Nobata, T. Ohta, N. Ogata, T. Sekimizu, R. Steiner, H. Park, and J. Tsujii. Developing nlp tools for genome informatics: An information extraction perspective. In Genome Informatics. Universal Academy Press, Inc., Tokyo, Japan, 1998., 1998.]]Google ScholarGoogle Scholar
  16. Y. Huang and P. S. Yu. Adaptive query processing for time-series data. In Proc. of the 5th Int'l Conference on Knowledge Discovery and Data Mining., 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Huhtala, J. Karkkainen, and H. Toivonen. Mining for similarities in aligned time series using wavelets. Data Mining and Knowlege Discovery: Theory, Tools, and Technology, SPIE Proceeding Series, 1999.]]Google ScholarGoogle Scholar
  18. E. Hunt, M. P. Atkinson, and R. W. Irving. A database index to large biological sequences. Proceedings of the 27th VLDB Conference, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Indyk, N. Koudas, and S. Muthukrishnan. Indentifying representation trends in massive time series data sets using sketches. In Proc. of the 26th Int'l Conference on Very Large Data Bases(VLDB)., 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. JA and M. BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1), 1982.]]Google ScholarGoogle Scholar
  21. E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Locally adaptive dimensionality reduction for indexing large time series databases. In Proc. of ACM SIGMOD Conference on Management of Data., 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. SIGKDD, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. Proceedings of International Conference on Machine Learning (ICML), 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Korn, H. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data., 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. R. G. Lanckriet, M. H. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble. Kernel-based data fusion and its application to protein function prediction in yeast. Proceedings of the Pacific Symposium on Biocomputing, 2004.]]Google ScholarGoogle Scholar
  26. J. Lin, E. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In Proc. of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Moulton, M. Zuker, M. Steel, R. Pointon, and D. Penny. Metrics on rna secondary structures. Journal of Computational Biology, 2000.]]Google ScholarGoogle ScholarCross RefCross Ref
  28. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 1970.]]Google ScholarGoogle Scholar
  29. S. Park, S. Kim, and W. W. Chu. Segment-based approach for subsequence searches in sequence databases. In proc. of the 16th ACM Symposium on Applied Computing. Las Vegas., 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Roth, J. Laub, J. Buhmann, and K.-R. Muller. Going metric: Denoising pairwise data. In Neural Information Processing Systems (NIPS), 2002.]]Google ScholarGoogle Scholar
  31. V. Roth and V. Steinhage. Nonlinear discriminant analysis using kernel functions. NIPS, 1999.]]Google ScholarGoogle Scholar
  32. B. Scholkopf and A. Smola. Learning with kernels. MIT Press, 2001.]]Google ScholarGoogle Scholar
  33. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of molecular biology, 1981.]]Google ScholarGoogle Scholar
  34. V. Vapnik. The nature of statistical learning theory. Springer, NY, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Watkins. Dynamic alignment kernels. Technical Report CSD-TR-98-11, 1999.]]Google ScholarGoogle Scholar
  36. G. Wu, Y. Wu, L. Jiao, Y.-F. Wang, and E. Chang. Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance. Proceedings of the 11th Annual ACM International Conference on Multimedia (ACMMM), 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Wu, C.-Y. Lin, E. Y. Chang, and J. R. Smith. Multimodal kernel fusion for news video concept detection. IEEE International Conference on Image Processing (ICIP), 2004.]]Google ScholarGoogle Scholar

Index Terms

  1. Distance-function design and fusion for sequence data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader