Abstract
Research and industry increasingly make use of large amounts of data to guide decision-making. To do this, however, data needs to be analyzed in typically nontrivial refinement processes, which require technical expertise about methods and algorithms, experience with how a precise analysis should proceed, and knowledge about an exploding number of analytic approaches. To alleviate these problems, a plethora of different systems have been proposed that “intelligently” help users to analyze their data.
This article provides a first survey to almost 30 years of research on intelligent discovery assistants (IDAs). It explicates the types of help IDAs can provide to users and the kinds of (background) knowledge they leverage to provide this help. Furthermore, it provides an overview of the systems developed over the past years, identifies their most important features, and sketches an ideal future IDA as well as the challenges on the road ahead.
- Aha, D. W. 1992. Generalizing from case studies: A case study. In Proceedings of the 9th International Workshop on Machine Learning. 1--10. Google ScholarDigital Library
- Amant, R. and Cohen, P. 1998a. Interaction with a mixed-initiative system for exploratory data analysis. Knowl. Based Syst. 10, 5, 265--273.Google ScholarDigital Library
- Amant, R. S. and Cohen, P. 1998b. Intelligent support for exploratory data analysis. J. Comput. Graph. Stat. 7, 4, 545--558.Google Scholar
- Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., and Sherlock, G. 2000. Gene ontology: Tool for the unification of biology. Nature Genetics 25, 25--29.Google ScholarCross Ref
- Bensusan, H. and Kalousis, A. 2001. Estimating the predictive accuracy of a classifier. In Machine Learning, Lecture Notes in Computer Science, vol. 2167, Springer, 25--36. Google ScholarDigital Library
- Bernstein, A. and Daenzer, M. 2007. The NExT system: Towards true dynamic adaptations of semantic web service compositions. In The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 4519, Springer, 739--748. Google ScholarDigital Library
- Bernstein, A., Provost, F., and Hill, S. 2005. Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification. IEEE Trans. Knowl. Data Eng. 17, 4, 503--518. Google ScholarDigital Library
- Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., Ohl, P., Thiel, K., and Wiswedel, B. 2009. Knime - the konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl. 11, 26--31. Google ScholarDigital Library
- Blockeel, H. and Vanschoren, J. 2007. Experiment databases: Towards an improved experimental methodology in machine learning. In Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol. 4702, Springer, 6--17. Google ScholarDigital Library
- Blum, A. and Furst, M. 1997. Fast planning through planning graph analysis* 1. Artificial intelligence 90, 1--2, 281--300. Google ScholarDigital Library
- Botia, J., Gomez-Skarmeta, A., Valdes, M., and Padilla, A. 2001. METALA: A meta-learning architecture. In Computational Intelligence. Theory and Apllications, Lecture Notes in Computer Science, vol. 2206, Springer, 688--698. Google ScholarDigital Library
- Boulos, M. N. K. 2009. Semantic wikis: A comprehensible introduction with examples from the health sciences. J. Emerging Technol. Web Intel.Google Scholar
- Castiello, C., Castellano, G., and Fanelli, A. 2005. Meta-data: Characterization of input features for meta-learning. Model. Decisions Artif. Intel. 3558, 457--468. Google ScholarDigital Library
- Castiello, C. and Fanelli, A. 2005. Meta-learning experiences with the mindful system. In Computational Intelligence and Security, Lecture Notes in Computer Science, vol. 3801, Springer, 321--328. Google ScholarDigital Library
- Cerrito, P. 2007. Introduction to Data Mining Using SAS Enterprise Miner. SAS Publishing, Cary, NC. Google ScholarDigital Library
- Chandrasekaran, B., Johnson, T., and Smith, J. 1992. Task-structure analysis for knowledge modeling. Commun. ACM 35, 9, 124--137. Google ScholarDigital Library
- Chandrasekaran, B. and Josephson, J. 1999. What are ontologies, and why do we need them? IEEE Intell. Sys. 14, 1, 20--26. Google ScholarDigital Library
- Chapman, P., Clinton, J., Khabaza, T., Reinartz, T., and Wirth, R. 1999. The crisp-dm process model. The CRIP--DM Consortium 310.Google Scholar
- Charest, M., Delisle, S., Cervantes, O., and Shen, Y. 2008. Bridging the gap between data mining and decision support: A case-based reasoning and ontology approach. Intell. Data Anal. 12, 1--26. Google ScholarDigital Library
- Choinski, M. and Chudziak, J. 2009. Ontological learning assistant for knowledge discovery and data mining. In Proceedings of the IEEE International Conference on Computer Science and Information Technology. 147--155.Google Scholar
- Craw, S., Sleeman, D., Graner, N., and Rissakis, M. 1992. Consultant: Providing advice for the machine learning toolbox. In Proceedings of the Annual Technical Conference on Expert Systems (ES). 5--23.Google Scholar
- Derriere, S., Preite-Martinez, A., and Richard, A. 2006. UCDs and ontologies. ASP Conf. Series 351, 449.Google Scholar
- Diamantini, C., Potena, D., and Storti, E. 2009a. KDDONTO: An ontology for discovery and composition of KDD algorithms. In Proceedings of the ECML-PKDD Workshop on Service-Oriented Knowledge Discovery. 13--24.Google Scholar
- Diamantini, C., Potena, D., and Storti, E. 2009b. Ontology-driven KDD process composition. In Advances in Intelligent Data Analysis VIII, Lecture Notes in Computer Science, vol. 5772, Springer, 285--296. Google ScholarDigital Library
- Engels, R. 1996. Planning tasks for knowledge discovery in databases: Performing task-oriented user-guidance. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 170--175.Google Scholar
- Engels, R., Lindner, G., and Studer, R. 1997. A guided tour through the data mining jungle. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases. 163--166.Google Scholar
- Erol, K. 1996. Hierarchical task network planning: Formalization, analysis, and implementation. Ph.D. dissertation, University of Maryland at College Park, College Park, MD. UMI Order No. GAX96-22054. Google ScholarDigital Library
- Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. From data mining to knowledge discovery in databases. AI Mag. 17, 3, 37--54.Google ScholarDigital Library
- Fox, M. and Long, D. 2003. PDDL2. 1: An extension to PDDL for expressing temporal planning domains. J. Artif. Intell. Res. 20, 1, 61--124. Google ScholarCross Ref
- Gale, W. 1986. Rex review. In Artificial Intelligence and Statistics. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. 173--227. Google ScholarDigital Library
- Giraud-Carrier, C. 2005. The data mining advisor: Meta-learning at the service of practitioners. In Proceedings of the International Conference on Machine Learning and Applications (ICMLA). 113--119. Google ScholarDigital Library
- Goble, C., Bhagat, J., Aleksejevs, S., Cruickshank, D., Michaelides, D., Newman, D., Borkum, M., Bechhofer, S., Roos, M., Li, P., and De Roure, D. 2010. myExperiment: A repository and social network for the sharing of bioinformatics workflows. Nucl. Acids Res..Google Scholar
- Goebel, M. and Gruenwald, L. 1999. A survey of data mining and knowledge discovery software tools. SIGKDD Explor. Newsl. 1, 20--33. Google ScholarDigital Library
- Grabczewski, K. and Jankowski, N. 2007. Versatile and efficient meta-learning architecture: Knowledge representation and management in computational intelligence. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining. 51--58.Google Scholar
- Graner, N., Sharma, S., Sleeman, D., Rissakis, M., CRAW, S., and Moore, C. 1993. The machine learning toolbox consultant. Int. J. AI Tools 2, 3, 307--328.Google ScholarCross Ref
- Grimmer, U. 1996. Clementine: Data mining software. In Classification and Multivariate Graphics: Models, Software and Applications. 25--31.Google Scholar
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. 2009. The weka data mining software: An update. ACM SIGKDD Explor. News. 11, 1, 10--18. Google ScholarDigital Library
- Hand, D. 1985. Statistical expert systems: Necessary attributes. J. Appl. Stat. 12, 1, 19--27.Google ScholarCross Ref
- Hand, D. 1987. A statistical knowledge enhancement system. J. Royal Stat. Soc. Series A (General) 150, 4, 334--345.Google ScholarCross Ref
- Hand, D. 1990. Practical experience in developing statistical knowledge enhancement systems. Ann. Math. Artif. Intell. 2, 1, 197--208.Google ScholarCross Ref
- Hand, D. 1997. Intelligent data analysis: Issues and opportunities. In Proceedings of the 2nd International Symposium on Advances in Intelligent Data Analysis. Reasoning about Data (IDA'97). 1--14. Google ScholarDigital Library
- Hernansaez, J., Bota, J., and Skarmeta, A. 2004. METALA: A J2EE technology based framework for web mining. Revista Colombiana de Computación 5, 1.Google Scholar
- Hilario, M. and Kalousis, A. 2001. Fusion of meta-knowledge and meta-data for case-based model selection. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'01). 180--191. Google ScholarDigital Library
- Hilario, M., Kalousis, A., Nguyen, P., and Woznica, A. 2009. A data mining ontology for algorithm selection and meta-mining. In Proceedings of the ECML-PKDD Workshop on Service-Oriented Knowledge Discovery. 76--87.Google Scholar
- Hoffmann, J. and Nebel, B. 2001. The FF planning system: Fast plan generation through heuristic search. J. Artif. Intell. Res. 14, 253--302. Google ScholarCross Ref
- Horrocks, I., Patel-Schneider, P., and Boley, H. 2004. SWRL: A semantic web rule language combining OWL and RuleML. http://www. w3.org/submission/SWRL/.Google Scholar
- Ihaka, R. and Gentleman, R. 1996. R: A language for data analysis and graphics. J. Computation. Graph. Stat. 5, 3, 299--314.Google Scholar
- Kalousis, A. 2002. Algorithm selection via meta-learning. Ph.D. dissertation, University of Geveve.Google Scholar
- Kalousis, A., Bernstein, A., and Hilario, M. 2008. Meta-learning with kernels and similarity functions for planning of data mining workflows. In Proceedings of the ICML/UAI/COLT Workshop on Planning to Learn. 23--28.Google Scholar
- Kalousis, A. and Hilario, M. 2001. Model selection via meta-learning: A comparative study. Int. J. Artif. Intell. Tools 10, 4, 525--554.Google ScholarCross Ref
- Kalousis, A. and Theoharis, T. 1999. Noemon: Design, implementation and performance results of an intelligent assistant for classifier selection. Intell. Data Anal. 3, 4, 319--337.Google ScholarDigital Library
- Kietz, J., Serban, F., and Bernstein, A. 2010. eProPlan: A tool to model automatic generation of data mining workflows. In Proceedings of the 3rd Planning to Learn Workshop (WS9) At the European Conference on Artificial Intelligence (ECAI'10). 15.Google Scholar
- Kietz, J., Serban, F., Bernstein, A., and Fischer, S. 2009. Towards cooperative planning of data mining workflows. In Proceedings of the ECML-PKDD Workshop on Service-Oriented Knowledge Discovery. 1--12.Google Scholar
- Kietz, J., Vaduva, A., and Zücker, R. 2000. Mining mart: Combining case-based-reasoning and multi-strategy learning into a framework to reuse kdd-application. In Proceedings of the 5th International Workshop on Multistrategy Learning (MSL'00). Vol. 311.Google Scholar
- Klusch, M., Gerber, A., and Schmidt, M. 2005. Semantic Web service composition planning with OWLS-Xplan. In Proceedings of the AAAI Fall Symposium on Agents and the Semantic Web. 55--62.Google Scholar
- Kodratoff, Y., Sleeman, D., Uszynski, M., Causse, K., and Craw, S. 1992. Building a machine learning toolbox. In Enhancing the Knowledge Engineering Process: Contributions from ESPRIT, L. Steels and B. Lepape, Eds., Elsevier, 81--108.Google Scholar
- Kohavi, R., Brodley, C. E., Frasca, B., Mason, L., and Zheng, Z. 2000. Kdd-cup 2000 organizers' report: Peeling the onion. SIGKDD Explor. Newsl. 2, 86--93. Google ScholarDigital Library
- Leite, R. and Brazdil, P. 2007. An iterative process for building learning curves and predicting relative performance of classifiers. In Progress in Artificial Intelligence, Lecture Notes in Computer Science, vol. 4874, Springer, 87--98. Google ScholarDigital Library
- Levesque, R. 2005. SPSS Programming and Data Management: A Guide for SPSS and SAS Users. SPSS, Chicago, IL. Google ScholarDigital Library
- Lindner, G. and Studer, R. 1999. AST: Support for algorithm selection with a CBR approach. In Principles of Data Mining and Knowledge Discovery, Lecture Notes in Computer Science, vol. 1704, Springer, 418--423. Google ScholarDigital Library
- Liu, Z., Ranganathan, A., and Riabov, A. 2007. A planning approach for message-oriented semantic web service composition. In Proceedings of the AAAI National Conference On Artificial Intelligence 5, 2, 1389--1394. Google ScholarDigital Library
- MathWorks. 2004. Matlab. The MathWorks, Natick, MA.Google Scholar
- McDermott, D., Ghallab, M., Howe, A., Knoblock, C., Ram, A., Veloso, M., Weld, D., and Wilkins, D. 1998. PDDL-the planning domain definition language. http://academic.research.microsoft.com/Paper/2024980.Google Scholar
- Michie, D., Spiegelhalter, D., and Taylor, C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ. Google ScholarDigital Library
- Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. 2006. Yale: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). 935--940. Google ScholarDigital Library
- Mikut, R. and Reischl, M. 2011. Data mining tools. Wiley Interdisciplinary Rev. Data Mining Knowl. Discov..Google Scholar
- Morik, K. and Scholz, M. 2004. The MiningMart approach to knowledge discovery in databases. In Intelligent Technologies for Information Analysis, N. Zhong, and J. Liu, Eds., Springer, 47--65.Google Scholar
- Nonaka, I. and Takeuchi, H. 1995. The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press, New York, NY.Google Scholar
- Oinn, T., Addis, M., Ferris, J., Marvin, D., Greenwood, M., Carver, T., Pocock, M., Wipat, A., and Li, P. 2004. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 17, 3045--3054. Google ScholarDigital Library
- Panov, P., Soldatova, L., and Džeroski, S. 2009. Towards an ontology of data mining investigations. In Discovery Science, Lecture Notes in Computer Science, vol. 5808, Springer, 257--271. Google ScholarDigital Library
- Patel-Schneider, P., Hayes, P., and Horrocks, I. 2004. OWL web ontology language semantics and abstract syntax. http://www.w3.org/TR/owl-semantics/.Google Scholar
- Peng, Y., Flach, P., Brazdil, P., and Soares, C. 2002a. Decision tree-based data characterization for meta-learning. In Proceedings of the ECML-PKDD Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning. 111--122.Google Scholar
- Peng, Y., Flach, P., Soares, C., and Brazdil, P. 2002b. Improved dataset characterisation for meta-learning. In Discovery Science, Lecture Notes in Computer Science, vol. 2534, Springer, 141--152. Google ScholarDigital Library
- Pfahringer, B., Bensusan, H., and Giraud-Carrier, C. 2000. Meta-learning by landmarking various learning algorithms. In Proceedings of the International Conference on Machine Learning (ICML) 951, 743--750. Google ScholarDigital Library
- Podpečan, V., Zemenova, M., and Lavrač, N. 2011. Orange4ws environment for service-oriented data mining. Comput. J.Google Scholar
- Raes, J. 1992. Inside two commercially available statistical expert systems. Stat. Comput. 2, 2, 55--62.Google ScholarCross Ref
- Rendell, L., Seshu, R., and Tcheng, D. 1987. Layered concept learning and dynamically-variable bias management. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 308--314. Google ScholarDigital Library
- Rice, J. 1976. The algorithm selection problem. Adv. Comput. 15, 65--118.Google ScholarCross Ref
- Roure, D. D., Goble, C., and Stevens, R. 2009. The design and realisation of the myExperiment virtual research environment for social sharing of workflows. Future Gen. Comput. Syst. 25, 561--567. Google ScholarDigital Library
- Russell, D. M., Stefik, M. J., Pirolli, P., and Card, S. K. 1993. The cost structure of sensemaking. In Proceedings of the INTERACT and CHI Conference on Human Factors in Computing Systems (CHI'93). 269--276. Google ScholarDigital Library
- Sacerdoti, E. 1974. Planning in a hierarchy of abstraction spaces. Artif. Intell. 5, 2, 115--135.Google ScholarCross Ref
- Schaffer, C. 1994. A conservation law for generalization performance. In Proceedings of the International Conference on Machine Learning. 259--265.Google ScholarCross Ref
- Sirin, E. and Parsia, B. 2007. SPARQL-DL: SPARQL query for OWL-DL. In Proceedings of the International Workshop on OWL Experiences and Directions (OWLED).Google Scholar
- Sirin, E., Parsia, B., Grau, B., Kalyanpur, A., and Katz, Y. 2007. Pellet: A practical owl-dl reasoner. Web Semantics: Sci. Services Agents World Wide Web 5, 2, 51--53. Google ScholarDigital Library
- Sirin, E., Parsia, B., Wu, D., Hendler, J., and Nau, D. 2004. HTN planning for web service composition using SHOP2. J. Web Semantics 1, 4, 377--396. Google ScholarDigital Library
- Sleeman, D., Rissakis, M., Craw, S., Graner, N., and Sharma, S. 1995. Consultant-2: Pre-and post-processing of machine learning applications. Int. J. Human Comput. Studies 43, 1, 43--63. Google ScholarDigital Library
- Smith-Miles, K. 2008. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. 41, 1, Article 6. Google ScholarDigital Library
- Soares, C., Brazdil, P., and Kuba, P. 2004. A meta-learning method to select the kernel width in support vector regression. Machine Learn. 54, 195--209. Google ScholarDigital Library
- Stoeckert, C., Causton, H., and Ball, C. 2002. Microarray databases: Standards and ontologies. Nature Genetics 32, 469--473.Google ScholarCross Ref
- Szalay, A. and Gray, J. 2001. The world-wide telescope. Science 293, 2037--2040.Google ScholarCross Ref
- Taylor, I., Shields, M., Wang, I., and Harrison, A. 2007. The Triana workflow environment: Architecture and applications. In Workflows for e-Science I. Taylor, E. Deelman, D. Gannon, and M. Shields, Eds., Springer, London, U.K. 320--339.Google Scholar
- Todorovski, L., Blockeel, H., and Džeroski, S. 2002. Ranking with predictive clustering trees. In Proceedings of the 13th European Conference on Machine Learning, Lecture Notes in Computer Science, vol. 2430, Springer, 444--455. Google ScholarDigital Library
- Vanschoren, J. 2010. Understanding machine learning performance with experiment databases. Ph.D. dissertation, Katholieke Universiteit Leuven, Fianders, Belgium.Google Scholar
- Vanschoren, J., Blockeel, H., Pfahringer, B., and Holmes, G. 2012. Experiment databases: A new way to share, organize and learn from experiments. Machine Learn. DOI 10.1007/s10994-011-5277-0. To appear. Google ScholarDigital Library
- Vilalta, R. and Drissi, Y. 2002a. A characterization of difficult problems in classification. In Proceedings of the International Conference on Machine Learning and Applications (ICMLA). 133--138.Google Scholar
- Vilalta, R. and Drissi, Y. 2002b. A perspective view and survey of meta-learning. Artif. Intell. Rev. 18, 77--95. Google ScholarDigital Library
- Wirth, R., Shearer, C., Grimmer, U., Reinartz, T., Schlosser, J., Breitner, C., Engels, R., and Lindner, G. 1997. Towards process-oriented tool support for knowledge discovery in databases. In Proceedings of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, Lecture Notes in Computer Science, vol. 1263, Springer, 243--253. Google ScholarDigital Library
- Wolpert, D. 2001. The supervised learning no-free-lunch theorems. In Proceedings of the Online World Conference on Soft Computing in Industrial Applications. 25--42.Google Scholar
- Yang, G., Kifer, M., Zhao, C., and Chowdhary, V. 2002. Flora-2: Users Manual. Department of Computer Science, Stony Brook University, Stony Brook, NY.Google Scholar
- Žáková, M., Křemen, P., Železný, F., and Lavrač, N. 2010. Automating knowledge discovery workflow composition through ontology-based planning. IEEE Tran. Autom. Sci. Eng. 8, 2, 253--264.Google ScholarCross Ref
Index Terms
- A survey of intelligent assistants for data analysis
Recommendations
Toward Intelligent Assistance for a Data Mining Process: An Ontology-Based Approach for Cost-Sensitive Classification
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some ...
Intelligent assistants for handicapped people's independence: case study
IJSIS '96: Proceedings of the 1996 IEEE International Joint Symposia on Intelligence and SystemsThis paper presents the first stage of the development of two intelligent assistants for handicapped people's independence. The first intelligent assistant, called Tyflos, will help a blind user to be independent and able to walk and work alone in a 3-D ...
Cooperation between expert knowledge and data mining discovered knowledge: Lessons learned
Expert systems are built from knowledge traditionally elicited from the human expert. It is precisely knowledge elicitation from the expert that is the bottleneck in expert system construction. On the other hand, a data mining system, which ...
Comments