Skip to main content

2017 | OriginalPaper | Buchkapitel

Generating Fixed-Size Training Sets for Large and Streaming Datasets

verfasst von : Stefanos Ougiaroglou, Georgios Arampatzis, Dimitris A. Dervos, Georgios Evangelidis

Erschienen in: Advances in Databases and Information Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The k Nearest Neighbor is a popular and versatile classifier but requires a relatively small training set in order to perform adequately, a prerequisite not satisfiable with the large volumes of training data that are nowadays available from streaming environments. Conventional Data Reduction Techniques that select or generate training prototypes are also inappropriate in such environments. Dynamic RHC (dRHC) is a prototype generation algorithm that can update its condensing set when new training data arrives. However, after repetitive updates, the size of the condensing set may become unpredictably large. This paper proposes dRHC2, a new variation of dRHC, which remedies the aforementioned drawback. dRHC2 keeps the size of the condensing set in a convenient, manageable by the classifier, level by ranking the prototypes and removing the least important ones. dRHC2 is tested on several datasets and the experimental results reveal that it is more efficient and noise tolerant than dRHC and is comparable to dRHC in terms of accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aggarwal, C.: Data Streams: Models and Algorithms. Advances in Database Systems Series. Springer, Heidelberg (2007)CrossRefMATH Aggarwal, C.: Data Streams: Models and Algorithms. Advances in Database Systems Series. Springer, Heidelberg (2007)CrossRefMATH
3.
Zurück zum Zitat Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi. Valued Logic Soft Comput. 17(2–3), 255–287 (2011) Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi. Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
7.
Zurück zum Zitat Gama, J.A., Sebastião, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 329–338, KDD 2009. ACM, New York (2009). http://doi.acm.org/10.1145/1557019.1557060 Gama, J.A., Sebastião, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 329–338, KDD 2009. ACM, New York (2009). http://​doi.​acm.​org/​10.​1145/​1557019.​1557060
9.
Zurück zum Zitat Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRef Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)CrossRef
10.
Zurück zum Zitat Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern Anal. Appl. 13(2), 131–141 (2010)MathSciNetCrossRef Olvera-Lopez, J.A., Carrasco-Ochoa, J.A., Trinidad, J.F.M.: A new fast prototype selection method based on clustering. Pattern Anal. Appl. 13(2), 131–141 (2010)MathSciNetCrossRef
13.
14.
Zurück zum Zitat Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)CrossRef Sánchez, J.S.: High training set size reduction by space partitioning and prototype abstraction. Pattern Recogn. 37(7), 1561–1564 (2004)CrossRef
16.
Zurück zum Zitat Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland (2004) Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report TCD-CS-2004-15, The University of Dublin, Trinity College, Department of Computer Science, Dublin, Ireland (2004)
Metadaten
Titel
Generating Fixed-Size Training Sets for Large and Streaming Datasets
verfasst von
Stefanos Ougiaroglou
Georgios Arampatzis
Dimitris A. Dervos
Georgios Evangelidis
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66917-5_7