Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 1/2006

01.01.2006

Computing LTS Regression for Large Data Sets

verfasst von: PETER J. ROUSSEEUW, KATRIEN VAN DRIESSEN

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 1/2006

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agulló, J. 1997a. Computación de estimadores con alto punto de ruptura. Ph.D. Thesis, University of Alicante, Spain. Agulló, J. 1997a. Computación de estimadores con alto punto de ruptura. Ph.D. Thesis, University of Alicante, Spain.
Zurück zum Zitat Agulló, J. 1997b. Exact algorithms to compute the least median of squares estimate in multiple linear regression. In L1-Statistical Procedures and Related Topics, Y. Dodge (ed.), The IMS Lecture Notes – Monograph Series, Volume 31, pp. 133–146. Agulló, J. 1997b. Exact algorithms to compute the least median of squares estimate in multiple linear regression. In L1-Statistical Procedures and Related Topics, Y. Dodge (ed.), The IMS Lecture Notes – Monograph Series, Volume 31, pp. 133–146.
Zurück zum Zitat Chork, C.J. 1990. Unmasking multivariate anomalous observations in exploration geochemical data from sheeted-vein tin mineralization near Emmaville, N.S.W., Australia. Journal of Geochemical Exploration, 37:191–203. Chork, C.J. 1990. Unmasking multivariate anomalous observations in exploration geochemical data from sheeted-vein tin mineralization near Emmaville, N.S.W., Australia. Journal of Geochemical Exploration, 37:191–203.
Zurück zum Zitat Coakley, C.W. and Hettmansperger, T.P. 1993. A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association, 88:872–880. Coakley, C.W. and Hettmansperger, T.P. 1993. A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association, 88:872–880.
Zurück zum Zitat Hawkins, D.M. 1994. The feasible solution algorithm for least trimmed squares regression. Computational Statistics and Data Analysis, 17:185–196. Hawkins, D.M. 1994. The feasible solution algorithm for least trimmed squares regression. Computational Statistics and Data Analysis, 17:185–196.
Zurück zum Zitat Hawkins, D.M. and Olive, D.J. 1999. Improved feasible solution algorithms for high breakdown estimation. Computational Statistics and Data Analysis, 30:1–11. Hawkins, D.M. and Olive, D.J. 1999. Improved feasible solution algorithms for high breakdown estimation. Computational Statistics and Data Analysis, 30:1–11.
Zurück zum Zitat Hössjer, O. 1994. Rank-based estimates in the linear model with high breakdown point. Journal of the American Statistical Association, 89:149–158. Hössjer, O. 1994. Rank-based estimates in the linear model with high breakdown point. Journal of the American Statistical Association, 89:149–158.
Zurück zum Zitat Huang, Z. 1998. Extensions of the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283–304. Huang, Z. 1998. Extensions of the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283–304.
Zurück zum Zitat Kaufman, L. and Rousseeuw, P.J. 1986. Clustering large data sets. In Pattern Recognition in Practice II, E.S. Gelsema and L.N. Kanal (eds.) Elsevier/North-Holland, pp. 425–437. Kaufman, L. and Rousseeuw, P.J. 1986. Clustering large data sets. In Pattern Recognition in Practice II, E.S. Gelsema and L.N. Kanal (eds.) Elsevier/North-Holland, pp. 425–437.
Zurück zum Zitat Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data, New York: John Wiley. Kaufman, L. and Rousseeuw, P.J. 1990. Finding Groups in Data, New York: John Wiley.
Zurück zum Zitat Meer, P., Mintz, D., Rosenfeld, A., and Kim, D. 1991. Robust regression methods in computer vision: a review. International Journal of Computer Vision, 6:59–70. Meer, P., Mintz, D., Rosenfeld, A., and Kim, D. 1991. Robust regression methods in computer vision: a review. International Journal of Computer Vision, 6:59–70.
Zurück zum Zitat Mili, L., Phaniraj, V., and Rousseeuw, P.J. 1991. Least median of squares estimation in power systems (with discussion). IEEE Trans. on Power Systems, 6:511–523. Mili, L., Phaniraj, V., and Rousseeuw, P.J. 1991. Least median of squares estimation in power systems (with discussion). IEEE Trans. on Power Systems, 6:511–523.
Zurück zum Zitat Mili, L., Cheniae, N.S., and Rousseeuw, P.J. 1996. Robust state estimation based on projection statistics (with discussion). IEEE Trans. on Power Systems, 11:1118–1127. Mili, L., Cheniae, N.S., and Rousseeuw, P.J. 1996. Robust state estimation based on projection statistics (with discussion). IEEE Trans. on Power Systems, 11:1118–1127.
Zurück zum Zitat Ng, R.T. and Han, J., 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the International Conference on Very Large Data Bases (VLDB ’94), Santiago, Chile, September 1994, pp. 144–155. Ng, R.T. and Han, J., 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the International Conference on Very Large Data Bases (VLDB ’94), Santiago, Chile, September 1994, pp. 144–155.
Zurück zum Zitat Odewahn, S.C., Djorgovski, S.G., Brunner, R.J., and Gal, R. 1998. Data From the Digitized Palomar Sky Survey. Technical Report, California Institute of Technology. Odewahn, S.C., Djorgovski, S.G., Brunner, R.J., and Gal, R. 1998. Data From the Digitized Palomar Sky Survey. Technical Report, California Institute of Technology.
Zurück zum Zitat Rousseeuw, P.J. 1984. Least median of squares regression. Journal of the American Statistical Association, 79:871–880. Rousseeuw, P.J. 1984. Least median of squares regression. Journal of the American Statistical Association, 79:871–880.
Zurück zum Zitat Rousseeuw, P.J. 1985. Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications, Vol B, W. Grossmann, G. Pflug, I. Vincze and W. Wertz (eds.) Dordrecht: Reidel, pp. 283–297. Rousseeuw, P.J. 1985. Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications, Vol B, W. Grossmann, G. Pflug, I. Vincze and W. Wertz (eds.) Dordrecht: Reidel, pp. 283–297.
Zurück zum Zitat Rousseeuw, P.J. 1997. Introduction to positive-breakdown methods. In Handbook of Statistics, Vol. 15: Robust Inference, G.S. Maddala and C.R. Rao (eds.) Amsterdam: Elsevier, pp. 101–121. Rousseeuw, P.J. 1997. Introduction to positive-breakdown methods. In Handbook of Statistics, Vol. 15: Robust Inference, G.S. Maddala and C.R. Rao (eds.) Amsterdam: Elsevier, pp. 101–121.
Zurück zum Zitat Rousseeuw, P.J. and Hubert, M. 1997. Recent developments in PROGRESS. In \({\rm L}_1\)-Statistical Procedures and Related Topics, Y. Dodge (ed.) The IMS Lecture Notes – Monograph Series, Vol. 31, pp. 201–214. Rousseeuw, P.J. and Hubert, M. 1997. Recent developments in PROGRESS. In \({\rm L}_1\)-Statistical Procedures and Related Topics, Y. Dodge (ed.) The IMS Lecture Notes – Monograph Series, Vol. 31, pp. 201–214.
Zurück zum Zitat Rousseeuw, P.J. and Leroy, A.M. 1987. Robust Regression and Outlier Detection, New York: John Wiley. Rousseeuw, P.J. and Leroy, A.M. 1987. Robust Regression and Outlier Detection, New York: John Wiley.
Zurück zum Zitat Rousseeuw, P.J. and Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212–223. Rousseeuw, P.J. and Van Driessen, K. 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212–223.
Zurück zum Zitat Rousseeuw, P.J. and van Zomeren, B.C., 1990. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85:633–639. Rousseeuw, P.J. and van Zomeren, B.C., 1990. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85:633–639.
Zurück zum Zitat Steele, J.M. and Steiger, W.L. 1986. Algorithms and complexity for least median of squares regression. Discrete Applied Mathematics, 14:93–100. Steele, J.M. and Steiger, W.L. 1986. Algorithms and complexity for least median of squares regression. Discrete Applied Mathematics, 14:93–100.
Zurück zum Zitat Stromberg, A.J. 1993. Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM Journal of Scientific Computing, 14:1289–1299. Stromberg, A.J. 1993. Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM Journal of Scientific Computing, 14:1289–1299.
Zurück zum Zitat Simpson, D.G., Ruppert, D., and Carroll, R.J. 1992. On one-step GM-estimates and stability of inferences in linear regression. Journal of the American Statistical Association, 87:439–450. Simpson, D.G., Ruppert, D., and Carroll, R.J. 1992. On one-step GM-estimates and stability of inferences in linear regression. Journal of the American Statistical Association, 87:439–450.
Zurück zum Zitat Wang, C.M., Vecchia, D.F., Young, M. and Brilliant, N.A. 1997. Robust regression applied to optical fiber dimensional quality control. Technometrics, 39:25–33. Wang, C.M., Vecchia, D.F., Young, M. and Brilliant, N.A. 1997. Robust regression applied to optical fiber dimensional quality control. Technometrics, 39:25–33.
Zurück zum Zitat Woodruff, D.L. and Rocke, D.M. 1994. Computable robust estimation of multivariate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888–896. Woodruff, D.L. and Rocke, D.M. 1994. Computable robust estimation of multivariate location and shape in high dimension using compound estimators. Journal of the American Statistical Association, 89:888–896.
Zurück zum Zitat Yohai, V.J. 1987. High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15:642–656. Yohai, V.J. 1987. High breakdown point and high efficiency robust estimates for regression. Annals of Statistics, 15:642–656.
Zurück zum Zitat Zhang, T., Ramakrishnan, R., and Livny, M. 1997. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182. Zhang, T., Ramakrishnan, R., and Livny, M. 1997. BIRCH: a new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182.
Metadaten
Titel
Computing LTS Regression for Large Data Sets
verfasst von
PETER J. ROUSSEEUW
KATRIEN VAN DRIESSEN
Publikationsdatum
01.01.2006
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 1/2006
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-005-0024-4

Weitere Artikel der Ausgabe 1/2006

Data Mining and Knowledge Discovery 1/2006 Zur Ausgabe