Skip to main content



Classification and Data Analysis


Entropy Optimizing Methods for the Estimation of Tables

A new procedure for the problem of recovering tabular data in case of incomplete or inconsistent information is presented. It generalizes the well kown RAS (or IPF) algorithm by allowing a wider class of constraints concerning the table entries such as equalities and inequalities over arbitrary cross sections. The theoretical background of the procedure is outlined and some examples of applications are reported.

Uwe Blien, Friedrich Graef

Automatic Spectral Classification

In this paper we report on a joint research project between astronomers and philosophers of science. The philosophical and the astronomical goal are described and the astronomical background is shortly reviewed. We present the current status of our development of methods for tackling the relevant classification problems, i.e.: (1) application of Bayes’ decision rule for “simple” classification of all spectra in the data base; (2) minimum cost rule classification for compilation of complete samples of rare stellar objects and (3) Bayes classification with application of an atypicality index reject criterion for the detection of non-stellar spectra. We report on the discovery of an extremely metal poor halo star by application of method (2) to a small fraction of our data. A method for adequate handling of low signal-to-noise ratio spectra is presented. The classification methods presented are currently applied to a large data base of digital spectra.

N. Christlieb, G. Graßhoff, A. Nelke, A. Schlemminger, L. Wisotzki

An Approach to Modelling Directional and Betweenness Data

A model is introduced to analyze and represent spatially either directly obtained or derived judgments expressing directions in space or betwennness relations among objects. For the two-dimensinal case the construction rules tofind a spatial representation of the object points are given in detail. A small example illustrates the procedure. Consistency tests are reported which allow a differentiated analysis of this kind of cognitive structure. The three-dimensional case is discussed briefly. Finally, it is illustrated how ratings, rankings, and co-occurrence informations are transformed to betweenness data.

Hubert Feger

The Application of Random Coincidence Graphs for Testing the Homogeneity of Data

Graph-theoretic classification models provide us with probability models which can be used to study the structure of a data set. In models of random interval graphs or, generally, random coincidence graphs, points are drawn “at random” and joined by lines if their mutual distances are smaller than a threshold d. This is exactly the procedure of finding linkage clusters. We present exact and asymptotic results for properties of those random graphs, especially for the properties that the expected numbers of isolated edges and of isolated vertices remain positive finite as the total number of vertices grows. These properties can serve as test statistics for testing the homogeneity in a data set; they can be used to derive tests for goodness of fit as well.

E. Godehardt, J. Jaworski, D. Godehardt

City-Block Scaling: Smoothing Strategies for Avoiding Local Minima

Multidimensional scaling (MDS) with city-block distances suffers from many local minima if the Stress function is minimized In fact, the problem can be viewed as a combinatorial problem, where finding the correct order of the coordinates on a dimension is crucial for attaining the minimum. Several strategies have been proposed for arriving at a global minimum of the Stress function. We pay particular attention to Pliner’s (1996) smoothing strategy for unidimensional scaling, which smoothes the concave part of the Stress function. We discuss three extensions of this strategy to the multidimensional case with city-block distances. The first extension is shown to lead to problems because it yields a unidimensional solution. A second extension, proposed by Pliner (1986), and a third extension, distance smoothing introduced here, do not have this problem. Numerical experiments with the smoothing strategy have been limited to the unidimensional case. Therefore, we present a comparison study using real data, which shows that the smoothing strategy performs better than three other strategies considered.

P. J. F. Groenen, W. J. Heiser, J. J. Meulman

Probability Models and Limit Theorems for Random Interval Graphs with Applications to Cluster Analysis

Assume that n k-dimensional data points have been obtained and subjected to a cluster analysis algorithm. A potential concern is whether the resulting clusters have a “causal” interpretation or whether they are merely consequences of a “random” fluctuation. In this report, the asymptotic properties of a number of potentially useful combinatorial tests based on the theory of random interval graphs are described. Some preliminary numerical results illustrating their possible application as a method of resolving the above question are provided.

B. Harris, E. Godehardt

Labor Supply Decisions in Germany — A Semiparametric Regression Analysis

This paper analyzes labor supply decisions of married women in order to identify differences between East and West Germany. The semiparametric General Additive Model (GAM) was chosen to avoid assumptions about the functional type of correlation and to discover characteristics in behavior. The estimator is based on a partial integration following Linton and Nielsen (1995). The analytical features of the new estimator are easier to determine than in the traditional back-fitting algorithm. This analysis unveiled significant differences of labor supply behavior among East and West Germany.

Wolfram Kempe

A Multiplicative Approach to Partitioning the Risk of Disease

Analysing the interrelations between several exposure factors that affect the risk of developing a disease is an important aim in epidemiological studies. The relative risk is the perhaps most popular parameter for quantifying the strength of such interrelations. The paper introduces the concept of factorial relative risks as one possibility of generalizing the relative risk parameter to the context of several interrelated exposures. An axiomatic justification of the new parameter is given and it is outlined that the factorial relative risk of a single exposure factor in a multifactorial setting is a measure of its individual contribution to the joint effect of all considered exposures.

M. Land, O. Gefeller

Multiple Media Stream Data Analysis: Theory and Applications

This paper presents a new model for multiple media stream data analysis as well as descriptions of some applications of this model. This model formalizes the exploitation of correlations between multiple, potentially heterogeneous, media streams in support of numerous application areas. The goal of the technique is to determine temporal and spatial alignments which optimize a correlation function and indicate commonality and synchronization between media streams. It also provides a framework for comparison of media in unrelated domains.

F. Makedon, C. Owen

Multimedia Data Analysis using ImageTcl

ImageTcl is an new system which provides powerful Tcl/Tk based media scripting capabilities similar to those of the ViewSystem and Rivl in a unique environment that allows rapid prototyping and development of new components in the C++ language. Powerful user tools automate the creation of new components as well as the addition of new data types and file formats. Applications using ImageTcl at the Dartmouth Experimental Visualization Laboratory (DEVLAB) include multiple stream media data analysis, automatic image annotation, and image sequence motion analysis. ImageTcl combines the high speed of compiled languages with the testing and parameterization advantages of scripting languages.

F. Makedon, C. Owen

Robust Bivariate Boxplots and Visualization of Multivariate Data

Zani et al. (1997) suggested a simple way of constructing a bivariate boxplot based on convex hull peeling and B-spline smoothing. This approach leads to define a natural, smooth and completely non parametric region in ℝ2 which retains the correlation in the data and adapts to differing spread in the var­ious directions. In this paper we initially consider some variations of this method. The proposed approach shows some advantages with respect to that suggested by Goldberg and Iglewicz (1992), because we do not need to estimate either the standard deviations of the two variables or a correlation measure. Furthermore we also show how, in presence of p-dimensional data, the data visualization method based on the construction of the scatterplot matrix with superimposed bivariate boxplots in each diagram can become a very useful tool for the detection of mul­tivariate outliers, the analysis of multivariate transformations and more generally for the ordering of multidimensional data.

M. Riani, S. Zani, A. Corbellini

Unsupervised Fuzzy Classification of Multispectral Imagery Using Spatial-Spectral Features

Pixel-wise spectral classification is a widely used technique to produce thematic maps from remotely sensed multispectral imagery. It is commonly based on purely spectral features. In our approach we additionally consider additional spatial features in the form of local context information. After all, spatial context is the defining property of an image. Markov random field modeling provides the assumption that the probability of a certain pixel to belong to a certain class depends on the pixel’s local neighborhood. We enhance the ICM algorithm of Besag (1986) to account for the fuzzy class membership in the fuzzy clustering algorithm of Bezdek (1973). The algorithm presented here was tested on simulated and real remotely sensed multispectral imagery. We demonstrate the improvement of the clustering as achieved by the additional spatial fuzzy neighborhood features.

Rafael Wiemker

Mathematical and Statistical Methods


Some News about C.A.MAN Computer Assisted Analysis of Mixtures

The paper reviews recent developments in the area of computer assisted analysis of mixture distributions (C.A.MAN). Nonparametric mixture distribution modelling heterogeneity in populations can become the standard model in many biometric applications since it also incorporates the homogeneous situations as a special case. The approach is nonparametric for the mixing distribution including leaving the number of components (subpopulations) of the mixing distribution unknown. Besides developments in theory and algorithms the work focuses in various biometric applications.

D. Böhning, E. Dietz

Mathematical Aspects of the Feature Pattern Analysis

The Feature Pattern Analysis (FPA), as introduced by Feger (1988), is a method which analyzes a set of observed patterns with respect to co-occurrence. The mathematical formalism of the FPA and the several logically equivalent alternative forms of its representation as geometrical configurations, sets of contingencies and sets of prediction rules are described. Mathematical conditions for the uniqueness and existence of Type I and Type II FPA-solutions are discussed. A fast algorithm is developed to construct a two dimensional FPA-solution using Hasse-diagrams.

Michelle Brehm

A Note on the Off-Block-Diagonal Approximation of the Burt Matrix as Applied in Joint Correspondence Analysis

Joint correspondence analysis (JCA) is a commonly applied variation of multiple correspondence analysis (MCA) where the block-diagonal part of the Burt matrix is not considered in the fit. Examples shown here underline that this approach may in some cases lead to ambiguous results which may violate desirable properties of the representation.

Johannes Faßbinder

A New Look at the Visual Performance of Nonparametric Hazard Rate Estimators

Nonparametric curve estimation by kernel methods has attracted widespread interest in theoretical and applied statistics. One area of conflict between theory and application relates to the evaluation of the performance of the estimators. Recently, Marron and Tsybakov (1995) proposed visual error criteria for addressing this issue of controversy in density estimation. Their core idea consists in using integrated alternatives to the Hausdorff distance for measuring the closeness of two sets based on the Euclidean distance. In this paper, we transfer these ideas to hazard rate estimation from censored data. We are able to derive similar results that help to understand when the application of the new criteria will lead to answers that differ from those given by the conventional approach.

O. Gefeller, N. L. Hjort

Multilevel Modeling: When and Why

Multilevel models have become popular for the analysis of a variety of problems. This chapter gives a summary of the reasons for using multilevel models, and provides examples why these reasons are indeed valid. Next, recent (simulation) research is reviewed on the robustness and power of the usual estimation procedures with varying sample sizes.

J. Hox

Upper Bounds for the P-Values of a Scan Statistic with a Variable Window

It is asked if n independent events occurring in a given time interval are clustered or if, alternatively, the null hypothesis of a uniform distribution can be adopted. A simple scan statistic defined to be the maximum number of events within any sub-interval (or window) of given length has been used as test statistic in this context. Nagarwalla (1996) described a modification of this scan statistic, based on a generalized likelihood ratio statistic, which no longer assumes that the window width is fixed a priori. Unfortunately, the distribution of this statistic is not known and a simulation procedure had to be applied. In this paper a quite simpler statistic is proposed which can be considered as an approximation of Nagarwalla’s statistic. For this new statistic, upper bounds for the upper tail probabilities are given. Thus, the new test can be performed without recourse to a simulation. Furthermore, no restrictions on the cluster size are imposed. The procedure is illustrated by examples from epidemiology.

J. Krauth

A Branch-and-bound Algorithm for Boolean Regression

This paper proposes a branch-and-bound algorithm to trace disjunctive (conjunctive) combinations of binary predictor variables to predict a binary criterion variable. The algorithm allows for finding logical classification rules that can be used to derive whether or not a given object belongs to a given category based on the attribute pattern of the object. An objective function is minimized which takes into account both accuracy in prediction and cost of the predictors. A simulation study is presented in which the performance of the algorithm is evaluated.

Iwin Leenen, Iven Van Mechelen

Mathematical Classification and Clustering: From How to What and Why

Although some clustering techniques are well known and widely used, their theoretical foundations are still unclear. We consider an approach, approximation clustering, as a unifying framework for making theoretical foundations to some popular techniques. The questions of interrelation of the models with each other and with some other methods (especially in contingency and spatial data analyses) are also discussed.

B. Mirkin

Heteroskedastic Linear Regression Models A Bayesian Analysis

Heteroskedastic linear regression models with linear and non-linear multiplicative and additive specifications are analysed. A Bayesian estimation approach based on natural conjugate priors and a Markov Chain Monte Carlo (MCMC) method is proposed. The numerical computation is done using the Gibbs and Metropolis sampling algorithm. Simulated data sets are examined. The marginal likelihood analysis is proposed to compare among specifications for modelling the heteroskedasticity.

W. Polasek, S. Liu, S. Jin

A Heuristic Partial-Least-Squares Approach to Estimating Dynamic Path Models

An approach to dynamic modelling with latent variables is proposed. It has been developed on the base of H. Wold’s Partial Least Squares (PLS). An operator matrix containing the lag operator L is substituted for the path coefficient matrix of Wold’s static PLS model. On what is called the dynamic PLS model (DPLS) the original PLS estimation algorithm is virtually applicable. Lagged and leaded latent variables are used in the iterative process of estimating the weights of the manifest variables. The path coefficients are estimated by OLS or an appropriate dynamic modelling method The redundancy coefficient allows to measure the forecasting validity. DPLS has been programmed in PC-ISP/DGS©. Some properties of DPLS will be shown by simulation.

Hans Gerhard Strohe

World Wide Web and the Internet


Using Logic for the Specification of Hypermedia Documents

We describe an approach to the specification of hypermedia systems using first order logic. The static part concerning the document structure is described using Horn Clauses, the dynamic part uses Smolka’s feature logic. The linguistic mechanisms for a hypermedia description language are outlined, it is emphasized that an object oriented approach is helpful.

Ernst-Erich Doberkat

Project TeleTeaching Mannheim — Heidelberg

The technology needed for teleteaching is widely available today. Interest focuses on the use of multimedia technology and high-speed networks to disseminate course content and work and deepen understanding on the part of the students. The Universities of Mannheim and Heidelberg are engaged in a joint pilot project to develop and test new technologies for teleteaching in a digital network. High-capacity multimedia workstations and PCs are linked via ATM to enable access over the network to lectures, exercises and stored teaching materials. The departments of education and psychology of the two universities are scientifically advising and evaluating the project.

Wolfgang Effelsberg, Werner Geyer, Andreas Eckert

WWW-Access to Relational Databases

After a general discussion of Web Database Connectivity this paper presents a concept for a transaction-based Web-frontend for relational databases. The prototype was developed in Objective-C using NeXT’s WebObjects and allows consistent operations even across different database systems. The future work will focus on views, workflows, and user administration.

W. Esswein, A. Selz

Technology, Data, Relevancy: A Culture-Theoretical Look at the Internet

The Internet is a medium that gives access to an unknown amount and quality of knowledge. This knowledge is, moreover, stored in several ways and distributed over several continents. As a potential source for problem-solving, this knowledge cannot be ignored. Clearly enough, the Internet and its users share important characteristics with other socio-cultural systems. Assuming this parallel, three points seem to be particularly interesting: the problem of access, the problem of meaning, and the problem of relevancy. My discussion of these problem areas first with respect to culture in general and then with respect to the Internet seeks both to enlarge the discussion and to substantiate a proposition that opposes the idea of an ”information overload” suggested by the title of this conference. The proposition says: ”Information overload” is a phenomenon of stress. It results, at least partially, from our insistence to cling to solutions determined by what we take to be the very nature of the object itself. As a result, we tend to ignore behavioral strategies that evolved to cope with relevancy problems.

Peter M. Hejl

Self-Organizing Maps of Very Large Document Collections: Justification for the WEBSOM Method

Powerful methods are needed for interactive exploration and search from collections of miscellaneous textual documents that are available in the electronic media. Searching from text documents has traditionally been based on keywords and Boolean expressions. With the WEBSOM method a document collection may be organized into a map display that provides an overview of the collection and facilitates interactive browsing. Interesting documents can be retrieved by a content addressable search. The WEBSOM method is based on using the Self-Organizing Map algorithm for automatically learning relevant structures in the text and for organizing the document collection.

T. Honkela, S. Kaski, T. Kohonen, K. Lagus

Segment-Specific Aspects of Designing Online Services in the Internet

The success of Internet services depends on, e.g., the identification and description of relevant user segments and efforts to establish segment-specific attraction. In contrast to other Internet surveys the approach discussed here takes into consideration behavioral aspects of potential users related to the design of online services in the World Wide Web (Web for short). Preference data have been collected using a technical realization of online pairwise comparisons of selected homepages together with association data concerning indicators for homepage valuation and data on attitudes towards Web site features. Segment-specific results concerning Web site design aspects will be reported.

T. Klein, W. Gaul, F. Wartenberg

Design of World Wide Web Information Systems

Public global networks as for example the internet with the world wide web represent more and more a suitable platform for distributed information systems. Yet, the lack of methods for conceptual modelling (as known for the database design) often leads to quick and dirty implementations with uncontrollable data, data redundancies and data inconsistencies. In this paper, we present a method for the conceptual modelling of information systems within the world wide web. Starting from an extended Entity Relationship model, a page link scheme can be derived according to a classification of web pages and their links.

K. Lenz, A. Oberweis

Large WWW Systems: New Phenomena, Problems and Solutions

This paper is a compact written version of an invited paper presented on the occasion of the 21st Annual Meeting of the German Society for Classification. It is structured into three parts. In the first we try to show how WWW is starting to influence all kinds of areas of our lives, and in doing so new and unforeseen phenomena and problems arise. In the second part we explain how modern WWW systems are starting to reduce the problems discussed. And in the last part we analyse a number of statements often made about WWW that — when carefully analysed — turn out to be superficial, if not incorrect. We do not include many references to written papers, but rather pointers to further information on the WWW.

H. Maurer

Structured Visualization of Search Result List

Information highways are being built around the world. The impact on the business of companies like Siemens is tremendous. New techniques for information processing are required. The integration of features for navigation, in addition to the static presentation, enables the user to find the relevant information by browsing through the information space. By structured visualization, which means to represent the information in form of a 3-D net, the “Kontextgestalt”, the user gets an overview of the relevant topics the information set is concerned about.

U. Preiser

Speech and Pattern Recognition


Application of Discriminative Methods for Isolated Word Recognition

This paper describes a Hidden Markov Model based system for automatic recognition of isolated digits over telephone lines. For an LDA based linear feature transformation the classes to discriminate are choosen to be the HMM states. For MCE training this selection of classes is compared to the usage of the lexical words treated as classes. Experiments show that for MCE based reestimation of model parameters the latter choice is more appropriate, although in the case of Maximum Likelihood trainined parameters the correlation between Word Error rate and State Error rate is quite high.

Josef G. Bauer

Statistical Classifiers in Computer Vision

This paper introduces a unified Bayesian approach to 3-D computer vision using segmented image features. The theoretical part summarizes the basic requirements of statistical object recognition systems. Non-standard types of models are introduced using parametric probability density functions, which allow the implementation of Bayesian classifiers for object recognition purposes. The importance of model densities is demonstrated by concrete examples. Normally distributed features are used for automatic learning, localization, and classification. The contribution concludes with the experimental evaluation of the presented theoretical approach.

J. Hornegger, D. Paulus, H. Niemann

Speech Signal Classification with Hybrid Systems

This paper gives a brief overview on two successful hybrid approaches that combine artificial neural networks and Hidden Markov Models for speech signal classification tasks. At first, a short description of traditional stochastic-based Hidden Markov Model speech recognizers with different kinds of emission probabilities are given. The first proposed hybrid approach uses a neural network that approximates arbitrary emission densities in a model-free way. The second hybrid system uses discrete models and a neural network that is trained to work as optimal vector quantizer. The paper compares both systems and integrates them in the traditional stochastic model framework. Speech recognition results are given for the speaker-independent continuous speech ARPA resource management database.

Ch. Neukirchen, G. Rigoll

Stochastic Modelling of Knowledge Sources in Automatic Speech Recognition

This paper gives an overview over the stochastic approach in automatic speech recognition. The Bayes decision rule along with its application to the speech recognition problem is discussed. There are five topics in stochastic modelling for speech recognition that are studied in more detail: the EM algorithm, the probabilistic interpretation of neural net outputs, the method of decision trees, the leaving-one-out method for language modelling and the maximum entropy approach to language modelling.

Hermann Ney

Classification of Speech Pattern Using Locally Recurrent Neural Networks

Subject of automatic speech recognition is the classification of speech pattern as phones, syllables or words. Speech is generated by a complex articulation process which is influenced by coarticulation effects and depends on speaker characteristics. Thus, static as well as dynamic aspects of the resulting speech signal must be captured during feature extraction. Optimum classification of speech pattern consisting of feature vectors can only be achieved if this representation of information is adequate for the chosen classification method. Here we present such a combination consisting of psychoacoustically oriented features and locally recurrent neural networks.

H. Reininger, K. Kasper, H. Wüst

Knowledge and Databases


Information Gathering for Vague Queries Using Case Retrieval Nets

Case Retrieval Nets (CRNs) have been developed for the efficient and flexible retrieval of cases from large case bases in the context of Case Based Reasoning. The information access in CRNs is performed by a bottom-up spreading activation process according to similarity and relevance from query related nodes to case nodes. Special attention is put on the handling of vague queries in CRNs.

Hans-Dieter Burkhard

Characterizing Bibliographic Databases by Content — an Experimental Approach

With the growing number of bibliographic databases accessible online the user has to choose between numerous sources. For that content descriptions of the databases are necessary. In this paper we present an approach to describe bibliographic databases by content. It is based on classification profiles, which show the portion of a given database each class of a classification has. Approx­imations to these profiles can be gained automatically with a limited number of queries with characteristic terms for each class of a classification. A way to obtain characteristic terms is presented and the method is tested on different databases.

M. Dreger, S. Göbel, S. Lohrum

Medoc Searching Heterogeneous Bibliographic and Text Databases

The Medoc system aims at providing Computer Science researchers and practitioners with information they need, on their desktops. On the one hand, this includes building a database of full text documents with browsing and navigation functions in addition to the usual search facilities. On the other hand, transparent access to heterogeneous full text and bibliographic databases is provided. The paper presented develops a document model needed to support all of the functions for the heterogeneous databases and sketches some methods for querying heterogeneous databases.

Kai Großjohann, Cornelia Haber, Ricarda Weber

Supervised Learning with Qualitative and Mixed Attributes

A Local Scaling Approach to Discriminate between Good and Bad Credit Risks

Building classification tools to discriminate between good and bad credit risks is a supervised learning task which can be solved using different approaches (Graf and Nakhaeizadeh (1994)). In constructing such tools, generally, a set of training data, containing qualitative and quantitative attributes, is used to learn the discriminant rules. In real world of credit applications a lot of the available information about the customer and his behaviour of payment appears in qualitative, categorical attributes.On the other hand many approaches of supervised learning require quantitative, numerical input variables to be processed in the learning algorithms. Qualitative attributes first have to be transformed into a numerical form, before they can be used for the learning process.One very simple approach to handle that problem is to code each possible value of all qualitative categorical attributes in new, separate binary attributes. This leads to an increasing number of input variables, the learning process to build the rules gets more complicated. In particular neural networks need more time for training and often loose accuracy.In this paper we consider different scaling approaches — here the number of variables does not increase — to transform categorical into numerical attributes (Nishisato (1994)). We use them as input variables to learn the discriminant rules and develop a method of local scaling to enhance accuracy and stability of the rules. Using real world credit data, we evaluate the different approaches and compare the results.

Harald Kauderer, Hans-Joachim Mucha



A Comparison of Traditional Segmentation Methods with Segmentation Based upon Artificial Neural Networks by Means of Conjoint Data from a Monte-Carlo-Simulation

Simulated data are needed to compare traditional segmentation methods with segmentation by neural networks, because only under these circumstances the quality of reproduction between methods is comparable. Therefore conjoint data with differently distributed errors are created by a Monte-Carlo-Simulation. The results of a segmentation by neural networks are compared with those of a segmentation by traditional methods in order to reveal whether the introduced neural networks are capable of a better segmentation at all, respectively for which structure of the starting data the segmentation by neural networks appears to be particularly promising.

H. Gierl, S. Schwanenberg

Classification of Pricing Strategies in a Competitive Environment

The general structure of multiperiod price paths mainly depends on consumer characteristics, competitive reactions and restrictions which describe additional salient features of the underlying pricing situation. We present an approach which generalizes well-known price response functions in the area of reference price research and discuss price paths for important classes of competitive pricing strategies.

M. Löffler, W. Gaul

Predicting the Amount of Purchase by a Procedure Using Multidimensional Scaling: An Application to Scanner Data on Beer

A predicting procedure based on two multidimensional scaling methods, INDSCAL and PREFMAP, was applied to scanner data on a brand of beer and its competitive brands at a supermarket. The data, collected at the supermarket during the first 13 weeks after the introduction of the brand, were analyzed by the procedure to predict the amount of purchase of that brand and the competitive brands from weeks 14 to 39. The predicted market share of the brand in the category of beer between weeks 27 to 39 at the supermarket was close to the actual figure.

A. Okada, A. Miyauchi


Weitere Informationen