main-content

## Über dieses Buch

The 13th conference of the Gesellschaft fUr Klassifikation e. V. took place at the Universitat Augsburg from April 10 to 12, 1989, with the' local organization by the Lehrstuhl fUr Mathematische Me­ thoden der Wirtschaftswissenschaften. The wide ranged subject of the conference Conceptual and Numerical Analysis of Data was obliged to indicate the variety of the concepts of data and information as well as the manifold methods of analysing and structuring. Based on the received announcements of papers four sections have been arranged: 1. Data Analysis and Classification: Basic Concepts and Methods 2. Applications in Library Sciences, Documentation and Information Sciences 3. Applications in Economics and Social Sciences 4. Applications in Natural Sciences and Computer Sciences This classification doesn't separate strictly, but it shows that theo­ retic and applying researchers of most different disciplines were disposed to present a paper. In 60 survey and special lectures the speakers reported on developments in theory and applications en­ couraging the interdisciplinary dialogue of all participants. This volume contains 42 selected papers grouped according to the four sections. Now we give a short insight into the presented papers. x Several problems of concept analysis, cluster analysis, data analysis and multivariate statistics are considered in 18 pa­ pers of section 1. The geometric representation of a concept lattice is a collection of figures in the plane corresponding to the given concepts in such a way that the subconcept-superconcept-relation corresponds to the containment relation between the figures. R.

## Inhaltsverzeichnis

### Robustness of Estimation Methods against Small Sample Sizes and Nonnormality in Confirmatory Factor Analysis Models

The goal of this paper is to compare three estimation methods in analysing a confirmatory factor analysis model using a Monte Carlo simulation design. Nonnormal data are generated and analysed to assess the effects of sample size and estimation methods on parameter bias, standard errors, and chi-square test values.

Ingo Balderjahn

### Probabilistic Aspects in Cluster Analysis

Cluster analysis provides methods and algorithms for partitioning a set of objects O = 1,…, n (or data vectors x1,…, xn ∈ Rp) into a suitable number of classes C1,…,Cm ⊆ O such that these classes are homogeneous and each of them comprizes only objects which are’similar’ in some sense. The historical evolution shows a surprising trend from an algorithmic, heuristic and applications oriented point of view (Sokal/Sneath 1963) to a more basic, theory oriented investigation of the structural, mathematical and statistical properties of clustering methods. Nowadays, the questions to be answered are of the type’How many clusters are there ?’,’Is there a classification structure ?’,’Is the calculated classification adequate ?’,’Which are the strongest clusters ?’ etc.

H. H. Bock

### Symbolic Cluster Analysis

The aim of this paper is to introduce the symbolic approach in data analysis and to show that it extends data analysis to more complex data which may be closer to the multidimensional reality. We introduce several kinds of symbolic objects (”events”, ”assertions”, and also ”hordes” and ”synthesis” objects) which are defined by a logical conjunction of properties concerning the variables. They can take for instance several values on a same variable and they are adapted to the case of missing and nonsense values. Background knowledge may be represented by hierarchical or pyramidal taxonomies. In clustering the problem remains to find inter-class structures such as partitions, hierarchies and pyramids on symbolic objects. Symbolic data analysis is conducted on several principles: accuracy of the representation, coherence between the kind of objects used at input and output, knowledge predominance for driving the algorithms, self-explanation of the results. We define order, union and intersection between symbolic objects and we conclude that they are organised according to an inheritance lattice. We study several properties and qualities of symbolic objects, of classes and of classifications of symbolic objects. Modal symbolic objects are then introduced. Finally, we present an algorithm to represent the clusters of a partition by modal assertions and obtain a locally optimal partition according to a given criterion.

E. Diday, M. Paula Brito

### Analysis of Nonnormal Longitudinal Data with Generalized Linear Models

This paper surveys models and methods for time series and panel data based on extensions of generalized linear models as a unifying tool. Though the linear normal case is covered within this frame, focus is on nonnormal, in particular discrete and categorical data. Models are classified according to their type of parameters, which may be constant, time-varying or cross-varying.

Ludwig Fahrmeir

### Cluster Methods for Qualitative Data

The cluster methods studied in this paper are intended to be applicable to the following types of data: D1Data represented by vectors with components consisting of finitely many linearly ordered alternative attributes.D2More generally: Naturally preordered data with a length L(x, y) for all data $$x \precsim y$$$$L\left( {x,y} \right); = \left\{ {\begin{array}{*{20}{c}} {0\,if\,x \sim y} \\ {\sup \left\{ {\left| K \right|:K \subsetneqq \left[ {x,y} \right]\,else} \right.} \\ {K\,is\,a\,finite\,proper\,chain} \\ \end{array} } \right.$$ which may be infinite. Because of measure theoretic reasons (cf. Herden (1988)) only data $$x \precsim y$$ are allowed with neither some z such that z < x nor some z’ such that y < z’. If x, y ∈ {0, l}n (n ∈ IN, n ≥ 2), then L(x,y) is the number of components for which x and y do not coincide.D3Still more generally: Data with a dissimilarity coefficient which only has ordinal significance.

Gerhard Herden

### On Testing for and against Inequality Restrictions

This paper deals with testing problems with the null and the alternative hypothesis both determined by inequalities. In the linear model, new finite sample results on the likelihood ratio (LR) statistic are presented which apply if hypotheses are given by homo geneous linear inequality constraints. A union- intersection (UI) and a convex hull statistic are also proposed, and it is noted that UI dominates LR in certain cases.

Heinz Kaufmann

### Evaluation of a Survey with Methods of Formal Concept Analysis

‘Formal Concept Analysis’ is a new method of data analysis which is based on a set-theoretic model for conceptual hierarchies. Formal Concept Analysis mathematizes the philosophical understanding of a concept as a unit of thoughts consisting of two parts: the extension and the intension. The extension of a concept consists of all objects which belong to the concept. The intension comprises all attributes which are valid for all these objects.

Wolfgang Kollewe

### A New Modification of the Rand Index for Comparing Partitions

The comparison of different partitions is an essential problem in evaluating numerical classifications. Measures of the similarity between partitions can be used not only to compare the results obtained from different cluster analyses, but also to compare the known structure of a data set with the result of a cluster analysis. A large number of such indices have been proposed. Reviews and comparisons are, for example, given in (1, 5, 7, 11, 12, 15). Especially (11) appears to reveal the fact that a modification of the Rand index (14) as described in (8) is particularly suitable for measuring the similarity between partitions.

Joachim Krauth

### Relations between Models for Three-Way Factor Analysis and Multidimensional Scaling

Unconstrained spatial models for analyzing three-way data may be classified by the type of geometric model (distance models, scalar product models), the number of reduced modes (7), the number of spaces and their interrelations (1), to state three of many organizing ways. Number of spaces and their relations refers to a classification of the models into trilinear and quadrilinear cases: a)trilinear models: CANDECOMP and INDSCAL (2) rest on a trilinear model as well as PARAFAC (4) and SUMMAX (9).b)quadrilinear models: Tucker’s Three-Mode Factor Analysis (10), (7) and SUM-MAX in its extended form (9) use a quadrilinear model.

Sabine Krolak-Schwerdt, André Kohler

### Algorithms in Multidimensional Scaling

The ideas described in this paper are motivated by a basic problem in Multidimensional Scaling which is solved not quite satisfactory up to now. Suppose that there are given pairwise dissimilarities δij, 1 ≤ i, j ≤ n, as nonnegative real numbers between n objects. These originate from the special type of application and may be, for example in cartography, measurements of pairwise distances disturbed by additive random errors or, in psychology, individual assessments of the deviation in behaviour of n persons measured on a certain real scale. There are many other applications in a variety of fields, a structured overview may be obtained from de Leeuw & Heiser (1980, 1982).

Rudolf Mathar

### Extensions of Correspondence Analysis for the Statistical Exploration of Multidimensional Contingency Tables

In this paper four different generalizations of canonical correlation analysis to Q ≥ 3 sets of random variables are proposed, their application to indicator variables is studied, and the resulting extensions of correspondence analysis (CA) to Q-dimensional contingency tables are presented. The determination of canonical variates leads to generalized eigenvalue problems which can be solved using a globally convergent algorithm, based on Watson’s iteration.

Renate Meyer

### Numerical Classification of Biased Estimators

In recent years some alternatives to the least squares estimation function in the linear model have been introduced. Most of these functions can be represented as a linear or non-linear transformation of the least squares function. It is shown that these estimation functions can be derived as minimum norm estimation functions in the class of linear transforms of the least squares estimation function. In addition a numerical classification of these functions is given which yields an appropriate algorithm for computation.

Paul Müller

### An Agglomerative Algorithm of Overlapping Clustering

Cluster analysis is especially concerned in algorithms for computing partitions or hierarchies on given object sets. For several economic problems and other areas however the determination of partitions representing the classification structure of data is too specific and narrow. In opposition to that given natural overlap-pings should not be suppressed. Doing so we obtain the problem how to limit overlappings.

O. Opitz, R. Wiedemann

### Isotonic Regression — For a Monotone Index on a Hierarchy

Let X be a finite set, IH(X) the set of all hierarchies on X, and let D (X) and U(X) denote the sets of distances and ultrametrics respectively, cp. Degens (1985). If a distance d and a positive weight function w are given, we can state the following approximation problem: Look for the ultrametric d u , which is a weighted least squares proxium to d, defined by 1$${\left\| {d - {d_u}} \right\|_w} \leqslant {\left\| {d - {{d'}_u}} \right\|_w}\forall {d'_u} \in U(X)$$ Krivanek (1986) proved the NP-completness of problem (1). In analogy to the maximum parsimony problem we will call it the ‘large average linkage problem’: No knowledge of the structure (hierarchy) of the ultramtric is given.

Manfred Tittel, Paul O. Degens

### On the Interpretation of Median Relations

Within the broader field of the analysis of qualitative data, the corresponding classification sub-problem can be stated as follows. Let N = 1,2,...,n, the set of objects, and M = k 1 ,k 2 ...,km, the set of variables; the complete information available on the problem is summarized in the data matrix D = (dik)n,m, where the entry d ik represents the property of object i ∈ N with respect to variable k ∈ M.

Ulrich Tüshaus

### Least Squares Approximation of Additive Trees

Several authors have proposed procedures for fitting an additive tree to a distance, and some of them make use of the least squares principle, e.g. Sattath & Tversky (1977), Carroll & Pruzansky (1980) and de Soete (1983). However, a rigorous mathematical treatment of the underlying least squares approximation problem is missing. We start to fill this gap by considering the approximation problem for a fixed tree structure. We present a characterization of the unique solution based on the comparison of averages. Some conclusions from this characterization give a first insight to the general approximation problem with unknown tree structure. Let X be a finite set of objects. Following Buneman (1971), a tree structure on X can be represented by a system of compatible splits: Each edge of the tree structure is represented by a bipartition {A, Ac} of X, called a split, and two splits {A, Ac} and {B, BC} are called compatible, if A ⊆B ⋁ A ⊆ Bc ⋁ Ac ⊆ ⋁Ac ⊆ Bc holds (cp. fig. 1).

Werner Vach

### Geometric Representation of Concept Lattices

A fundamental method of data analysis is to visualize structures of data by geometric configurations in the plane. Since concept lattices are basic structures for data contexts, it is desirable to study systematically the geometric representation for concept lattices. In this paper we want to understand the notion of geometric representation to be based on the containment of figures; more specifically, a geometric representation of a concept lattice shall be given by a collection of figures in the plane corresponding to the concepts in such a way that the subconcept-superconcept-relation corresponds to the containment relation between the figures. It is most common to represent objects by points in which case an object has a given attribute if and only if the figure corresponding to the concept generated by this attribute contains the point which represents the object. Although the line diagram has been proven to be a general and successful tool for representing concept lattices (cf. [Wi84]), there are situations where the geometric representation is also useful and even more appropriate.

Rudolf Wille

### On Properties of Additive Tree Algorithms

There is a growing interest in methods used for analyzing dissimilarity measurements which occur in many fields of biology, medicine, psychology etc. These data may be analyzed by using some graph theoretical structures. We concentrate on a special graph: the additive tree. Different methods have been proposed for constructing additive trees for given dissimilarity data, but there is less knowledge concerning properties of these methods. We present some suggestions for analyzing properties of additive tree constructing methods which allow a preliminary rating of proposed methods.

K. Wolf, P. O. Degens

### Knowledge Structures and Knowledge Representation: Psychological Models of Conceptual Order

The present paper is concerned with a critical examination of recent theoretical approaches to concepts. Specifically, it addresses the question of how to model natural concepts whose exemplars vary along a continuum of concept membership. In the first section, relevant positions in the theory of concepts and categorization are discussed. Subsequently a representational model accounting for people’s comprehension of simple and complex concepts is outlined.

Thomas Eckes

### Qualitative and Numerical Data in a Three-dimensional System

Datamanagement in interdisciplinary research of decay mechanisms of mediaeval wall paintings

H. Glashoff

### New Concepts and Terms During the French Revolution. A Classification of the Neologisms According to their Origin

Often the change of a political regime is characterized by the creation of new institutions. Seldom was a society so deeply affected, even turned upside down as the French society during the Revolution of 1789. Indeed, the aim of the revolutionaries was to erase completely the Ancien Régime.

Henri Leclercq

### Einige sprachliche Probleme bei der Arbeit an einer Klassifikation und deren Registern

Die folgenden Überlegungen zu Klassifikation und Klassifikationsregistern sind nicht als Präsentation von Lösungen gedacht, sondern vielmehr als Aufzeigen von Problemen und damit als Bitte um Diskussion und Erfahrungsaustausch.

Bernd Lorenz

### Data Analysis in Literary Studies

In the present paper, the term”literary studies” will cover all aspects of research, criticism or interpretation devoted to literary works. Data analysis in literary studies, it is argued, should combine classified and verbal indexing, seek thesaurus control and make use of electronic data processing. The paper will briefly comment on prevailing deficits of data analysis in this field, outline the theory of facet analysis for literary studies, and describe MLA as the current example pursuing that approach.

Heiner Schnelling

### Have Very Large Data Bases Methodological Relevance?

The Humanities, like all provinces of Academia, have been considerably influenced by the “micro computer revolution” of the last few decades. While, as other fields, the Humanities, too, have their fair share of persons who are doubtful about the praiseworthiness of these developments on more general grounds, a specific argument has been raised in the Humanities, which, in our opinion, is fairly unique to them.

Manfred Thaller

### Priority-Based Classification of Available Information — An Important Aspect of Future User Interfaces —

In line with placing more emphasis on the end user, the attempt of ” total integration” is being made as may be demonstrated in the products subsumed under the term ”office and engineering systems”. On the one hand, this refers to hardware-related integration (cp. multi- functional workstation), on the other hand it implies the consolidation of conventional and autonomous applications. Softwarebased integration now results in an arsenal of ”tools”, similar to a big comprehensive tool kit, to be offered which includes the related ”material” (cp. files, data bases) to be processed. The integration of further functions — and this applies primarily to office systems (cp. electronic archive) — increasingly results in the quantitative problem of preparing a huge quantity of data and documents in a suitable way for subsequent retrieval. Then, the selection of uniquely defined names implies an overwhelming quantity of identifiers at the user interface which is further increased by the use of synonym tables and descriptors. Therefore, the user is faced with numerous problems, such as identification conflicts due to the coincidence of names, long and nevertheless incomprehensible identifiers and restrictions with respect to the selection of names.

Peter Zoller

### Computer-Based and Quantitative Methods in Market Research

A shift from an industrial to an information society is taking place in the leading industrial nations of the world. This shift is partly due to the development of new information and communication technology, and has significant effects on the field of marketing and particularly on market research.

Klaus Ambrosi

### Sample Techniques Used in Marketing Research

Green and Tull define marketing research as ”the systematic and objective search for and analysis of information relevant to the identification and solution of any problem in the field of marketing” (Green, Tull 1970). The study of markets by sample surveys is a possible instrument for the search of primary informations. The population for the sampling procedure is given as a collection of all elements defined over space and time which are relevant to the scope of a marketing problem.

Thomas Bausch

### Generalized Latent Class Analysis: A New Methodology for Market Structure Analysis

In this paper a generalization of LCA (Latent Class Analysis) is presented which allows a simultaneous classification and MDS (MultiDimensional Scaling) of ordered categorical data. This approach is managerially useful in several ways, because additional background variables can be directly incorporated to identify latent class specific response probabilities. Furthermore, this technique allows a graphical representation of the classification results obtained. Essential features of the methodology will be demonstrated in the empirical part of this paper.

I. Böckenholt, W. Gaul

### The Use of the Logical Programming Language PROLOG as a Classification Method

A classification method is described here which is based on a classification of objects determined by experts.

Roberto Buzzi

### Classification and Selection of Consumer Purchase Behaviour Models

Consumer purchase behaviour analysis situations can be characterized by two extremes: On the one side, a magnitude of models for analyzing data of consumer purchase behaviour is available, on the other side, potential users who are not so familiar with details of the corresponding methodology refrain from applying adequate models in day-to-day activities of market research and marketing. In this paper, we will give a classification of consumer purchase behaviour models and describe a selection procedure which — on the basis of the data provided and the market diagnostics desired — helps to find (an) appropriate model(s).An example based on household panel data provided by a German market research institute is included for illustration.

R. Decker, W. Gaul

### Towards a New Socioeconomic Classification Scheme for Farm Households Using a Cluster Analysis Technique

In developed economies farm households are increasingly involved in non-farm activities and consequently, total household income depends a great deal on non-farm sources. This changing pattern of the prevalence in farm and non-farm work and income provides a challenging task regarding the identification and separation of farm households with corresponding characteristics of labor supply and sources of livelihood. There are different approaches to tackle this problem, e.g. by establishing a scheme which attempts to classify farm households according to socioeconomic characteristics. Their main objectives are a thorough representation, monitoring and interpretation of structural changes within the agricultural sector. Additionally, these schemes facilitate the identification of farm households with social or economic/financial difficulties and they provide assistance in designing and implementing policy instruments. Most commonly, the categorization schemes are based on criteria like the main activity of the farm manager or the predominant source of livelihood. This results in a scheme which identifies full-time or part-time farm holdings.

Rolf H. Gebauer

### Identification of Multiple Criteria Decision Making

Assuming that m alternatives given by A = {a1,..., a m } are evaluated by n criteria, called Z = {z1,..., z n }, we obtain the utility function 1.1$$u:A \times Z \to R\,with\,u\left( {{a_i},{z_j}} \right) = {u_{{ij}}}$$ In the classical utility theory u is a linear function. If we are able to set weights w1,..., w n with w k ≧ 0, $${w_k}\; \geqq \;0,\quad \sum\limits_{{k = 1}}^n {{w_k}} = 1$$ representing the importance of z1,..., z n , we get a complete preordering ≳ on pairs of alternatives: 1.2$${a_i}\begin{array}{*{20}{c}} > \\ {\left( \sim \right)} \\ \end{array} {a_j} \Leftrightarrow \sum\limits_{{k = 1}}^n {{u_{{ik}}}{w_k}} \begin{array}{*{20}{c}} > \\ {\left( \sim \right)} \\ \end{array} \sum\limits_{{k = 1}}^n {{u_{{jk}}}{w_k}}$$ Obviously each preordering depends on the weights of criteria. Changing the weights we may obtain other preorderings.

Magdalena Mißler-Behr, Otto Opitz

### Explorative Data Analysis and Macroeconomics

For empirical economic studies, econometric methods and time series analysis are most important. The stochastic methods of estimating and testing used are well investigated. However, these methods do not always work with the desired success. For example in Weichhardt’s study (1982) analysing economic forecasting errors, it is hardly possible to perceive the interesting structural changes of the development of economic variables. Looking for the reasons, we sometimes find that inadequate model specifications are used which do not reflect the lack of structural stability of the model.

R. Pauly, W. Hauke, O. Opitz

### A Microeconometric Study of Travelling Behaviour

The growing availability of leisure in industrialized countries has increased economists’ interest in those industries which meet the needs and demands created by leisure such as entertainment, sports and tourism. In particular travelling has become an economic factor and travelling abroad considerably affects balances of payments, at least for European countries. It is therefore of interest to know why people spend their holidays abroad and what explanatory variables direct the demand for foreign tourism. There has been a number of empirical studies related to this question. However, they all used aggregated data. See, for example, O’Hagan and Harrison (1984) and Smeral (1985). These authors used demand systems (Almost Ideal Demand System and modifications) to model the demand for foreign travel. They also pointed out that most consumers will most likely decide in two or even three steps: At first they decide about travelling at all, then they decide about staying in their country or going abroad, and in a third step they decide about the (foreign) destination. However, no attempt has been made to test this assumption.

Gerd Ronning

### Inference Techniques in Decision Support Systems — Comparison and Example from Data Analysis

A decision support system (DSS) is a knowledge-based computing system designed to give expert’s advice in solving problems of some fairly complex knowledge-rich domain.

### The Classification of Organisms — The Hurdle of Homology —

Homology is one of the central concepts in biological systematics, irrespective of the”systematic schools”, be it phenetic, cladistic or evolutionary (for a discussion of these terms see, for instance, Felsenstein, 1983). That hypotheses on phylogenetic relations should always be based on homologous characters (e.g. Rieppel, 1980) is almost a truism.

Walter Erdelen

### Limits in the Reconstruction of Phylogonetic Trees Exemplified with 5S rRNA Sequences

We analyzed all archaebacterial and eukaryotic 5S rRNA sequences using the method of statistical geometry in sequence space. Special emphasis was put on the analysis of the evolutionary relationships of eukaryotes.We show that the evolutionary tree of the major groups of eukaryotic 5S rRNA sequences exhibits a more or less bush like structure compared with the archaebacterial sequences and certain classes of well defined biological eukaryotic groups.

Arndt von Haeseler

### The Concept of Information in Computer Science

As a discipline of information processing computer science should be based upon some special concept of information. But neither the fundamental theories nor the main methods and applications refer to such concept, cf. Hotz [1] or Zemanek [2]. We are going to point out, that information although it actually appears as a hidden concept only nevertheless is a fundamental concept and that its structural properties allow a precise definition.

Ingbert Kupka

### Exploring Homologous tRNA Sequence Data: Positional Mutation Rates and Genetic Distance

The literature provides different proposals for mathematical or biological models for the evolutionary interpretation of sequence data. For example Felsenstein (1983) or Weir (1985, 1988a) reviewed the state of art. Many papers propose models with parameters for different kinds of nucleotide substitutions (e.g. Kimura, 1980; Barry and Hartigan, 1987; Cavender and Felsenstein, 1987; Rempe, 1988). Most models need some assumption of independent and identical distributions for each site. Moreover, they often propose to estimate a lot of parameters together with the unknown phylogeny. Some work was done concerning the sequence itself; e.g. Avery (1987) analysed aspects of intron data and Haselman, Chappelear and Fox (1988) discussed secondary and tertiary interactions of tRNA.

Berthold Lausen

### Nucleotide Sequence Analysis of Conserved Genes from Bacteria

A deductive approach has to be chosen to resolve the phylogenetic relationships of prokaryotes since they lack the ontogeny and fossil records of higher organisms. It should be possible by comparative analyses of homologous macromolecules that are universally distributed among organisms, show a high degree of functional constancy and are sufficiently conserved to span the full evolutionary spectrum.

Wolfgang Ludwig, Karl Heinz Schleifer

### An Analysis of Throughput Measurements on a Computer Network

Since 1986 a broadband-network of Nixdorf exists at the University of Siegen. This network uses a homogenous infrastructure to serve all the university’s communication requirements. On a broadband several logical nets may be simultaneously implemented using different transferrates.

R. Ostermann, L. Hofmann, J. W. Münch

### AC-Characteristics and the Pass/Fail Performance of a Memory Chip

The continuing rapid development in memory device technology results in more dense and more complex storage structures. At the time Dynamic Random Access Memories (DRAM) are able to store about 1.000.000 bits. Speed quantities like access, hold and setup times are AC-parameters (alternating current). They depend on internal processes and indicate the efficiency of a chip. The measured values of these parameters show the timing conditions of a chip which is working correctly. If all AC-parameters of a chip meet the specification, the chip passes the test and can be classified into a group which indicates its speed. With each chip generation the number of AC-parameters may increase, e.g. the timing conditions of the Siemens 1 M DRAM are characterized by 48 AC-parameters. Analysing the timing conditions becomes more and more difficult since vast amount of data is the result of testing. By the selection of more suitable parameters, the analysis of data is simplified and the costs of testing are reduced. Our results are shown by the example of the read cycle.

Beate Roos, Paul O. Degens

### Backmatter

Weitere Informationen