Skip to main content
main-content

Über dieses Buch

This volume contains revised versions of selected papers presented dur­ ing the 23rd Annual Conference of the German Classification Society GfKl (Gesellschaft fiir Klassifikation). The conference took place at the Univer­ sity of Bielefeld (Germany) in March 1999 under the title "Classification and Information Processing at the Turn of the Millennium". Researchers and practitioners - interested in data analysis, classification, and information processing in the broad sense, including computer science, multimedia, WWW, knowledge discovery, and data mining as well as spe­ cial application areas such as (in alphabetical order) biology, finance, genome analysis, marketing, medicine, public health, and text analysis - had the op­ portunity to discuss recent developments and to establish cross-disciplinary cooperation in their fields of interest. Additionally, software and book pre­ sentations as well as several tutorial courses were organized. The scientific program of the conference included 18 plenary or semi plenary lectures and more than 100 presentations in special sections. The peer-reviewed papers are presented in 5 chapters as follows: • Data Analysis and Classification • Computer Science, Computational Statistics, and Data Mining • Management Science, Marketing, and Finance • Biology, Genome Analysis, and Medicine • Text Analysis and Information Retrieval As an unambiguous assignment of results to single chapters is sometimes difficult papers are grouped in a way that the editors found appropriate.

Inhaltsverzeichnis

Frontmatter

Data Analysis and Classification

Frontmatter

New Developments in Latent Variable Models

Three new developments in latent variable models are discussed. The first development is the extension of latent variable models with metrical observed variables and indicators to limited dependent variables. The second development concerns the inclusion of models that are non-linear in the latent variables. Finally, finite mixtures of conditional mean- and covariance structure models are considered.

G. Arminger, J. Wittenberg, P. Stein

Regression-Type Models for Kohonen’s Self-Organizing Networks

Kohonen’s self-organizing maps (SOMs) visualize high-dimensional data x1, x2, … ∈ Rp essentially by combining two steps: (1) clustering the data points into classes, and (2) displaying’ neighbouring’ classes by’ neighbouring’ vertices in a given low-dimensional lattice (map). This paper generalizes the classical approach where each class is represented by a typical point in Rp (class center) to the case where (a) the data u1 = (x1, y1), (x2, y2), … ∈ Rp+q are partitioned into an explanatory part xk ∈ Rp and a response part y k ∈ Rq and (b) each class is represented by a regression hyperplane. The class-specific linear regression functions can be combined to a global one, and classes with similar response functions are visualized as neighbours in the map. In analogy to classical cluster analysis, we define an optimum regression-type SOM by a (discrete or continuous) generalized clustering criterion (K-criterion) and propose SOM algorithms derived from k- means, MacQueen’s sequential approach, or from stochastic approximation. The resulting SOMs can be applied, e.g., in segmented credit scoring or for response surface approximation

H.-H. Bock

Relational Data Analysis, Discrimination and Classification

In relational data analysis, we focus on the representation of the objects (individuals) in a multivariate data matrix as points, and the variables as vectors in a common low-dimensional space. The highlight will be on discriminant analysis, and will include nominal and ordinal variables. Goodness-of-fit (discrimination) is given by the Between-to-Total sum of squares. Results are also compared with respect to classification, where the allocation is obtained by the use of Euclidean distances in low-dimensional space, which can be related to Bayes’ rule, giving a posteriori probabilities.

J. J. Meulman

Data Types and Information: Beyond the Current Practice of Data Analysis

The role of “scaling” is to upgrade given measurement, for instance, from nominal to interval, thereby affording added accuracy of derived measurement. The higher-level measurement, however, imposes greater restrictions on what one can do with it. Thus, the present study discusses the opposite process, that is, downgrading (or desensitizing) of high- level measurement, say from interval to nominal, and let a scaling procedure upgrade it again so that in the process of upgrading one may capture not only linear relations but also nonlinear relations from less restricted measurement. How much more information is captured by this process is not certain, but one attempt is presented together with two numerical examples of the process.

S. Nishisato

Two-Mode Three-Way Asymmetric Multidimensional Scaling with Constraints on Asymmetry

A model and an accompanying algorithm for two-mode three-way asymmetric multidimensional scaling is presented. The present model has a constraint on asymmetry, compared with the model of Okada, Imaizumi (1997) where each source has a different magnitude of asymmetry, but all sources are constrained so that the relative importance of the asymmetry along dimensions is constant for all sources. The accompanying nonmetric algorithm was developed from similar work in Okada, Imaizumi (1997). An application to interpersonal attraction data among university students is presented.

A. Okada, T. Imaizumi

Regularized Discriminant Analysis with Optimally Scaled Data

Linear discriminant analysis is a well known procedure of discrimination which is equivalent to canonical correlation analysis where the linear predictors define one set of variables, and a set of dummy variables representing class membership defines the other set. Here we propose a new way of discrimination of observations explained by a set of variables with mixed scaling level. We use a nonparametric discriminant procedure based on optimally scaling the data to estimate the distribution of the object scores. Next, we propose a multivariate kernel distribution of the variables with a variety of window widths which controls the degree of smoothness of the estimate. We choose the window width of the kernel distribution and the dimension of object scores matrix by cross-validation (leave-one-out).

H. Bensmail, J. J. Meulman

Recurrence Properties in Two-Mode Hierarchical Clustering

We study the agglomerative method proposed by Eckes, Orlik (1993) for constructing a two-mode hierarchical clustering, and we propose a Lance, Williams (1966) type recurrence formula for the criterion used. This formula guarantees that no inversions will occur during the agglomerative hierarchical clustering process.

W. Castillo, J. Trejos

On Pyramidal Classification in the Delimitation of Climatic Regions in Catalonia

In this paper we define a new probabilistic model (which we call p-Weibull), considering the Weibull distribution function with an additional parameter, related to the probability of a wet day. We show that from this model and a classification method (for instance the pyramidal one) several climatic regions can be delimited. Each region can be characterized in terms of the same p-Weibull model in order to simulate daily precipitation in each region and any period of time. As an example we get several climatic regions in Catalonia.

J. Conde, M. A. Colomer, C. Capdevila, A. Gil

On the Modal Understanding of Triadic Contexts

A triadic context consists of sets of formal objects, formal attributes, and formal conditions together with the formalization of the ternary relation saying when an object has an attribute under a certain condition. The modal understanding of necessity and possiblity occurs naturally in triadic contexts, especially when the dyadic relationships between formal objects and attributes are considered: a formal object g has “necessarily” a formal attribute m if g has m under all formal conditions of the context; g has “possibly”m if g has m under some formal condition. Such necessity and possibility relations give rise to dyadic contexts allowing a modal analysis of triadic data contexts. How this analysis can be approached is shown by examples. Theoretically, we point out how a “modal Attribute Logic” may be developed.

F. Dau, R. Wille

Clustering and Scaling: Grouping Variables in Burt Matrices

Although clustering and scaling are different techniques, it has been shown that in the case of a two-way cross-table, cluster analysis and correspondence analysis provide similar solutions (Greenacre 1988a, 1993; Lebart 1994). In this paper we extend this approach to the multiple case. Instead of analyzing two variables we use a set of variables. When scaling the data we apply joint correspondence analysis which is the generalization of simple correspondence analysis (Greenacre 1988b, 1993). The suggested clustering process is hierarchical whereby the similarity matrix consists of standardized χ2-values computed from the subtables of the Burt matrix. As an example we use 25 variables of cultural competences taken from the German General Social Survey 1986 (ALLBUS).

S. Gabler, J. Blasius

Standardisation of Data Set under Different Measurement Scales

Standardisation of multivariate observations is the important stage that precedes the determination of distances (dissimilarities) in clustering and multidimensional scaling. Different studies (e.g. Milligan, Cooper (1988)) show the effect of standardisation on the cluster structure in various data configurations. In the paper a survey of standardisation formulas is given. Then we consider the problem of different scales of measurement and their impact on: — the selection of the standardisation formula;— the selections of the appropriate dissimilarity (or similarity) measure.

K. Jajuga, M. Walesiak

Neural Networks for Two-Group Classification Problems with Monotonicity Hints

Neural networks are competitive tools for classification problems. In this context, a hint is any piece of prior side information about the classification. Common examples are monotonicity hints. The present paper focuses on learning vector quantization neural networks and gives a simple, however effective, technique, which guarantees that the predictions of the network obey the required monotonicity properties in a strict fashion. The method is based on a proper modification of the Euclidean distance between input and codebook vectors.

P. Lory, D. Gietl

A Bridge from the Nearest Neighbour to the Fixed Bandwidth in Nonparametric Functional Estimation

Nonparametric functional estimation using kernel estimators is an evolving area of research in applied statistics with a direct connection to biometrical applications. For example, estimation of the hazard rate quantifies the instantaneous risk of failure or death that can be used in the context of medical survival analysis. We will deal with the crucial problem in practice, the bandwidth selection. A generalized bandwidth parameter incorporates as special cases the nearest neighbour and the fixed bandwidth. Implementation of this parameter in a kernel estimator points to solutions of the bandwidth selection problem. Consistency and rate of convergence of this generalized estimator are shown and support its use in practice.

R. Pflüger, O. Gefeller

Clustering Techniques for the Detection of Business Cycles

In this paper business cycles are considered as a multivariate phenomenon and not as a univariate one determined e.g. by the GNP. The subject is to look for the number of phases of a business cycle, which can be motivated by the number of clusters in a given dataset of macro-economical variables. Different approaches to distances in the data are tried in a fuzzy cluster analysis to pursue this goal.

W. Theis, C. Weihs

Simulated Annealing Optimization for Two-mode Partitioning

The Alternating Exchanges method proposed by Gaul, Schader (1996) for two-mode partitioning by minimizing an additive criterion, is based on a local- search strategy, and hence it can find local optima of the criterion. In order to overcome these local minima we propose to use the Metropolis rule for acceptance of possible transfers and a simulated annealing-type algorithm for two-mode non- overlapping partitioning. We show the performance of the algorithm by some examples, the results obtained are encouraging. We also prove the monotonicity of the method of Baier et al. (1997) for two-mode partitioning based in a k-means scheme.

J. Trejos, W. Castillo

Computer Science, Computational Statistics, and Data Mining

Frontmatter

Competitiveness of Evolutionary Algorithms

The development of efficient hardware for computational purposes has led to a new class of optimization algorithms. These methods are oriented towards the mechanisms of evolutionary processes and are therefore called evolutionary algorithms. The main idea presented here is the transfer of selection, recombination and mutation techniques within biological evolution to optimization strategies. The basic forms of genetic algorithms and evolutionary strategies are presented in this paper as well. Also, results of empirical studies will demonstrate the competitiveness of these algorithms.

W. Hauke

Intelligent Data Analysis via Linguistic Data Summaries: A Fuzzy Logic Approach

We propose the use of linguistic data summaries, exemplified by the proposition “most employees are young and well paid” (with a degree of validity), to grasp relevant relations in data data. We employ Yager’s (1982, 1991) concept of a linguistic data summary based on a calculus of linguistically quantified propositions. We advocate an interactive approach in that the user formulates the class of summaries of interest. We show how this is related to fuzzy database queries, and how to employ here Kacprzyk, Zadrożny—s (1994a-1997b) fuzzy querying addon, FQUERY for Access. We present an implementation for a sales database of a computer retailer.

J. Kacprzyk

Algorithm Selection Support for Classification

Providing user support for the application of Data Mining algorithms in the field of Knowledge Discovery in Databases (KDD) is an important issue. Based on ideas from the fields of statistics, machine learning and knowledge engineering we provided a general framework for defining user support. The general framework contains a combined top-down and bottom-up strategy to tackle this problem. In the current paper we describe the Algorithm Selection Tool (AST) that is one component in our framework. AST is designed to support algorithm selection in the knowledge discovery process with a case-based reasoning approach. We discuss the architecture of AST and explain the basic components. We present the evaluation of our approach in a systematic analysis of the case retrieval behaviour and thus of the selection support offered by our system.

G. Lindner, R. Studer

Society of Knowledge — Interdisciplinary Perspectives of Information Processing in Virtual Networks

Computer-assisted informational and communicational networks are the driving forces in the evolution towards a knowledge society. They are achieving virtual worlds of knowledge storing, business planning, arts and entertainment. How are they changing research and teaching of natural sciences and medicine, economics, social sciences and humanities? Can knowledge management learn from evolutionary strategies? Knowledge management in complex networks needs the assistance of autonomous, mobile, and intelligent software agents. Ethical goal of knowledge management in virtual net worlds is a humanistic service to improve the living conditions in the knowledge society.

K. Mainzer

What is Computational Statistics?

Sligthly changing Yates’ wording one might state: “today computers are taken for granted.” So the answer to the question posed seems to be obvious. Computational statistics is what statisticians do with the computer. Based on the experience as editor of a journal on computational statistics this paper will list and classify what authors reveal in their contributions. At a first look this list is impressive: bootstrap, simulation, statistical tables to mention just few. Many studies clearly could not have be done without a computer. Unfortunately when taking a closer look one comes to the conclusion that old problems still are unsolved and new problems emerged. To mention just some: numerical problems, random number generation, experimental design, how to communicate statistical software solutions, reproducible statistical data analysis. Due to lack of space not all of these topics could be dealt with. Hopefully some necessary prerequisites are listed everybody should know before considering himself to be a computational statistician.

P. Naeve

A Development Framework for Nature Analogic Heuristics

This paper classifies important nature analogic heuristics, such as Genetic Algorithms, Evolutionary Strategies, Simulated Annealing, and Tabu Search. Their central elements are compared by means of the descriptive A-R-O Model. It is shown how components of the procedures can be successfully interchanged, so that hybrid heuristics and their high potentials become available for future use. A development framework, called the Seven Steps of Development, that allows structured design of these methods and their hybrids is offered.

M. Feldmann

Systematizing and Evaluating Data Mining Methods

Data mining is presented as a method for data analysis, that possesses the ability to discover new and so far hidden knowledge in existing databases. A brief presentation and categorization of data mining methods will be followed by the compilation of a catalogue of criteria. This catalogue enables the classification of the methods. The different algorithms to implement data mining methods are not regarded. In conclusion, the analysis process of data mining methods is systematized exemplary.

C. Heidsieck, W. Uhr

On Textures: A Sketch of a Texture-Based Image Segmentation Approach

In the context of this work we present a new approach of domain- independent texture segmentation. The texture segmentation is implemented by a combination of region and edge-oriented segmentation methodes. The region- oriented segmentation procedure is based on texture energy measures (Laws(1979)). For each pixel an energy value is calculated, which serves as the homogeneity criterion. The edge-oriented methode measures texture edges by energy distributions in certain directions (Yu et al. (1991)). The texture edges serve then as discontinuity criterion for the blob coloring algorithm (Ballard, Brown (1982)) adapted to grey values. The texture region image and the texture edges image are combined to one texture-based segmented image.

T. Hermes, A. Miene, P. Kreyenhop

Automatic Texture Classification by Visual Properties

In this paper we provide an approach for automatic texture description based on visual texture properties. A set of texture features which directly coincide with the human visual perception of textures could be useful for e.g. domain independent texture classification in image retrieval systems like IRIS (Hermes et al. (1995)). Therefore, this set of items concerning the visual properties was tested on natural textures as well as on synthetic textures. Various statistical texture features were evaluated and we found a direct mapping between certain statistical texture features and the items of the set of visual texture properties. Therefore, we provide a way of domain independent classification of textures.

T. Hermes, A. Miene, O. Moehrke

Java Servlets versus CGI — Implications for Remote Data Analysis

The Common Gateway Interface (CGI) was the first attempt to enable the creation of dynamic HTML pages which represent a very suitable concept to meet the requirements of web-based applications for remote data analysis (RDA). CGI scripts are still popular, but by now there are new approaches which should be able to solve the main CGI problems. In this paper, we present the most promising one: Java Servlets. We will discuss the advantages and drawbacks of Java Servlets compared to CGI scripts. Moreover, we will do some performance measurements on the basis of simple classification problems and introduce the key functions of the Java Servlet API.

S. Kuhlins, A. Korthaus

Generating Automatically Local Feature Groups of Complex and Deformed Objects

Automatic generation of significant feature groups from a given set of basic features, i.e. creation of abstract characteristics, is a fundamental problem to be solved in pattern recognition. With the presented method it is possible to generate automatically local and important feature groups of contours of 2D objects resulting in reasonable class membership. The method was designed for the development of general classifiers, relying on automatically generated, local and important feature groups, derived from all kinds of basic features, for example contour points, colours, and infra-red spectra.

D. Pechtel, K.-D. Kuhnert

Decision Tree Construction by Association Rules

Association rules are used to build decision trees with the help of so-called multivariate splits that can take some (or all) attributes for the description of underlying (data) objects into consideration. It is shown that the new DTAR (Decision Tree Construction by Association Rules) method produces small trees with better interpretability than other multivariate decision tree construction techniques. Data sets well known to the data mining and machine learning community are analyzed to demonstrate the abilities of the new method.

F. Säuberlich, W. Gaul

Learning Complex Similarity Measures

Case-based reasoning is a knowledge processing concept that has shown success in various problem classes. One key challenge in CBR is the construction of a measure that adequately models the similarity between two cases. Typically, a similarity measure consists of a set of feature-specific distance functions coupled with an underlying feature weighting (importance) scheme. While the definition of the distance functions is often straightforward, the estimation of the weighting scheme requires a deep understanding of the domain and the underlying connections.The paper in hand addresses this problem. It shows how discrimination knowledge, which is coded within an already solved classification problem, can be transformed towards a similarity measure. Moreover, it demonstrates our approach at the problem of diagnosing heart diseases.

B. Stein, O. Niggemann, U. Husemeyer

Graphical Programming in Statistics: The XGPL Prototype

Modern statistical software systems usually consist of a menu system supported by a textual programming language. Presently, another component is developed within the host system Xtremes, namely, a graphical programming language called XGPL. Within the graph editor of XGPL, one may program a diagram of nodes and edges (the dependence graph) where (a) the nodes represent the input of data and parameters, the data processing or statistical procedures and the output of results, (b) the edges connect the input and output ports of the nodes and pass the data, specified parameters and results to dependent nodes.

M. Thomas, R.-D. Reiss

Management Science, Marketing, and Finance

Frontmatter

Measuring Brand Loyalty on the Individual Level: a Comparative Study

This paper discusses several alternatives to measure loyalty for inexpensive, frequently bought, nondurable consumer goods. All these measures look at loyalty from a behavioristic perspective and are based upon individual purchase records as provided by e.g. household panel data. In the literature many different proposals to address this issue can be found which focus on various aspects of loyalty. We take a look at them from a data mining perspective, i.e., which measure(s) can routinely be used to infer a customer’s degree of loyalty. In particular eleven different, easy to calculate quantities are compared with each other. Using empirical data over eleven different markets it turns out that an entropy based measure is consistently highly correlated with the other indices and thus capable of grasping various aspects of loyalty.

U. Wagner, C. Boyer

Facility Location Planning with Qualitative Location Factors

In the literature concerning facility location planning quantitative as well as qualitative methods can be found. Since empirical studies mention the importance of qualitative location factors beside quantitative criteria, this paper deals with corresponding planning methods in particular. The methods known from the literature are to criticize, because the given quantitative and qualitative information is not represented adequately. To face this criticism, approaches resulting by the application of multivariate methods of data analysis are presented.

U. Bankhofer

Testing the Multinomial Logit Model

A general test to check the adequateness of a regression model against nonparametric alternatives is presented. This test procedure is then applied to the well known multinomial logit model and its power is considered in a simulation study. Finally, the multinomial logit model is tested for a real scanner panel data set. The null hypothesis that the multinomial logit model is an adequate model for the real data set is rejected.

K. Bartels, Y. Boztug, M. Müller

Time Optimal Portfolio Selection: Mean-variance-efficient Sets for Arithmetic and Geometric Brownian Price Processes

Time optimal portfolio selection may be considered as an alternative or a supplement to the standard approach originated by Markowitz. It is appropriate for investors who are interested to reach a predefined level of wealth and whose preferences can be defined on the feasible distributions of the times at which this goal is reached for the first time. That is, investors are concerned with the first passage time or holding period distributions of the feasible portfolio price processes. The time optimal approach does therefore not require the troubling assumption of a known investment horizon, but concentrates on growth. Based on mean and variance of this distribution, the efficient sets are characterized for constant proportion investment strategies assuming multivariate (1) arithmetic and (2) geometric Brownian asset price processes, given the availability of a risk free asset. Both assumptions yield similar efficient sets with qualitatively different properties in comparison with the standard approach. Furthermore, the sets of efficient portfolios in the time optimal models are shown to be subsets of the efficient set according to the standard model.

T. Burkhardt

Analyzing Multigroup Data with Structural Equation Models

In empirical applications of structural equation modeling researchers often assume that the sample under investigation is homogenous unless observed charateristics allow for a division of the sample into mutual exclusive homogenous subgroups. If such information is not available, unobserved heterogeneity can be taken into account by a finite-mixture approach (Arminger et al.(1998); Jeidi et al.(1997)). The simulation study presented in this paper reveals that this approach clearly outperforms a sequential procedure combining cluster and multigroup analysis.

N. Görz, L. Hildebrandt, D. Annacker

Validity of Customized and Adaptive Hybrid Conjoint Analysis

In this paper we compare the validity of a new type of customized hybrid conjoint analysis called Customized Computerized Conjoint Analysis (CCC) with the Adaptive Conjoint Analysis (ACA) and two self-explicated models. CCC combines self-explicated preference structure measurement with individually designed full-profile conjoint analysis in a fully computerized interview. For a sample of almost 500 German potential customers of refrigerators we find (similar to Srinivasan, Park 1997)) surprisingly robust results of the self-explicated approaches compared to CCC as well as ACA.

S. Hensel-Börner, H. Sattler

Analysis of the Assessments of One’s Values Among Different Cohorts in Age and Sex by Multidimensional Scaling

Differences in assessments of one’s values on personal or private issues among different cohorts in age and sex in the Japanese society were investigated. Responses of about 2,700 people to 13 items were analyzed by multidimensional scaling. The two-dimensional configuration of items given by INDSCAL showed that one dimension represents extroverted or introverted values and the other dimension represents materialistic or post-materialistic values. Each of the 10 different cohorts was imbedded as a vector or as an anti-ideal point in the configuration. It was shown that differences in values among cohorts were closely connected with the role change induced by one’s life events.

Y. Kimura, A. Okada

Impacts of Hedging with Futures on Optimal Production Levels

Decisions about the kind of risk management not only influence the optimal hedging volume, but also affect the optimal production level. We compare hedging advices in two different model environments. Optimal production levels (and optimal hedging volumes) are determined in models with and without futures markets in a conventional modeling approach and in a rational expectations equilibrium (REE). Results differ substantially. Hedging on the futures market is always advantageous and leads to an increased output in the conventional setting, if short hedging (normal hedge or reversed hedge) is optimal. In our REE-model, in case of non-zero output level, only normal hedge is compatible with the assumption of risk-averse market participants. Nevertheless, an active risk management is not always advantageous. If the number of producers exceeds the number of speculators, producers should only take positions on the futures market, when information about a decreasing spot market demand (which has to lie between certain bounds) prevails in the market.

J. Limperger

How to Integrate Stock Price Jumps into Portfolio Selection

Jumps call for a completely new portfolio theory: We have to integrate jump risks explicitly into portfolio planning via correction terms. Therefore, it is clearly a second-best solution either to ignore jumps as rare events or to simply adjust stocks’ mean and standard deviation and to continue to apply the optimization formulas of the pure diffusion case.

B. Nietert

A Multivariate GARCH-M Model for Exchange Rates in the US, Germany and Japan

After the so-called Asia crisis in the summer of 1997 the financial markets were shaken by increased volatility transmission around the world. Therefore, in this paper we will analyse the daily exchange rates in New York, Germany, and Japan for the period of 2 years (June 21, 1996 to June 22, 1998). We estimate a VAR-GARCH in mean model and estimate the multivariate volatility effects between the time series. We are also interested in the question of whether or not the volatility of the 3 exchange rates will feed back on the returns of the exchange rates. Using the marginal likelihood criterion for model selection we choose a VAR-GARCH-M (1,1,2,2) model. The model is estimated using MCMC methods and the coefficients show a quite rich transmission pattern between the financial markets. Comparing the predictive densities we see that the VAR-GARCH-M model produces forecasts with much smaller standard deviations.

W. Polasek, L. Ren

A Model for Constructing Quadratic Objective Functions and its Applications to Price Adjustments

We develop a model for constructing quadratic objective functions in n target variables. At the input, a decision maker is asked a few relatively simple questions about his ordinal preferences (to compare two-dimensional alternatives in terms ‘better’, ‘worse’, ‘indifferent’). At the output, the model mathematically derives a quadratic objective function used to evaluate n-dimensional alternatives. The model is provided with operational restrictions for the monotonicity of the objective function (= either only growth, or only decrease in every variable) and quasi-concavity of the objective function (= convexity of the associated preference). Constructing a monotonic quasi-concave quadratic objective function from ordinal data is reduced to a least squares problem with linear and polynomial inequality constraints.In illustration, we construct a quadratic objective function of ski station customers. then it is used to adjust prices of 10 ski stations in the South of Stuttgart.

A. Tangian

Biology, Genome Analysis, and Medicine

Frontmatter

Phylogeny Inference by Minimum Conflict

while standard phylogenetic methods often yield good trees from multiple sequence alignments, sometimes these trees contradict the wide body of taxonomic expert knowledge. One reason is that “ancient” character states can persist in a group of taxa, although they are eroded in all other members of a larger monophylum. Then, the group is placed into its own subtree although it may not be monophyletic. We will explain this erosion trap, and how to overcome it, using a specific kind of outgroup comparison. The technique is part of a divide-and- conquer tree inference method which tries to find splits with minimum conflicting evidence, guided by the inconsistency patterns triggered by incorrect splits.

G. Fuellen, J.-W. Wägele

Detecting Change-Points in Aircraft Noise Effects

Economic reasons require the capacity of many existing airports to be extended. Residents living close to the airports, however, are not willing to accept the increased stress resulting from the higher amount of starts and landings and are thus forming a strong opposition. It is of interest to consider the relationship between causal variables (e.g., effective level of energy or vertical distance from the flightpath) and binomial distributed effect variables (e.g., percentage of highly annoyed residents). Typically, this dose-response curve is a decreasing curve with a high plateau near the flightpath, a low plateau far away from the flightpath and a more or less extended intermediate region. Here, we propose estimates and tests for one and two change-points, which provide us with a means to identify the zone where the stress for the affected population is intolerable as well as the zone where no stress due to aircraft noise is observed. The procedures are illustrated by data from a study at Düsseldorf airport.

J. Krauth

A Generalized Estimating Equations Approach to Quantifying the Influence of Meteorological Conditions on Hand Dermatitis

Generalized estimating equations (GEE) have been introduced as the result of transferring the concept of generalized linear models to the context of dependent response variables. Here, we focus on the methodological aspects of applying the generalized estimating equations approach to longitudinal binary data on hand dermatitis in hairdressing apprentices. We studied the influence of temperature and absolute humidity on the occurrence of irritant skin changes. Due to an increased efficiency of the parameter estimates the GEE approach revealed a significant association of existing hand dermatitis with low temperature and low absolute humidity of ambient air.

M. Land, W. Uter, O. Gefeller

Discrimination in Paired Data Situations

Paired data occur in a natural way in ophthalmology and other medical fields concerning paired organs. In our paper we deal with the diagnosis of the eye disease “glaucoma” using quantitative measurements. We compare discriminant analysis to the conditional Rosner model. We derive a characterization of the Rosner model as a special case of linear discriminant analysis and present a simple graphical display which allows to check the assumptions of the Rosner model and the impact of diagnostic information from the fellow eye. In our clinical example, for known disease status of the fellow organ, pathological values of this organ decreased the disease probability of the selected organ.

P. Martus, M. Korth, M. Wisse

Classification of Multidimensional Visual Field Data for Glaucoma Diagnosis

Visual field loss and glaucomatous optic disk damage are typical findings in glaucoma disease. The question is examined how the stages of optic disk damage are reflected in different patterns of visual field loss. Due to the complex situation of 59-dimensional visual field data and the assumed non-linear association with the optic disk diagnosis, a neural network was used to classify visual field patterns into morphometric stages of glaucoma. Good classification results were achieved with a backpropagation network consisting of 59 input neurons and 15 neurons in two hidden layers.

M. Meyer, A. Jünemann, M. Wisse, J. B. Jonas

Quality of Life of Cancer Patients: A Multi-set Analysis

In this research the quality of life of cancer patients is studied. The basic idea is that cancer patients experience a lower quality of life than healthy people. We investigated this hypothesis using a multi-set technique named OVERALS. The hypothesis could partly be confirmed.

E. van der Burg, I. Noordermeer, H. de Haes

A New Approach on Nonparametric Repeated Measures Analysis of Variance

In a previous paper Wernecke, Kalb (1999) applied the multivariate repeated measures analysis of variance (Timm (1980)) as well as the method of data alignment (Hildebrand(1980)) to a cross-over clinical trial. The latter method consists of an adjustment of those factors not regarded in the current analysis and a following rank analysis of variance. However, the precondition of independent groups for this analysis is only valid from a practical point of view (Jones, Kenward(1990) assume the groups as independent, if the wash-out-phases will be large enough). Furthermore, ranking after alignment supposes quantitative scaling.Recently, Brunner, Langer (1998) considered general rank-score statistics for general repeated measures designs, compound symmetry designs and designs for longitudinal data with continuous or ordered categorical data. Therefore, it seems worth applying this approach to our data and to compare the results of all the discussed procedures.

K.-D. Wernecke, J. Kaufmann

Text Analysis and Information Retrieval

Frontmatter

GERHARD (German Harvest Automated Retrieval and Directory)

Gerhard is a fully automatic indexing and classification system of the German World-Wide Web for integrated searching and browsing. It consists of a database-driven robot that collects academically relevant documents, a fast classification module, a database and a trilingual (German, English, and French) user-interface for the service itself as well as for maintaining the service. The integration of searching and browsing mechanisms allows the user to look for “similar” documents very easily. At the moment there are 400 areas of the German WWW defined that are considered academically relevant, with more than a million documents classified and indexed.The project GERHARD was started in October 1996 and finished in March 1998 by BIS University Oldenburg in Cooperation with OFFIS e.V. Oldenburg and with ISIV (Institute for Semantic Information Processing) Univ. Osnabrück. It has been funded by the German Research Foundation (DFG). Since 1 April 1998 the service has gone public and can be accessed via http://www.gerhard.de.

K.-U. Carstensen, B. Diekmann, G. Möller

Automatic Labelling of References for Internet Information Systems

Today users of Internet information services like e.g. Yahoo! or AltaVista often experience high search costs. One important reason for this is the necessity to browse long reference lists manually, because of the well-known problems of relevance ranking. A possible remedy is to complement the references with automatically generated labels which provide valuable information about the referenced information source. Presenting suitably labelled lists of references to users aims at improving the clarity and thus comprehensibility of the information offered and at reducing the search cost. In the following we survey several dimensions for labelling (time, frequency of usage, region, language, subject, industry, and preferences) and the corresponding classification problems. To solve these problems automatically we sketch for each problem a pragmatic mix of machine learning methods and report selected results.

A. Geyer-Schulz, M. Hahsler

Architecture and Retrieval Methods of a Search Assistant for Scientific Libraries

In this paper, we present the design and retrieval methodology of an intuitively operated retrieval assistant (RA) which supports the thematic search in databases of scientific libraries. The retrieval assistant establishes innovative and more adequate means for expressing a user’s search interest by adopting aggregation operators of natural language (e.g. almost all, as many as possible), the interpretation of which is accomplished by novel methods from fuzzy set theory. These operators can be used in their intuitive meaning, i.e. just as in everyday language, for aggregating over sets of weighted search terms. The required scalability of the system is ensured through its multi-tier architecture, which disburdens both the clients and the external database servers by introducing an (arbitrarily replicable) intermediary to perform the computationally intensive aggregation step.

I. Glöckner, A. Knoll

Computer Aided Evaluations of Linguistic Atlases: From Automatic Classification to Dialectometry

After a general introduction to the ALD Project this paper deals with procedures of automatic classification of dialect data as implemented in IRS. The linguists prior knowledge is employed to set the focus for further classification techniques. Using the Weighted Levenshtein Distance IRS calculates dissimilarities between the dialect responses and applies a hierachical cluster analysis to find groups in the data. These classified responses of many linguistic atlas maps can be synoptically evaluated and visualised with VDM; this application implements a wide variety of methods known from numerical classification to investigate into basilectal structures hidden in linguistic atlases.

E. Haimerl

Structured Phrases for Indexing and Retrieval of Research Topics

This paper presents a new approach to improve the quality of information systems in scientific domains. Based on an analysis of the form of metadata descriptions of research results we develop a model which uses structured phrase-respresentations instead of keyword-based methodologies for indexing and retrieval.

W. Lenski, E. Wette-Roch

Erratum to: Management Science, Marketing, and Finance

Erratum to: Facility Location Planning with Qualitative Location Factors

U. Bankhofer

Backmatter

Weitere Informationen