Abstract
Appropriate preparation of data for analysis is a key element in empirical research. Considering the source of data or the nature of the phenomenon studied, some observations may differ significantly from others. Inclusion of such cases in a research may seriously distort the profile of the population under examination. Nevertheless, their omission can be equally disadvantageous. When analyzing dynamically changing phenomena, especially in case of big data, a relatively small amount of outliers may constitute a coherent and internally homogeneous group, which, along with the registration of subsequent observations, may grow into an independent cluster. Whether or not an outlier is removed from the dataset, researcher must be first aware of its existence. For this purpose, an appropriate method of anomaly detection should be used. Identification of such units allows the researcher to make an appropriate decision regarding the further steps in the analysis.
Assessment of the usefulness of outlier value detection methods has been increasingly influenced by the possibility of their application for big data problems. The algorithms should be effective for large volume and diverse sets of data, which are additionally subject to constant changes. For these reasons, apart from high sensitivity, the following are also important: low computational time and the algorithm’s adaptability.
The aim of the research presented is to assess the usefulness of Isolation Forests in outlier detection. Properties of the algorithm, with its extensions, will be analyzed. The results of simulation and empirical research on selected datasets will be presented. The algorithm evaluation will take into account the impact of particular features of big datasets on the effectiveness of the methods analyzed.