The incorporation of background knowledge in unsupervised algorithms has been shown to yield performance improvements in terms of model quality and execution speed. However, performance is dependent on the quantity and quality of the background knowledge being exploited. In this work, we study the issue of selecting Must-Link and Cannot-Link constraints for semi-supervised clustering. We propose “
”, an algorithm that takes as input a set of labeled data instances, from which constraints can be derived, ranks these instances on their usability and then derives constraints from the top-ranked instances only. Our experiments show that
chooses, respectively reduces, the set of candidate constraints without compromising the quality of the derived model.