The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are
, providing schema-rich data. Their schemas define the
. This paper proposes clustering sources by their
, which is critical for enabling both
, by organizing sources of with similar query capabilities. In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction). Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a novel objective function,
, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.