Predicting the diversity of words and multi-words (n-grams) in a text corpus and their frequency distributions is important in NLP and language modeling, and is becoming critical to enable the design of modern applications, namely Large Language Models, e.g. for guiding tokenization and corpus analysis for pre-training. This requires the ability to model the very large scale corpora behaviour, the handling of multi-words as subwords or phrases, and the distribution of n-grams across different frequency ranges, namely the low occurrence n-grams. We present a scalable model to predict the number of distinct n-grams and their frequency distributions targeting an extended range of corpora sizes, from hundreds of million words to hundreds of billion words (a 1000 times factor). This led us to a novel approach for explicitly incorporating into the model the parameter dependency behaviour regarding the extended corpora size range.
In the presence of such extended range of corpora sizes, the model estimates the cumulative numbers of distinct n-grams (\(1\le n\le 6\)) greater or equal to a given frequency \(k\ge 1\), in a corpus, and the numbers of n-grams with equal-frequencies, in a given language corpus. Unlike most approaches that assume an open, potentially infinite, language word vocabulary, this model relies on the vocabulary finiteness. The model ensures very low and stable average relative errors (circa 2%), for the low frequencies starting with singletons, from 1-grams to 6-grams, across the above very large range of corpora sizes, in English and German.