Skip to main content
Top

A Scalable Model for Frequency Distribution of Low Occurrence Multi-words Towards Handling Very Large Spectrum of Text Corpora Sizes

  • 2026
  • OriginalPaper
  • Chapter
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter delves into the development of a scalable model for predicting the frequency distribution of low occurrence multi-words in large text corpora. The model focuses on n-grams, sequences of n consecutive words, and their statistical distributions, which are crucial for applications such as indexing, term extraction, compression, cache design, and translation. The chapter highlights the importance of understanding n-gram distributions as a function of corpus size for guiding tokenization strategies and corpus analysis for pre-training language models. It addresses the limitations of traditional models that consider only moderate-sized corpora and single words, emphasizing the relevance of multi-word n-grams in capturing semantic specificity and language structure. The proposed model predicts the cumulative number of distinct n-grams and their sizes, achieving very low and stable average relative errors across a wide range of corpus sizes, from hundreds of millions to hundreds of billions of words. The chapter also discusses the methodology for estimating model parameters, including the use of cross-validation and spline-based regression, and presents experimental results that demonstrate the model's accuracy and stability. The findings suggest that the proposed approach is promising for addressing the challenges posed by very large-scale corpora sizes and opens possibilities for handling relevant low occurrence multi-words in emerging applications based on large language models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Business + Economics & Engineering + Technology"

Online-Abonnement

Springer Professional "Business + Economics & Engineering + Technology" gives you access to:

  • more than 102.000 books
  • more than 537 journals

from the following subject areas:

  • Automotive
  • Construction + Real Estate
  • Business IT + Informatics
  • Electrical Engineering + Electronics
  • Energy + Sustainability
  • Finance + Banking
  • Management + Leadership
  • Marketing + Sales
  • Mechanical Engineering + Materials
  • Insurance + Risk


Secure your knowledge advantage now!

Springer Professional "Engineering + Technology"

Online-Abonnement

Springer Professional "Engineering + Technology" gives you access to:

  • more than 67.000 books
  • more than 390 journals

from the following specialised fileds:

  • Automotive
  • Business IT + Informatics
  • Construction + Real Estate
  • Electrical Engineering + Electronics
  • Energy + Sustainability
  • Mechanical Engineering + Materials





 

Secure your knowledge advantage now!

Springer Professional "Business + Economics"

Online-Abonnement

Springer Professional "Business + Economics" gives you access to:

  • more than 67.000 books
  • more than 340 journals

from the following specialised fileds:

  • Construction + Real Estate
  • Business IT + Informatics
  • Finance + Banking
  • Management + Leadership
  • Marketing + Sales
  • Insurance + Risk



Secure your knowledge advantage now!

Title
A Scalable Model for Frequency Distribution of Low Occurrence Multi-words Towards Handling Very Large Spectrum of Text Corpora Sizes
Authors
Joaquim F. Silva
Jose C. Cunha
Copyright Year
2026
DOI
https://doi.org/10.1007/978-3-032-06109-6_23
This content is only visible if you are logged in and have the appropriate permissions.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH