Skip to main content
Top

Towards a Digital Archivist: Applications of LLMs in Automated Web Archive Description

  • 2026
  • OriginalPaper
  • Chapter
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This chapter delves into the automation of web archive descriptions using large language models, specifically the Qwen3-8B model. The study focuses on fine-tuning the model with a curated dataset of high-quality web page descriptions to generate accurate and contextually relevant metadata. The process involves extracting and cleaning HTML records from WARC files, followed by generating descriptive metadata. The fine-tuned model is evaluated by professionals, including archivists and librarians, to assess its reliability and usability. The results show that the model can produce descriptions with high semantic fidelity and minimal need for manual edits. The chapter also discusses the system's implementation, including a user-friendly interface for processing and previewing archived web content. Additionally, it highlights the challenges and future improvements needed for handling sensitive content and non-English web pages. The study concludes that the automated system significantly enhances metadata workflows in digital preservation, offering a robust solution for generating trustworthy archival descriptions.

Not a customer yet? Then find out more about our access models now:

Individual Access

Start your personal individual access now. Get instant access to more than 164,000 books and 540 journals – including PDF downloads and new releases.

Starting from 54,00 € per month!    

Get access

Access for Businesses

Utilise Springer Professional in your company and provide your employees with sound specialist knowledge. Request information about corporate access now.

Find out how Springer Professional can uplift your work!

Contact us now
Title
Towards a Digital Archivist: Applications of LLMs in Automated Web Archive Description
Author
Hao Zhang
Copyright Year
2026
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-95-4861-3_40
This content is only visible if you are logged in and have the appropriate permissions.
This content is only visible if you are logged in and have the appropriate permissions.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG