Cross-Modal Communication Technology
- 2026
- Book
- Authors
- Xin Wei
- Dan Wu
- Liang Zhou
- Book Series
- Wireless Networks
- Publisher
- Springer Nature Switzerland
About this book
This book offers a systematic and in-depth investigation on Cross-Modal Communication Technology. In chapter 1, the authors present the background and motivation of this domain. In chapter 2, the authors outline multi-modal services and a general cross-modal communication architecture, including typical service categories, communication requirements, architectural details, challenges, and design principles. In chapters 3, 4, and 5, the authors respectively propose three key techniques within the cross-modal communication architecture: cross-modal coding, cross-modal transmission and cross-modal signal recovery. For each technique, the authors provide an in-depth presentation of the representative algorithms and the related experimental results. chapter 6 describes an established database and the three developed cross-modal communication prototype systems. Chapter 7 introduces extended techniques, such as cross-modal semantic communications, cross-modal communications in extremely low-resource scenarios. Finally, the authors provide a summary of the entire book and discuss future research directions in chapter 8. The readers will gain a deep understanding of the development and technical details of this new communication paradigm for the emerging multi-modal services.
Currently, haptic information is gradually being integrated into traditional audio-visual-dominant multimedia services, forming multi-modal services. Multi-modal services have been considered killer applications in the B5G and 6G eras, including remote industrial robotic grasping, tele-surgery and diagnosis, online immersive shopping and gaming, etc. To support the thriving multi-modal services, the cross-modal communication paradigm is proposed. Compared with traditional multimedia communications, the cross-modal communication paradigm is characterized by the collaborative transmission and processing of audio, visual, haptic signals and streams. By fully exploring the potential correlations among audio, visual, and haptic modalities, it enhances natural human-machine interactions and immersive experiences in multi-modal services, while meeting the requirements of low latency, high reliability, and high throughput.
This book targets graduate and undergraduate students majoring in the areas of wireless communications, computer science and engineering, and electrical engineering as well as researchers working in this field. Professionals seeking multimedia communications for the emerging multi-modal services will also want to purchase this book.
Table of Contents
-
Frontmatter
-
Chapter 1. Introduction
Xin Wei, Dan Wu, Liang ZhouAbstractAs the demand for comprehensive digitization and intelligence continues to grow, multi-modal services that integrate audio, visual, and haptic elements have emerged [1, 2]. The International Telecommunication Union (ITU) Network 2030 Focus Group has identified these services as killer applications for the B5G and 6G era due to their ability to enhance immersive experiences in scenarios such as industrial manufacturing, rehabilitation medicine, and intelligent education. Cisco’s white paper on future network trends projects a 100-fold increase in global mobile data traffic by 2030, with 81% attributed to multi-modal services [3]. The rapid growth of multi-modal services has led to a massive increase in multi-modal data, including audio, visual, and haptic signals, posing significant challenges to current multimedia communication systems. In the IMT-2030 (6G) white paper published by the China Academy of Information and Communications Technology, it is highlighted that for fully immersive real-time interaction, transmission delay must be under 1ms, data rate should be over 10Gbps, and reliability should reach 99.99999% [4]. However, existing 5G-based systems have 20ms end-to-end delay and a packet loss rate of around \(10^{-5}\) with high deployment costs. Therefore, communication technology should be re-designed to fulfill the increase in multi-modal data volume and the requirements of humans’ immersive experiences. -
Chapter 2. Cross-Modal Communication Architecture
Xin Wei, Dan Wu, Liang ZhouAbstractIn this chapter, the typical multi-modal services that have emerged in recent years are firstly described. These services can be categorized into three main types: haptic-dominant, audio-visual-dominant, and equally important multi-modal services. Following this, an analysis of the characteristics of different modalities is conducted. On the one hand, the heterogeneous communication requirements of audio, visual, and haptic signals are examined. On the other hand, the diverse communication requirements of the different categories of multi-modal services are also presented. Subsequently, to meet these requirements, a cross-modal communication architecture is constructed and the details of its components are described. Finally, the key techniques within this architecture are introduced. Additionally, the issues that need to be addressed as well as the core ideas behind the design of these key techniques are also discussed. -
Chapter 3. Cross-Modal Coding
Xin Wei, Dan Wu, Liang ZhouAbstractIn this chapter, a visual-haptic mutual redundancy elimination scheme is firstly provided by considering mutual assistance between these two modalities. Then, an audio-haptic redundancy elimination mechanism is explored from the perspective of time-frequency masking. Subsequently, by exploring the benefit of potential correlation, a visual-aided haptic coding scheme is proposed to support cross-modal communications. From a theoretical perspective, the minimum number of bits required to compress haptic signals under the rate conditions of visual streams is explored. From a technical perspective, AI-empowered cross-modal prediction and channel coding is designed to support this scheme. Finally, to further achieve the optimum rate of multi-modal streams, a general cross-modal coding strategy is presented, which concerns the mutual assistance between visual and haptic modalities. -
Chapter 4. Cross-Modal Transmission
Xin Wei, Dan Wu, Liang ZhouAbstractIn this chapter, the cross-modal transmission techniques are described in detail, which can be divided into two main categories. Schemes in the first category belong to the preemptive-resume scheduling pattern, where haptic streams have higher priority than audio-visual streams during transmission. On one hand, low-latency and high-reliability demands for haptic streaming should be guaranteed. Considering this, the intra-modal correlation-based haptic streaming scheme is proposed, which explores correlation within the haptic modality to employ efficient transmission. On the other hand, due to the occasional arrival of haptic streams, the throughput of audio-visual streams may well be reduced or even interrupted. The inter-modal correlation-based audio-visual streaming and content-aware stream scheduling schemes are respectively designed, handling the fluctuant audio-visual transmission quality issues with the aid of the haptic modality. Schemes in the second category belong to the non-preemptive-resume scheduling pattern. In other words, the transmission resources allocated to haptic and audio-visual modalities are under comprehensive and coordinated consideration. Among them, the modal-aware stream scheduling scheme flexibly settles the transmission priority of the modal stream instead of the data flow, achieving a tradeoff between various requirements. The adaptive prediction-aware scheme predicts and transmits the multi-modal signals in advance to reduce the whole latency during stream delivery. The dynamic transmission mode selection scheme tries to improve resource utilization efficacy by optimizing both the chosen transmission strategy and the allocated resource. Finally, the edge intelligence-empowered transmission decision scheme carefully considers factors such as communication, caching, computation, and control capacities, and realizes autonomous transmission decisions through AI technology. -
Chapter 5. Cross-Modal Signal Recovery
Xin Wei, Dan Wu, Liang ZhouAbstractIn order to handle issues about the lost, damaged, or delayed arrival of different modality signals after transmission, the corresponding signal recovery schemes should be designed. Firstly, for haptic-dominant multi-modal services, an audio-visual-aided haptic signal recovery scheme is proposed, while for audio-visual-dominant multi-modal services, a haptic-aided visual signal recovery scheme is presented. The core ideas are to utilize the one modality signal (audio-visual or haptic) to recover the other desired modality signal (haptic or visual) by leveraging their potential correlations. Then, for equally important multi-modal services, a visual-haptic mutual signal recovery scheme is provided, which can be seen as the general case for the two schemes mentioned above. Subsequently, besides these three representative cross-modal signal recovery solutions, information retrieval is another method with fast response and high-reliability characteristics. Under this circumstance, an information retrieval-based signal recovery scheme is designed. Finally, in order to enhance the display quality of the receiver and guarantee human experience, a cross-modal super-resolution reconstruction strategy is proposed, which achieves high-resolution visual signals from the received low-resolution visual signals and the related haptic signals. -
Chapter 6. Cross-Modal Communication Prototype System
Xin Wei, Dan Wu, Liang ZhouAbstractBased on theoretical and technical investigations, in this chapter, several developed cross-modal communication prototype systems in the areas of industrial manipulation, smart education, and intelligent healthcare have been described. Firstly, a large-scale multimodal database named VisTouch has been established, which supports the research of cross-modal communications. Secondly, the details about the design and implementation of a visual-haptic human-machine interactive system are provided, which effectively meets the demands of the remote industrial robotic grasping applications. Thirdly, a virtual acupuncture skill training application is constructed to support the emerging remote practical teaching demands. Finally, a remote throat swab sampling platform is constructed, which is of great significance in protecting the lives of medical staff and breaking the chains of virus transmission during the COVID-19 pandemic. -
Chapter 7. Extended Techniques
Xin Wei, Dan Wu, Liang ZhouAbstractIn this chapter, two extended cross-modal communication techniques are presented. On the one hand, cross-modal communication technology has been deeply integrated with semantic communications, which has attracted widespread attention in recent years, forming a new communication paradigm called cross-modal semantic communications. This new communication paradigm fully explores the semantics both within each modality (intra-modal semantics) and among modalities (inter-modal semantics), and combines them organically. It can effectively solve the polysemy and ambiguity issues existing in current semantic communications and ensure the stability of semantic transmissions. On the other hand, in extremely low communication and computing resource scenarios such as underwater equipment troubleshooting, agricultural operations in remote mountainous areas, and emergency rescue missions responding to natural disasters, existing regular cross-modal communication techniques often fail, and the associated algorithms need to be re-designed. This chapter takes the cross-modal haptic signal recovery in an emergency rescue scenario as an example to illustrate the encountered challenges and available solutions in this direction. -
Chapter 8. Conclusion
Xin Wei, Dan Wu, Liang ZhouAbstractTo support the thriving multi-modal services, the cross-modal communication paradigm is proposed by the MultiMedia Communication (MMC) Research Group, which comprises researchers from Nanjing University of Posts and Telecommunications and Army Engineering University of PLA. This book offers a systematic and in-depth investigation into this highly promising research area. There are several specific concluding remarks given in the following. Firstly, this book provides a deep introduction to the characteristics of the current and future multi-modal services. Based on these characteristics, a general cross-modal communication architecture is constructed. Three key techniques, their encountered challenges, and design principles are also analyzed. It can be concluded that both the architecture construction and the key techniques design should center on the core notion of potential correlations among modalities or “semantic”. In other words, it should focus on uncovering, analyzing, and utilizing semantic information that reflects the potential correlations among modalities.
- Title
- Cross-Modal Communication Technology
- Authors
-
Xin Wei
Dan Wu
Liang Zhou
- Copyright Year
- 2026
- Publisher
- Springer Nature Switzerland
- Electronic ISBN
- 978-3-032-00957-9
- Print ISBN
- 978-3-032-00956-2
- DOI
- https://doi.org/10.1007/978-3-032-00957-9
PDF files of this book have been created in accordance with the PDF/UA-1 standard to enhance accessibility, including screen reader support, described non-text content (images, graphs), bookmarks for easy navigation, keyboard-friendly links and forms and searchable, selectable text. We recognize the importance of accessibility, and we welcome queries about accessibility for any of our products. If you have a question or an access need, please get in touch with us at accessibilitysupport@springernature.com.