In this section, we first provide an overview of prior works on general style transfer in Sect.
2.1, including image, audio, and text style transfer. Then, we focus on motion style transfer in Sect.
2.2. We also review motion synthesis from multi-modal data in Sect.
2.3.
2.1 Style transfer
In recent years, style transfer has achieved impressive progress across various fields, including computer vision, speech and music processing, natural language processing, motion animation, etc.
In the field of computer vision, the pioneering work of Gatys et al. [
1] introduces the concept of style transfer and leverages the hierarchical layers in convolutional neural networks (CNNs) to extract both the underlying content structures and stylistic elements. They utilize an optimization-based technique to transfer styles between arbitrary images. Later, Li et al. [
15] propose whitening and coloring transform (WCTs) to stylize images by analyzing second-order correlations of content and style features. More generally, Huang et al. [
16] introduce an adaptive instance normalization (AdaIN) layer to solve the challenge of arbitrary target style application in image style transfer, which has been broadly adopted to fuse the style and content information for image generation and image-to-image translation [
17‐
19]. Zhu et al. [
6] introduce CycleGAN which uses a pair of generators and discriminators to learn the mapping between two unpaired image domains. This general idea has been further developed in StarGAN [
13], which incorporates domain labels as additional input and enables image style transfer among multiple corresponding domains, such as facial appearances and expressions.
Voice conversion (VC) refers to a technique of converting non-linguistic or para-linguistic information from the original speech into the desired target speech while retaining the linguistic content unchanged. While some early VC frameworks have achieved success [
20,
21], they rely on precisely aligned parallel data of source and target speech. To address this challenge, researchers have turned to non-parallel VC techniques. For example, Hsu et al. [
22] construct a VC system from non-parallel speech with variational autoencoders and Wasserstein GANs. Kameoka et al. [
23] build an auxiliary classifier VAE with information-theoretic regularization for the model training. Kaneko and Kameoka [
24] propose CycleGAN-VC, which is a variation of the CycleGAN architecture using gated CNNs and an identity-mapping loss. This was later improved by CycleGAN-VC2 [
10] with the addition of a 2-1-2D convolution structure and two-step adversarial losses to improve performance. This approach has also been extended to a StarGAN-based architecture to enable many-to-many mappings across different domains [
14,
25]. Fu et al. [
11] incorporated transformers and curriculum learning in voice conversion to facilitate training efficiency.
With the development of MIDI format parsing, research has also been carried out to transfer symbolic music styles, as demonstrated by studies such as Groove2Groove [
26], which employs an encoder-decoder architecture and parallel data, and Malik et al. [
27] that introduce StyleNet with a shared GenreNet, which aimed to learn various styles for music translation. Brunner et al. [
2] use a CycleGAN-based approach for MIDI music. Ding et al. [
28] design SteelyGAN, a symbolic-domain transfer approach that combines both pixel-level and latent-level features. Regarding style transfer in natural language processing (NLP), Mueller et al. [
29] propose recurrent variational auto-encoders (VAE) to modify text sequences. Fu et al. [
30] develop a multi-decoder and style-embedding model using adversarial networks to learn content and style representations. Dai et al. [
31] propose a Style Transformer network with a tailored training scheme that integrates an attention mechanism and makes a latent representation-agnostic assumption. Finally, Xu et al. [
32] introduce a cycled reinforcement learning approach focusing on unpaired sentiment-to-sentiment translation.
Our research focuses on transferring motion data, specifically dance movements. We use the CycleGAN-VC2 backbone, originally designed for voice conversion, as our foundation. To improve scalability, we extend the model to a StarGAN-based framework. We also augment an additional music modality in our approach to improve the training performance.
2.2 Motion style transfer
Motion style transfer has been a longstanding challenge in the field of computer animation, which involves transferring the motion style of a source animation to a target animation while preserving the key content, such as its structure, timing, spatial relationships, etc. Prior research in motion style transfer relied on handcrafted features [
33‐
38]. Since style is a challenging attribute to define precisely, most modern studies advocate data-driven approaches for feature extractions [
4,
39‐
45]. Commonly used models for style transfer include K nearest neighbors (KNNs) [
46], convolutional auto-encoders [
39], temporal invariant AdaIN layers [
5], CycleGAN [
45], spatial-temporal graph neural networks [
44], and autoregressive flows [
47]. Furthermore, certain studies focus on the issue of efficient real-time style transfer [
41,
43,
46]. However, it should be noted that all these studies focus on relatively simple human movements, such as exercise and locomotion, where the stylistic variation is often limited. For example, the transfer between children and adult locomotion [
45]. In contrast, our work deals with the transfer of dance movements that possess a significant level of complexity in terms of postures, transitions, rhythms, and artistic styles. Consequently, our research may have more empirical and practical value for video games or film industries. Given the intricacies involved, our method differs significantly from the reviewed research. We utilize transformer and curriculum learning on top of CycleGAN-VC2 to enable more effective training on more complex motion data.
Another important task that accompanies the transfer of motion styles is to evaluate the quality of the synthesized animation. While subjective surveys help estimate movement quality, such as recruiting a group of dance experts with defined requirements, relying on them for evaluation can be expensive, time-consuming, and have low reproducibility [
38]. Utilizing objective metrics for quantitative evaluation eliminates the need for human involvement, avoiding the issues associated with subjective surveys. The Fréchet Inception Distance (FID) metric [
48], which has proven effective in assessing synthesized images in computer vision, has become a standard in evaluating image generative models. Building on the success of FID, Wang et al. [
49] extended the FID concept to motion data. Yoon et al. [
50] defined the Fréchet Gesture Distance (FGD) as a metric to evaluate speech-driven gesture generation based on the distance between gesture feature distributions. Maiorca et al. [
38] transform motions into image representations and introduced the Fréchet Motion Distance (FMD) to assess the quality and diversity of synthesized motion. Valle-Pérez et al. [
8] evaluated the realism of music-based dance generation by measuring the Fréchet distance between the distributions of poses and movements. For the motion style transfer task, we propose a Fréchet Pose Distance (FPD) based on the distribution of key poses to assess the content preservation, as well as a Fréchet Motion Distance (FMD), the Fréchet distance between the distribution of the true dance motion and the generated dance motion to evaluate transfer strength.
2.3 Music-conditioned motion synthesis
Numerous studies have focused on human motion synthesis, utilizing various techniques such as deep feedforward networks [
51], convolutional networks [
52], recurrent models [
53], graph neural networks [
54], and autoencoders [
55]. Dance and music are often intertwined, leading to an emerging research topic known as cross-modal motion generation. This field aims to understand the association between different modalities better and improve music-conditioned motion synthesis. Early works in cross-modal motion generation mostly focused on statistical models [
56‐
58]. Specifically, these models typically generate motions by selecting pre-existing dance moves that match particular music features, such as rhythm, intensity, and structure. With the recent advances in deep learning and the availability of large-scale datasets, learning-based methods have been developed to learn the patterns between music and motion. For example, ChoreoMaster [
9] propose an embedding module to capture music-dance connections, while in DeepDance [
59], a cross-modal association system is designed to correlate dance motion with music. Lee et al. [
60] propose a decomposition-to-composition framework that leverages MM-GAN for music-based dance unit organization. The DanceNet model, as proposed in [
60], uses a musical context-aware encoder to fuse music and motion features. In DanceFormer [
61], kinematics-enhanced transformer-guided networks are utilized to perform motion curve regression. In a recent work by Valle-Pérez et al. [
8], cross-modal transformers were successfully employed to model the relationship between music and motion distributions.
Music-conditioned dance synthesis refers to the task of generating dance motion sequences that are synchronized with a given musical context. In contrast, our work focuses on the dance style transfer task, which involves manipulating the style of existing dance movements while preserving contextual information. Although our style transfer model does not require music as an input for conditioning, incorporating it can enhance the quality of the generated movements.