AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation

Gyehun Go, Satbyul Han, Ahyeon Choi, Eunjin Choi, Juhan Nam and Jeong Mi Park

Abstract

Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.


Method

The method involved a multi-step process to create and evaluate the AImoclips benchmark. First, 12 emotion intents were selected from the valence-arousal space, excluding intermediate values to ensure clear emotional separation. Six state-of-the-art text-to-music (TTM) systems, including both open-source and commercial models, were used to generate 991 ten-second music clips based on these emotion intents. Finally, an online survey was conducted with 111 participants to collect valence and arousal ratings for each clip on a 9-point Likert scale, ensuring a robust human evaluation of the emotional conveyance of the generated music. The images below are displayed along questions to help participants understand the concepts of valence and arousal.


Results

Our analysis reveals significant differences in emotional conveyance across various text-to-music models. Commercial models generally skewed towards higher valence, producing music that was perceived as more positive than intended, whereas open-source models tended to do the opposite. We observed that high-arousal emotions were more accurately conveyed across all systems. A consistent finding was a systemic bias towards emotional neutrality, indicating that current TTM models have difficulty rendering strong, unambiguous emotions. These results underscore the need for improved affective control in future text-to-music generation systems.

Mean valence and arousal deviations for each Text-to-Music (TTM) system, averaging (clip ratings - corresponding emotion intent scores) across all emotion intents.
Mean valence and arousal deviations for each valence-arousal quadrant, averaging (clip ratings - corresponding emotion intent scores) across all TTM systems.
Valence–arousal quadrant distributions for each TTM system. Stars show mean ratings per quadrant, ’X’ marks represent ground truth scores of emotion intents, and ellipses indicate 95% confidence regions.

Examples

Music Source TTM System Emotion Intent Valence (Ground Truth vs Rated) Arousal (Ground Truth vs Rated)
AudioLDM 2 angry
GT
2.53
Rated
4.4
GT
6.2
Rated
5.6
AudioLDM 2 gloomy
GT
3.15
Rated
3.83
GT
3.32
Rated
3.0
AudioLDM 2 happy
GT
8.47
Rated
6.8
GT
6.05
Rated
5.8
AudioLDM 2 relaxed
GT
7.25
Rated
6.0
GT
2.49
Rated
3.17
MusicGen angry
GT
2.53
Rated
4.2
GT
6.2
Rated
5.4
MusicGen gloomy
GT
3.15
Rated
5.33
GT
3.32
Rated
4.67
MusicGen happy
GT
8.47
Rated
6.5
GT
6.05
Rated
5.83
MusicGen tranquil
GT
7.11
Rated
4.2
GT
2.61
Rated
3.6
Mustango anxious
GT
3.8
Rated
4.43
GT
6.2
Rated
3.71
Mustango calm
GT
6.89
Rated
5.2
GT
1.67
Rated
4.6
Mustango energetic
GT
7.57
Rated
4.67
GT
6.1
Rated
6.17
Mustango sad
GT
2.1
Rated
4.29
GT
3.49
Rated
2.71
Stable Audio Open dull
GT
3.4
Rated
7.17
GT
1.67
Rated
4.83
Stable Audio Open excited
GT
8.11
Rated
6.13
GT
6.43
Rated
5.63
Stable Audio Open relaxed
GT
7.25
Rated
4.0
GT
2.49
Rated
2.67
Stable Audio Open scared
GT
2.8
Rated
4.14
GT
6.1
Rated
5.57
Suno calm
GT
6.89
Rated
6.0
GT
1.67
Rated
2.67
Suno dull
GT
3.4
Rated
6.57
GT
1.67
Rated
2.57
Suno energetic
GT
7.57
Rated
4.57
GT
6.1
Rated
5.43
Suno scared
GT
2.8
Rated
7.4
GT
6.1
Rated
6.6
Udio anxious
GT
3.8
Rated
5.0
GT
6.2
Rated
4.6
Udio excited
GT
8.11
Rated
7.0
GT
6.43
Rated
6.71
Udio sad
GT
2.1
Rated
5.0
GT
3.49
Rated
3.25
Udio tranquil
GT
7.11
Rated
5.0
GT
2.61
Rated
2.63

BibTeX