AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation

Abstract

Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.

Method

The method involved a multi-step process to create and evaluate the AImoclips benchmark. First, 12 emotion intents were selected from the valence-arousal space, excluding intermediate values to ensure clear emotional separation. Six state-of-the-art text-to-music (TTM) systems, including both open-source and commercial models, were used to generate 991 ten-second music clips based on these emotion intents. Finally, an online survey was conducted with 111 participants to collect valence and arousal ratings for each clip on a 9-point Likert scale, ensuring a robust human evaluation of the emotional conveyance of the generated music. The images below are displayed along questions to help participants understand the concepts of valence and arousal.

Results

Our analysis reveals significant differences in emotional conveyance across various text-to-music models. Commercial models generally skewed towards higher valence, producing music that was perceived as more positive than intended, whereas open-source models tended to do the opposite. We observed that high-arousal emotions were more accurately conveyed across all systems. A consistent finding was a systemic bias towards emotional neutrality, indicating that current TTM models have difficulty rendering strong, unambiguous emotions. These results underscore the need for improved affective control in future text-to-music generation systems.

Mean valence and arousal deviations for each Text-to-Music (TTM) system, averaging (clip ratings - corresponding emotion intent scores) across all emotion intents.

Mean valence and arousal deviations for each valence-arousal quadrant, averaging (clip ratings - corresponding emotion intent scores) across all TTM systems.

Valence–arousal quadrant distributions for each TTM system. Stars show mean ratings per quadrant, ’X’ marks represent ground truth scores of emotion intents, and ellipses indicate 95% confidence regions.

❮ ❯

Examples

TTM System	Emotion Intent	Valence (Ground Truth vs Rated)	Arousal (Ground Truth vs Rated)
AudioLDM 2	angry	GT 2.53 Rated 4.4	GT 6.2 Rated 5.6
AudioLDM 2	gloomy	GT 3.15 Rated 3.83	GT 3.32 Rated 3.0
AudioLDM 2	happy	GT 8.47 Rated 6.8	GT 6.05 Rated 5.8
AudioLDM 2	relaxed	GT 7.25 Rated 6.0	GT 2.49 Rated 3.17
MusicGen	angry	GT 2.53 Rated 4.2	GT 6.2 Rated 5.4
MusicGen	gloomy	GT 3.15 Rated 5.33	GT 3.32 Rated 4.67
MusicGen	happy	GT 8.47 Rated 6.5	GT 6.05 Rated 5.83
MusicGen	tranquil	GT 7.11 Rated 4.2	GT 2.61 Rated 3.6
Mustango	anxious	GT 3.8 Rated 4.43	GT 6.2 Rated 3.71
Mustango	calm	GT 6.89 Rated 5.2	GT 1.67 Rated 4.6
Mustango	energetic	GT 7.57 Rated 4.67	GT 6.1 Rated 6.17
Mustango	sad	GT 2.1 Rated 4.29	GT 3.49 Rated 2.71
Stable Audio Open	dull	GT 3.4 Rated 7.17	GT 1.67 Rated 4.83
Stable Audio Open	excited	GT 8.11 Rated 6.13	GT 6.43 Rated 5.63
Stable Audio Open	relaxed	GT 7.25 Rated 4.0	GT 2.49 Rated 2.67
Stable Audio Open	scared	GT 2.8 Rated 4.14	GT 6.1 Rated 5.57
Suno	calm	GT 6.89 Rated 6.0	GT 1.67 Rated 2.67
Suno	dull	GT 3.4 Rated 6.57	GT 1.67 Rated 2.57
Suno	energetic	GT 7.57 Rated 4.57	GT 6.1 Rated 5.43
Suno	scared	GT 2.8 Rated 7.4	GT 6.1 Rated 6.6
Udio	anxious	GT 3.8 Rated 5.0	GT 6.2 Rated 4.6
Udio	excited	GT 8.11 Rated 7.0	GT 6.43 Rated 6.71
Udio	sad	GT 2.1 Rated 5.0	GT 3.49 Rated 3.25
Udio	tranquil	GT 7.11 Rated 5.0	GT 2.61 Rated 2.63

Abstract

Method

Results

Examples

BibTeX