While there are many symbolic music (MIDI) generators available, very few of them can be conditioned on high-level features such as emotions. Existing emotion-labeled MIDI datasets are limited, containing only a few thousand samples.

In this project, I first created an emotion-based MIDI dataset that is two orders of magnitude larger than the existing ones. The emotion labels are continuous valence-arousal values, enabling fine-grained conditioning. Then, I built multiple architectures for emotion-based symbolic music generation. To the best of my knowledge, this is the only transformer-based generator that can use continuous-valued conditions while processing discrete tokens.

The paper is available on ArXiv (Sulun et al., 2022). The source code is available on Github. Below, I present some output samples.

Supplementary material

Constant conditioning

In the table below, the left, middle, and right columns contain samples generated with negative (unpleasant), neutral and positive (pleasant) valence condition values, respectively. Similarly, the top, middle, and bottom rows contain samples generated with positive (excited), neutral, and negative (calm) arousal condition values, respectively. In each cell, we present samples generated by our three different models, named discrete-token (DT), continuous-token (CT), and continuous-concatenated (CC). Note that all samples are the first random samples that are generated using each configuration, and hence, are not cherry-picked.

Valence Arousal	Negative	Neutral	Positive
Positive	DT: CT: CC:	DT: CT: CC:	DT: CT: CC:
Neutral	DT: CT: CC:	DT: CT: CC:	DT: CT: CC:
Negative	DT: CT: CC:	DT: CT: CC:	DT: CT: CC:

Dynamic conditioning

I also present samples that are generated using dynamic conditioning, where the condition values change over time. I used the continuous-token (CC) and continuous-concatenated (CC) models since only they allow dynamic conditioning. Contrary to the samples previously presented, these samples are cherry-picked.

Increasing valence, increasing arousal

CT:

CC:

Decreasing valence, decreasing arousal

CT:

CC:

Increasing valence, decreasing arousal

CT:

CC:

Decreasing valence, increasing arousal

CT:

CC:

Cherry-picked samples

Here I present the cherry-picked samples generated for four basic emotions; happy, relaxed, sad and angry.
These emotions occupy the four quadrants of the valence-arousal plane as shown below:

Emotions

ANGRY DT: CT: CC:	HAPPY DT: CT: CC:
SAD DT: CT: CC:	RELAXED DT: CT: CC:

Related lightning talk at EPIA 2023:

In this paper we present a new approach for the generation of multi-instrument symbolic music driven by musical emotion. The principal novelty of our approach centres on conditioning a state-of-the-art transformer based on continuous-valued valence and arousal labels. In addition, we provide a new large-scale dataset of symbolic music paired with emotion labels in terms of valence and arousal. We evaluate our approach in a quantitative manner in two ways, first by measuring its note prediction accuracy, and second via a regression task in the valence-arousal plane. Our results demonstrate that our proposed approaches outperform conditioning using control tokens which is representative of the current state of the art.

Emotion-based Symbolic Music Generation