Integration of nonparametric fuzzy classification with an evolutionary-developmental framework to perform music sentiment-based analysis and composition
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden.
powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden.
powered by
Abstract
Over the past years, several approaches have been developed to create algorithmic music composers. Most existing solutions focus on composing music that appears theoretically correct or interesting to the listener. However, few methods have targeted sentiment-based music composition: generating music that expresses human emotions. The few existing methods are restricted in the spectrum of emotions they can express (usually to two dimensions: valence and arousal) as well as the level of sophistication of the music they compose (usually monophonic, following translation-based, predefined templates or heuristic textures). In this paper, we introduce a new algorithmic framework for autonomous music sentiment-based expression and composition, titled MUSEC, that perceives an extensible set of six primary human emotions (e.g., anger, fear, joy, love, sadness, and surprise) expressed by a MIDI musical file and then composes (creates) new polyphonic (pseudo) thematic, and diversified musical pieces that express these emotions. Unlike existing solutions, MUSEC is: (i) a hybrid crossover between supervised learning (SL, to learn sentiments from music) and evolutionary computation (for music composition, MC), where SL serves at the fitness function of MC to compose music that expresses target sentiments, (ii) extensible in the panel of emotions it can convey, producing pieces that reflect a target crisp sentiment (e.g., love) or a collection of fuzzy sentiments (e.g., 65% happy, 20% sad, and 15% angry), compared with crisp-only or two-dimensional (valence/arousal) sentiment models used in existing solutions, (iii) adopts the evolutionary-developmental model, using an extensive set of specially designed music-theoretic mutation operators (trille, staccato, repeat, compress, etc.), stochastically orchestrated to add atomic (individual chord-level) and thematic (chord pattern-level) variability to the composed polyphonic pieces, compared with traditional evolutionary solutions producing monophonic and non-thematic music. We conducted a large battery of tests to evaluate MUSEC’s effectiveness and efficiency in both sentiment analysis and composition. It was trained on a specially constructed set of 120 MIDI pieces, including 70 sentiment-annotated pieces: the first significant dataset of sentiment-labeled MIDI music made available online as a benchmark for future research in this area. Results are encouraging and highlight the potential of our approach in different application domains, ranging over music information retrieval, music composition, assistive music therapy, and emotional intelligence.
Simplest form of musical textures where only one note is played at a time, in contrast with polyphonic music where more than one note is played simultaneously.
A fuzzy classifier is a classifier which assigns membership scores to input data objects, producing fuzzy categories with fuzzy boundaries, such that an object, e.g., a musical piece, can be part of one category and the other at the same time (e.g., 80% excitement and 20% fear), in contract with traditional crisp classifiers which categorize data in crisp/distinct categories (Kotsiantis 2007). In our current system, we utilize fuzzy k-NN due to its flexibility and effectiveness, yet any other fuzzy classifier could be used, e.g., (Abu 2017; Abu et al. 2016; Amin et al. 2018; Fahmi et al. 2017, 2018, 2019).
A fundamental frequency is the lowest frequency produced by the oscillation of an object. In music, it is perceived as the lowest partial (simple tone) present that is distinct from the harmonics of higher frequency. In the remainder of this paper, terms frequency and fundamental frequency will be used interchangeably, unless explicitly stated otherwise.
It is a machine learning approach which allows the learning of a function that maps an input (e.g., musical piece) to an output (e.g., sentiment category or sentiment score) based on sample input–output pairs, so-called labeledtraining data, where each sample pair consists of a given input object (e.g., a music feature vector) and a desired output value (e.g., a sentiment category or a sentiment score). The produced mapping function is an approximation of the true mapping function between the sample training pairs (Kotsiantis 2007).
An evolutionary algorithm can be defined as a population-based metaheuristic optimization algorithm, which uses mechanisms inspired by biological evolution, such as reproduction, mutation, crossover, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions. The evolution of the population then takes place after the repeated application of the above operators (Goldberg 1989; Whitley and Sutton 2012).
To determine the dominant key, a chroma histogram for the input music file is first computed, denoting the percentage of total piece duration in which every chroma can be heard. The histogram is later used to compute likelihood scores using Temperley’s key profiles (Temperley 2002). A Bayesian Approach. The key with the highest score is finally selected as the dominant key (Temperley 2002).
Dominant key misidentification can occasionally occur, particularly for pieces where modulations occur very frequently and for atonal music (Temperley 2002; Kyogu 2008) (e.g., modern music which does not abide by a fixed key).
Note that 100% accuracy in chord progression identification is difficult to obtain due to the very nature of chord progressions: where i) the same chord progression can be played in so many different ways while still portraying the same musical structure, and ii) it can be often difficult to separate between consecutive chords since notes are sometimes combined between them. Our heuristic performs accurately on relatively simple music where there is a clear chord structure, and a clear separation between chords with no rapid transitions between them.
Consider two chord progression sequences A and B, consisting of chords A1, A2, …, Am and B1, B2, …, Bn, respectively. Without loss of generality, consider the case where m < n. Following the standard TPSD algorithm in Ayadi et al. (2016), the shorter sequence is compared with the longer one at every position, e.g., A1, …, Am versus B1,…,Bm, then A1, …, Am versus B2,…,Bm+ 1, and so forth until A1, …, Am versus Bn−m,…,Bn. Then the comparison yielding the smallest difference is selected as the final similarity (or distance) value. With the more efficient version of the TPSD algorithm in Bas De Haas et al. (2013), the chord progression sequences are only compared from their starting positions, e.g., A1, …, Am is only compared with B1,…,Bm, and that score is utilized as the chord progressions similarity (distance) score. Despite this linear relaxation of the original algorithm, TPSD computation remains the most expensive among all other feature similarity computations put together (cf. experiments in Sect. 6.2.2).
Available online at: http://sigappfr.acm.org/Projects/MUSEC, SL survey form #1 (first part, 24-pieces), #2 (second part, 8-pieces), and #3 (third part, 8-pieces), along with the resulting sentiment-labeled dataset.
In our current implementation of MC, we hard-coded the chord probability distribution (through which a chord is selected) based on empirical sampling from our training set. Yet, learning the chord probability distribution can be a research project in and of itself, and can entail different composition styles. For instance, the distribution could be learned from a composer’s composition corpus, to produce pieces following the composer’s own style (which we further discuss as an ongoing work in Sect. 8).
Pearson correlation coefficient. Note that any other vector similarity measure (such as cosine or dice) could have been used. We adopt PCC here since it is commonly utilized in the literature (Abbasi et al. 2008; O’Connor et al. 2010).
We consider this strategy to be similar to the way some human composers usually write music: producing multiple candidate (trial) pieces, slicing and mixing them up, developing them and making them evolve until reaching a final pool of best candidates, from which the single best candidate is usually adopted as the actual final piece.
We adopted a ratio R = 0.7 in our current study, so that 70% of the offspring would be subject to fitness trimming, whereas only 30% would undergo variability trimming.
Note that the number of beats in a piece is naturally less than the number of notes. While there is no straightforward relationship between the two, they can be paralleled to sentences and words in flat text: where beats represent music sentences, and notes represent the sentences’ words. In our sample test dataset of 100 pieces, the number of beats was on average 4-to-8 times less than the number of notes.
PCC = δXY/(δX × δY) where: x and y designate user and system generated similarity values, respectively, δX and δY denote the standard deviations of x and y, respectively, and δXY denotes the covariance between the x and y variables. The values of PCC ∈ [− 1, 1] such that: − 1 designates that one of the variables is a decreasing function of the other variable (i.e., music pieces deemed similar by human testers are deemed dissimilar by the system, and vice versa), 1 designates that one of the variables is an increasing function of the other variable (i.e., pieces are deemed similar/dissimilar by human testers and the system alike), and 0 means that the variables are not correlated.
MSE, computed as an average Euclidian distance measure, is a good indication of how close similarity scores are to human ratings: one by one (for every pair of pieces), whereas PCC compares the behavior of the vector of similarity ratings (for all pairs or pieces) as a whole.
While we could have asked the testers to provide a confidence score associated with every sentiment score, yet, we felt this would complicate things for non-expert testers, especially that our objective was to capture their inherent feelings when listening to the music pieces, rather than have them “rationalize” their ratings by adding confidence scores. Nonetheless, considering tester rating confidence is an interesting factor that we plan to evaluate in a future study.
With the 100-piece training set, the system had “less” to learn since it was training on a more or less homogeneous training set, and thus over-fitted w.r.t. the well represented sentiments, namely joy and sadness, but was less successful in inferring less represented sentiments like anger and fear.
To help illustrate this concept, let’s consider the following example, consisting of three vectors: V1 = (0.8, 0.6), V2 = (0.95, 0.45), and V3 = (0.65, 0.75). Let V1 be our target vector and let V2 and V3 be our system estimate vectors. Upon first inspection, it is obvious that V2 is a better representative of V1 than V3, since it more or less exhibits the same behavior as V1 (higher first term). This similarity in behavior is visible through PCC, where PCC(V1, V2) = 1 and PCC(V1, V3) = − 1. However with MSE, we obtain MSE(V1,V2) = MSE(V1,V3) = 0.0225. This shows that MSE is only a good indication of how close scores are to target sentiments one by one, while PCC reflects the overall similarity of a predicted sentiment vector to the target vector as a whole.
The Turing test was proposed by Alan Turing in 1950, designed to test the ability of a machine to exhibit intelligent behavior that is equivalent to or indistinguishable from that of a human. It was originally used to evaluate machines mimicking human conversation (originally referred to as the “imitation game”). A machine passes the Turing test if, after a number of questions, the human tester (asking questions) cannot know if the answers come from a human or a machine (Epstein et al. 2009).
Anthony Bou Fayad is a processional composer, pianist, and music instructor in the Antonine University’s School of Music, located in Baabda, Mont Lebanon. He also holds a Master’s of Computer Engineering, specializing in multimedia data processing, which allowed him to easily understand the context and purpose of our study, helping us set up the experimental process. Mr. Bou Fayad was partly remunerated for his efforts, mainly for playing and digitally recording all pieces, while volunteering his consulting services.
Using a population size S = 50, a generation size N varying between 50 and 80, a branching factor B = 10 and a fitness-to-variability ratio R = 0.7. All mutation probabilities were set to 0.1.
Recall that states where both valence and arousal dimensions converge (e.g., both valence and arousal are high, or both are low) occur more often than states were they diverge, indicating a potential bias or ambiguity in the model (as stated by the model’s creator in (Russell 1980)).
Integration of nonparametric fuzzy classification with an evolutionary-developmental framework to perform music sentiment-based analysis and composition