Can You Measure Musicality? The Search for a Universal Metric

In 2007, Shane Legg and Marcus Hutter published a paper that attempted something audacious: a single mathematical formula for intelligence. Their Universal Intelligence measure defined intelligence as an agent’s ability to achieve goals across a wide range of environments, weighted by the complexity of each environment. One number. Theoretically grounded. Applicable to any system — human, animal, or machine.

The formula worked because intelligence, however contested its definition, has a relatively clear objective function: goal achievement in environments. You can formalise environments. You can measure outcomes. You can weight by complexity using Kolmogorov theory.

Music has no such luck.

There is no universally accepted metric for the musicality of a song — no formula that takes a piece of music as input and outputs a number representing its quality, beauty, or power. But the quest to build one has a surprisingly long history, and the components that might eventually make it possible are more advanced than most musicians realise.

1. The First Attempt: Birkhoff’s Aesthetic Measure (1933)

The earliest serious attempt to mathematise aesthetic quality came from George David Birkhoff, a Harvard mathematician who spent a year travelling the world studying art, music, and poetry before publishing Aesthetic Measure in 1933.

His formula was elegant:

M = O / C

Where M is aesthetic measure, O is aesthetic order (symmetry, harmony, pattern), and C is complexity (the number of distinct elements). Birkhoff applied this to vases, polygons, melodies, and English poetry.

The insight is still relevant: beauty arises from the ratio of order to complexity. A melody that is perfectly ordered but simple (a single repeated note) scores low on both O and C — trivial. A melody that is wildly complex but disordered (random notes) has high C but low O — chaotic. The “best” music, in Birkhoff’s framework, maximises order relative to complexity.

The limitation is equally clear. Birkhoff’s O is hand-defined for each domain — he specified what “order” means for polygons (symmetry, equilibrium) and for melodies (repetition, tonal resolution) separately. There is no universal definition of order that works across all musical traditions. What counts as “ordered” in a Bach fugue is fundamentally different from what counts as “ordered” in a Coltrane solo or a raga alaap.

Still, the core idea — that aesthetic value lives in the relationship between pattern and complexity — has proven durable. Nearly every subsequent attempt to measure musicality is, in some sense, a refinement of Birkhoff.

2. The Wundt Curve: Optimal Complexity and the Inverted U

In the 1960s, psychologist Daniel Berlyne formalised what many musicians intuit: there is a sweet spot of complexity where enjoyment peaks. Too simple, and the music is boring. Too complex, and it is overwhelming. The relationship between complexity and pleasure follows an inverted U-shape — the Wundt curve, named after Wilhelm Wundt, who first described the pattern in the 19th century.

This has been rigorously validated in music. Research using computational definitions of complexity and entropy as independent variables has found reliable evidence of the Wundt effect in aesthetic musical judgments. Preference is maximised at moderate levels of complexity. Stimuli that are too simple lead to boredom and habituation, while stimuli that are excessively complex lead to cognitive overload and negative affect.

The Wundt curve tells us something important: musicality is not about maximising any single dimension. It is about calibrating complexity to the listener. And here is the catch — the curve shifts. Repeated listening increases tolerance for complexity. Expert listeners have a curve shifted to the right compared to novices. A jazz musician and a pop listener have different sweet spots. The “optimal complexity” is not a property of the music alone — it is a property of the music-listener system.

This is the first hint of why a universal metric is hard. The Wundt curve is real, but it has a free parameter: the listener.

3. Information Theory Enters Music

The most rigorous modern approaches to measuring musicality come from information theory — the mathematical framework Claude Shannon developed in 1948 to quantify communication.

Shannon Entropy

Shannon entropy measures the average surprise in a sequence. Applied to music, it quantifies how unpredictable the note choices are. A melody that uses only two notes in a repeating pattern has low entropy. A melody that uses all twelve chromatic notes with equal probability has high entropy.

Researchers have computed Shannon entropy across thousands of musical works and found that it reliably distinguishes compositional forms. Bach chorales have low entropy — they are harmonically predictable. Toccatas and preludes have high entropy — they communicate more information through complexity. A 2024 study analysing 10,000 songs using a complexity-entropy causality plane showed that genres cluster in distinct regions of this space. You can identify a genre by its information signature.

But entropy alone is not musicality. Random noise has maximum entropy. The question is not “how much information?” but “how is the information structured?”

Kolmogorov Complexity

Kolmogorov complexity takes a different approach: it defines the complexity of a piece of information as the length of the shortest computer program that could reproduce it. A melody built on a simple repeating pattern has low Kolmogorov complexity — you can describe it in a few instructions. A melody with no discernible pattern has high Kolmogorov complexity — the only way to describe it is to reproduce it note for note.

Researchers have applied this to music using compression algorithms (specifically Lempel-Ziv compression, the basis of ZIP files) as a practical approximation. More compressible music is less complex. Studies of Irish traditional dance music, for instance, have used this method to distinguish more repetitive tunes from more varied ones.

The Kolmogorov approach gives us a metric for structural richness, but again, complexity is not quality. A randomly generated MIDI file has very high Kolmogorov complexity. “Twinkle, Twinkle, Little Star” has very low Kolmogorov complexity. Neither fact tells us which one is better music.

4. IDyOM: The Closest Thing to a Universal Musicality Engine

Marcus Pearce’s Information Dynamics of Music (IDyOM) model, developed in his 2005 PhD thesis and refined through 2018, is arguably the most sophisticated computational model of musical perception that exists.

IDyOM treats music listening as a prediction problem. The model processes music event by event — note by note — and at each point generates a probability distribution over what the next note might be. It then measures two things:

Information Content (IC) — how surprising the actual next note is, given the predicted distribution. A note with high IC was unexpected. A note with low IC was predicted.

Entropy — how uncertain the model was before hearing the next note. High entropy means many notes were roughly equally likely. Low entropy means the model was fairly confident about what was coming.

The model’s architecture mirrors human listening. It maintains two sub-models: a Long-Term Model (LTM) trained on a large corpus of melodies (representing a lifetime of musical experience) and a Short-Term Model (STM) that learns patterns within the current piece (representing real-time adaptation). The two are combined using a weighted geometric mean based on their respective confidence.

IDyOM explains up to 83% of the variance in listeners’ pitch expectations. It can model perceptual expectation, boundary perception, metre induction, similarity, memory, emotional response, and aesthetic experience. It is, in effect, a computational simulation of a listener’s brain.

Why IDyOM Matters for the Musicality Question

IDyOM does not output a single “musicality score.” But it provides the infrastructure for one. If you accept that musical pleasure arises from the dynamics of expectation — the interplay between prediction and surprise — then IDyOM gives you a way to measure that interplay computationally, for any piece of music, relative to any listener model.

The key insight: the same piece of music will have different IC and entropy profiles for different listener models (trained on different corpora). A Western-trained model will find a raga highly surprising. An Indian-classical-trained model will find it coherent. This is not a bug — it is the formal representation of cultural context in musical perception.

5. The Neuroscience: Dopamine, Prediction, and the Sweet Spot

In 2019, Cheung et al. published a landmark study in Current Biology that connected the information-theoretic framework to brain chemistry. Using IDyOM to calculate the uncertainty and surprise associated with 80,000 chords from US Billboard pop songs, they found that:

Musical pleasure is maximised when uncertainty is high but the outcome is surprising — or when uncertainty is low and the outcome is expected.

In other words, the brain rewards two scenarios: (1) when you don’t know what’s coming and get hit with something unexpected — a genuine surprise; or (2) when you do know what’s coming and the music delivers exactly that — the satisfaction of prediction confirmed. The least pleasurable scenario is high uncertainty followed by an expected outcome — ambiguity without payoff.

This maps directly onto the Wundt curve, but with a critical addition: it is not just complexity that matters, but the temporal dynamics of complexity. A song that is uniformly surprising is exhausting. A song that is uniformly predictable is boring. A song that modulates between the two — building expectation here, violating it there — triggers the dopamine reward system.

The brain regions involved are the same ones implicated in other reward pathways: the amygdala, hippocampus, auditory cortex, and — crucially — the nucleus accumbens, which mediates dopamine-driven pleasure. Musical pleasure is, neurochemically, the same kind of pleasure as food, sex, and drugs. The mechanism is prediction error.

A separate line of research on musical groove found the same pattern for rhythm: the pleasurable urge to move to a beat depends on “a fine-tuned interplay between predictability arising from repetitive rhythmic patterns, and surprise arising from rhythmic deviations.” The groove sweet spot is at moderate rhythmic complexity.

6. Zipf’s Law: The Statistical Signature of “Good” Music

Zipf’s law — the observation that in natural language, the frequency of a word is inversely proportional to its rank — also appears in music. When you rank musical events (notes, intervals, chords) by frequency, the distribution follows a power law.

Research has shown that Zipf distributions across certain musical dimensions appear to be a necessary, but not sufficient, condition for pleasant music. Music that violates Zipf’s law — where note distributions are too uniform (random) or too concentrated (repetitive) — tends to be perceived as less musical.

This is a statistical fingerprint of what Birkhoff was trying to capture with O/C: there is a specific distribution of elements that characterises music humans find engaging. Not too uniform (noise), not too concentrated (monotony), but following the power-law distribution that characterises natural, structured communication.

7. Network Science: Mapping Musical Structure (2025)

In January 2025, researchers from Sapienza University of Rome and the University of Padova published a study that applies network science to music complexity. Their method: treat each note as a node in a network, connect notes that follow each other sequentially, and weight the edges by transition frequency.

They automated this for approximately 20,000 songs across six genres spanning four centuries, producing what amounts to a structural fingerprint for each piece. Their findings:

Classical and Jazz compositions have the highest network complexity and melodic diversity
Recently developed genres (pop, electronic) have lower network complexity
There is a measurable trend toward simplification across all genres over time, including classical and jazz
Digital tools and streaming platforms appear to be driving homogenisation

This approach gives us a genre-independent measure of structural complexity. A song’s network — its pattern of note transitions — can be compared to any other song’s network regardless of key, tempo, or instrumentation. It is one of the most promising candidates for a cross-cultural complexity metric.

8. The Declining Complexity Trend

In 2024, Madeline Hamilton and Marcus Pearce (the creator of IDyOM) published a study analysing the melodic complexity of every song that reached the top five of the US Billboard year-end singles charts between 1950 and 2022.

Their finding: both pitch complexity and rhythmic complexity have decreased by approximately 30% over this period.

They identified three “melodic revolutions” — sudden drops in complexity — in 1975 (disco, new wave), 1996 (hip-hop’s mainstream breakthrough), and 2000 (the digital production era). Each drop coincided with a shift in dominant production methods and audience listening habits.

The important caveat: the simplification of melody does not necessarily mean music has become less complex overall. Other dimensions — timbre, production texture, rhythmic layering — may have increased in complexity as melody simplified. The total information content may have remained constant or even increased, but redistributed across different musical parameters.

This is a critical point for any universal metric: you cannot measure musicality along a single dimension. A metric that captures melodic complexity but ignores timbral complexity will conclude that Billie Eilish is less “musical” than Mozart. That conclusion reveals the metric’s limitations, not Eilish’s.

9. The Commercial Proxy: Spotify’s Audio Features

Spotify’s API exposes a set of audio features for every track: danceability, energy, valence (emotional positivity), acousticness, instrumentalness, speechiness, liveness, and tempo. These are computed algorithmically from audio signals, though Spotify does not disclose the exact methods.

Researchers have used these features to predict commercial success with 60-82% accuracy. A Stanford project (HITPREDICT) and multiple academic studies have built classifiers that distinguish hits from non-hits based on audio features alone.

But here is the fundamental problem: popularity is not musicality. “Baby Shark” has over 14 billion YouTube views. It is not a paragon of musical quality. Spotify’s features measure what a song sounds like — fast, loud, happy, acoustic — but not how well it works as music.

Moreover, validation studies have found that Spotify’s valence and energy metrics correlate moderately with human ratings, but danceability — perhaps the most musically relevant feature — does not strongly correlate with how danceable humans actually find the music. The algorithms are opaque, and their relationship to genuine musical perception is uncertain.

10. The Raga Problem: Why Non-Western Music Breaks Everything

Most computational musicality research is built on Western tonal music. The twelve-tone equal temperament scale, major/minor tonality, and four-bar phrase structures are baked into the models.

Indian classical music — both Hindustani and Carnatic — operates on fundamentally different principles. A raga is not a scale — it is a melodic framework that specifies not just which notes are permitted but how they should be approached, what ornaments are appropriate, which phrases are characteristic, and even what time of day the raga should be performed. Two ragas can use identical notes but be completely different entities based on their phrase grammar.

Computational musicology for raga analysis exists — researchers use transition probability matrices, multifractal analysis, Mel-Frequency Cepstral Coefficients, and deep learning for raga identification and classification. But measuring the quality of a raga performance is a different problem entirely. As one review notes: “the artistic sense of raga cannot be captured only by form of rules.”

The challenge extends beyond ragas. Maqam music (Arabic/Turkish), gamelan (Indonesian), mbira traditions (Southern African), and countless other systems have structures that don’t map cleanly onto Western information-theoretic models. Any truly universal musicality metric must account for these differences — must be able to evaluate a Coltrane solo and a Bismillah Khan shehnai performance on their own terms.

This is the musical equivalent of Legg’s requirement that intelligence be measured across a “wide range of environments.” A metric that only works for Western tonal music is not universal — it is parochial.

11. AI Evaluation: The Current State

The rise of AI music generation (Suno, Udio, Soundverse) has created an urgent practical need for music quality metrics. If machines can generate millions of songs, how do you sort the good from the bad?

Current AI music evaluation relies on a mix of approaches:

Fréchet Audio Distance (FAD) — measures how similar AI-generated audio is to a reference distribution of real music. Lower FAD means more “realistic.” But realistic is not the same as good.
Mean Opinion Score (MOS) — human listeners rate samples on a scale. The gold standard, but expensive and slow.
Musical Turing Tests — can listeners distinguish AI-generated from human-composed music? Useful for measuring realism, not quality.
Multi-dimensional frameworks combining analytical hierarchy processes, Likert scales, and emotional state estimation to assess coherence, novelty, and emotional resonance.

The honest assessment: there is still no clear method to computationally assess musical creativity and quality. Most evaluations fall back on human judgment.

12. What a Universal Musicality Measure Would Need

Drawing from everything above, a true Universal Musicality Measure — the musical equivalent of Legg-Hutter — would need to:

1. Be multi-dimensional. Melody, harmony, rhythm, timbre, lyrics, production, and structure all contribute to musicality. A metric that captures only one dimension will systematically undervalue music that is complex in other dimensions.

2. Be listener-relative. The Wundt curve, Cheung’s dopamine research, and IDyOM’s dual-model architecture all point to the same conclusion: musical quality is a relational property between a piece of music and a listener’s prior experience. A universal metric must include a listener model — trained on a corpus that represents the listener’s musical background — the way Legg’s measure weights environments by complexity.

3. Capture temporal dynamics. Static measures (average entropy, total complexity) miss what makes music work: the trajectory of expectation and surprise over time. The metric must evaluate how a piece modulates between tension and resolution across its duration.

4. Be culture-agnostic in framework but culture-specific in parameterisation. The same mathematical framework (information dynamics, network structure, prediction error) should apply to any musical tradition, but the reference corpus (what counts as “expected” or “surprising”) must be tradition-specific. A metric that treats Western tonality as the default is not universal.

5. Account for the Zipfian sweet spot. The distribution of musical elements should follow power-law patterns — neither too uniform nor too concentrated. This is a necessary, though not sufficient, condition.

6. Integrate production and timbral dimensions. A 2024 analysis showed that as melodic complexity has declined, other dimensions may have compensated. Any metric that ignores timbre, spatial design, and production texture is measuring an increasingly incomplete picture of modern music.

The Hypothetical Formula

If one were to attempt the formalisation:

Musicality(S, L) = f( IC_trajectory(S, L), H_trajectory(S, L), Network_complexity(S), Zipf_fit(S), Timbral_diversity(S) )

Where:

S is the song
L is the listener model (trained corpus representing prior musical experience)
IC_trajectory is the information content (surprise) over time
H_trajectory is the entropy (uncertainty) over time
Network_complexity is the structural richness of note transitions
Zipf_fit is how well the element distributions follow power-law scaling
Timbral_diversity captures production and spectral complexity
f is a function that evaluates whether these dimensions collectively maintain the optimal information gradient — the sweet spot of the Wundt curve — across the song’s duration, for the given listener model

Nobody has built this. The components exist — IDyOM, network science metrics, Zipf analysis, timbral feature extraction — but they have never been integrated into a single framework with a listener-relative architecture.

13. Why It Might Never Be Solved (And Why That’s Fine)

There is a philosophical argument that a universal musicality metric is not just difficult but conceptually incoherent. Intelligence, in Legg’s framework, is about achieving goals — and goals can be objectively evaluated. Musical quality is about aesthetic experience — and aesthetic experience may not have an objective ground truth.

The counterargument: we don’t need the metric to be objective — we need it to be formal. Legg’s measure is not objective in any absolute sense either — it depends on the choice of reference machine. But it provides a rigorous framework for comparing intelligences under the same assumptions. A musicality metric could do the same: not declare absolute quality, but provide a rigorous framework for comparing musical works under the same listener model.

The practical applications would be immediate: better AI music evaluation, personalised music recommendation that goes beyond collaborative filtering, computational tools for composers and producers, and — perhaps most valuably — a formal vocabulary for discussing what makes music work that transcends subjective assertion.

For now, the best we have is a constellation of partial measures, each illuminating one facet of what we experience as musicality. Birkhoff’s order-to-complexity ratio. The Wundt curve’s optimal arousal point. IDyOM’s prediction dynamics. Cheung’s dopamine sweet spot. Network science’s structural fingerprints. Zipf’s statistical signature.

The universal metric doesn’t exist yet. But the mathematical tools to build it do.

Sources: Legg & Hutter, “Universal Intelligence” (Minds and Machines, 2007); Birkhoff, “Aesthetic Measure” (Harvard, 1933); Berlyne, “Aesthetics and Psychobiology” (1971); Pearce, “Statistical learning and probabilistic prediction in music cognition” (Annals of the New York Academy of Sciences, 2018); Cheung et al., “Uncertainty and Surprise Jointly Predict Musical Pleasure” (Current Biology, 2019); Hamilton & Pearce, “Melodic complexity in popular music” (2024); Sapienza University / University of Padova, “Decoding Musical Evolution Through Network Science” (2025); Frontiers in Psychology, “The sweet spot between predictability and surprise” (2022); Smithsonian Magazine; Discover Magazine; MusicRadar; arXiv.