US20140372479A1 - Music searching methods based on human perception - Google Patents

Music searching methods based on human perception Download PDF

Info

Publication number
US20140372479A1
US20140372479A1 US14/329,368 US201414329368A US2014372479A1 US 20140372479 A1 US20140372479 A1 US 20140372479A1 US 201414329368 A US201414329368 A US 201414329368A US 2014372479 A1 US2014372479 A1 US 2014372479A1
Authority
US
United States
Prior art keywords
musical recording
scalar
recording
music
selected musical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/329,368
Inventor
Maxwell J. Wells
Navdeep S. Dhillon
David Waller
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gracenote Inc
Original Assignee
Gracenote Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gracenote Inc filed Critical Gracenote Inc
Priority to US14/329,368 priority Critical patent/US20140372479A1/en
Publication of US20140372479A1 publication Critical patent/US20140372479A1/en
Assigned to CITIBANK, N.A., AS COLLATERAL AGENT reassignment CITIBANK, N.A., AS COLLATERAL AGENT SUPPLEMENTAL SECURITY AGREEMENT Assignors: GRACENOTE DIGITAL VENTURES, LLC, GRACENOTE MEDIA SERVICES, LLC, GRACENOTE, INC.
Assigned to GRACENOTE DIGITAL VENTURES, LLC, GRACENOTE, INC. reassignment GRACENOTE DIGITAL VENTURES, LLC RELEASE (REEL 042262 / FRAME 0601) Assignors: CITIBANK, N.A.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • G06F16/634Query by example, e.g. query by humming
    • G06F17/30758
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/433Query formulation using audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/30026
    • G06F17/30743
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0038System on Chip
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/01Automatic library building

Definitions

  • Modern computers have made possible the efficient assemblage and searching of large databases of information.
  • Text-based information can be searched for key words.
  • databases containing recordings of music could only be searched via the textual metadata associated with each recording rather than via the acoustical content of the music itself.
  • the metadata includes information such as title, artist, duration, publisher, classification applied by publisher or others, instrumentation, and recording methods.
  • searching by sound requires less knowledge on the part of the searcher; they don't have to know, for example, the names of artists or titles.
  • a second reason is that textual metadata tends to put music into classes or genres, and a search in one genre can limit the discovery of songs from other genres that may be attractive to a listener. Yet another reason is that searching by the content of the music allows searches when textual information is absent, inaccurate, or inconsistent.
  • Muscle Fish LLC in Berkeley, Calif. has developed computer methods for classification, search and retrieval of all kinds of sound recordings. These methods are based on computationally extracting many “parameters” from each sound recording to develop a vector, containing a large number of data points, which characteristically describes or represents the sound. These methods are described in a paper entitled Classification, Search, and Retrieval of Audio by Erling Wold, Thom Blum, Douglas Keislar, and James Wheaton which was published in September 1999 on the Muscle Fish website at Musclefish.com, and in U.S. Pat. No. 5,918,223 to Blum et al entitled “Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information.”
  • the Blum patent describes how the authors selected a set of parameters that can be computationally derived from any sound recording with no particular emphasis on music. Data for each parameter is gathered over a period of time, such as two seconds.
  • the parameters are well known in the art and can be easily computed.
  • the parameters include variation in loudness over the duration of the recording (which captures beat information as well as other information), variation in fundamental frequency over the duration of the recording (often called “pitch”), variation in average frequency over the duration of the recording (often called “brightness”), and computation over time of a parameter called the mel frequency cepstrum coefficient (MFCC).
  • MFCC mel frequency cepstrum coefficient
  • DFT log discrete Fourier transform
  • a large vector of parameters is generated for a representative sample or each section of each recording.
  • a human will then select many recordings as all comprising a single class as perceived by the human, and the computer system will then derive from these examples appropriate ranges for each parameter to characterize that class of sounds and distinguish it from other classes of sounds in the database. Based on this approach, it is not important that any of the parameters relate to human perception. It is only important that the data within the vectors be capable of distinguishing sounds into classes as classified by humans where music is merely one of the classes.
  • the invention builds on the extraction of many parameters from each recording as described by Blum and adds improvements that are particularly adapted to music and to allowing humans to find desired musical recordings within a large database of musical recordings.
  • the invention is a method which performs additional computations with the Blum parameters and other similar parameters to model descriptors of the music based on human perception.
  • descriptive terms which immediately achieve substantial agreement among human listeners include: energy level and good/bad for dancing.
  • Other descriptors of music with significant agreement among listeners include: sadness, happiness, anger, symphonicness, relative amount of singer contribution, melodicness, band size, countryness, metallicness, smoothness, coolness, salsa influence, reggae influence and recording quality.
  • Other descriptors which immediately achieve consensus or which achieve consensus over time will be discovered as the methods achieve widespread implementation.
  • the originally extracted parameter data are mathematically combined with an algorithm that achieves a high correlation with one of the scalars of human perception.
  • Each algorithm is empirically derived by working with large numbers of recordings and asking a listener, or preferably large numbers of listeners, to place the recordings in relative position compared to each other with respect to a descriptor such as energy level, or any of the above-listed descriptors, or any other descriptor.
  • the data which is mathematically extracted from the music is further processed based on a computational model of human perception to represent the music as a set of scalars, each of which corresponds as well as possible with human perception.
  • parameters such as those used by Blum are first calculated, but then, instead of storing the data of these calculated parameters, the data from the parameters is processed through each one of the algorithms to achieve a single number, a scalar, which represents the music with respect to a particular descriptor based on human perception. This collection of scalars is then stored for each musical recording.
  • the invented methods work with a second derivative of data which is extracted from the first derivative of data. Processing through these algorithms is “lossy” in the sense that the original parameter data cannot be recreated from the set of scalars, just as the first derivative data computations of the parameters are “lossy” because the original music cannot be recreated from the parameter data.
  • each musical recording By characterizing each musical recording as a set of scalar descriptors, each of which is based on human perception, the system allows a human to search for a recording where the descriptors fall within certain ranges or which, compared to a sample recording, has a little bit more or a little bit less, or a lot more or a lot less of any identified scalar characteristic.
  • Each algorithm for converting the parameter data, which is a first derivative of the music, into descriptor data, which is a second derivative based on the first derivative can be of any type.
  • the simplest type of algorithm simply applies a multiplied weighting factor to each data point measured for each parameter and adds all of the results to achieve a single number.
  • the weightings need not be linear.
  • Each weighting can have a complex function for producing the desired contribution to the resultant scalar.
  • Each weighting can even be generated from a look-up table based on the value of each parameter datum. What is important is that the algorithm is empirically developed to achieve a high correlation with human perception of the selected descriptor.
  • a user wishes to find a recording which is similar to a recording which has not yet been processed into a set of scalar descriptors, the user provides a representative portion of the recording, or directs the system where to find it.
  • the computer system extracts the parameters, combines them according to the algorithms with the appropriate weightings and develops a set of scalars for the recording which are calculated by the same methods as the scalars for recordings in the database. The new recording can then easily be compared to any recording already in the database.
  • FIG. 1 shows the prior art method described by Blum.
  • FIG. 2 describes the method by which Blum would find sounds that sound alike.
  • FIG. 3 is an illustration of the current invention as it is used to create a database of descriptors of music.
  • FIG. 4 illustrates a method for creating and searching a database.
  • FIG. 5 is an example of an interface used to interact with a database.
  • FIG. 6 illustrates how the current invention is used to find music that humans perceive as sounding alike using weighted parameters.
  • FIG. 7 illustrates how the current invention is used to find music that humans perceive as sounding alike using weighted descriptors.
  • FIG. 8 is a method for determining the perceptual salience of a parameter.
  • FIG. 1 The prior art method of categorizing sounds, as described by Blum, is illustrated in FIG. 1 .
  • a database of stored sounds 101 is fed to a parameter extractor 102 , where the sounds are decomposed using digital signal processing (DSP) techniques that are known in the art to create parameters 103 such as loudness, brightness, fundamental frequency, and cepstrum.
  • DSP digital signal processing
  • a human selects a sound, step 202 , from a database of stored sounds 201 , that falls into a category of sounds. For example, “laughter.” This is the target sound.
  • the target sound is fed into a parameter extractor 204 to create a number of parameters 205 , with a large number of data points over time for each parameter.
  • the parameters from the sound are then stored as a vector in an n-dimensional space 206 .
  • the parameter values of the target sound are adjusted, step 207 .
  • a new sound is played, step 208 , which corresponds with the adjusted parameter values.
  • a human listens to the new sound and determines whether or not the new sound is perceptually similar to the target sound, step 209 . If it is not, branch 210 , the parameters are again adjusted, step 207 , until a similar sound is found, or there are no more sounds. If a similar sound is found, then the parameter values of that sound are used to determine an area of similar sounding sounds.
  • This classification is binary. Either something is like the target class or it is not. There is an assumption that mathematical distance of a parameter vector from the parameter vectors of the target class is related to the perceptual similarity of the sounds. Blum claims that this technique works for cataloging transient sounds. However, for music, it is unlikely that the relationship between two parameter vectors would have any relevance to the perceptual relationship of the music represented by those vectors. This is because the parameters bear insufficient relevance to the human perception of music. Also, music cannot be adequately represented with a binary classification system. A music classification system must account for the fact that a piece of music can belong to one of several classes. For example, something may be country with a rock influence.
  • a classification system must account for the fact that a piece of music may be a better or worse example of a class. Some music may have more or less elements of some descriptive element. For example, a piece of music may have more or less energy, or be more or less country than another piece.
  • a set of scalar descriptors allows for more comprehensive and inclusive searches of music.
  • a database of stored music 301 is played to one or more humans, step 302 , who rate the music on the amount of one or more descriptors.
  • the same music is fed into a parameter extractor 303 , that uses methods known in the art to extract parameters 304 that are relevant to the perception of music, such as tempo, rhythm complexity, rhythm strength, brightness, dynamic range, and harmonicity. Numerous different methods for extracting each of these parameters are known in the art.
  • a model of a descriptor 305 is created by combining the parameters with different weightings for each parameter. The weightings may vary with the value of the parameter.
  • the parameter “brightness” may contribute to a descriptor value only when it is above a threshold or below a threshold or within a range.
  • the model is refined, step 307 , by minimizing the difference, calculated in step 306 , between the human-derived descriptor value and the machine-derived value.
  • the objective of using human listeners is to create models of their perceptual processes of the descriptors, using a subset of all music, and to apply that model to the categorization of the set of all music.
  • the limit on the goodness of the fit of the model is determined by, among other things:
  • the preferred method for collecting human data is to use a panel of ear-pickers who listen to each song and ascertain the quantity of each of several descriptors, using a Lickert scale, which is well known in the art. For example, nine ear-pickers are given a rating scale with a list of descriptors, as shown below:
  • Another method uses music with known quantities of some descriptor as defined by the purpose to which it is put by the music-buying public, or by music critics.
  • Our technique rank orders the recommended songs by their quantity of the descriptor, either using the values that come with the music, or using our panel of ear-pickers. For example, www.jamaicans.com/eddyedwards features music with Caribbean rhythms that are high energy.
  • Amazon.Com features a mood matcher in which music critics have categorized music according to its uses. For example, for line dancing they recommend the following
  • Another method uses professional programmers who create descriptors to create programs of music for play in public locations. For example, one company has discovered by trial and error the type of music they need to play at different times of the day to energize or pacify the customers of locations for which they provide the ambient music. They use the descriptor “energy,” rated on a scale of 1 (low) to 5 (high). We used 5 songs at each of 5 energy levels and in 5 genres (125 songs total), extracted the parameters and created a model of the descriptor “energy” on the basis of the e company's human-derived energy values. We then applied that model to 125 different songs and found an 88% match between the values of our machine-derived descriptors and the values from the human-derived descriptors.
  • the preferred method for representing each descriptor uses generalized linear models, known in the art (e.g. McCullagh and Neider (1989) Generalized Linear Models, Chapman and Hall).
  • the preferred model of “energy” uses linear regression, and looks like this:
  • the preferred weighting values are:
  • the preferred weighting values are:
  • ⁇ 8 0 if no key; 10 if minor keys; 20 if major keys
  • Another method of optimizing a descriptor model involves using non linear models. For example:
  • Yet another method involves using heuristics. For example, if the beats per minute value of a song is less than 60, then the energy cannot be more than some predetermined value.
  • each descriptor model is a machine-derived descriptor 308 .
  • descriptors for each song are stored in a database 309 .
  • the presently preferred descriptors for use in the preferred system are:
  • the models of the descriptors have been created, using a subset of all available music 301 , they are applied to the classification of other music 311 . Tests are conducted to determine the fit between the descriptor models derived from the subset of music and the other music, by substituting some of the other music 311 into the process beginning with the parameter extractor 303 . As new music becomes available, a subset of the new music is tested against the models by placing it into the process. Any adjustments of the models are applied to all of the previously processed music, either by reprocessing the music, or, preferably, reprocessing the parameters originally derived from the music. From time to time, a subset of new music is placed into the process at step 301 . Thus, any changes in the tastes of human observers, or in styles of music can be measured and accommodated in the descriptor models.
  • Harmonicity is related to the number of peaks in the frequency domain which are an Integer Multiple of the Fundamental (IMF) frequency.
  • the harmonicity value is expressed as a ratio of the number of computed IMFs to a maximum IMF value (specified to be four).
  • Harmonicity values H are computed for time windows of length equal to one second for a total of 20 seconds. Mean and standard deviation values are additional parameters taken over the vector H.
  • Loudness is defined to be the root mean square value of the song signal. Loudness values L were computed for time windows of length equal to one second for a total of 20 seconds.
  • Dynamic Range Standard deviation of loudness for 20 values, calculated 1/second for 20 seconds.
  • Rhythm Strength is calculated in the same process used to extract tempo. First, a short-time Fourier transform spectrogram of the song is performed, using a window size of 92.8 ms (Hanning windowed), and a frequency resolution of 10.77 Hz. For each frequency bin in the range of 0-500 Hz, an onset track is formed by computing the first difference of the time history of amplitude in that bin. Large positive values in the onset track for a certain frequency bin correspond to rapid onsets in amplitude at that frequency. Negative values (corresponding to decreases in amplitude) in the onset tracks are truncated to zero, since the onsets are deemed to be most important in determining the temporal locations of beats.
  • a correlogram of the onset tracks is then computed by calculating the unbiased autocorrelation of each onset track.
  • the frequency bins are sorted in decreasing order based on the activity in the correlation function, and the twenty most active correlation functions are further analyzed to extract tempo information.
  • Each of the selected correlation functions is analyzed using a peak detection algorithm and a robust peak separation method in order to determine the time lag between onsets in the amplitude of the corresponding frequency bin. If a lag can be identified with reasonable confidence, and if the value lies between 222 ms and 2 seconds, then a rhythmic component has been detected in that frequency bin. The lags of all of the detected components are then resolved to a single lag value by means of a weighted greatest common divisor algorithm, where the weighting is dependent on the total energy in that frequency bin, the activity in the correlation function for that frequency bin, and the degree of confidence achieved by the peak detection and peak separation algorithms for that frequency bin.
  • the tempo of the song is set to be the inverse of the resolved lag.
  • the rhythm strength is the sum of the activity levels of the 20 most active correlation functions, normalized by the total energy in the song.
  • the activity level of each correlation function is defined as the sum-of-squares of the negative elements of second difference of that function. It is a measure of how strong and how repetitive the beat onsets are in that frequency bin.
  • Rhythm Complexity The number of rhythmic events per measure.
  • a measure prototype is created by dividing the onset tracks into segments whose length corresponds to the length of one measure. These segments are then summed together to create a single, average onset track for one measure of the song, this is the measure prototype.
  • the rhythm complexity is calculated as the number of distinct peaks in the measured prototype.
  • Articulation is the ratio of note length, ie the duration from note start to note end (L) over the note spacing ie the duration of one note start to the next note start (S).
  • An L:S ratio close to 1.00 reflects legato articulation. Ratios less than 1.00 are considered staccato.
  • Attack is the speed at which a note achieves its half-peak amplitude.
  • Note Duration Pitch is extracted by using a peak separation algorithm to find the separation in peaks in the autocorrelation of the frequency domain of the song signal.
  • the peak separation algorithm uses a windowed threshold peak detection algorithm which uses a sliding window and finds the location of the maximum value in each of the peak-containing-regions in every window.
  • a peak-containing-region is defined as a compact set of points where all points in the set are above the specified threshold, and the surrounding points are below the threshold.
  • a confidence measure in the pitch is returned as well; confidence is equal to harmonicity.
  • Pitch values P are computed for time windows of length equal to 0.1 second. Changes of less than 10 Hz are considered to be one note.
  • Sound Salience Uses a modified version of the rhythm extraction algorithm, in which spectral events without rapid onsets are identified.
  • Another method for optimizing the fit between the human-derived descriptor value and the machine-derived value is to adjust the extractor performance, step 310 .
  • This can be accomplished by using extractors in which some or all of the internal values, for example the sample duration, or the upper or lower bounds of the frequencies being analyzed, are adjusted.
  • the adjustment is accomplished by having an iterative program repeat the extraction and adjust the values until the error, step 306 , between the human values and the machine values is minimized.
  • the parameter extractor 801 is tested, step 802 , by visually or audibly displaying the parameter value while concurrently playing the music from which it was extracted. For example, a time history of the harmonicity values of a song, sampled every 100 ms, is displayed on a screen, with a moving cursor line. The computer plays the music and moves the cursor line so that the position of the cursor line on the x-axis is synchronized with the music. If the listener or listeners can perceive the correct connection between the parameter value and the changes in the music then the parameter extractor is considered for use in further model development, branch 803 .
  • the database of descriptors 309 can be combined with other meta data about the music, such as the name of the artist, the name of the song, the date of recording etc.
  • One method for searching the database is illustrated in FIG. 4 .
  • a large set of music 401 representing “all music” is sent to the parameter extractor 402 .
  • the descriptors are then combined with other meta data and pointers to the location of the music, for example URLs of the places where the music can be purchased, to create a database 405 .
  • a user can interrogate the database 405 , and receive the results of that interrogation using an interface 406 .
  • FIG. 5 An example of an interface is illustrated in FIG. 5 .
  • a user can type in textural queries using the text box 504 . For example; “Song title: Samba pa ti, Artist: Santana.” The user then submits the query by pressing the sort by similarity button 503 . The song “Samba pa ti” becomes the target song, and appears in the target box 501 .
  • the computer searches the n dimensional database 405 of vectors made up of the song descriptors, looking for the smallest vector distances between the target song and other songs. These songs are arranged in a hit list box 502 , arranged in increasing vector distance (decreasing similarity). An indication of the vector distance between each hit song and the target song is shown beside each hit song, expressed as a percent, with 100% being the same song. For example, the top of the list may be “Girl from Ipanema by Stan Getz, 85%.”
  • Another type of query allows a user to arrange the songs by profile.
  • the user presses the sort by number button 509 which lists all of the songs in the database in numerical order.
  • the user can scroll through the songs using a scroll bar 506 . They can select a song by clicking on it, and play the song by double clicking on it. Any selected song is profiled by the slider bars 505 . These show the scalar values of each of several descriptors. The current preferred method uses energy, danceability and anger.
  • Pressing the sort by profile button 510 places the highlighted song in the target box 501 and lists songs in the hit box 503 that have the closest values to the values of the target song.
  • Yet another type of query allows the user to sort by similarity plus profile.
  • a target song is chosen and the songs in the hit box 502 are listed by similarity.
  • the user performs a search of this subset of songs by using the slider bars 505 .
  • the slider values of the original target song are 5,6,7 for energy, danceabilty and anger respectively.
  • the user increases the energy value from 5 to 9 and presses the search similar songs button 507 .
  • the profile of the target song remains as a ghosted image 508 on the energy slider.
  • the computer searches the subset of songs for a song with values of 9, 6 and 7. Songs with that profile are arranged in decreasing order of similarity. Songs without that profile are appended, arranged by decreasing similarity.
  • the target song remains in the target box 501 .
  • a user can choose a new target song by clicking on a song in the hit list box 502 .
  • FIG. 6 illustrates how the invention is used to find music that sounds to human listeners like any musical composition selected by a user.
  • the processes in FIG. 3 are repeated to create a set of parameters 604 . These are used to create a model of likeness 605 by a process described below.
  • One or more humans listen to pairs of songs and judge their similarity on a scale, for example from 1 to 9, where 1 is very dissimilar, and 9 is very similar, step 602 .
  • the objective of using human listeners is to create a model of their perceptual process of “likeness”, using a subset of all music, and to apply that model to the set of all music.
  • the preferred method for collecting human data is to use a panel of ear-pickers who listen to pairs of songs and score their similarity, using a Lickert scale, which is well known in the art.
  • Another method is to have people visit a web site on which they can listen to pairs of songs and make similarity judgments. These judgments could be on a scale of 1 to 9, or could be yes/no. It is possible to estimate a scalar value of similarity based on a large number of binary judgments using statistical techniques known in the art.
  • the objective of the model is to predict these perceived differences using the extracted parameters of music.
  • a list of numbers is calculated for the comparison of each song to each other song.
  • the list of numbers consists of a value for each parameter where the value is the difference between the parameter value for a first song and the value of the same parameter for a second song.
  • the model is used to compare one song to another for likeness, the list of parameter differences between the two songs is calculated and these differences are the inputs to the model.
  • the model then yields a number that predicts the likeness that people would judge for the same two songs.
  • the model processes the list of difference values by applying weights and heuristics to each value.
  • the preferred method of creating a model of likeness is to sum the weighted parameter values. For example, for three parameters, A, B and C, the following steps are used for songs 1 and 2:
  • n ⁇ ( A 1 ⁇ A 2 ) n ,n ⁇ ( B 1 ⁇ B 2 ) n ,n ⁇ ( C 1 ⁇ C 2 ) n
  • STEP 3 weight and sum the differences.
  • the values of the weights ( ⁇ 1 - ⁇ n ) are determined by linear regression, as explained below.
  • the value of the weights are determined by a process of linear regression, step 608 , which seeks to minimize the difference, step 606 , between the human-derived likeness values, step 602 , and the output from the model, 605 .
  • the preferred model of likeness is:
  • Another method for deriving likeness is to calculate the correlation coefficients (r) of the parameter values between each pair of songs in the database, and to create a matrix of similarity for the songs, with high correlation equating to high similarity.
  • the parameter values are normalized to ensure that they all have the same range.
  • Song 1 provides the parameters for the x values
  • song 2 provides the parameters for the y values in the following formula:
  • the preferred method of storing and organizing the parameter differences data is as a multi dimensional vector in a multi dimensional database 607 .
  • the resulting matrix contains n*(n ⁇ 1)/2 cells where n is the number of songs.
  • the model is used by starting with a target song 609 , calculating the difference in value for each parameter between the comparison song and the songs in the target database, steps 603 - 605 , and then applying the model to the difference values to arrive at a value which represents the likeness between the comparison song and each song in the target database.
  • An alternative method precomputes the 2 dimensional similarity matrix such that each song is connected with only those songs with which there is a match above a predetermined value.
  • the low matching songs are culled from the similarity matrix. This decreases the size of the database and can increase the search speed.
  • the limit on the goodness of the fit of the model is determined by, among other things:
  • FIG. 7 illustrates an alternative method for finding music that sounds to human listeners like any musical composition selected by a user.
  • the processes in FIG. 3 are repeated to create a set of descriptors 706 . These are used to create a model of likeness 707 by a process similar to that used to create the model of likeness using parameters 605 .
  • One or more humans listen to pairs of songs and judge their similarity on a scale, for example from 1 to 9, where 1 is very dissimilar, and 9 is very similar, step 702 .
  • the objective of using human listeners is to create a model of their perceptual process of “likeness”, using a subset of all music, and to apply that model to the set of all music.
  • the preferred or alternative methods of collecting human data described with FIG. 6 are used.
  • the objective of the model is to predict the perceived differences using the modelled descriptors of music.
  • a list of numbers is calculated for the comparison of each song to each other song.
  • the list of numbers consists of a value for each descriptor where the value is the difference between the descriptor value for a first song and the value of the same descriptor for a second song.
  • the model is used to compare one song to another for likeness, the list of descriptor differences between the two songs is calculated and these differences are the inputs to the model.
  • the model then yields a number that predicts the likeness that people would judge for the same two songs.
  • the model processes the list of descriptor difference values by applying weights and heuristics to each value.
  • the preferred method of creating a model of likeness is to sum the weighted descriptor difference values. For example, for three descriptors, A, B and C, the following steps are used for songs 1 and 2:
  • n ⁇ ( A 1 ⁇ A 2 ) n ,n ⁇ ( B 1 ⁇ B 2 ) n ,n ⁇ ( C 1 ⁇ C 2 ) n
  • the value of the weights are determined by a process of linear regression, step 710 , which seeks to minimize the difference, step 708 , between the human-derived likeness values, step 702 , and the output from the model.
  • the preferred model of likeness is:
  • the preferred and alternative methods of organizing and storing the parameter differences data for calculating likeness are also used when the process uses descriptors.
  • the sorting results in a maximum of L n similarity classes, where L is the number of levels of each descriptor and n is the number of descriptors. In this case, 59049 classes.
  • the songs in each class have the identical descriptor values. This provides an alternative type of likeness.
  • the likeness database may be quickly searched by starting with any song and immediately finding the closest songs.
  • a more enhanced search combines a similarity search with a descriptor search and the descriptor adjustment step described above. The steps are:

Abstract

A method for characterizing a musical recording as a set of scalar descriptors, each of which is based on human perception. A group of people listens to a large number of musical recordings and assigns to each one many scalar values, each value describing a characteristic of the music as judged by the human listeners. Typical scalar values include energy level, happiness, danceability, melodicness, tempo, and anger. Each of the pieces of music judged by the listeners is then computationally processed to extract a large number of parameters which characterize the electronic signal within the recording. Algorithms are empirically generated which correlate the extracted parameters with the judgments based on human perception to build a model for each of the scalars of human perception. These models can then be applied to other music which has not been judged by the group of listeners to give to each piece of music a set of scalar values based on human perception. The set of scalar values can be used to find other pieces that sound similar to humans or vary in a dimension of one of the scalars.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. patent application Ser. No. 13/667,683, filed on Nov. 2, 2012, which is a continuation of U.S. patent application Ser. No. 09/556,086, filed on Apr. 21, 2000, which claims the benefit of priority to U.S. Provisional Patent Application Ser. No. 60/153,768, filed on Sep. 14, 1999, the priority benefit of each of which is claimed hereby, and each of which is incorporated by reference herein in its entirety.
  • 1.0 BACKGROUND
  • Modern computers have made possible the efficient assemblage and searching of large databases of information. Text-based information can be searched for key words. Until recently, databases containing recordings of music could only be searched via the textual metadata associated with each recording rather than via the acoustical content of the music itself. The metadata includes information such as title, artist, duration, publisher, classification applied by publisher or others, instrumentation, and recording methods. For several reasons it is highly desirable to be able to search the content of the music to find music which sounds to humans like other music, or which has more or less of a specified quality as perceived by a human than another piece of music. One reason is that searching by sound requires less knowledge on the part of the searcher; they don't have to know, for example, the names of artists or titles. A second reason is that textual metadata tends to put music into classes or genres, and a search in one genre can limit the discovery of songs from other genres that may be attractive to a listener. Yet another reason is that searching by the content of the music allows searches when textual information is absent, inaccurate, or inconsistent.
  • A company called Muscle Fish LLC in Berkeley, Calif. has developed computer methods for classification, search and retrieval of all kinds of sound recordings. These methods are based on computationally extracting many “parameters” from each sound recording to develop a vector, containing a large number of data points, which characteristically describes or represents the sound. These methods are described in a paper entitled Classification, Search, and Retrieval of Audio by Erling Wold, Thom Blum, Douglas Keislar, and James Wheaton which was published in September 1999 on the Muscle Fish website at Musclefish.com, and in U.S. Pat. No. 5,918,223 to Blum et al entitled “Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information.”
  • The Blum patent describes how the authors selected a set of parameters that can be computationally derived from any sound recording with no particular emphasis on music. Data for each parameter is gathered over a period of time, such as two seconds. The parameters are well known in the art and can be easily computed. The parameters include variation in loudness over the duration of the recording (which captures beat information as well as other information), variation in fundamental frequency over the duration of the recording (often called “pitch”), variation in average frequency over the duration of the recording (often called “brightness”), and computation over time of a parameter called the mel frequency cepstrum coefficient (MFCC).
  • Mel frequency cepstra are data derived by resampling a uniformly-spaced frequency axis to a mel spacing, which is roughly linear below 100 Hz and logarithmic above 100 Hz. Mel cepstra are the most commonly used front-end features in speech recognition systems. While the mel frequency spacing is derived from human perception, no other aspect of cepstral processing is connected with human perception. The processing before taking the mel spacing involves, in one approach, taking a log discrete Fourier transform (DFT) of a frame of data, followed by an inverse DFT. The resulting time domain signal compacts the resonant information close to the t=0 axis and pushes any periodicity out to higher time. For monophonic sounds, such as speech, this approach is effective for pitch tracking, since the resonant and periodic information has little overlap. But for polyphonic signals such as music, this separability would typically not exist.
  • These parameters are chosen not because they correlate closely with human perception, but rather because they are well known and, in computationally extracted form, they distinguish well the different sounds of all kinds with no adaptation to distinguishing different pieces of music. In other words, they are mathematically distinctive parameters, not parameters which are distinctive based on human perception of music. That correlation with human perception is not deemed important by the Blum authors is demonstrated by their discussion of the loudness parameter. When describing the extraction of the loudness parameter, the authors acknowledge that the loudness which is measured mathematically does not correlate with human perception of loudness at high and low frequencies. They comment that the frequency response of the human ear could be modeled if desired, but, for the purposes of their invention, there is no benefit.
  • In the Blum system, a large vector of parameters is generated for a representative sample or each section of each recording. A human will then select many recordings as all comprising a single class as perceived by the human, and the computer system will then derive from these examples appropriate ranges for each parameter to characterize that class of sounds and distinguish it from other classes of sounds in the database. Based on this approach, it is not important that any of the parameters relate to human perception. It is only important that the data within the vectors be capable of distinguishing sounds into classes as classified by humans where music is merely one of the classes.
  • 2.0 SUMMARY OF THE INVENTION
  • The invention builds on the extraction of many parameters from each recording as described by Blum and adds improvements that are particularly adapted to music and to allowing humans to find desired musical recordings within a large database of musical recordings. The invention is a method which performs additional computations with the Blum parameters and other similar parameters to model descriptors of the music based on human perception.
  • When human listeners compare one recording of music to another, they may use many different words to describe the recordings and compare the differences. There will often be little agreement between the humans as to the meanings of the descriptive words, or about the quantity associated with those words. However, for selected descriptive words, there will often be substantial agreement among large numbers of people.
  • For example, descriptive terms which immediately achieve substantial agreement among human listeners include: energy level and good/bad for dancing. Other descriptors of music with significant agreement among listeners include: sadness, happiness, anger, symphonicness, relative amount of singer contribution, melodicness, band size, countryness, metallicness, smoothness, coolness, salsa influence, reggae influence and recording quality. Other descriptors which immediately achieve consensus or which achieve consensus over time will be discovered as the methods achieve widespread implementation.
  • All of the above-mentioned descriptors of music are scalars—that is, they are one dimensional measures rather than multidimensional vectors of data. Scalars are chosen because most people cannot easily describe or even understand the difference between one multidimensional vector describing a recording of music and another multidimensional vector describing another recording of music. People tend to express themselves and think in scalars, such as “this recording is better for dancing than that recording.” Sometimes they may combine many multi dimensional characteristics into a single category and think and express themselves in a scalar of that category. For example “this music is more country than that music.”
  • In the methods of the present invention, rather than using a multidimensional vector with large amounts of data extracted from each musical recording, the originally extracted parameter data are mathematically combined with an algorithm that achieves a high correlation with one of the scalars of human perception. Each algorithm is empirically derived by working with large numbers of recordings and asking a listener, or preferably large numbers of listeners, to place the recordings in relative position compared to each other with respect to a descriptor such as energy level, or any of the above-listed descriptors, or any other descriptor. In other words, the data which is mathematically extracted from the music is further processed based on a computational model of human perception to represent the music as a set of scalars, each of which corresponds as well as possible with human perception.
  • In the invented methods, parameters such as those used by Blum are first calculated, but then, instead of storing the data of these calculated parameters, the data from the parameters is processed through each one of the algorithms to achieve a single number, a scalar, which represents the music with respect to a particular descriptor based on human perception. This collection of scalars is then stored for each musical recording. In other words, instead of working with a first derivative of data extracted from the original music recordings (the parameters), the invented methods work with a second derivative of data which is extracted from the first derivative of data. Processing through these algorithms is “lossy” in the sense that the original parameter data cannot be recreated from the set of scalars, just as the first derivative data computations of the parameters are “lossy” because the original music cannot be recreated from the parameter data.
  • By characterizing each musical recording as a set of scalar descriptors, each of which is based on human perception, the system allows a human to search for a recording where the descriptors fall within certain ranges or which, compared to a sample recording, has a little bit more or a little bit less, or a lot more or a lot less of any identified scalar characteristic.
  • Each algorithm for converting the parameter data, which is a first derivative of the music, into descriptor data, which is a second derivative based on the first derivative, can be of any type. The simplest type of algorithm simply applies a multiplied weighting factor to each data point measured for each parameter and adds all of the results to achieve a single number. However, the weightings need not be linear. Each weighting can have a complex function for producing the desired contribution to the resultant scalar. Each weighting can even be generated from a look-up table based on the value of each parameter datum. What is important is that the algorithm is empirically developed to achieve a high correlation with human perception of the selected descriptor. Of course, the algorithms will constantly be improved over time, as will the parameter extraction methods, to achieve better and better empirical correlation with human perception of the descriptors. Whenever improved extraction algorithms or combinatorial algorithms are ready for use, the set of scalars for each recording in the entire database is recalculated.
  • If a user wishes to find a recording which is similar to a recording which has not yet been processed into a set of scalar descriptors, the user provides a representative portion of the recording, or directs the system where to find it. The computer system extracts the parameters, combines them according to the algorithms with the appropriate weightings and develops a set of scalars for the recording which are calculated by the same methods as the scalars for recordings in the database. The new recording can then easily be compared to any recording already in the database.
  • The prior art literature describes many methods for extracting parameters from music. The inventors of these methods often apply labels which correspond somewhat with human perception such as “pitch,” or “brightness.” However, few of them, or none of them, correlate highly with human perception. In fact, there are many competing methods which yield different results for calculating a parameter with the same label such as “pitch,” or “brightness.” Although these competing calculation methods were typically derived in an attempt to roughly model human perception based on theories of sound, for most or all of the parameters, human perception can be better modeled by combining the results of several calculation methods in an algorithm with various weightings. This fact of reality is demonstrated by the well known phenomenon that the human ear requires greater amplification of low frequencies and high frequencies to perceive them as the same loudness compared to mid-range frequencies when played at low volume. It is well known that the best models for human perception of loudness apply a compensating algorithm based on frequency. Of course, loudness is not a meaningful descriptor of music because any musical recording can be played at high volume or low volume and original music can be recorded at high volume or low volume.
  • 3.0 BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows the prior art method described by Blum.
  • FIG. 2 describes the method by which Blum would find sounds that sound alike.
  • FIG. 3 is an illustration of the current invention as it is used to create a database of descriptors of music.
  • FIG. 4 illustrates a method for creating and searching a database.
  • FIG. 5 is an example of an interface used to interact with a database.
  • FIG. 6 illustrates how the current invention is used to find music that humans perceive as sounding alike using weighted parameters.
  • FIG. 7 illustrates how the current invention is used to find music that humans perceive as sounding alike using weighted descriptors.
  • FIG. 8 is a method for determining the perceptual salience of a parameter.
  • 4.0 DETAILED DESCRIPTION
  • 4.1 Prior Art
  • The prior art method of categorizing sounds, as described by Blum, is illustrated in FIG. 1. A database of stored sounds 101 is fed to a parameter extractor 102, where the sounds are decomposed using digital signal processing (DSP) techniques that are known in the art to create parameters 103 such as loudness, brightness, fundamental frequency, and cepstrum. These parameters are stored as multi-dimensional vectors in an n-dimensional space of parameters 104.
  • In order to find sounds that are alike, the procedure illustrated in FIG. 2 is used. In Phase 1, a human selects a sound, step 202, from a database of stored sounds 201, that falls into a category of sounds. For example, “laughter.” This is the target sound. The target sound is fed into a parameter extractor 204 to create a number of parameters 205, with a large number of data points over time for each parameter. The parameters from the sound are then stored as a vector in an n-dimensional space 206.
  • It is assumed that all sounds within, or close to, the area describing laughter all sound like laughter. One way of exploring this is shown in Phase 2 of FIG. 2. The parameter values of the target sound are adjusted, step 207. A new sound is played, step 208, which corresponds with the adjusted parameter values. A human listens to the new sound and determines whether or not the new sound is perceptually similar to the target sound, step 209. If it is not, branch 210, the parameters are again adjusted, step 207, until a similar sound is found, or there are no more sounds. If a similar sound is found, then the parameter values of that sound are used to determine an area of similar sounding sounds.
  • The prior art puts sounds into classes. This classification is binary. Either something is like the target class or it is not. There is an assumption that mathematical distance of a parameter vector from the parameter vectors of the target class is related to the perceptual similarity of the sounds. Blum claims that this technique works for cataloging transient sounds. However, for music, it is unlikely that the relationship between two parameter vectors would have any relevance to the perceptual relationship of the music represented by those vectors. This is because the parameters bear insufficient relevance to the human perception of music. Also, music cannot be adequately represented with a binary classification system. A music classification system must account for the fact that a piece of music can belong to one of several classes. For example, something may be country with a rock influence. Also, a classification system must account for the fact that a piece of music may be a better or worse example of a class. Some music may have more or less elements of some descriptive element. For example, a piece of music may have more or less energy, or be more or less country than another piece. A set of scalar descriptors allows for more comprehensive and inclusive searches of music.
  • 4.2 Current Invention, Modeling Descriptors (FIGS. 3 and 8)
  • 4.2.1 Overview of Operation
  • The invention described here is illustrated in FIG. 3. A database of stored music 301 is played to one or more humans, step 302, who rate the music on the amount of one or more descriptors. The same music is fed into a parameter extractor 303, that uses methods known in the art to extract parameters 304 that are relevant to the perception of music, such as tempo, rhythm complexity, rhythm strength, brightness, dynamic range, and harmonicity. Numerous different methods for extracting each of these parameters are known in the art. A model of a descriptor 305 is created by combining the parameters with different weightings for each parameter. The weightings may vary with the value of the parameter. For example, the parameter “brightness” may contribute to a descriptor value only when it is above a threshold or below a threshold or within a range. The model is refined, step 307, by minimizing the difference, calculated in step 306, between the human-derived descriptor value and the machine-derived value.
  • The objective of using human listeners is to create models of their perceptual processes of the descriptors, using a subset of all music, and to apply that model to the categorization of the set of all music.
  • 4.2.2 Limitations of the Models
  • The limit on the goodness of the fit of the model (its predictive power) is determined by, among other things:
      • (1) The variability between the human responses, step 302: High variability means that people do not agree on the descriptors and any model will have poor predictive power.
      • (2) The intra-song variability in the set of all music: High variability within a song means than any one part of the song will be a poor representation of any other part of the same song. This will impede the listeners' task of judging a representative descriptor for that song, resulting in less accurate data and a less accurate model.
      • (3) How well the subset represents the set: The ability to create a model depends on the existence of patterns of parameters in similar songs. If the subset on which the model is based does not represent the patterns that are present in the set, then the model will be incorrect.
      • We can improve the performance of the model by:
        • (a) Choosing descriptors for which there is low inter-rater variability. Our criterion for selecting descriptors is a minimum correlation coefficient (r) of 0.5 between the mean from at least 5 human raters, and the individual scores from those raters.
        • (b) Applying the technique to music which has low intra-song variability (e.g. some Classical music has high variability within a song). One method for determining intra-song variability is to extract parameters from a series of short contiguous samples of the music. If more than half of the parameters have a standard deviation greater than twice the mean, then the song is classified as having high intra-song variability. Similarly, if more than half of the mean parameters of a song lie more than 3 standard deviations from the mean of the population, that song is classified as having high intra-song variability. This excludes less than 1% of all songs.
        • (c) Using statistical sampling techniques known in the art (for example, for political polling) to ensure that the subset represents the set.
    4.2.3 Collecting Human Data
  • The preferred method for collecting human data is to use a panel of ear-pickers who listen to each song and ascertain the quantity of each of several descriptors, using a Lickert scale, which is well known in the art. For example, nine ear-pickers are given a rating scale with a list of descriptors, as shown below:
  • Rating scales
    1. Energy: How much does the song
    make you want to move or sing?
    1 2 3 4 5 6 7 8 9
    No Little Some A lot of Very much
    movement. movement Movement movement movement
    Very light Light Medium High energy Very high
    Energy energy Energy energy
    2. Rhythm salience: What is the relative contribution of
    the rhythm to the overall sound of the song?
    1 2 3 4 5 6 7 8 9
    No rhythmic Very little Moderate A lot of The song is
    component rhythm Rhythm rhythm all rhythm
    3. Melodic salience: What is the relative contribution of the melody (lead
    singer/lead instrument} to the overall sound of the song?
    1 2 3 4 5 6 7 8 9
    No melody Melody is Melody is Melody is The song is
    not too moderately quite all melody
    important Important important
    4. Tempo: Is the song slow or fast?
    1 2 3 4 5 6 7 8 9
    Extremely Pretty slow Moderate Pretty Extremely
    slow tempo fast fast
    5. Density How dense is the song? How many
    sounds per second?
    1 2 3 4 5 6 7 8 9
    Not at all A little Moderate Fairly Extremely
    dense dense density dense dense
    6. Mood (happiness): What is the overall mood or
    emotional valence of the song?
    1 2 3 4 5 6 7 8 9
    Extremely Pretty sad Neither Fairly Extremely
    sad happy nor happy happy
    sad
  • These data are analyzed, and only those descriptors for which there is high inter-rater agreement (low variability) are used in the development of the system. For example, the correlations between the mean ratings and the mean of the individual ratings for 7 descriptors are shown in the table below. All of the mean correlation values are above 0.5 and the descriptors anger, danceabifity, density, energy, happines, melodic salience, and rhythmic salience are all acceptable for use within the present system.
  • Pearson correlation
    Mean
    Subject MELODIC RHYTHM
    rating for: ANGER DANCE DENSITY ENERGY HAPPY SALIENCE SALIENCE
    with subject 0.76 0.74 0.68 0.88 0.83 0.67 0.73
    101
    102 0.73 0.74 0.83 0.86 0.67 0.82
    103 0.65 0.68 0.85 0.77 0.72 0.30
    104 0.56 0.40 0.71 0.77 0.66 0.72 0.63
    105 0.75 0.74 0.69 0.85 0.72 0.14 0.37
    106 0.70 0.88 0.75 0.88 0.83 0.39 0.87
    107 0.69 0.84 0.71 0.81 0.59 0.57 0.74
    108 0.63 0.89 0.76 0.89 0.83 0.57 0.86
    201 0.82 0.85 0.86 0.81 0.73 0.44 0.83
    Mean 0.70 0.76 0.73 0.84 0.76 0.54 0.68
    Correlation
    values
    Standard 0.03 0.06 0.02 0.01 0.03 0.06 0.07
    Error of
    Mean
  • Another method uses music with known quantities of some descriptor as defined by the purpose to which it is put by the music-buying public, or by music critics. Our technique rank orders the recommended songs by their quantity of the descriptor, either using the values that come with the music, or using our panel of ear-pickers. For example, www.jamaicans.com/eddyedwards features music with Caribbean rhythms that are high energy. Amazon.Com features a mood matcher in which music critics have categorized music according to its uses. For example, for line dancing they recommend the following
  • Wreck Your Life by Old 97's
  • The Best Of Billy Ray Cyrus by Billy Ray Cyrus
  • Guitars, Cadilfacs, Etc., Etc. by Dwight Yoakam
  • American Legends Best Of The Early Years by Hank Williams
  • Vol. 1-Hot Country Hits by Mcdaniel, et al
  • Another method uses professional programmers who create descriptors to create programs of music for play in public locations. For example, one company has discovered by trial and error the type of music they need to play at different times of the day to energize or pacify the customers of locations for which they provide the ambient music. They use the descriptor “energy,” rated on a scale of 1 (low) to 5 (high). We used 5 songs at each of 5 energy levels and in 5 genres (125 songs total), extracted the parameters and created a model of the descriptor “energy” on the basis of the e company's human-derived energy values. We then applied that model to 125 different songs and found an 88% match between the values of our machine-derived descriptors and the values from the human-derived descriptors.
  • 4.2.5 Modeling
  • The preferred method for representing each descriptor uses generalized linear models, known in the art (e.g. McCullagh and Neider (1989) Generalized Linear Models, Chapman and Hall). For example, the preferred model of “energy” uses linear regression, and looks like this:

  • Energy=β01*Harmonicity+β2*DynamicRange+β3*Loudness+β4*RhythmComplexity+β5*RhythmStrength  (1)
  • The preferred weighting values are:
  • β0=4.92
  • β1=−1.12
  • β2=−45.09
  • β3=−7.84
  • β4=0.016
  • β5=0.001
  • The preferred descriptor model for “happiness” is:

  • Happiness=β01*Articulation+β2*Attack+β3*NoteDuration+β4*Tempo+β5*DynamicRangeLow+β6*DynamicRangeHigh+β7*SoundSalience+β8(Key)  (2)
  • The preferred weighting values are:
  • β0=6.51
  • β1=−4.14
  • β2=8.64
  • β3=−15.84
  • β4=14.73
  • β5=6.1
  • β6=−8.7
  • β7=11.00
  • β8=0 if no key; 10 if minor keys; 20 if major keys
  • It is likely possible to improve each model by adjusting the weighting values β0 to βn, so that they vary with the input value of the parameter or using different extraction methods for one or more parameters or adding other parameters to the step of extracting parameters.
  • Another method of optimizing a descriptor model involves using non linear models. For example:

  • Energy=F1*tempo(β2*tempo+β3*sound salience))  (3)
      • where F is the cumulative normal distribution:

  • F(x)=1/√2πσ2−∞ xe−(x-μ) 2 /2π 2 dx
      • σ2=the standard deviation of x
      • μ=the mean of x
        The values of βn are set to 1.0. Other values will be substituted as we develop the process.
  • Yet another method involves using heuristics. For example, if the beats per minute value of a song is less than 60, then the energy cannot be more than some predetermined value.
  • The output from each descriptor model is a machine-derived descriptor 308. Several such descriptors for each song are stored in a database 309. The presently preferred descriptors for use in the preferred system are:
  • Energy
  • Tempo
  • Mood (happiness)
  • Mood (anger)
  • Danceability
  • Once the models of the descriptors have been created, using a subset of all available music 301, they are applied to the classification of other music 311. Tests are conducted to determine the fit between the descriptor models derived from the subset of music and the other music, by substituting some of the other music 311 into the process beginning with the parameter extractor 303. As new music becomes available, a subset of the new music is tested against the models by placing it into the process. Any adjustments of the models are applied to all of the previously processed music, either by reprocessing the music, or, preferably, reprocessing the parameters originally derived from the music. From time to time, a subset of new music is placed into the process at step 301. Thus, any changes in the tastes of human observers, or in styles of music can be measured and accommodated in the descriptor models.
  • 4.2.4 Parameter Extractors
  • The following parameter extraction methods are preferred:
  • Harmonicity: Harmonicity is related to the number of peaks in the frequency domain which are an Integer Multiple of the Fundamental (IMF) frequency. The harmonicity value is expressed as a ratio of the number of computed IMFs to a maximum IMF value (specified to be four). Harmonicity values H are computed for time windows of length equal to one second for a total of 20 seconds. Mean and standard deviation values are additional parameters taken over the vector H.
  • Loudness: Loudness is defined to be the root mean square value of the song signal. Loudness values L were computed for time windows of length equal to one second for a total of 20 seconds.
  • Dynamic Range: Standard deviation of loudness for 20 values, calculated 1/second for 20 seconds.
  • Rhythm Strength: Rhythm strength is calculated in the same process used to extract tempo. First, a short-time Fourier transform spectrogram of the song is performed, using a window size of 92.8 ms (Hanning windowed), and a frequency resolution of 10.77 Hz. For each frequency bin in the range of 0-500 Hz, an onset track is formed by computing the first difference of the time history of amplitude in that bin. Large positive values in the onset track for a certain frequency bin correspond to rapid onsets in amplitude at that frequency. Negative values (corresponding to decreases in amplitude) in the onset tracks are truncated to zero, since the onsets are deemed to be most important in determining the temporal locations of beats. A correlogram of the onset tracks is then computed by calculating the unbiased autocorrelation of each onset track. The frequency bins are sorted in decreasing order based on the activity in the correlation function, and the twenty most active correlation functions are further analyzed to extract tempo information.
  • Each of the selected correlation functions is analyzed using a peak detection algorithm and a robust peak separation method in order to determine the time lag between onsets in the amplitude of the corresponding frequency bin. If a lag can be identified with reasonable confidence, and if the value lies between 222 ms and 2 seconds, then a rhythmic component has been detected in that frequency bin. The lags of all of the detected components are then resolved to a single lag value by means of a weighted greatest common divisor algorithm, where the weighting is dependent on the total energy in that frequency bin, the activity in the correlation function for that frequency bin, and the degree of confidence achieved by the peak detection and peak separation algorithms for that frequency bin. The tempo of the song is set to be the inverse of the resolved lag.
  • The rhythm strength is the sum of the activity levels of the 20 most active correlation functions, normalized by the total energy in the song. The activity level of each correlation function is defined as the sum-of-squares of the negative elements of second difference of that function. It is a measure of how strong and how repetitive the beat onsets are in that frequency bin.
  • Rhythm Complexity: The number of rhythmic events per measure. A measure prototype is created by dividing the onset tracks into segments whose length corresponds to the length of one measure. These segments are then summed together to create a single, average onset track for one measure of the song, this is the measure prototype. The rhythm complexity is calculated as the number of distinct peaks in the measured prototype.
  • Articulation is the ratio of note length, ie the duration from note start to note end (L) over the note spacing ie the duration of one note start to the next note start (S). An L:S ratio close to 1.00 reflects legato articulation. Ratios less than 1.00 are considered staccato.
  • Attack is the speed at which a note achieves its half-peak amplitude.
  • Note Duration Pitch is extracted by using a peak separation algorithm to find the separation in peaks in the autocorrelation of the frequency domain of the song signal. The peak separation algorithm uses a windowed threshold peak detection algorithm which uses a sliding window and finds the location of the maximum value in each of the peak-containing-regions in every window. A peak-containing-region is defined as a compact set of points where all points in the set are above the specified threshold, and the surrounding points are below the threshold. A confidence measure in the pitch is returned as well; confidence is equal to harmonicity. Pitch values P are computed for time windows of length equal to 0.1 second. Changes of less than 10 Hz are considered to be one note.
  • Tempo The tempo extraction technique is described in an earlier section on rhythm strength.
  • Dynamic Range Low Standard deviation of loudness calculated over a duration of 10 seconds.
  • Dynamic Range High Standard deviation of loudness calculated over a duration of 0.1 seconds.
  • Sound Salience Uses a modified version of the rhythm extraction algorithm, in which spectral events without rapid onsets are identified.
  • Key Determines the key by the distribution of the notes identified with the note extractor.
  • 4.2.6 Optimizing the Parameter Extractors
  • Another method for optimizing the fit between the human-derived descriptor value and the machine-derived value is to adjust the extractor performance, step 310. This can be accomplished by using extractors in which some or all of the internal values, for example the sample duration, or the upper or lower bounds of the frequencies being analyzed, are adjusted. The adjustment is accomplished by having an iterative program repeat the extraction and adjust the values until the error, step 306, between the human values and the machine values is minimized.
  • It is important that the parameters extracted from the music have some perceptual salience. This is tested using the technique illustrated in FIG. 8. The parameter extractor 801 is tested, step 802, by visually or audibly displaying the parameter value while concurrently playing the music from which it was extracted. For example, a time history of the harmonicity values of a song, sampled every 100 ms, is displayed on a screen, with a moving cursor line. The computer plays the music and moves the cursor line so that the position of the cursor line on the x-axis is synchronized with the music. If the listener or listeners can perceive the correct connection between the parameter value and the changes in the music then the parameter extractor is considered for use in further model development, branch 803.
  • If it is not perceived, or not perceived correctly, then that extractor is rejected or subjected to further improvement, branch 804.
  • 4.3 Interacting with the Database (FIGS. 4 and 5)
  • Once the database of descriptors 309 has been created, it can be combined with other meta data about the music, such as the name of the artist, the name of the song, the date of recording etc. One method for searching the database is illustrated in FIG. 4. A large set of music 401, representing “all music” is sent to the parameter extractor 402. The descriptors are then combined with other meta data and pointers to the location of the music, for example URLs of the places where the music can be purchased, to create a database 405. A user can interrogate the database 405, and receive the results of that interrogation using an interface 406.
  • An example of an interface is illustrated in FIG. 5. A user can type in textural queries using the text box 504. For example; “Song title: Samba pa ti, Artist: Santana.” The user then submits the query by pressing the sort by similarity button 503. The song “Samba pa ti” becomes the target song, and appears in the target box 501. The computer searches the n dimensional database 405 of vectors made up of the song descriptors, looking for the smallest vector distances between the target song and other songs. These songs are arranged in a hit list box 502, arranged in increasing vector distance (decreasing similarity). An indication of the vector distance between each hit song and the target song is shown beside each hit song, expressed as a percent, with 100% being the same song. For example, the top of the list may be “Girl from Ipanema by Stan Getz, 85%.”
  • Another type of query allows a user to arrange the songs by profile. First, the user presses the sort by number button 509 which lists all of the songs in the database in numerical order. The user can scroll through the songs using a scroll bar 506. They can select a song by clicking on it, and play the song by double clicking on it. Any selected song is profiled by the slider bars 505. These show the scalar values of each of several descriptors. The current preferred method uses energy, danceability and anger. Pressing the sort by profile button 510 places the highlighted song in the target box 501 and lists songs in the hit box 503 that have the closest values to the values of the target song.
  • Yet another type of query allows the user to sort by similarity plus profile. First a target song is chosen and the songs in the hit box 502 are listed by similarity. Then the user performs a search of this subset of songs by using the slider bars 505. For example, the slider values of the original target song are 5,6,7 for energy, danceabilty and anger respectively. The user increases the energy value from 5 to 9 and presses the search similar songs button 507. The profile of the target song remains as a ghosted image 508 on the energy slider. The computer searches the subset of songs for a song with values of 9, 6 and 7. Songs with that profile are arranged in decreasing order of similarity. Songs without that profile are appended, arranged by decreasing similarity. The target song remains in the target box 501. A user can choose a new target song by clicking on a song in the hit list box 502.
  • 4.4 Modeling Likeness with Parameters (FIG. 6)
  • FIG. 6 illustrates how the invention is used to find music that sounds to human listeners like any musical composition selected by a user. The processes in FIG. 3 are repeated to create a set of parameters 604. These are used to create a model of likeness 605 by a process described below.
  • 4.4.1 Collecting Human Data
  • One or more humans listen to pairs of songs and judge their similarity on a scale, for example from 1 to 9, where 1 is very dissimilar, and 9 is very similar, step 602.
  • The objective of using human listeners is to create a model of their perceptual process of “likeness”, using a subset of all music, and to apply that model to the set of all music.
  • The preferred method for collecting human data is to use a panel of ear-pickers who listen to pairs of songs and score their similarity, using a Lickert scale, which is well known in the art.
  • Another method is to have people visit a web site on which they can listen to pairs of songs and make similarity judgments. These judgments could be on a scale of 1 to 9, or could be yes/no. It is possible to estimate a scalar value of similarity based on a large number of binary judgments using statistical techniques known in the art.
  • 4.4.2 Creating the Likeness Model
  • The objective of the model is to predict these perceived differences using the extracted parameters of music. To build the model, a list of numbers is calculated for the comparison of each song to each other song. The list of numbers consists of a value for each parameter where the value is the difference between the parameter value for a first song and the value of the same parameter for a second song. When the model is used to compare one song to another for likeness, the list of parameter differences between the two songs is calculated and these differences are the inputs to the model. The model then yields a number that predicts the likeness that people would judge for the same two songs. The model processes the list of difference values by applying weights and heuristics to each value.
  • The preferred method of creating a model of likeness is to sum the weighted parameter values. For example, for three parameters, A, B and C, the following steps are used for songs 1 and 2:
  • STEP 1—subtract the parameters of each song

  • A 1 −A 2 ,B 1 −B 2 ,C 1 −C 2
  • STEP 2—calculate the absolute differences. Our preferred method uses a value of n=2, but other values can be used.

  • n√(A 1 −A 2)n ,n√(B 1 −B 2)n ,n√(C 1 −C 2)n
  • STEP 3—weight and sum the differences. The values of the weights (β1n) are determined by linear regression, as explained below.

  • Likeness=β1 *A difference2 *B difference3 *C difference
  • The value of the weights are determined by a process of linear regression, step 608, which seeks to minimize the difference, step 606, between the human-derived likeness values, step 602, and the output from the model, 605. The preferred model of likeness is:

  • Likeness=β01*mean loudness+32*rhythm strength+β3*tempo+β4*dynamic range+β5*mean brightness+β6*mean harmonicity+β7*rhythm complexity+β8*standard deviation brightness+β9*standard deviation harmonicity  (4)
  • Where
      • β0=0
      • β1=−0.108
      • β2=−0.225
      • β3=−0.127
      • β4=−0.015
      • β5=−0.296
      • β6=−0.223
      • β7=−0.122
      • β8=0.277
      • β9=−0.074
  • Another method for deriving likeness is to calculate the correlation coefficients (r) of the parameter values between each pair of songs in the database, and to create a matrix of similarity for the songs, with high correlation equating to high similarity. The parameter values are normalized to ensure that they all have the same range. Song 1 provides the parameters for the x values, and song 2 provides the parameters for the y values in the following formula:

  • r=Σ(x i −x mean)(y i −y mean),√{[Σ(x i −X mean)2][Σ(y i −Y mean)2]}
      • Where Xmean and Ymean are the means of the normalized parameter values for songs 1 and 2.
      • The rationale behind using correlation coefficients is that if the parameters of two songs have a high positive correlation which is statistically significant then the two songs will be judged to be alike.
    4.4.3 Organizing and Storing the Likeness Data
  • The preferred method of storing and organizing the parameter differences data is as a multi dimensional vector in a multi dimensional database 607. The resulting matrix contains n*(n−1)/2 cells where n is the number of songs. The model is used by starting with a target song 609, calculating the difference in value for each parameter between the comparison song and the songs in the target database, steps 603-605, and then applying the model to the difference values to arrive at a value which represents the likeness between the comparison song and each song in the target database.
  • An alternative method precomputes the 2 dimensional similarity matrix such that each song is connected with only those songs with which there is a match above a predetermined value. Thus, the low matching songs are culled from the similarity matrix. This decreases the size of the database and can increase the search speed.
  • 4.4.4 Limitations of the Model
  • The limit on the goodness of the fit of the model (its predictive power) is determined by, among other things:
      • (1) The variability between the human responses, step 602. High variability means that people do not agree on the descriptors and any model will have poor predictive power.
      • (2) The intra song variability in the set of all music. High variablity within a song means than any one part of the song will be a poor representation of any other part of the same song. This will impede the listeners' task of judging similarity between such songs, resulting in less accurate data and a less accurate model.
      • (3) How well the subset represents the set. The ability to create a model depends on the existence of patterns of parameters in similar songs. If the subset on which the model is based does not represent the patterns that are present in the set, then the model will be incorrect.
  • We have found that a group of 12 human observers had a correlation coefficient of 0.5 or greater in what they consider sounds alike. This indicates that there is sufficient inter-rater reliability to be able to model the process. We can further improve our chances of successfully using the model to predict what sounds alike by:
      • (a) Only applying the technique to music which has low intra-song variability (e.g. some Classical music has high variability within songs).
      • (b) Using statistical sampling techniques known in the art (for example, for political polling) to ensure that the subset represents the set.
  • 4.5 Modeling Likeness with Descriptors (FIG. 7)
  • FIG. 7 illustrates an alternative method for finding music that sounds to human listeners like any musical composition selected by a user. The processes in FIG. 3 are repeated to create a set of descriptors 706. These are used to create a model of likeness 707 by a process similar to that used to create the model of likeness using parameters 605.
  • 4.5.1 Collecting Human Data
  • One or more humans listen to pairs of songs and judge their similarity on a scale, for example from 1 to 9, where 1 is very dissimilar, and 9 is very similar, step 702. The objective of using human listeners is to create a model of their perceptual process of “likeness”, using a subset of all music, and to apply that model to the set of all music. The preferred or alternative methods of collecting human data described with FIG. 6 are used.
  • 4.5.2 Creating the Likeness Model
  • The objective of the model is to predict the perceived differences using the modelled descriptors of music. To build the model, a list of numbers is calculated for the comparison of each song to each other song. The list of numbers consists of a value for each descriptor where the value is the difference between the descriptor value for a first song and the value of the same descriptor for a second song. When the model is used to compare one song to another for likeness, the list of descriptor differences between the two songs is calculated and these differences are the inputs to the model. The model then yields a number that predicts the likeness that people would judge for the same two songs. The model processes the list of descriptor difference values by applying weights and heuristics to each value.
  • The preferred method of creating a model of likeness is to sum the weighted descriptor difference values. For example, for three descriptors, A, B and C, the following steps are used for songs 1 and 2:
  • STEP 1—subtract the descriptors of each song

  • A 1 −A 2 ,B 1 −B 2 ,C 1 −C 2
  • STEP 2—calculate the absolute differences. Our preferred method uses a value of n=2, but other values can be used.

  • n√(A 1 −A 2)n ,n√(B 1 −B 2)n ,n√(C 1 −C 2)n
  • STEP 3—weight and sum the differences. The values of the weights (β1−βn) are determined by linear regression, as explained below.

  • Likeness=β1 *A difference2 *B difference3 *C difference
  • The value of the weights are determined by a process of linear regression, step 710, which seeks to minimize the difference, step 708, between the human-derived likeness values, step 702, and the output from the model. The preferred model of likeness is:

  • Likeness=β01*Energy+β2*Tempo+β3*Happiness+β4*Anger+β5*Danceability  (5)
  • β0=0
    β1=11.4
    β2=0.87
    β3=4.1
    β4=6.3
    β5=15.94
    The alternative methods of calculating likeness using correlations of parameters are also used with the descriptors. The preferred weightings of the descriptors are:
  • Energy=4.5
  • Tempo=2.1
  • Happiness=0.78
  • Anger=0.55
  • Danceability=3.7
  • 4.5.3 Organizing and Storing the Likeness Data
  • The preferred and alternative methods of organizing and storing the parameter differences data for calculating likeness are also used when the process uses descriptors. In addition, there is yet another alternative for calculating likeness. It involves precomputing a series of hierarchical listings of the songs organized by their descriptor values. For example, all of the songs in the database are organized into nine classes according to their energy values. Then all of the songs with an energy value of 9 are organized by their tempo values, then all of the songs with energy value 8 are organized into their tempo values, and so on, through all levels of all of the descriptors. The sorting results in a maximum of Ln similarity classes, where L is the number of levels of each descriptor and n is the number of descriptors. In this case, 59049 classes. The songs in each class have the identical descriptor values. This provides an alternative type of likeness.
  • 4.5.4 Limitations of the Model
  • There are the same limitations on the model based on descriptors as there are on the model based on parameters.
  • 4.5.6 Searching the Likeness Database
  • The likeness database may be quickly searched by starting with any song and immediately finding the closest songs. A more enhanced search combines a similarity search with a descriptor search and the descriptor adjustment step described above. The steps are:
      • 1a. Find a list of likeness matches, including some that are somewhat different
      • 1b. Present only the acceptable likeness songs.
      • 2. When a search adjusted by a scalar descriptor is requested, rank the entire list of likeness matches (1a) by the descriptor to be adjusted.
      • 3. Present a new list based on the adjusted values with the best matches at the top.
        This means that the likeness list (1a.) compiled for the original search is much broader than the list displayed for the user (1b), and includes songs that are less similar to the initial target song than would be tolerated by the listener. These poorer likeness matches lie below the presentability threshold for the initial target.

Claims (21)

1-20. (canceled)
21. A method comprising:
receiving, from a device, a query that references a selected musical recording for which a similar recording is sought;
calculating a parameter of the selected musical recording from an analysis of music in at least a portion of the selected musical recording;
determining a scalar of the selected musical recording, the scalar being determined based on the calculated parameter and indicating an extent to which a descriptor is humanly perceptible in the selected musical recording, the determining of scalar being performed by a computer;
searching a database of reference scalars each associated with a corresponding reference musical recording to obtain a search result based on a comparison of the scalar to the reference scalars, each reference scalar being determined based on a corresponding reference parameter calculated and indicating a corresponding extent to which the descriptor is humanly perceptible in the corresponding reference musical recording; and
providing the search result to the device.
22. The method of claim 21, wherein:
the determining of the scalar of the selected musical recording includes multiplying the parameter by a weight factor.
23. The method of claim 22 further comprising:
determining the weight factor based on ratings submitted by human listeners of a set of musical recordings from which the selected musical recording is absent.
24. The method of claim 22 further comprising:
accessing the weight factor from a lookup table that correlates the weight factor with the calculated parameter.
25. The method of claim 24 further comprising:
generating the lookup table based on ratings submitted by human listeners of a set of musical recordings from which the reference musical recordings and the selected musical recording are absent.
26. The method of claim 22 further comprising:
each reference scalar is determined based on its corresponding reference parameter being multiplied by the weight factor that is multiplied to the parameter of the selected musical recording.
27. The method of claim 21, wherein:
the received query that references the selected musical recording specifies a range that corresponds to the descriptor; and
the providing of the search result is based on the scalar of the selected musical recording being within the range specified by the received query.
28. The method of claim 21, wherein:
the calculating of the parameter of the selected musical recording includes performing a mathematical analysis of sounds of the music in at least the portion of the selected musical recording.
29. The method of claim 21, wherein:
the calculated parameter represents a mathematical measurement of sounds of the music in at least the portion of the selected musical recording, the mathematical measurement being selected from a group consisting of harmonicity, loudness, dynamic range, rhythm strength, rhythm complexity, articulation, attack, note duration, tempo, and key.
30. The method of claim 21, wherein:
the determined scalar quantifies the extent to which the descriptor is humanly perceptible in the selected musical recording, the descriptor being a text phrase selected from a group consisting of sadness, happiness, anger, symphonicness, amount of singer contribution, melodicness, band size, countryness, metallicness, smoothness, coolness, salsa influence, reggae influence, recording quality, energy level, and danceability.
31. The method of claim 21, wherein:
the calculated parameter of the selected musical recording is a value of an acoustic characteristic calculated for sounds of the music in at least the portion of the selected musical recording.
32. The method of claim 21, wherein:
the determined scalar of the selected musical recording is a value of a musical characteristic that is perceivable by humans in the music in the selected musical recording.
33. A non-transitory computer-readable medium comprising a computer program that, when executed by a computer, causes the computer to perform operations comprising:
receiving, from a device, a query that references a selected musical recording for which a similar recording is sought;
calculating a parameter of the selected musical recording from an analysis of music in at least a portion of the selected musical recording;
determining a scalar of the selected musical recording, the scalar being determined based on the calculated parameter and indicating an extent to which a descriptor is humanly perceptible in the selected musical recording;
searching a database of reference scalars each associated with a corresponding reference musical recording to obtain a search result based on a comparison of the scalar to the reference scalars, each reference scalar being determined based on a corresponding reference parameter calculated and indicating a corresponding extent to which the descriptor is humanly perceptible in the corresponding reference musical recording; and
providing the search result to the device.
34. The non-transitory computer-readable medium of claim 33, wherein:
the determining of the scalar of the selected musical recording includes multiplying the parameter by a weight factor.
35. The non-transitory computer-readable medium of claim 33, wherein:
the received query that references the selected musical recording specifies a range that corresponds to the descriptor; and
the providing of the search result is based on the scalar of the selected musical recording being within the range specified by the received query.
36. A system comprising:
a computer; and
a computer program that, when executed by the computer, causes the computer to perform operations comprising:
receiving, from a device, a query that references a selected musical recording for which a similar recording is sought;
calculating a parameter of the selected musical recording from an analysis of music in at least a portion of the selected musical recording;
determining a scalar of the selected musical recording, the scalar being determined based on the calculated parameter and indicating an extent to which a descriptor is humanly perceptible in the selected musical recording;
searching a database of reference scalars each associated with a corresponding reference musical recording to obtain a search result based on a comparison of the scalar to the reference scalars, each reference scalar being determined based on a corresponding reference parameter calculated and indicating a corresponding extent to which the descriptor is humanly perceptible in the corresponding reference musical recording; and
providing the search result to the device.
37. The system of claim 36, wherein:
the determining of the scalar of the selected musical recording includes multiplying the parameter by a weight factor.
38. The system of claim 37, wherein the operations further comprise:
determining the weight factor based on ratings submitted by human listeners of a set of musical recordings from which the selected musical recording is absent.
39. The system of claim 37, wherein the operations further comprise:
accessing the weight factor from a lookup table that correlates the weight factor with the calculated parameter.
40. The system of claim 36, wherein:
the received query that references the selected musical recording specifies a range that corresponds to the descriptor; and
the providing of the search result is based on the scalar of the selected musical recording being within the range specified by the received query.
US14/329,368 1999-09-14 2014-07-11 Music searching methods based on human perception Abandoned US20140372479A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/329,368 US20140372479A1 (en) 1999-09-14 2014-07-11 Music searching methods based on human perception

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US15376899P 1999-09-14 1999-09-14
US09/556,086 US8326584B1 (en) 1999-09-14 2000-04-21 Music searching methods based on human perception
US13/667,683 US8805657B2 (en) 1999-09-14 2012-11-02 Music searching methods based on human perception
US14/329,368 US20140372479A1 (en) 1999-09-14 2014-07-11 Music searching methods based on human perception

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/667,683 Continuation US8805657B2 (en) 1999-09-14 2012-11-02 Music searching methods based on human perception

Publications (1)

Publication Number Publication Date
US20140372479A1 true US20140372479A1 (en) 2014-12-18

Family

ID=26850847

Family Applications (3)

Application Number Title Priority Date Filing Date
US09/556,086 Active 2025-04-21 US8326584B1 (en) 1999-09-14 2000-04-21 Music searching methods based on human perception
US13/667,683 Expired - Fee Related US8805657B2 (en) 1999-09-14 2012-11-02 Music searching methods based on human perception
US14/329,368 Abandoned US20140372479A1 (en) 1999-09-14 2014-07-11 Music searching methods based on human perception

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US09/556,086 Active 2025-04-21 US8326584B1 (en) 1999-09-14 2000-04-21 Music searching methods based on human perception
US13/667,683 Expired - Fee Related US8805657B2 (en) 1999-09-14 2012-11-02 Music searching methods based on human perception

Country Status (3)

Country Link
US (3) US8326584B1 (en)
AU (2) AU7490000A (en)
WO (2) WO2001020483A2 (en)

Families Citing this family (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6505160B1 (en) 1995-07-27 2003-01-07 Digimarc Corporation Connected audio and other media objects
US8326584B1 (en) 1999-09-14 2012-12-04 Gracenote, Inc. Music searching methods based on human perception
US7194752B1 (en) * 1999-10-19 2007-03-20 Iceberg Industries, Llc Method and apparatus for automatically recognizing input audio and/or video streams
US7343553B1 (en) * 2000-05-19 2008-03-11 Evan John Kaye Voice clip identification method
US7277766B1 (en) 2000-10-24 2007-10-02 Moodlogic, Inc. Method and system for analyzing digital audio files
US7890374B1 (en) 2000-10-24 2011-02-15 Rovi Technologies Corporation System and method for presenting music to consumers
WO2002051063A1 (en) 2000-12-21 2002-06-27 Digimarc Corporation Methods, apparatus and programs for generating and utilizing content signatures
US7421376B1 (en) 2001-04-24 2008-09-02 Auditude, Inc. Comparison of data signals using characteristic electronic thumbprints
US7962482B2 (en) * 2001-05-16 2011-06-14 Pandora Media, Inc. Methods and systems for utilizing contextual feedback to generate and modify playlists
EP1410380B1 (en) 2001-07-20 2010-04-28 Gracenote, Inc. Automatic identification of sound recordings
AU2002323413A1 (en) * 2001-08-27 2003-03-10 Gracenote, Inc. Playlist generation, delivery and navigation
US20050010604A1 (en) 2001-12-05 2005-01-13 Digital Networks North America, Inc. Automatic identification of DVD title using internet technologies and fuzzy matching techniques
JP2005062971A (en) * 2003-08-19 2005-03-10 Pioneer Electronic Corp Content retrieval system
US8396800B1 (en) * 2003-11-03 2013-03-12 James W. Wieder Adaptive personalized music and entertainment
EP1550942A1 (en) * 2004-01-05 2005-07-06 Thomson Licensing S.A. User interface for a device for playback of audio files
JP2007534995A (en) * 2004-04-29 2007-11-29 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Method and system for classifying audio signals
US7567899B2 (en) 2004-12-30 2009-07-28 All Media Guide, Llc Methods and apparatus for audio recognition
US20070094215A1 (en) * 2005-08-03 2007-04-26 Toms Mona L Reducing genre metadata
US7516074B2 (en) 2005-09-01 2009-04-07 Auditude, Inc. Extraction and matching of characteristic fingerprints from audio signals
DE602006013666D1 (en) * 2005-09-29 2010-05-27 Koninkl Philips Electronics Nv METHOD AND DEVICE FOR AUTOMATICALLY PREPARING AN ABSPIELLIST BY SEGMENTAL CHARACTERISTIC COMPARISON
JP2009510658A (en) 2005-09-30 2009-03-12 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Method and apparatus for processing audio for playback
JP5512126B2 (en) 2005-10-17 2014-06-04 コーニンクレッカ フィリップス エヌ ヴェ Method for deriving a set of features for an audio input signal
WO2008117232A2 (en) * 2007-03-27 2008-10-02 Koninklijke Philips Electronics N.V. Apparatus for creating a multimedia file list
US20080274687A1 (en) 2007-05-02 2008-11-06 Roberts Dale T Dynamic mixed media package
DE102007034031A1 (en) * 2007-07-20 2009-01-22 Robert Bosch Gmbh Method of determining similarity, device and use thereof
US7958130B2 (en) 2008-05-26 2011-06-07 Microsoft Corporation Similarity-based content sampling and relevance feedback
EP2159720A1 (en) * 2008-08-28 2010-03-03 Bach Technology AS Apparatus and method for generating a collection profile and for communicating based on the collection profile
US10567823B2 (en) 2008-11-26 2020-02-18 Free Stream Media Corp. Relevant advertisement generation based on a user operating a client device communicatively coupled with a networked media device
US10419541B2 (en) 2008-11-26 2019-09-17 Free Stream Media Corp. Remotely control devices over a network without authentication or registration
US10631068B2 (en) 2008-11-26 2020-04-21 Free Stream Media Corp. Content exposure attribution based on renderings of related content across multiple devices
US9519772B2 (en) 2008-11-26 2016-12-13 Free Stream Media Corp. Relevancy improvement through targeting of information based on data gathered from a networked device associated with a security sandbox of a client device
US9961388B2 (en) 2008-11-26 2018-05-01 David Harrison Exposure of public internet protocol addresses in an advertising exchange server to improve relevancy of advertisements
US10880340B2 (en) 2008-11-26 2020-12-29 Free Stream Media Corp. Relevancy improvement through targeting of information based on data gathered from a networked device associated with a security sandbox of a client device
US9154942B2 (en) 2008-11-26 2015-10-06 Free Stream Media Corp. Zero configuration communication between a browser and a networked media device
US10334324B2 (en) 2008-11-26 2019-06-25 Free Stream Media Corp. Relevant advertisement generation based on a user operating a client device communicatively coupled with a networked media device
US8180891B1 (en) 2008-11-26 2012-05-15 Free Stream Media Corp. Discovery, access control, and communication with networked services from within a security sandbox
US9986279B2 (en) 2008-11-26 2018-05-29 Free Stream Media Corp. Discovery, access control, and communication with networked services
US10977693B2 (en) 2008-11-26 2021-04-13 Free Stream Media Corp. Association of content identifier of audio-visual data with additional data through capture infrastructure
US8996538B1 (en) 2009-05-06 2015-03-31 Gracenote, Inc. Systems, methods, and apparatus for generating an audio-visual presentation using characteristics of audio, visual and symbolic media objects
US8805854B2 (en) * 2009-06-23 2014-08-12 Gracenote, Inc. Methods and apparatus for determining a mood profile associated with media data
US8620967B2 (en) 2009-06-11 2013-12-31 Rovi Technologies Corporation Managing metadata for occurrences of a recording
US8161071B2 (en) 2009-09-30 2012-04-17 United Video Properties, Inc. Systems and methods for audio asset storage and management
US8886531B2 (en) 2010-01-13 2014-11-11 Rovi Technologies Corporation Apparatus and method for generating an audio fingerprint and using a two-stage query
US9093120B2 (en) 2011-02-10 2015-07-28 Yahoo! Inc. Audio fingerprint extraction by scaling in time and resampling
US10055493B2 (en) * 2011-05-09 2018-08-21 Google Llc Generating a playlist
US8865993B2 (en) * 2012-11-02 2014-10-21 Mixed In Key Llc Musical composition processing system for processing musical composition for energy level and related methods
US9037278B2 (en) * 2013-03-12 2015-05-19 Jeffrey Scott Smith System and method of predicting user audio file preferences
US9905233B1 (en) 2014-08-07 2018-02-27 Digimarc Corporation Methods and apparatus for facilitating ambient content recognition using digital watermarks, and related arrangements
US9501568B2 (en) 2015-01-02 2016-11-22 Gracenote, Inc. Audio matching based on harmonogram
US10372757B2 (en) 2015-05-19 2019-08-06 Spotify Ab Search media content based upon tempo
US11113346B2 (en) 2016-06-09 2021-09-07 Spotify Ab Search media content based upon tempo
US10984035B2 (en) * 2016-06-09 2021-04-20 Spotify Ab Identifying media content
US11750989B2 (en) 2018-04-05 2023-09-05 Cochlear Limited Advanced hearing prosthesis recipient habilitation and/or rehabilitation
US20200320404A1 (en) * 2019-04-02 2020-10-08 International Business Machines Corporation Construction of a machine learning model
US11783723B1 (en) 2019-06-13 2023-10-10 Dance4Healing Inc. Method and system for music and dance recommendations
CN116842616B (en) * 2023-06-30 2024-01-26 同济大学 Method for designing speed perception enhanced rhythm curve based on frequency of side wall of underground road

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918223A (en) * 1996-07-22 1999-06-29 Muscle Fish Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4697209A (en) 1984-04-26 1987-09-29 A. C. Nielsen Company Methods and apparatus for automatically identifying programs viewed or recorded
US4677466A (en) 1985-07-29 1987-06-30 A. C. Nielsen Company Broadcast program identification method and apparatus
US4843562A (en) 1987-06-24 1989-06-27 Broadcast Data Systems Limited Partnership Broadcast information classification system and method
US5019899A (en) 1988-11-01 1991-05-28 Control Data Corporation Electronic data encoding and recognition system
US5510572A (en) 1992-01-12 1996-04-23 Casio Computer Co., Ltd. Apparatus for analyzing and harmonizing melody using results of melody analysis
US5436653A (en) 1992-04-30 1995-07-25 The Arbitron Company Method and system for recognition of broadcast segments
US5536902A (en) 1993-04-14 1996-07-16 Yamaha Corporation Method of and apparatus for analyzing and synthesizing a sound by extracting and controlling a sound parameter
US5499294A (en) 1993-11-24 1996-03-12 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Digital camera with apparatus for authentication of images produced from an image file
US5616876A (en) * 1995-04-19 1997-04-01 Microsoft Corporation System and methods for selecting music on the basis of subjective content
JP3307156B2 (en) 1995-04-24 2002-07-24 ヤマハ株式会社 Music information analyzer
US5751672A (en) 1995-07-26 1998-05-12 Sony Corporation Compact disc changer utilizing disc database
JP3196604B2 (en) 1995-09-27 2001-08-06 ヤマハ株式会社 Chord analyzer
US5767893A (en) 1995-10-11 1998-06-16 International Business Machines Corporation Method and apparatus for content based downloading of video programs
US5874686A (en) 1995-10-31 1999-02-23 Ghias; Asif U. Apparatus and method for searching a melody
JP2778567B2 (en) 1995-12-23 1998-07-23 日本電気株式会社 Signal encoding apparatus and method
US5693903A (en) 1996-04-04 1997-12-02 Coda Music Technology, Inc. Apparatus and method for analyzing vocal audio data to provide accompaniment to a vocalist
WO1998025269A1 (en) 1996-12-02 1998-06-11 Thomson Consumer Electronics, Inc. Apparatus and method for identifying the information stored on a medium
US5925843A (en) 1997-02-12 1999-07-20 Virtual Music Entertainment, Inc. Song identification and synchronization
US5987525A (en) 1997-04-15 1999-11-16 Cddb, Inc. Network delivery of interactive entertainment synchronized to playback of audio recordings
US5960411A (en) 1997-09-12 1999-09-28 Amazon.Com, Inc. Method and system for placing a purchase order via a communications network
JP3765171B2 (en) * 1997-10-07 2006-04-12 ヤマハ株式会社 Speech encoding / decoding system
US6076111A (en) 1997-10-24 2000-06-13 Pictra, Inc. Methods and apparatuses for transferring data between data processing systems which transfer a representation of the data before transferring the data
JPH11232286A (en) 1998-02-12 1999-08-27 Hitachi Ltd Information retrieving system
US6201176B1 (en) * 1998-05-07 2001-03-13 Canon Kabushiki Kaisha System and method for querying a music database
US6226618B1 (en) 1998-08-13 2001-05-01 International Business Machines Corporation Electronic content delivery system
AUPP547898A0 (en) * 1998-08-26 1998-09-17 Canon Kabushiki Kaisha System and method for automatic music generation
US6266429B1 (en) 1998-09-23 2001-07-24 Philips Electronics North America Corporation Method for confirming the integrity of an image transmitted with a loss
US8332478B2 (en) 1998-10-01 2012-12-11 Digimarc Corporation Context sensitive connected content
US6498955B1 (en) * 1999-03-19 2002-12-24 Accenture Llp Member preference control of an environment
US8326584B1 (en) 1999-09-14 2012-12-04 Gracenote, Inc. Music searching methods based on human perception
US7548851B1 (en) 1999-10-12 2009-06-16 Jack Lau Digital multimedia jukebox
US6539395B1 (en) * 2000-03-22 2003-03-25 Mood Logic, Inc. Method for creating a database for comparing music
US6990453B2 (en) 2000-07-31 2006-01-24 Landmark Digital Services Llc System and methods for recognizing sound and music signals in high noise and distortion
JP2002049631A (en) 2000-08-01 2002-02-15 Sony Corp Information providing device, method and recording medium
US6748360B2 (en) 2000-11-03 2004-06-08 International Business Machines Corporation System for selling a product utilizing audio content identification
JP4723171B2 (en) 2001-02-12 2011-07-13 グレースノート インク Generating and matching multimedia content hashes
US7784103B2 (en) 2004-10-19 2010-08-24 Rovi Solutions Corporation Method and apparatus for storing copy protection information separately from protected content

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918223A (en) * 1996-07-22 1999-06-29 Muscle Fish Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Martin et al: "Music Content Analysis through Models of Audition", Martin et al, ACM Multimedia Workshop '98, ACM 1998; pp. 1-8. *

Also Published As

Publication number Publication date
AU7490000A (en) 2001-04-17
US20130191088A1 (en) 2013-07-25
WO2001020609A3 (en) 2004-02-19
WO2001020483A3 (en) 2004-02-19
WO2001020483A2 (en) 2001-03-22
US8805657B2 (en) 2014-08-12
US8326584B1 (en) 2012-12-04
AU7489900A (en) 2001-04-17
WO2001020609A2 (en) 2001-03-22

Similar Documents

Publication Publication Date Title
US8805657B2 (en) Music searching methods based on human perception
Salamon et al. Tonal representations for music retrieval: from version identification to query-by-humming
Li et al. Music data mining
EP1244093B1 (en) Sound features extracting apparatus, sound data registering apparatus, sound data retrieving apparatus and methods and programs for implementing the same
Typke et al. A survey of music information retrieval systems
Casey et al. Content-based music information retrieval: Current directions and future challenges
Bartsch et al. To catch a chorus: Using chroma-based representations for audio thumbnailing
Pohle et al. Evaluation of frequently used audio features for classification of music into perceptual categories
US20040093354A1 (en) Method and system of representing musical information in a digital representation for use in content-based multimedia information retrieval
Welsh et al. Querying large collections of music for similarity
Marolt A mid-level representation for melody-based retrieval in audio collections
WO2007029002A2 (en) Music analysis
Hargreaves et al. Structural segmentation of multitrack audio
Moelants et al. Exploring African tone scales
Liu et al. Content-based audio classification and retrieval using a fuzzy logic system: towards multimedia search engines
Gómez et al. Comparative analysis of music recordings from western and non-western traditions by automatic tonal feature extraction
Lippincott Issues in content-based music information retrieval
Li et al. Music data mining: an introduction
Leman et al. Tendencies, perspectives, and opportunities of musical audio-mining
Reiss et al. Benchmarking music information retrieval systems
Lidy Evaluation of new audio features and their utilization in novel music retrieval applications
US20030120679A1 (en) Method for creating a database index for a piece of music and for retrieval of piece of music
EP4250134A1 (en) System and method for automated music pitching
Kostek et al. Processing of musical data employing rough sets and artificial neural networks
Lindenbaum et al. Musical features extraction for audio-based search

Legal Events

Date Code Title Description
AS Assignment

Owner name: CITIBANK, N.A., AS COLLATERAL AGENT, NEW YORK

Free format text: SUPPLEMENTAL SECURITY AGREEMENT;ASSIGNORS:GRACENOTE, INC.;GRACENOTE MEDIA SERVICES, LLC;GRACENOTE DIGITAL VENTURES, LLC;REEL/FRAME:042262/0601

Effective date: 20170412

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: GRACENOTE DIGITAL VENTURES, LLC, NEW YORK

Free format text: RELEASE (REEL 042262 / FRAME 0601);ASSIGNOR:CITIBANK, N.A.;REEL/FRAME:061748/0001

Effective date: 20221011

Owner name: GRACENOTE, INC., NEW YORK

Free format text: RELEASE (REEL 042262 / FRAME 0601);ASSIGNOR:CITIBANK, N.A.;REEL/FRAME:061748/0001

Effective date: 20221011