WO2011144617A1

WO2011144617A1 - Apparatus and method for extending or compressing time sections of an audio signal

Info

Publication number: WO2011144617A1
Application number: PCT/EP2011/057979
Authority: WO
Inventors: Frederik Nagel; Stefan Geyersberger; Sascha Disch; Max Neuendorf
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2010-05-19
Filing date: 2011-05-17
Publication date: 2011-11-24
Also published as: EP2388780A1; AR081014A1; TW201207847A

Abstract

An audio signal processor (100) comprising an analysis means (104), a manipulation factor unit (106), and a time-stretching and compression device (108). The analysis means (104) is implemented to determine a first measure of information content (M₁) of a first time section of an audio signal and a second measure of information content (M₂) of a second time section. The manipulation factor unit (106) is implemented to determine a time manipulation factor (ΔD₁) for the first time section in dependence on the first measure of information content (M₁) and the second measure of information content (M₂). The time-stretching and compression device (108) is implemented to time-stretch or compress the first time section according to the manipulation factor (ΔD₁) and to treat the second time section differently from the first time section. A corresponding method for adjusting time information content variations of an audio signal (s) is also disclosed. The determination of the first measure of information content and of the second measure of information content may be based on externally provided control information such as meta-data provided along with the audio signal.

Description

APPARATUS AND METHOD FOR EXTENDING OR COMPRESSING TIME SECTIONS OF AN AUDIO SIGNAL

Description

Embodiments according to the invention relate to audio processing and particularly to an apparatus and a method for extending or compressing an audio signal in a time section-wise manner.

Recorded audio signals may be played back at a speed different from the original speed at which the (original) audio signal was recorded. This may be useful to slow down or accelerate the audio signal so that a listener may receive the information conveyed by the audio signal at a rate that is convenient to the listener. The listener may, for example, choose a relatively fast playback speed when seeking a certain segment within the audio signal by paying attention to a particular keyword used within the segment sought. Typically, the listener will not be able to mentally process the entire information conveyed by the audio signal if the audio signal is played back at a fast speed. Nevertheless, the keyword typically can be recognized by the listener, even at relatively high playback speeds, if the listener concentrates on detecting the keyword. Another option is to choose a slower speed for playback which is useful when the listener wants to extract relevant information from the audio signal. For example, the audio signal may have been recorded during a court hearing and a transcript of the audio signal needs to be prepared. Yet another example can be found in the field of aviation when the audio signal has been extracted from a flight data recorder, wherein specialists are commissioned with the identification of various utterances and sounds that can be heard when playing the audio signal. Varying the playback speed of the audio signal may facilitate the identification of the various recorded sounds.

Currently known methods for temporarily extending and compressing an audio signal receive control information from a user, e.g. the listener. For example, a time-stretching method has been introduced under the name "Phavorit" and described in an article by T. Karrer, E. Lee, and J. Borchers, "PhaVoRIT: A Phase Vocoder for Real Time Interactive Time- Stretching" in Proc. ICMC, Nov. 2006, pp.708-715". This time-stretching method is based on a phase Vocoder for which the user may select the degree of time-stretching at runtime.

This and other currently known methods process an entire audio signal in an across-the-board manner or they have to be explicitly controlled by the user. The alternation of the playback speed of an audio signal typically causes a change of the pitch of the audio signal. If this is not desired, many different time stretching methods may be used, such as synchronous overlap and add (SOLA), pitch synchronous overlap and add (PSOLA), waveform similarity overlap and add (WSOLA), Pointer Interval Controlled OverLap and Add (PICOL), Time Domain Harmonic Scaling (TDHS), Minimum Perceived Loss Time Compression/Expansion (MPEX), or the phase vocoder. Each of these techniques has some advantages for certain signals. The following, however, concentrates on the phase vocoder. In an article "The Phase Vocoder: A Tutorial" by Mark Dolson, Computer Audio Research Laboratory, Center for Music Experiment, University of California, San Diego, the operation of the phase vocoder is explained. The phase vocoder is a signal processing technique (typically digital) that can be used to perform very high fidelity time-scaling, pitch transposition and other modifications of recorded sound.

In the German Patent Application publication DE 10 2008 015 702 Al , a device and a method for bandwidth expansion of an audio signal is described. The device uses a phase vocoder in filterbank implementation or transformation implementation for temporarily spreading the audio signal by a predetermined, constant factor.

It would be desirable to provide a device and a method for automatically and selectively stretch and/or compress individual sections of audio signals, in particular, speech signals. This operation may be carried out with either SOLA, WSOLA, PSOLA, PICOLA, TDHS, MPEX, phase vocoder or other time or pitch scaling techniques.

This desire and/or other desires are addressed by an audio signal processor according to claim 1, a method according to claim 14, or a computer program according to claim 15.

An embodiment of the invention provides an audio signal processor which comprises an analysis means, a manipulation factor unit, and a time-stretching and compression device. The analysis means is implemented to determine a first measure of information content of a first time section of an audio signal and a second measure of information content of a second time section. The manipulation factor unit is implemented to determine a time manipulation factor for the first time section in dependence on the first measure of information content and the second measure of information content. The time-stretching and compression device is implemented to time-stretch or compress the first time section according to the manipulation factor and to treat the second time section differently from the first time section. By applying different manipulation factors regarding the time stretching and compression to different time sections, time sections having a higher measure of information content (e.g. a higher information density), can be time-stretched or temporarily extended. In the alternative, time sections having a relatively low measure of information content may be temporarily compressed or even deleted from the signal. The audio signal processor also facilitates a combination of both options. With the proposed audio signal processor it is possible to distribute information content more evenly over the duration of the audio signal. In the related field of perceptual speech and audio coding, current methods of speech and audio coding may not code signal components that are perceived as noise-like, but synthesize an equally noise-like perceived signal at the receiver side, possibly using a few parametric values transmitted from the sender to the receiver. This receiver side substitution is typically limited to noise. This technique is called Perceptual Noise Substitution (PNS). The substituted signal components typically are not dispensable, but contain e.g. sibilant sounds etc. with a high semantic content.

Another technique used in mobile phones is the insertion of comfort noise. The purpose of this technique is to reduce the amount of data that needs to be transmitted or stored, especially in the case of noise. In contrast, the proposed audio signal processor makes it possible to use the noise-filled time section as a freed-up resource that is usable for other information. This inherently helps to maintain the quality and/or intelligibility of important signals portions, while less effort is spent for the coding of noise like segments of an audio signal. This functionality of the audio signal processor is not limited to noise or noise-like signal components, but also to other signal components having a low measure of information content. The choice of what kind of signals qualify as having a relatively high measure of information content is a question of implementation, configuration, and user preferences. This question and its solution are typically addressed during an implementation process of the analysis means. Another difference to the above mentioned current methods of speech and audio coding is that with the proposed method and apparatus a noise-filled time section is not filled with a synthesized noise that is modeled to imitate the original noise. Instead, the noise- filled time section and other time sections having a low measure of information content may be filled with payload information. With PNS-based methods, the noise is re-synthesized at the decoder side. However, irrelevant speech segments and pauses are not considered separately. Furthermore, these methods regarding audio coding do not modify durations of time sections within the audio signal or the duration of the complete audio signal, because this would be contradictory to the goal of the audio coding methods, which is to achieve a high degree of similarity between the original signal at the coder side and the decoded signal at the decoder side. The determination of the first measure of information content and of the second measure of information content may be based on externally provided control information, that is, the analysis means may be configured to identify and/or extract the control information. For example, the analysis of the externally provided control information may be performed in real time on the basis of information additionally supplied along with the audio signal. The information content data could be sent along as meta-information or meta-data, which indicates the information content of one or several time sections.

Current methods for time-stretching and compressing of audio signals act on an audio signal in a global, across-the-board-manner which results in pauses and superfluous sections being time-stretched or compressed at the same time and to the same degree as other time sections. Audio coding methods that exploit the irrelevance of information, do so either exclusively with respect to masking or with respect to noise.

The audio signal processor according to the teachings disclosed herein extracts parameters regarding a degree of (time section-wise) stretching and compressing from the signal itself, which does not appear to be implemented in currently known methods.

The audio signal processor may be used to automatically detect or estimate speech pauses and to use these time intervals for a selective speech stretching. The words may be played back at a slower speed. In turn, the pauses are shortened. Primarily, pauses designate a being silent of the speaker. However, the term may be extended to so-called "filled" pauses. Filled pauses refer to makeshift words such as "er", urn", "well", etc., or the repetition of words and word parts, for example, "I mean mean mean...". Stuttered syllables fall into this category as well. All of these pauses have in common that they do not contain information in the sense of exchanging facts and are thus substantially negligible as irrelevant. In the literature, these pauses are sometimes referred to as "filled pauses".

The time stretching of selected time sections may improve the comprehensibility of the audio signal and enable non-native speakers, aurally challenged persons, senior citizens etc. to follow spoken texts more easily. In addition, such detection could be used in audio or speech coding, as filled pauses may be coded at an inferior quality or not at all. In some embodiments of the teachings disclosed herein, the audio signal processor may further comprise a comparator implemented to compare the first measure of information of the content of the first time section to a threshold and to classify the first time section, in dependence on a respective result of the comparison, as a section having a higher measure of information content or as a section having a lower measure of information content. A section bounding means may be provided that is implemented to shift boundaries between the sections having the higher measure of information content and the sections having the lower measure of information content into or towards the sections having lower information content. The time-stretching and compression device may further be implemented to time-stretch or compress the sections having a higher measure of information content by a factor corresponding to the shift of the boundaries of the first time section.

Another embodiment of the teachings disclosed herein provides a method for adjusting time information content variations of an audio signal which comprises: determining a first measure of information content of a first time section of the audio signal and a second measure of information content of a second time section of the audio signal; determining a time manipulation factor for the first time section in dependence on the first measure of information content and the second measure of information content; and processing the audio signal such that the first time section is time-stretched or compressed according to the time manipulation factor and that the second time section is processed differently from the first time section.

The dependent claims relate to further enhancements and/or details of the audio signal processor, the method for adjusting time information content variations of the audio signal, and/or the computer program.

Brief Description of the Drawings

The accompanying drawings are included to provide a further understanding of embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and, together with the description, serve to explain the principles of the embodiments. Other embodiments and many of the intended advantages of embodiments will be readily appreciated, as they become better understood with reference to the following detailed description. Like reference numerals designate corresponding similar parts.

Fig. 1 is a schematic block diagram of an audio signal processor according to the teachings of this document; Fig. 2 is a schematic block diagram of another embodiment of an audio signal processor according to the teachings of this document; Fig. 3 is a diagram illustrating measures of information content for a plurality of time sections of the audio signal over time;

Fig. 4 shows another diagram of measure of information content over time illustrating a hysteresis concept;

Fig. 5 is a schematic block diagram of another embodiment of an audio signal processor according to the teachings disclosed herein;

Fig. 6 is a schematic block diagram of another embodiment of an audio signal processor according to the teachings disclosed herein;

Fig. 7 is a schematic flowchart of an embodiment of a method for adjusting time information content variations of an audio signal according to the teachings disclosed herein;

Fig. 8 shows a flowchart of another embodiment of a method for adjusting time information content variations,

Figs. 9a to 9e show diagrams of an energy quantity of the audio signal over time to illustrate various actions of an embodiment of the method for adjusting time information content variations; and

Figs. 10a to lOe show diagrams of a time manipulation factor over time for different implementations or configurations of the audio signal processor and/or the method for adjusting time information content variations.

Detailed Description of the Embodiments Fig. 1 shows a schematic block diagram of an audio signal processor 100 according to an embodiment of the teachings of this document. The audio signal processor 100 receives an audio signal s as an input. The audio signal s is illustrated at the top of Fig. 1 in a diagram of a signal amplitude over time. The audio signal contains relatively large amplitude values in a first time section (section 1), which extends between two time instants ti and t₂. Without loss of generality and as an illustrative example, it is assumed that the first section contains relevant information in the form of spoken words and therefore has a high measure of information content. A second time section (section 2) extends between the time instants t₂ and t₃. On average, the amplitude values are lower in the second section than in the first section. For the purpose of illustration, it is assumed that this indicates a low measure of information content for the second time section. Within the audio signal processor 100, the audio signal s is input to a section identifier 102. The section identifier 102 may perform a coarse analysis of the audio signal s to determine changes in characteristic properties of the audio signal s from one time instant to another. Large changes may be an indicator of boundaries between two time sections of the audio signal s. A simpler implementation of the section identifier 102 splits the audio signal s into time sections of equal length (e.g. between 1/10 seconds to several seconds). Other implementations of the section identifier 102 are also possible. The section identifier 102 produces a set of time instant values {tj, t₂, ...} to be used by other components of the audio signal processor 100. The audio signal s may be provided to the audio signal processor as a digital, pulse code modulated (PCM) signal. Other forms for the audio signal s are also possible, and even an analog representation. In the case of s being an analog signal, it would either be analog-to- digital converted for subsequent digital signal processing or it would be analyzed and processed as an analog signal.

The set of time instants {tj, t₂, ... } is transmitted to an analysis means 104 in the form of, for example, a vector or a list. The analysis means processes the audio signal s in a time section- wise manner to determine a plurality of measures of information content for the plurality of time sections. Thus, the analysis means 104 determines a high measure of information content for the first section shown in the time diagram for the audio signal s, and a lower measure of information content for the second section. The measures of information content are indicated by reference signs Mi, M₂, ... .

In order to determine and quantify the measure of information content for a given time section, the analysis means 104 may analyze the audio signal within the time sections in a variety of ways, relatively simple implementation is based on evaluating the strength of the audio signal within the given time sections. To this end, an average amplitude or power of the audio signal within the given time sections may be determined. The determination of a maximal value within the given time section is another option. An amplitude-based or power- based analysis of the audio signal is suitable to distinguish between silent parts and non-silent parts of the audio signal. A more complex approach is to perform a spectral analysis of the time section to find out how the audio signal is distributed in the frequency-domain. A relatively uniform distribution of the audio signal's power spectrum over the frequency range occupied by the audio signal could indicate that the audio signal mostly consists of noise in the evaluated time section. Yet another option for the implementation of the analysis means 104 is given by a pattern detection. The audio signal in the time section is compared with a plurality of sound samples and the most similar sound samples are retained. Each sound sample may have an information associated with it indicating the nature of the sound sample, e.g. high measure of information content or low measure of information content. A more elaborate approach could even distinguish between, e.g. a man's voice, a woman's voice, a child's voice, noise, traffic noise, etc. Based on the result of the comparison, the analysis means may determine whether the time section in question has a high measure of information content or a low measure of information content. As an option, the analysis means 104 may locally measure the speech velocity (e.g. syllables rate), for determining whether a time section of the audio signal s essentially comprises spoken language, and if applicable, for determining a speech velocity within the time section. The information about the speech velocity may be used for controlling the time- stretching and or compression of individual time sections within the audio signal s. Another option for the analysis means 104 is to receive externally provided control data, for example as meta-information provided along with the audio signal s. The set of measures of information content {M_l5 M₂,... } is transmitted to a manipulation factor unit 106. The manipulation factor unit 106 determines a plurality of manipulation factors {ADi, AD₂,... } (the letter D stands for "duration"). For example, the manipulation factor unit 106 may assign a manipulation factor Δϋ, resulting in a time-stretch to be performed on a corresponding time section i if the corresponding measure of information content Mj is high. In contrast, time sections with a low measure of information content are assigned a manipulation factor that results in a compression of the corresponding time section. The manipulation factor unit 106 may optionally receive the set of time instants {tj, t₂,... }, too. Based on the information about the time instants of the boundaries between the various time sections, the manipulation factor unit 106 can evaluate how much margin is available in time sections with low measure of information content that can be used for time-stretching adjacent time sections having a higher measure of information content. This may be useful if it is intended that the time-stretching and compression will not modify the temporal positions of the corresponding time section within the entire audio signal and/or a total duration of the entire audio signal. For example, consider the case where the audio signal is an audio track of a movie picture. Assuming that one time section corresponds more or less to one line of an actor, it is important that the time section of the audio signal is played back substantially at the same time as the image of movie picture shows the actor pronouncing the line. Although a perfect synchronization of the played back audio signal is typically no longer feasible due to the time-stretching or compression of the time section, at least the beginning of the actor's line can be synchronized to the image of the movie picture so that the viewer knows what the actor said during a specific scene. Thus, the audio signal processor and the manipulation factor unit in particular may be implemented to preserve the temporal position of a given time section in the audio signal with respect to a beginning, an end, or a center of the given time section.

The set of manipulation factors {ADi, AD₂,...} is sent from the manipulation factor unit 106 to a time-stretching and compression device 108 in the form of e.g. a vector, a list, or a handover in one or more registers of the audio signal processor 100. The time-stretching and compression device 108 also receives the set of time instants {ti, t₂, ... } from the section identifier 102, so that the time-stretching and compression device 108 can perform time- stretching and/or compression operations at the intervals indicated by the time instants provided from the section identifier 102. Time-stretching and compression may be done by resampling the audio signal at a higher or lower sampling rate. The resampled audio signal is then decimated or interpolated in order to obtain the original sampling rate again. Resampling and decimating or interpolating the audio signal typically causes a modification of the pitch of the audio signal in the affected time section. The modification of the pitch may be used as an indicator to the listener by how much a particular time section has been time-stretched or compressed. If the modification of the pitch is not desired, it may be prevented, e.g., by using a phase vocoder. The phase vocoder provides a high quality solution for time-scale modification of signals. Pitch-scale modifications are usually implemented as a combination of time-scaling and sampling rate conversion. For a detailed description of phase- vocoders, reference is made to the following citations:

• "The Phase- Vocoder: A Tutorial ", Mark Dolson, Computer Music Journal, Vol. 10, No. 4, pages 14-27, 1986; · "New Phase- Vocoder Techniques for Pitch-Shifting, Harmonizing and Other Exotic

Effects", Jean Laroche and Mark Dolson, Proceedings 1999 IEEE, Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17-20, 1999, pages 91-94;

• "New Approach to Transient Processing in the Phase Vocoder", A. Robel, Proceedings of the International Conference on Digital Audio Effects of DAFx-03,

London, UK, September 8-1 1, 2003, pages DAFx-1 to DAFx6;

"Phase-locked Vocoder", Meller Puckette, Proceedings 1995 IEEE,

Conference on Applications of Signal Processing to Audio and Acoustic Noise;

US Patent Nr. 6,549,884.

Other available methods and techniques to time-stretch and/or compress time sections of the audio signal are provided by the PSOLA, WSOLA, SOLA, PICOLA, TDHS, and MPEX method.

The output of the time-stretching and compression device 108 and typically also of the audio signal processor 100 is a modified audio signal s', as illustrated in the second time diagram of Fig. 1. It can be seen that the first section in the modified audio signals (section Γ) has been time- stretched at the cost of the second time section (section 2'). This leads to a shift of the boundary between the first section and the second section t₂ to a new value t₂'. The time instants ΐ and t₃' are substantially unchanged and therefore substantially equal to ti and t₃, respectively. Note however that, in departure from the illustration in Fig. 1, the time section to the right of section 2 might also have been subjected to a time-stretching operation. In that case the time instant t₃ would have been shifted to the left, so that the time interval for section 2 would be even stronger compressed.

Fig. 2 shows another embodiment of the audio signal processor according to the teachings of this document. The section identifier 102, the analysis means 104, and the time-stretching and compression device 108 are substantially identical to the ones described in Fig. 1. The analysis means 104 provides the set of measurements of information content {M^ M₂, ... } to a comparator 204 which compares the measure of information content to a threshold M_thr and classifies each one of the plurality of time sections as having a high(er) measure of information content or a low(er) measure of information content. Thus, two classes are formed which reflects the fact that a time section can be either time-stretched or compressed. A third possibility would be to leave some of the time sections unaltered, giving rise to a potential third category and a second threshold for the measurement of the information content.

The threshold(s) may either be predetermined and fixed or variable in order to be adapted to the properties of a given audio signal. For example, one strategy may be to determine the threshold M_{t r} in a manner that the number of sections with high measure of information content is approximately equal to the number of time sections with low measure of information content. Thus, a relatively high number of boundaries between sections of different information content measure are obtained which increases the degrees of freedom for the manipulation factor unit 106 and/or the time-stretching and compression device 108. To this end, all of the measures of information content would be determined in a first step, then sorted according to their respective measures of information content, and finally the threshold would be set to the mean measure of information content. The comparator 204 produces a set of classification values {Ci , C₂, ... } which is provided to a section bounding means 206. The section bounding means 206 is implemented to shift boundaries between the section having a higher measure of information content and the sections having a lower measure of information content for the benefit of the former time sections and at the cost of the latter time sections. For example, the boundaries are shifted into the time sections having the lower measure of information content. These section bounding means 206 further receive the set of time instants {tj, t₂, ... } from the section identifier 102. The set of time instants is also supplied to the manipulation factor unit 106, as well as a set of shifted time instants

t₂', ...} by determining the difference between the original time instants and the shifted time instants. The manipulation factor unit 106 can determine the time-stretch or compression factors for the various time intervals. The determined manipulation factors are again transmitted to the time-stretching and compression device 108.

Fig. 3 shows a diagram of the measurement of information content determined for a plurality of time sections. The measure of information content M is piece-wise constant for the duration of at least one time section in this embodiment. The measure of information content is compared to a threshold Mthr. Based on a result of the comparison, the time sections are classified as sections having a higher measure of information content or as sections having a lower measure of information content. Two or more adjacent sections having the same classification may be combined to a contiguous region of time sections. For the purposes of time-stretching and compressions, the contiguous regions may be regarded as a unit, e.g. the same manipulation factor applies for all time sections within the contiguous region. The boundaries between adjacent contiguous regions are determined and shifted by an amount depending on the time manipulation factors valid for the two adjacent contiguous regions typically into the one of the contiguous regions that has the lower measure of information content. Time-stretching or compressing the first time section comprises time-stretching or compressing the time sections making up a contiguous region having a higher measure of information content into at least one adjacent contiguous region having a lower measure of information content, in correspondence to the shifted boundary and the amount of shifting.

Fig. 4 shows a graph similar to the one shown in Fig. 3 of a measure for information content M as determined for a plurality of time sections. Depending on the length of the time sections, in particular, when the length of the time sections and/or the threshold M_thr is predetermined and fixed, a certain audio signal may cause a lot of transitions from time sections with low measure of information content Μ,· = LO to time sections with a high measure of information content Mj = HI. This may, for example, happen when the audio signal has been recorded at a low recording level or the speaker has spoken in a relatively soft voice. In order to avoid too rapid changes between high and low information content measure sections, the comparator 204 may be arranged to determine the classification result using a hysteresis. As can be seen in Fig. 4, the comparator 204 uses two threshold values, M_hi and Mi₀. A boundary between a preceding time section with a low measurement of information content and a subsequent time section with a high measure of information content occurs if the higher threshold Mhi is exceeded in the upward direction. On the other hand, a transition from a time section with high information content measure to one with low information content measure occurs when the lower threshold Mi₀ is exceeded in the downward direction. Thus, the contiguous regions resulting from combining several adjacent time sections are larger than without a hysteresis. It can thus be avoided that the audio signal is split up into too many contiguous regions, leading to a high number of manipulation factors. It may be confusing to the listener if the manipulation factor changes too frequently.

The choice of, and interaction between, the values for the thresholds M_thr, M_hi, M]₀ and the length of the elementary time sections may be subject to a preprocessing step in which the audio signal is evaluated with respect to e.g. an average level of information content.

Fig. 5 shows a schematic block diagram of another embodiment for an audio signal processor 100 according to the teachings of this document. The audio signal processor 100 now further comprises a limiting device 508 for the time-stretching or compression. The limiting device 508 is implemented to determine a current threshold for the time stretching or compression of the section having higher information content and to limit the time stretching and compression to the current threshold. Fig. 5 shows an embodiment in which the limiting device 508 implements an upper threshold AD_max, and a lower threshold AD_mi_n. In the interval [ADmin, AD_max], the limiting device 508 is substantially a unity function, i.e. an output of the limiting device 508 is substantially equal to an input thereof. Outside this interval, the output value is limited to the respective lower or upper value. The output of the limiting device 508 is a set of limited manipulation factors {ADi ', Δϋ₂',... }. The limiting device 508 and a corresponding limiting action of a method for adjusting time information content variations of the audio signal avoids time-stretching or compressing the time sections of the audio signals s with excessive manipulation factors which would result in a speech being played back too slowly or too fast, for example.

Fig. 6 shows a schematic block diagram of another embodiment of an audio signal processor 100 according to the teachings disclosed herein. The audio signal s is also supplied to a speech velocity measuring device for determining whether a time section of the audio signal s essentially comprises spoken language, and if applicable, for determining a speech velocity within the time section. A set of section related speech velocity measures {vj, v₂,... } is output by the speech velocity measuring device 602 and forwarded to a threshold setting device 608. The threshold setting device 608 is connected to the speech velocity measuring device 602 and intended for determining, based on the determined speech velocity, at least one threshold for the manipulation factor valid for the time section in question. The threshold setting device 608 is further connected at an output of the threshold setting device 608 to the limiting device 508. The limiting device 508 receives a current threshold value or several current threshold values AD_max and AD_mj_n from the threshold setting device 608.

The embodiment of the audio signal processor 100 shown in Fig. 6 can be used for controlling the time-stretching and/or compression of individual time sections within the audio signal s. In particular, the degree of time-stretching and/or compressing can be determined as a function of an instantaneous speech velocity. By controlling the audio signal processing via an estimate of the instantaneous speech velocity, a balanced, substantially uniform speech velocity may be obtained over the entire speech signal due to such processing. This may be particularly helpful with intermittently performed speech or an irregular speech velocity. The speech comprehensibility of such speech presentations thus may be improved.

The set of estimated speech velocities {v_l5 v₂,...} may also be supplied directly to the manipulation factor unit 106 instead of to the threshold setting device 608, or in addition thereto. It is also possible to use the speech velocity estimate as a measure of information content in the various time sections, or as a precursor thereof. In this case, the speech velocity measuring device 602 may be a part of the analysis means 104. In the context of the method for adjusting time information content variations of the audio signal s, the following actions may be performed in connection with a speech velocity measurement:

• determining whether the audio signal s essentially comprises spoken text within a given time section;

• determining a speech velocity of the spoken text during the given time section when the audio signal s essentially comprises spoken text within the given time section; and

• determining, in dependence on the speech velocity, at least one threshold for the manipulation factor for the given time section.

One method that may be used to determine or estimate the speech velocity is to detect phonemes in the audio signal s and to count the number of phonemes per time unit. Per definition, a phoneme is the smallest segmental unit of sound employed to form meaningful contrasts between utterances in a language or dialect.

Fig. 7 shows a schematic flowchart of an embodiment of the method for adjusting time information content variations in the audio signal s. The method illustrated by the flowchart comprises some optional actions that are not part of a base embodiment of the method. After the start of the method, first and second measures of information content are determined, the first measure of information content corresponding to a first time section of the audio signal s and the second measure of information content corresponding to a second time section of the audio signal s (reference number 702). As shown in the box with reference number 704, at least the first measure of information content may be compared with a threshold M_thr. Typically, the measures of information content M; of all time sections are compared with the threshold. The comparison of the information content measures with the threshold M_thr is a preparatory action to a classification of the one or more time sections as a section with a high measure of information content or as a section with a low measure of information content (reference sign 706). Alternative embodiments may use three of more classes instead of only the two classes for high and low information content. The grouping of time sections having approximately equal measures of information content into a countable number of classes makes it possible to combine adjacent time sections having equal classification results, that is being in the same class, to form larger contiguous regions within the audio signal class, in which the measure of information content is approximately constant. Such a contiguous region may correspond to e.g. a complete sentence spoken by a speaker without any significant pauses. The combination of the adjacent time sections is represented by the box 708 in the flowchart of Fig. 7. In this embodiment of the method, the measures of information content are determined for relatively short time sections (on the order of a fraction of a second to a few seconds, e.g. 0.5 seconds, 1 second, 2 seconds, or 5 seconds). Thus, a relatively fine granularity can be achieved which facilitates a relatively precise detection of time instants in the audio signal s, at which the measure of information content varies significantly, such as at the end of a spoken passage followed by a pause or silence. On the other hand, the contiguous regions are typically larger than a single time section and thus allow the time-stretching or compressing of longer passages.

At 710 boundaries between the adjacent contiguous regions are determined, and then, at 712 a security zone is inserted in the contiguous regions having a low measure of information content. The security zone is typically inserted adjacent to the boundary to the time section with a high measure of information content. This will be explained in more detail below in the context of Fig. 9c. In short, the insertion of the security zone is done to prevent the beginnings and ends of spoken passages to be treated as having a low measure of information content only, which may occur due to edge effects or certain phenomena of spoken language occurring at the beginning or the end. The security zone is then attached to the adjacent region having a high measure of information content. Thus, the security zone will be treated as a part of the high information content measure or region, i.e. it undergoes the same time-stretching and/or compression (cf. reference sign 714).

A time manipulation factor is determined for the first time section in dependence on the first measure of information content, and the second measure of information content at 716. The determination of the manipulation factor ADj may evaluate how much resource in the form of time intervals having a low measure of information content, is available around a time section having a high measure of information content, so that the high information content section may be time-stretched into the low information content section. When time sections containing substantial pauses or silence are compressed for the benefit of time sections containing spoken language, the determination of the time manipulation factor may keep a shorter pause or silence which may help, for example, a listener to mentally segment two subsequent sentences from each other. A currently valid threshold AD_max, AD_min for the time manipulation factors ADj is determined at an action having the reference sign 718 in Fig. 7. Then, at 720, the time manipulation factor ADj for a given time section is limited according to the current threshold AD_max, AD_mj_n. The audio signal s is then processed by time-stretching or compressing the first time section(s) as indicated by action 722 in Fig. 7. It is to be noted that the method may be repeated or that only selected actions of the method may be repeated.

Fig. 8 shows another schematic flow diagram of another embodiment of the method for adjusting time information content variations. The speech signal s is supplied to a pause detection 802 and to an optional filled pause detection 804. Filled pauses contain less important information, such as make-shift words (mh, oh, well, etc), or repeated words, to name a few. At an action 806, the pauses are at least partially removed. The pause removal may comprise a determination of modified time instants in the audio signal s to which non- pause time sections of the audio signal s may be extended. A result of the pause removal action 806 is supplied to a function block 818 charged with the creation of a time-stretching function. Both, the pause removal 806 and the time-stretching function 818 are controlled by control parameters 808, such as thresholds. The time-stretching function 818 is then applied to the audio signal s at 822, which yields the modified audio signal s' .

In Figs. 9A to 9E, a simple implementation of the method for adjusting time information content variations is visualized. By means of an evaluation of the signal energy shown in Fig. 9A, pauses are determined which are illustrated in Fig. 9B in the form of hatched rectangles. The determination of the pauses has located the pauses in time intervals in which the signal energy of the audio signal s is relatively low and possibly close to zero. When the energy is below a certain threshold, the presence of a pause is assumed and hence detected. In addition, security zones are inserted at both ends of the detected pauses in order to prevent a removal of low energy parts of words, such as "F" or "H" sounds. The security zones are represented as thick lines to the left and right of each detected pause in Fig. 9C.

Fig. 9D illustrates how a ratio of pauses versus speech activity is calculated. The time interval di represents the duration of a first segment containing speech activity (including the security zone). The time interval d₂, represents the duration which is available to the time-stretching function 818 (Fig. 8), when the left pause is used to this end. The time-interval d₂ does not consider that this particular pause may also be utilized by the center segment of speech activity. This may be resolved at a later stage by calculating an average split point within the pause. The calculation of the average split point may be a weighted average calculation based on the individual durations of the various time segments containing speech activity.

Fig. 9E shows the result after the time-stretching or compression has been performed in accordance with the preparatory calculations.

It is to be noted that although the duration of the modified audio signal s' shown in Fig. 9E is longer than the duration of the original audio signal s, this is not necessarily the case. In particular, the three segments containing speech activity illustrated in Figs. 9A to 9E, may be maintained at their respective temporal positions, if desired. As such, the time instants of the beginning, the end, or the middle of the segments with each activity may be fixed, and hence, equal in the original audio signal s and in the modified audio signal s' .

The time-stretching can be done by stretching speech segments into adjacent pauses.

In the alternative, an estimation of a pause identity can be performed over time, the result of which may then be used for the actual time-stretching or compression. Based on the detected pauses, a speech stretching function is calculated which, among others, limits the variation of the stretching as illustrated in Fig. 10A. Fig. 10A shows a function of single time-stretch factors as a step function. Fig. 10B shows a function of interpolated time manipulation factors or stretching factors based on the step function shown in Fig. 1 OA. When time-stretching or compressing is based on interpolated time manipulation factors, the listener may more easily adapt to the gradually increasing or decreasing speech velocity as opposed to the abrupt changes of the time manipulation factors shown in Fig. 10A which might lead to equally abrupt changes of the speech velocity in the modified audio signal s'. The interpolation of the time manipulation factors may be performed by a manipulation factor smoother for smoothing the manipulation factor over time.

Fig. IOC shows a limited variation of the time stretching factors. This fixes the minimal and maximal allowable time-stretching and/or compression. The minimal and maximal threshold needs to be determined, for example, by the fact that excessive time manipulation factors may lead to an unnatural rendering of the audio signals. Furthermore, the sound quality may suffer when a given audio signal or a time section thereof is stretched too much, since the original audio signal only contains a limited number of samples if it is available in a digital format (e.g. PCM). In principle, also an analog signal typically suffers from a loss of sound quality when it is time stretched or shrinked by e.g. electro-mechanical means. Fig. 10D also shows a limited variation of the time-stretching / compression, however, adapted to the signal. The degree of time-stretching or compression changes slowly with the signal. The variations within short time segments, however, are limited. The slowly varying lower and upper thresholds AD_mj_n(t) and AD_max(t) may be determined by moving averages over relatively long time intervals, e.g. 10 seconds, 30 seconds, or 1 minute, or values in between.

Fig. 10E shows the time dilatation function for alternative embodiments of the audio signal processor 100 and the method for adjusting time information content variations. Pauses are not cut or deleted, but remain in the audio signal. Only the regions with speech activity are time-stretched or "compressed", whereas the "filled" pauses remain unmodified.

The teachings disclosed in this document, in particular, the audio signal processor, the method for adjusting time information content variation and the computer program, enable to time- stretch / compress audio signals in a signal adaptive manner without human interaction. It is possible to detect filled or empty pauses and to process them differently from active speech segments. Moreover, it is possible to playback audio signals slower while maintaining the pitch. In particular, spoken language can be played back at a lower speed and thus be more easily comprehensible without necessarily lengthening the duration of the audio signal.

In the alternative, the total duration may be modified if the pauses are maintained with their original duration. However, the pauses do not need to be time-stretched or compressed along with the rest of the audio signal, so that the new total duration is shorter than a new total duration that would have been obtained by globally time-stretching the entire audio signal. The same applies in principle to the compression of the audio signal so that the total duration of the audio signal, after compression using the proposed methods, would be longer than a total duration of a conventionally (across-the-board) time-compressed audio signal.

The audio signal processor 100 may further comprise a deleting device implemented to delete the content of the second time section when the second measure of information content M₂ is lower than a deletion threshold. The deletion or erasing of the content in the second time section may be useful if the content comprises repeated word, repeated syllables, makeshift words etc.. Without the deletion, theses words, syllables, and sounds would be for example compressed and thus played back at a higher speed than originally recorded, which might be distracting for the listener. In order to identify signal passages in the audio signal s that contain superfluous words or sounds, a pattern detector may be used which is compares the signal passages with reference signal passages stored in a database. The reference signal passages may comprise the above mentioned makeshift words when pronounced by various speakers, superfluous sounds such as throat clearing, and other similar patterns. Word repetitions and syllable repetitions may be detected e.g. by an autocorrelation function. Note that word repetitions may be common and perfectly correct in some languages (for example in German) which should be taken into account by a word or syllable repetition function. An excision means may be used to remove the repeated words or syllables from the audio signal. The teachings disclosed in this document may be employed in the field of the distribution of audio content, such as digital radio, internet streaming, and audio communication applications. In particular, applications are imaginable in two categories:

• real-time applications, e.g. speech communication and audio coding; and

· processing of already recorded material, e.g. radio plays, lectures, etc.

The teachings disclosed herein may be beneficial for persons wishing to follow foreign languages more easily or studying foreign languages. Access to radio plays and audiobooks is facilitated to mentally challenged people as well as senior citizens. Furthermore, applications in the field of training linguistically challenged persons are also possible.

Some original audio signals may comprise rather long pauses. If these pauses are compressed so that the listener does not need to wait a long time between two speech activity segments, a sound or speech synthesizer may insert a short information about the original duration of the pauses, such as a succession of short beeps, each beep representing a pause of, for example, one minute. A duration of the pause could also be represented by using different pitches of the sound, a low sound indicating the long pause and a high pitched sound indicating a short pause. A speech synthesizer could be used to insert the words "pause of X minutes Y seconds".

Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.

Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blue-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.

Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.

Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.

In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.

A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet. A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein. A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver .

In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus. The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.

Claims

1. An audio signal processor (100), comprising: an analysis means (104) implemented to determine a first measure of information content (Mi) of a first time section of an audio signal and a second measure of information content (M₂) of a second time section; a manipulation factor unit (106) implemented to determine a time manipulation factor (AD for the first time section in dependence on the first measure of information content (Mi) and the second measure of information content (M₂); a time-stretching and compression device (108) implemented to time-stretch or compress the first time section according to the manipulation factor (ADi) and to treat the second time section differently from the first time section.

2. The audio signal processor (100) according to claim 1 , further comprising: a comparator (204) implemented to compare the first measure of information content

(Mi) of the first time section to a threshold and to classify the first time section, in dependence on a respective result of the comparison, as a section having a higher measure of information content or as a section having a lower measure of information content; and a section bounding means (206) implemented to shift boundaries between the sections having a higher measure of information content and the sections having a lower measure of information content into the sections having a lower measure of information content; wherein the time-stretching and compression device (108) is further implemented to time-stretch or compress a section having a higher measure of information content by a factor corresponding to the shift of the boundaries of the first time section.

3. The audio signal processor according to claim 2, further comprising: a limiting device (508) for the time-stretching or compression, wherein the limiting device is implemented to determine a current threshold (AD_mj_n, AD_max) for the time-stretching or compression of the section having higher information content and to limit the time- stretching and compression to the current threshold.

4. The audio signal processor according to claim 3, wherein the limiting device (508) is implemented to evaluate a moving average of the first measure of information content.

5. The audio signal processor according to claim 3 or 4, wherein the limiting device (508) is further implemented to vary the current threshold over the duration of the audio signal (s) in order to adjust sectional variations of the measure of information content (Mi, M₂).

6. The audio signal processor (100) according to any one of claims 1 to 5, further comprising: a pause density estimator implemented to perform pause density estimation over time, a result of which determines a shifting measure for shifting the boundaries.

7. The audio signal processor (100) according to any one of claims 1 to 6, wherein the analysis means (104) is implemented to identify a certain time section as a pause in the audio signal (s) and to set the manipulation factor (Mi, M₂) for the certain time section to a neutral value so that the certain time section is not time-stretched or compressed.

8. The audio signal processor according to one of claims 1 to 7, further comprising: a speech velocity measuring device (602) implemented to determine whether a time section of the audio signal (s) essentially comprises spoken language, and implemented to determine a speech velocity within the time section; a threshold setting device (608) connected to the speech velocity measuring device (602) and implemented to determine, based on the determined speech velocity, at least one threshold for the manipulation factor valid for the time section.

9. The audio signal processor (100) according to one of claims 1 to 8, further comprising: a deleting device implemented to delete content of the second time section when the second measure of information content is lower than a deletion threshold.

10. The audio signal processor (100) according to one of claims 1 to 9, wherein the time- stretching and compression means comprises at least one of SOLA, WSOLA, PSOLA, PICOLA, TDHS, MPEX or phase vocoder algorithm.

11. The audio signal processor (100) according to one of claims 1 to 10, further comprising: a total signal time-stretching and compression means implemented to time-stretch or compress time sections having a higher measure of information content and to leave time sections having a lower measure of information content substantially unaltered with regard to their duration.

12. The audio signal processor (100) according to any one of claims 1 to 1 1 , further comprising: a manipulation factor smoother for smoothing the manipulation factor over time.

13. The audio signal processor (100) according to any one of claims 1 to 12, further comprising: a repetition detector implemented to detect repeated passages within the audio signal; an excision means implemented to excise repeated passages from the audio signal.

14. A method for adjusting time information content variations of an audio signal (s), comprising: determining (702) a first measure of information content (Mi) of a first time section of the audio signal and a second measure of information content (M₂) of a second time section of the audio signal (s); determining (716) a time manipulation factor (ADj) for the first time section in dependence on the first measure of information content (Mi) and the second measure of information content (M₂); processing the audio signal (s) such that the first time section is time-stretched or compressed according to the time manipulation factor (ADi) and that the second time section is processed differently from the first time section.

15. A computer program having a program code for performing the method according to claim 14 when the program runs on a computer.