US20170125037A1 - Electronic device and method for recognizing speech - Google Patents
Electronic device and method for recognizing speech Download PDFInfo
- Publication number
- US20170125037A1 US20170125037A1 US15/340,528 US201615340528A US2017125037A1 US 20170125037 A1 US20170125037 A1 US 20170125037A1 US 201615340528 A US201615340528 A US 201615340528A US 2017125037 A1 US2017125037 A1 US 2017125037A1
- Authority
- US
- United States
- Prior art keywords
- audio signals
- speech
- power value
- direction information
- audio signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000005236 sound signal Effects 0.000 claims abstract description 357
- 238000012545 processing Methods 0.000 claims abstract description 40
- 238000004422 calculation algorithm Methods 0.000 claims description 27
- 238000000605 extraction Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 5
- 238000000926 separation method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 2
- 230000001131 transforming effect Effects 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 22
- 238000004891 communication Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 8
- 238000005314 correlation function Methods 0.000 description 7
- 238000003491 array Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000005406 washing Methods 0.000 description 2
- 230000003321 amplification Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000004304 visual acuity Effects 0.000 description 1
- 230000002087 whitening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
Definitions
- Apparatuses and methods consistent with the present disclosure relate to an electronic device and a method for recognizing a speech, and more particularly, to an electronic device and a method for detecting a speech section in an audio signal.
- the speech recognition technology means a technology of understanding an intention of an uttered speech of a user from a speech signal input from hardware or software device or a system and performing an operation based on the understood intention.
- the speech recognition technology recognizes various sounds generated from the surrounding environment as well as a speech signal for the uttered speech of the user and therefore may not correctly perform the intended operation of the user.
- a general method for detecting a speech section there are a method for detecting a speech section using energy for each audio signal in a frame unit, a method for detecting a speech section using zero crossing for each audio signal in a frame unit, a method for extracting a feature vector from an audio signal in a frame unit and detecting a speech section by determining existence and nonexistence of a speech signal from the pre-extracted feature vector using a support vector machine (SVM), or the like.
- SVM support vector machine
- the method for detecting a speech section using energy of an audio signal in a frame unit or zero crossing uses energy or zero crossing for audio signals for each frame.
- the existing method for detecting a speech section has relatively smaller computation for determining whether the audio signals for each frame are the speech signal over other methods for detecting a speech section but may often cause an error of detecting a noise signal as well as the speech signal as the speech section.
- the method for detecting a speech section using a feature vector extracted from an audio signal in a frame unit and a SVM has more excellent detection accuracy for only the speech signal from the audio signals for each frame over the method for detecting a speech section using the foregoing energy or zero crossing but requires more computation to determine the existence and nonexistence of the speech signal from the audio signals for each frame and therefore may consume much more CPU resources over other methods for detecting a speech section.
- Exemplary embodiments overcome the above disadvantages and other disadvantages not described above. Also, the embodiments are not required to overcome the disadvantages described above, and an exemplary embodiment may not overcome any of the problems described above.
- the present disclosure correctly detects a speech section including a speech signal from an input audio signal in an electronic device.
- the present disclosure inputs speech signals of a short distance and a far distance and detects a speech section based on sound direction tracking of the speech signals in the electronic device.
- a method for recognizing a speech by an electronic device includes: receiving sounds generated from a sound source through a plurality of microphones; calculating power values of a plurality of audio signals generated by performing signal processing on each sound input through the plurality of microphones and calculating direction information on the sound source based on the calculated power values and storing the calculated direction information; and performing the speech recognition on a speech section included in the audio signal based on the direction information on the sound source.
- the speech section may be detected based on audio signals corresponding to starting and ending points among the plurality of audio signals and the speech recognition may be performed on the detected speech section.
- the storing may include: calculating a maximum power value and a minimum power value from the plurality of signal-processed audio signals; calculating a power ratio from the calculated maximum power value and minimum power value; determining at least one audio signal of which the calculated power ratio is equal to or more than a preset threshold value; and calculating the direction information on the sound source from a sound corresponding to the direction information included in the at least one audio signal determined and the at least one audio signal determined and storing the calculated direction information and an index for the at least one audio signal.
- the storing may further include: comparing the minimum power value calculated from the plurality of audio signals with a pre-stored minimum power value to determine a power value having a smaller size as the minimum power value for the plurality of audio signals, if the minimum power value calculated from a previous audio signal is pre-stored.
- the storing may further include: resetting a minimum power value calculated from an N-th audio signal to an initial value, if a predefined N-th audio signal is input.
- N*(N ⁇ 1)/2 power values may be calculated from the plurality of audio signals using a generalized cross-correlation phase transform (GCC-PHAT) algorithm and the largest value among the N*(N ⁇ 1)/2 power values may be determined as the maximum power value, if the number of microphones is N, and the minimum power value is calculated from the plurality of audio signals using a minima-controlled recursive average (MCRA) algorithm.
- GCC-PHAT generalized cross-correlation phase transform
- MCRA minima-controlled recursive average
- the direction information may be angle information between sound direction of the sound source from which the sounds corresponding to each of the plurality of audio signals are generated and the plurality of microphones, and in the calculating of the maximum power value and the minimum power value, the direction information on the sound source from which the sounds corresponding to each of the plurality of audio signals are generated may be calculated from a delay value corresponding to the determined maximum power value.
- the speech recognition may be performed on a speech section included in audio signals corresponding to at least two direction information if the at least two of the plurality of direction information is included in a preset error range or the error range of the two direction information is less than a preset threshold value.
- the performing of the speech recognition may include: detecting the speech section from the audio signal based on the index for the at least one audio signal of which the power ratio is equal to or more than a preset threshold value; performing signal processing on the audio signal in the detected speech section based on the direction information on the sound source from which a sound corresponding to at least one audio signal of which the power ratio is equal to or more than a preset threshold value is generated; and performing the speech recognition from the signal-processed audio signal and transforming the speech into a text.
- the signal processing may be performed on the audio signal in the detected speech section using at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- LCMA linearly constrained minimum variance
- MVDR minimum variance distortion-less response
- GSS geometric source separation
- BSE blind source extraction
- an electronic device includes: an input receiving sounds generated from a sound source through a plurality of microphones; a memory storing direction information on the sound source; and a processor performing signal processing on each sound input through the plurality of microphones, calculating power values of a plurality of signal-processed audio signals, calculating direction information on the sound source based on the calculated power values and storing the calculated direction information in the memory, and performing the speech recognition on a speech section included in the audio signal based on the direction information on the sound source.
- the processor may detect the speech section based on audio signals corresponding to starting and ending points among the plurality of audio signals and perform the speech recognition on the detected speech section.
- the processor may calculate a maximum power value and a minimum power value from the plurality of signal-processed audio signals, calculate a power ratio from the calculated maximum power value and minimum power value, calculate the direction information on the sound source from a sound corresponding to at least one audio signal of which the calculated power ratio is equal to or more than a preset threshold value, and store the calculated direction information and an index for the at least one audio signal in a memory.
- the processor may compare the minimum power value calculated from the plurality of audio signals with a pre-stored minimum power value to determine a power value having a smaller size as the minimum power value for the plurality of audio signals, if the minimum power value calculated from a previous audio signal is pre-stored in the memory.
- the processor may reset a minimum power value calculated from an N-th audio signal to an initial value, if a predefined N-th audio signal is input.
- the processor may calculate N*(N ⁇ 1)/2 power values from the plurality of audio signals using a generalized cross-correlation phase transform (GCC-PHAT) algorithm, determine the largest value among the N*(N ⁇ 1)/2 power values as the maximum power value, if the number of microphones is N, and calculate the minimum power value from the plurality of audio signals using a minima-controlled recursive average (MCRA) algorithm.
- GCC-PHAT generalized cross-correlation phase transform
- MCRA minima-controlled recursive average
- the direction information may be the angle information between sound direction of the sound source from which the sounds corresponding to each of the plurality of audio signals are generated and the plurality of microphones, and the processor may calculate the direction information on the sound source from which the sounds corresponding to each of the plurality of audio signals are generated from a delay value corresponding to the determined maximum power value.
- the processor may perform the speech recognition on a speech section included in audio signals corresponding to at least two direction information if the at least two of the plurality of direction information is included in a preset error range or the error range of the two direction information is less than a preset threshold value.
- the processor may detect the speech section from the audio signal based on the index for the at least one audio signal of which the power ratio is equal to or more than a preset threshold value, perform signal processing on the audio signal in the detected speech section based on the direction information on the sound source from which a sound corresponding to at least one audio signal of which the power ratio is equal to or more than a preset threshold value is generated, and perform the speech recognition from the signal-processed audio signal and transforms the speech into a text.
- the signal processing may be performed on the audio signal in the detected speech section using at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- LCMA linearly constrained minimum variance
- MVDR minimum variance distortion-less response
- GSS geometric source separation
- BSE blind source extraction
- a non transitory computer readable storage stores a method, the method including receiving sound generated from a sound source via a plurality of microphones; calculating power values of a plurality of audio signals generated by performing signal processing on each sound input through the plurality of microphones; calculating direction information for the sound source based on the power values and storing the direction information; and performing speech recognition on a speech section included in an audio signal based on the direction information for the sound source.
- a system includes: a sound source; a plurality of microphones receiving sound produced by the sound source; a memory storing a direction of the sound source; and a computer processor calculating power values of audio signals from the microphones, calculating the direction of the sound source using the power values, storing the direction in the memory and performing speech recognition on a speech section of an audio signal responsive to the direction.
- a system includes a sound source; an array of microphones receiving sound signals of sound produced by the sound source the microphones having one of different locations and different directionalities; and a computer processor calculating power values of sound signals from the array microphones, selecting an audio signal from the sound signals using the power values and corresponding different locations and different directionalities, and performing speech recognition on a speech section of the audio signal in a noisy environment.
- a method includes: calculating power values of audio signals generated by microphones receiving sound from a sound source, calculating a direction of the sound source based on the power values and storing the direction, and identifying end points of a speech section responsive to an angle of the direction; and performing speech recognition on the speech section.
- the electronic device may correctly detect only the speech section from the audio signal while improving the processing speed for the speech section detection.
- FIG. 1 is an exemplified diagram illustrating the environment in which an electronic device according to an exemplary embodiment of the present disclosure performs speech recognition
- FIG. 2A is a schematic block diagram of the electronic device for recognizing a speech according to an exemplary embodiment of the present disclosure
- FIG. 2B is a detailed block diagram of the electronic device for recognizing a speech according to an exemplary embodiment of the present disclosure
- FIG. 3 is a block diagram illustrating a configuration of performing speech recognition in a processor according to an exemplary embodiment of the present disclosure
- FIG. 4 is a detailed block diagram of a sound source direction detection module according to an exemplary embodiment of the present disclosure
- FIGS. 5A to 5C are exemplified diagrams illustrating a speech section detection from an input audio signal in the electronic device according to an exemplary embodiment of the present disclosure
- FIGS. 6A and 6B are exemplified diagram illustrating a result of tracking a sound source direction from the input audio signal in the electronic device according to an exemplary embodiment of the present disclosure
- FIG. 7 is an exemplified diagram of internet of things services provided from the electronic device according to the exemplary embodiment of the present disclosure.
- FIG. 8 is a flow chart of a method for performing speech recognition by an electronic device according to an exemplary embodiment of the present disclosure
- FIG. 9 is a first flow chart of a method for storing direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal, by the electronic device according to an exemplary embodiment of the present disclosure.
- FIG. 10 is a second flow chart of a method for storing direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal determined as the speech section, by an electronic device according to another exemplary embodiment of the present disclosure.
- ordinal numbers like “first”, “second”, or the like may be used.
- the ordinal numbers are used to differentiate like or similar components from each other and the meaning of the terms should not be restrictively analyzed by the use of the ordinal numbers.
- a use order, a disposition order, or the like of components coupled to the ordinal numbers should not be limited by the numbers. If necessary, the respective ordinal numbers may also be used by being replaced by each other.
- module are terms naming components for performing at least one function or operation and these components may be implemented as hardware or software or implemented by a combination of hardware and software.
- the plurality of “modules”, “units”, “parts”, etc. may be integrated as at least one module or chip to be implemented as at least one processor (not illustrated), except for the case in which each of the “modules”, “units”, “parts”, etc., need to be implemented as individual specific hardware.
- any portion when any portion is connected to other portions, this includes a direction connection and an indirect connection through other media.
- this includes a direction connection and an indirect connection through other media.
- FIG. 1 is an exemplified diagram illustrating the environment in which an electronic device according to an exemplary embodiment of the present disclosure performs speech recognition.
- an electronic device 100 for recognizing a speech performs speech recognition from a speech signal for an uttered speech of a user 2 .
- the electronic device 100 for recognizing a speech may be peripheral devices such as a robot 1 , TV 4 , and a cleaner 5 in the house or a terminal device 3 that may control each of the peripheral devices such as the robot 1 , the TV 4 , and the cleaner 5 .
- the electronic device 100 may receive the speech signal for the uttered speech of the user 2 through a plurality of microphones that are installed in the electronic device 100 or receive the speech signal for the uttered speech of the user 2 from the plurality of microphones that are installed in the house.
- the electronic device 100 may receive a sound generated from a sound source that includes the speech signal for the uttered speech of the user 2 and a noise signal for noise generated from the surrounding environment through the plurality of microphones.
- the electronic device 100 When receiving sounds generated from the sound source through the plurality of microphones, the electronic device 100 performs signal processing on each sound input through each of the microphones. Next, the electronic device 100 calculates power values of a plurality of signal-processed audio signals and determines a direction of the sound source based on the calculated power values. Next, the electronic device 100 performs the speech recognition by removing the noise signal from signal-processed audio data from the sound input through the determined direction of the sound source and detecting only the speech signal. Therefore, the electronic device 100 may improve the problem of wrongly recognizing the noise signal as the speech signal.
- the microphones that are installed in the electronic device 100 (see FIG. 2 ) or installed at different locations in the house may include a plurality of microphone arrays having directionalities and may receive the sound generated from the sound source including the speech signal for the uttered speech of the user 2 in various directions through the plurality of microphone arrays.
- the microphone that is installed in the electronic device 100 or installed in the house may have a single configuration.
- FIG. 2A is a schematic block diagram of the electronic device 100 for recognizing a speech according to an exemplary embodiment of the present disclosure
- FIG. 2B is a detailed block diagram of the electronic device for recognizing a speech according to an exemplary embodiment of the present disclosure.
- the electronic device 100 is configured to include an input 110 , a memory 120 , and a processor 130 .
- the input 110 includes a plurality of microphones 111 and receives a sound generated from a sound source through a plurality of microphones 111 .
- the present disclosure is not limited thereto, and when the microphone 111 is configured in one, the corresponding microphone 111 may receive the sound generated from the sound source in various directions through the plurality of microphone arrays.
- the sound source may include the speech signal for the uttered speech of the user and the noise signal for noise generated from the surrounding environment.
- the memory 120 stores a direction or direction information for the sound source.
- the processor 130 performs the signal processing on each sound input through the plurality of microphones 111 and calculates the power values of the plurality of signal-processed audio signals. Next, the processor 130 calculates the direction information on the sound source based on the calculated power values and stores the direction information on the calculated sound source in the memory 120 . Next, the processor 130 performs the speech recognition on a speech section included in the audio signal based on the direction information on the sound source.
- the processor 130 calculates a maximum power value and a minimum power value from each of the signal-processed audio signals if the signal-processed audio signals are input from each sound input through the plurality of microphones 111 .
- the processor 130 calculates a power ratio from the maximum power value and the minimum power value that are calculated from each of the audio signals.
- the processor 130 compares the power ratio calculated from each of the audio signals with a preset threshold value and calculates the direction information on the sound source from the sound corresponding to at least one audio signal having the power ratio that is equal to or more than the preset threshold value.
- the processor 130 stores the calculated direction information on the sound source and an index for at least one audio signal having the power ratio that is equal to or more than the preset threshold value in the memory 120 .
- the index information is identification information on the audio signal and according to the exemplary embodiment of the present disclosure, may be information on time when the audio signal is input.
- the processor 130 detects the speech sections from the audio signals each corresponding to starting and ending points for the uttered speech of the user among the plurality of audio signals based on the direction information on the sound source and the index that are stored in the memory 120 and performs the speech recognition on the detected speech sections.
- the processor 130 when receiving the sounds generated from the sound source through the plurality of microphones 111 , the processor 130 performs the signal processing on each sound input through the plurality of microphones 111 as the audio signals. Next, the processor 130 may sample each of the signal-processed audio signals in L numbers and then generate the L sampled audio signals in a frame unit.
- the processor 130 calculates the maximum power value and the minimum power value from each of the plurality of audio signals and calculates the power ratio from the calculated maximum power value and minimum power value.
- the maximum power value and the minimum power value may be a signal strength value for the audio signal. Therefore, the processor 130 may calculate the power ratio from the maximum power value having the largest signal strength value and the minimum power value having the smallest signal strength value among the plurality of audio signals.
- the processor 130 stores the direction information on the sound source from which the sound corresponding to at least one audio signal of which the power ratio calculated from the maximum power value and the minimum power value is equal to or more than the preset threshold value among the plurality of audio signals is generated and the index for at least one audio signal of which the power ratio is equal to or more than the preset threshold value in the memory 120 .
- the processor 130 calculates N*(N ⁇ 1)/2 power values from the plurality of audio signals using generalized cross-correlation phase transform (GCC-PHAT) algorithm. Next, the processor 130 may determine the largest value among the calculated N*(N ⁇ 1)/2 power values as the maximum power value.
- GCC-PHAT generalized cross-correlation phase transform
- the processor 130 may calculate one power value from the plurality of audio signals. In this case, the processor 130 may determine the calculated power value as the maximum power value. Meanwhile, when the number of microphones 111 is three, the processor 130 may calculate three power values from the plurality of audio signals and determine the largest value among the three power values as the maximum power value.
- the processor 130 may calculate the N*(N ⁇ 1)/2 power values and delay values for each of the plurality of audio signals from the plurality of audio signals using the cross-correlation function like the following ⁇ Equation 1>.
- the delay values for each of the plurality of audio signals may be the information on the time when the audio signals are differently input to each of the plurality of microphones 111 depending on a distance between the plurality of microphones 111 .
- i and j are the indexes for the audio signals input from the plurality of microphones 111 and X i (k) is a discrete Fourier transform (DFT) signal for an i-th audio signal input from a first microphone among the plurality of microphones 111 . Further, X j (k) is a discrete Fourier transform (DFT) signal for a j-th audio signal input from a second microphone among the plurality of microphones 111 . Further, ( )* represents a complex conjugate and k represents an index for a discrete frequency.
- DFT discrete Fourier transform
- the cross-correlation function like the above ⁇ Equation 1> may be used as well as one of various whitening methods for increasing resolving power, a method for differently allocating weighting for each frequency, and a regularization method for preventing diffusion may be used in a form modified from the above ⁇ Equation 1>.
- the processor 130 may calculate the minimum power value from the plurality of audio signals using minima-controlled recursive average (MCRA) algorithm.
- MCRA minima-controlled recursive average
- GCC-PHAT generalized cross-correlation phase transform
- MCRA minima-controlled recursive average
- the processor 130 may calculate the power ratio from the maximum power value having the largest value among the power values calculated using the cross-correlation function like the above ⁇ Equation 1> and the minimum power value calculated using the MCRA algorithm.
- the processor 130 determines whether the minimum power value calculated from the previous audio signal is pre-stored in the memory 120 , prior to calculating the power ratio from the maximum power value and the minimum power value. As the determination result, if the minimum power value is not pre-stored in the memory 120 , the processor 130 may calculate the power ratio from the maximum power value having the largest value among the power values calculated using the cross-correlation function like the above ⁇ Equation 1> and the minimum power value calculated using the MCRA algorithm.
- the processor 130 compares the minimum power value calculated from the plurality of audio signals currently input with the pre-stored minimum power value to select the minimum power value having a relatively smaller size. In detail, if the size of the pre-stored minimum power value is smaller than that of the minimum power value currently calculated, the processor 130 calculates the power ratio from the pre-stored minimum power value and the maximum power value calculated from the plurality of audio signals currently input.
- the processor 130 updates the minimum power value pre-stored in the memory 120 to the minimum power value calculated from the plurality of audio signals currently input.
- the processor 130 may calculate the power ratio from the maximum power value and the minimum power value that are calculated from the plurality of audio signals currently input.
- the processor 130 performs the update of the minimum power value only before the sound corresponding to the pre-stored K-th audio signal is input. That is, if the sound corresponding to the pre-stored K-th audio signal is input, the processor 130 may reset the minimum power value calculated from the K-th audio signal to an initial value and store the initial value in the memory 120 .
- the processor 130 calculates the power ratio from the maximum power value and the minimum power value calculated from the K+1-th audio signal. Further, the processor 130 compares the minimum power value of the K+1-th audio signal with the minimum power value of the K-th audio signal reset to the initial value. As the comparison result, if it is determined that the minimum power value of the K+1-th audio signal is small, the processor 130 updates the minimum power value pre-stored in the memory 120 to the minimum power value of the K+1-th audio signal and if it is determined that the minimum power value of the K+1-th audio signal is large, the processor 130 keeps the minimum power value pre-stored in the memory 120 .
- the processor 130 compares each power ratio with the preset threshold value to store the direction information on the sound source from which the sound corresponding to at least one audio signal having the power ratio that is equal to or more than the preset threshold value is generated and the index for at least one audio signal having the power ratio that is equal to or more than the preset threshold value in the memory 120 . Therefore, if the direction information on the sound source from which the sound corresponding to at least one audio signal is generated and the index are stored in the memory 120 , the processor 130 may determine the starting and ending points of the speech section included in the audio signal based on the direction information on the sound source stored in the memory 120 .
- the processor 130 may determine each of the audio signals corresponding to at least two direction information as the audio signals of the starting and ending points if at least two direction information on the plurality of sound sources is included in the preset error range or the error range of the at least two direction information is less than the preset threshold value.
- the direction information is the angle information between the sound direction of the sound sources from which the sounds corresponding to the plurality of audio signals are generated and the plurality of microphones 111 . Therefore, the processor 130 may calculate the angle information that is the direction information on the sound sources from which the sounds corresponding to the plurality of audio signals are generated from the delay value calculated by the above ⁇ Equation 1> and the memory 120 may store the angle information on the plurality of audio signals of which the power ratio equal to or more than the preset threshold value is calculated and the index for the corresponding audio signal.
- the processor 130 may determine whether each angle information on each of the plurality of audio signals pre-stored in the memory 120 belongs to the preset error range to acquire the angle information included in the preset error range. If at least two angle information included in the preset error range is acquired, the processor 130 determines the audio signal corresponding to the acquired angle information as a speech signal of a static sound source.
- the processor 130 compares a difference value in the angle information on each of the first and second audio signals with the preset threshold value. As the comparison result, if the difference value in the angle information on each of the first and second audio signals is less than the preset threshold value, the processor 130 determines the first and second audio signals as a speech signal of a dynamic sound source.
- the processor 130 may determine each of at least two audio signals determined as the speech signal as the audio signals of the starting and ending points.
- the processor 130 may detect the speech sections based on the indexes for the audio signals determined as the starting and ending points. If the speech sections are detected, the processor 130 performs the signal processing on the audio signals included in the speech sections based on the direction information on the sound sources for the audio signals determined as the starting and ending points.
- the processor 130 may perform signal processing to amplify the signal-processed audio signal from the sound input from the corresponding direction based on the direction information on the sound sources for the audio signals determined as starting and ending points among the audio signals included in the speech sections and attenuate the audio signals in the rest directions.
- the processor 130 may perform signal processing to amplify the audio signal in the direction corresponding to the direction information on the sound sources for the audio signals determined as the starting and ending points from the audio signal in the previously detected speech section and attenuate the audio signals in the rest directions by at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- LCMA linearly constrained minimum variance
- MVDR minimum variance distortion-less response
- GSS geometric source separation
- BSE blind source extraction
- the processor 130 performs the speech recognition from the audio signal in the signal-processed speech section and transforms it into a text.
- the processor 130 may perform the speech recognition from the audio signal in the signal-processed speech section using a speech to text (STT) algorithm and transform it into a text form.
- STT speech to text
- the foregoing input 110 may include a plurality of microphones 111 , a manipulator 113 , a touch input 115 , and a user input 117 .
- the plurality of microphones 111 output the uttered speech of the user or audio signals generated from other living environments to the processor 130 .
- the manipulator 113 may be implemented as a key pad including various function keys, a numeric key, a special key, a character key, or the like and when a display 191 to be described below is implemented in a touch screen form, the touch input 115 may be implemented as a touch pad having a mutual layer structure with the display 130 . In this case, the touch input 115 may receive a touch command for an icon displayed through the display 191 to be described below.
- the user input 117 may receive an IR signal or an RF signal from at least one peripheral device (not illustrated). Therefore, the foregoing processor 130 may control an operation of the electronic device 100 based on the IR signal or the RF signal that is input through the user input 117 .
- the IR or RF signal may be a control signal or a speech signal for controlling the operation of the electronic device 100 .
- the electronic device 100 may further include various components besides the input 110 , the memory 120 , and the processor 130 that are described above.
- the electronic device 100 when the electronic device 100 is implemented as display devices such as a smart phone and a smart TV, as illustrated in FIGS. 2A and 2B , the electronic device 100 may further include a communicator 140 , a speech processor 150 , a photographer 160 , a sensor 170 , a signal processor 180 , and an output 190 .
- the communicator 140 performs data communication with at least one peripheral device (not illustrated).
- the communicator 140 may transmit the speech signal for the uttered speech of the user to a speech recognition server (not illustrated) and receive a speech recognition result in a text form that is recognized from the speech recognition server (not illustrated).
- the communicator 140 may perform data communication with a web server (not illustrated) to receive content corresponding to a user command or a content related search result.
- the communicator 140 may include a short range communication module 141 , a wireless communication module 143 such as a wireless LAN module, and a connector 145 including at least one of wired communication modules such as a high-definition multimedia interface (HDMIsh), a universal serial bus (USB), Institute of Electrical and Electronics Engineers (IEEE) 1394.
- a short range communication module 141 such as a wireless LAN module
- a wireless communication module 143 such as a wireless LAN module
- a connector 145 including at least one of wired communication modules such as a high-definition multimedia interface (HDMIsh), a universal serial bus (USB), Institute of Electrical and Electronics Engineers (IEEE) 1394.
- HDMIsh high-definition multimedia interface
- USB universal serial bus
- IEEE 1394 Institute of Electrical and Electronics Engineers
- the short range communication module 141 is configured to wirelessly perform short range communication between the portable terminal device (not illustrated) and the electronic device 100 .
- the short range communication module 141 may include at least one of a Bluetooth module, an infrared data association (IrDA) module, a near field communication (NFC) module, a WIFI module, and a Zigbee module.
- the wireless communication module 143 is a module that is connected to an external network according to a wireless communication protocol such as IEEE to perform communications.
- the wireless communication module may further include a mobile communication module which is connected to a mobile communication network according to various mobile communication standards such as 3rd generation (3G), 3rd generation partnership project (3GPP), and long term evolution (LTE) to perform communications.
- 3G 3rd generation
- 3GPP 3rd generation partnership project
- LTE long term evolution
- the communicator 140 may be implemented by the above-mentioned various short range communication schemes and may adopt other communication technologies not mentioned in the present specification as needed.
- the connector 145 is configured to provide an interface with various source devices such as USB 2.0, USB 3.0, HDMI, and IEEE 1394.
- the connector 145 may receive content data transmitted from an external server (not illustrated) through a wired cable connected to the connector 145 according to the control command of the processor 130 to be described below or may transmit pre-stored content data to an external recording medium. Further, the connector 145 may receive power from a power source through the wired cable physically connected to the connector 145 .
- the speech processor 150 is configured to perform the speech recognition on the uttered speech section of the user among the audio signals input through the plurality of microphones 111 .
- the speech processor 150 performs a pre-processing process of amplifying the plurality of audio signals included in the detected speech section when the speech section is detected from the input audio signals and attenuating the rest audio signals that are noise signals, according to the control command of the processor 130 .
- the speech processor 150 may perform the speech recognition on the uttered speech of the user using a speech recognition algorithm like the STT algorithm for the speech sections in which the audio signals are amplified.
- the photographer 160 is to photograph still images or moving images according to the user command and may be implemented in plural like a front camera and a rear camera.
- the sensor 170 senses various operation states of the electronic device 100 and a user interaction.
- the sensor 170 may sense a gripped state in which the user grips the electronic device 100 .
- the electronic device 100 may be rotated or inclined in various directions.
- the sensor 170 may use at least one of various sensors such as a geomagnetic sensor, a gyro sensor, and an accelerator sensor to sense a gradient, etc., of the electronic device 100 that is gripped by the user based on a rotational motion or a gravity direction.
- the signal processor 180 may be configured to process the content received through the communicator 140 and image data and audio data of the content stored in the memory 120 .
- the signal processor 180 may perform various image processing, such as decoding, scaling, noise filtering, frame rate conversion, and resolution conversion, on the image data included in the content.
- the signal processor 180 may perform various audio signal processing, such as decoding, amplification, and noise filtering, on the audio data included in the content.
- the output 190 outputs the signal-processed content through the signal processor 180 .
- the output 190 may output the content through at least one of the display 191 and an audio output 192 . That is, the display 191 may display the image data that are image processed by the signal processor 180 and the audio output 192 may output the audio data, which suffer from audio signal processing, in an audible sound form.
- the display 191 that displays the image data may be implemented as a liquid crystal display (LCD), an organic light emitting display (OLED), a plasma display panel (PDP), or the like.
- the display 191 may be implemented in a touch screen form in which it forms a mutual layer structure with the touch input 115 .
- the foregoing processor 130 may include a CPU 131 , a ROM 132 , a RAM 133 , and a GPU 135 that may be connected to one another via a bus 137 .
- the CPU 131 accesses the memory 120 to perform booting using an O/S stored in the memory 120 . Further, the CPU 131 performs various operations using various programs, content, data, and the like that are stored in the memory 120 .
- a set of commands for system booting, and the like are stored in the ROM 132 .
- the CPU 131 copies the O/S stored in the memory 120 to the RAM 133 according to the command stored in the ROM 132 and executes the O/S to boot the system. If the booting is completed, the CPU 131 copies the various programs stored in the memory 120 to the RAM 133 and executes the programs copied to the RAM 133 to execute various operations.
- the GPU 135 generates a display screen including various objects like an icon, an image, a text, or the like.
- the GPU 135 calculates attribute values, such as coordinate values at which the respective objects will be displayed, shapes, sizes, and colors of the objects, based on a layout of the screen according to the received control command and generates display screens having various layouts including the objects based on the calculated attribute values.
- the processor 130 may be implemented as a system-on-a chip (SOC) or a system on chip (Soc) by being combined with various components such as the input 110 , the communicator 140 , and the sensor 170 that are described above.
- SOC system-on-a chip
- Soc system on chip
- the operation of the processor 130 may be executed by programs that are stored in the memory 120 .
- the memory 120 may be implemented as at least one of a memory card (for example, SD card, memory stick) that may be detached from and attached to the ROM 132 , the RAM 133 , or the electronic device 100 , a non-volatile memory, a volatile memory, a hard disk drive (HDD), and a solid state drive (SSD).
- a memory card for example, SD card, memory stick
- HDD hard disk drive
- SSD solid state drive
- the processor 130 that detects the speech sections from the plurality of audio signals may detect the speech sections from the plurality of audio signals using the program module stored in the memory 120 as illustrated in FIG. 3 .
- FIG. 3 is a block diagram illustrating a configuration of performing speech recognition in a processor according to an exemplary embodiment of the present disclosure.
- the processor 130 may include a sound source direction detection module 121 , a sound source direction recorder 12 , an end point detection module 123 , a speech signal processing module 124 , and a speech recognition module 125 .
- the sound source direction detection module 121 may calculate a maximum power value and a minimum power value from each of the plurality of audio signals and acquire the direction information on the sound sources from which the sounds corresponding to each of the plurality of audio signals are generated and indexes for the plurality of audio signals based on the calculated maximum power value and minimum power value.
- FIG. 4 is a detailed block diagram of the sound source direction detection module according to an exemplary embodiment of the present disclosure.
- the sound source direction detection module 121 includes a sound source direction calculation module 121 - 1 and a speech section detection module 121 - 2 .
- the sound source direction calculation module 121 - 1 calculates N*(N ⁇ 1)/2 power values and delay values for each of the plurality of audio signals from the audio signals input through the plurality of microphones 111 - 1 and 111 - 2 based on a cross-correlation function.
- the speech section detection module 121 - 2 acquires a maximum power value among the calculated power values and the delay value corresponding to the maximum power value from the sound source direction calculation module 121 - 1 .
- the speech section detection module 121 - 2 calculates the minimum power value from the plurality of audio signals using an MCRA algorithm.
- the maximum power value and the minimum power value may be signal strength values for the audio signals.
- the speech section detection module 121 - 2 compares the calculated minimum power value with the pre-stored minimum power value to select a minimum power value having a smaller size and calculates a power ratio from the selected minimum power value and the maximum power value calculated from the plurality of audio signals. Next, the speech section detection module 121 - 2 compares the power ratio calculated from the maximum power value and the minimum power value with the preset threshold value to detect audio signals having the power ratio that is equal to or more than the preset threshold value and outputs the direction information on the sound source for the audio signal and the index for the audio signal from the detected audio signals.
- the sound source direction recorder 122 may record the direction information on the sound source for the audio signal and the index for the audio signal that are output through the speech section detection module 121 - 2 in the memory 120 .
- the end point detection module 123 may determine the starting and ending points of the speech section included in the audio signal based on the direction information on the sound source recorded in the memory 120 .
- the direction information on the sound source recorded in the memory 120 may be the angle information between a sound direction of the sound sources from which the sounds corresponding to each of the plurality of audio signals are generated and the plurality of microphones 111 - 1 and 111 - 2 .
- the end point detection module 123 determines whether the angle information on each of the plurality of audio signals pre-stored in the memory 120 is included in the preset error range and if at least two angle information included in the preset error range is acquired, determines audio signals corresponding to the acquired angle information as speech signals from a static sound source.
- the end point detection module 123 may determine the first and second audio signals as speech signals from a dynamic sound source depending on whether a difference value of the angle information of each of the first and second audio signals is less than the preset threshold value.
- the end point detection module 123 may determine each of at least two audio signals determined as the speech signal as the audio signals of the starting and ending points.
- the speech signal processing module 124 detects the speech sections based on the indexes for the audio signals determined as the starting and ending points. Next, the speech signal processing module 124 performs the signal processing to amplify the audio signal in the direction corresponding to the direction information on the sound sources for the audio signals determined as the starting and ending points and attenuate the audio signals in the rest directions. Therefore, the speech recognition module 125 may perform the speech recognition from the audio signal in the speech section that is signal-processed by the speech signal processing module 124 to transform the speech signal for the uttered speech of the user into the text.
- the electronic device 100 may detect the section having the power ratio equal to or more than the preset threshold value as the speech section based on the power ratio calculated from the plurality of audio signals, thereby accurately detecting the speech section for the uttered speech of the user even in the environment that a lot of noise is present. Further, the electronic device 100 according to the exemplary embodiment of the present disclosure performs the speech recognition only in the detected speech section, thereby more minimizing the computation required to perform the speech recognition than before.
- FIGS. 5A to 5C are exemplified diagrams illustrating a speech section detection from an input audio signal in the electronic device according to an exemplary embodiment of the present disclosure.
- the sounds including the speech signal may be received through the plurality of microphones 111 .
- sections A to F 410 to 460 may be the speech section including the speech signal and the rest sections may be the noise section including the noise signal.
- the electronic device 100 performs the signal processing on each of the input sounds. Next, the electronic device 100 calculates the maximum power value and the minimum power value from each of the plurality of audio signals that are signal-processed, and calculates the power ratio from the calculated maximum power value and minimum power value.
- a power ratio of sections A′ to F′ 411 to 461 corresponding to the sections A to F 410 to 460 may be equal to or more than a preset threshold value 470 . Therefore, the electronic device 100 may detect the sections A′ to F′ 411 to 461 having the power ratio equal to or more than the preset threshold value 470 as the speech sections.
- angles of each audio signal of sections A′′ to F′′ 413 to 463 corresponding to the sections A′ to F′ 411 to 461 as the speech sections are present within the preset error range and angles of other sections may be present outside the preset error range.
- the electronic device 100 may amplify only the audio signals in the directions corresponding to the angles present within the error range among the audio signals in the speech sections that are the sections A′ to F′ 411 to 461 having the power ratio equal to or more than the preset threshold value 470 .
- FIG. 6 is an exemplified diagram illustrating a result of tracking a sound source direction from the input audio signal in the electronic device according to an exemplary embodiment of the present disclosure.
- the speech section may be detected from the audio signals input through the plurality of microphones 111 .
- the electronic device 100 may perform the signal processing to amplify an audio signal in a specific direction among the audio signals in the speech section detected from the audio signals and attenuate audio signals in the rest directions.
- the electronic device 100 amplifies the audio signal in the direction corresponding to the corresponding angle information among the audio signals in the previously detected speech section based on the angle information on the sound sources for at least two audio signals determined as the starting and ending points among the plurality of audio signals having the power ratio equal to or more than the preset threshold value. Further, the electronic device 100 attenuates the audio signals in the rest directions other than the audio signal in the direction corresponding to the corresponding angle information among the audio signals in the previously detected speech section.
- the electronic device 100 may amplify audio signals in speech processing sections 510 to 560 corresponding to the sections A to F 410 to 460 detected as the speech section and attenuate audio signals in the rest sections.
- the electronic device 100 may provide various internet of things services based on the foregoing exemplary embodiments.
- FIG. 7 is an exemplified diagram of internet of things services provided from the electronic device according to the exemplary embodiment of the present disclosure.
- the electronic device 100 may perform the speech recognition from the speech signal for the uttered speech of the user and control home appliances such as first and second TVs 10 and 10 ′, an air conditioner 20 , a refrigerator 30 , and a washing machine 40 in the house based on the recognized speech command.
- control home appliances such as first and second TVs 10 and 10 ′, an air conditioner 20 , a refrigerator 30 , and a washing machine 40 in the house based on the recognized speech command.
- the user may utter a speech command ‘turn on the TV!’ in his/her own room. If the speech command of the user is uttered, the electronic device 100 receives sounds generated from sound sources including speech signals corresponding to the speech commands of the user through the plurality of microphones and performs signal processing on each of the input sounds.
- the electronic device 100 understands directions in which the speech commands of the user are uttered based on the series of performance operations as described above.
- the electronic device 100 identifies home appliances associated with the directions in which the speech commands of the user are uttered based on the pre-stored direction information on each home appliance.
- the electronic device 100 may store the identification information corresponding to the first and second TVs 10 and 10 ′, the air conditioner 20 , the refrigerator 30 , and the washing machine 40 , respectively and the direction information on each home appliance. Therefore, the electronic device 100 may compare the direction in which the speech command of the user is uttered with the pre-stored direction information on each home appliance to detect the direction in which the speech command of the user is uttered and the home appliances present within the preset range.
- the first TV 10 is located in a living room and the second TV 10 ′ may be located in a room in which the user is currently located. Further, the home appliances present in the direction in which the speech command of the user is uttered and within the preset range may be the second TV 10 ′. In this case, the electronic device 100 may transmit a power on control signal to the second TV 10 ′ in the room in which the user is currently located among the first and second TVs 10 and 10 ′ based on the speech command of the user.
- the second TV 10 ′ may perform a power on operation based on the power on control signal received from the electronic device 100 to watch broadcasting through the second TV 10 ′ present in the room in which the user is currently located.
- FIG. 8 is a flow chart of a method for performing speech recognition by an electronic device according to an exemplary embodiment of the present disclosure.
- the electronic device 100 performs the signal processing on each of the input sounds to generate the plurality of signal-processed audio signals (S 710 ).
- the electronic device 100 performs the signal processing on each of the input sounds to generate the plurality of signal-processed audio signals.
- the electronic device 100 may sample each of the signal-processed audio signals in L numbers and then generate the L sampled audio signals in a frame unit. If or when the plurality of audio signals are generated, the electronic device 100 calculates the power values from each of the plurality of audio signals (S 720 ).
- the electronic device 100 stores the direction information on the sound source from which the sound corresponding to at least one of the plurality of audio signals is generated and the index for at least one audio signal based on the power values calculated from the plurality of audio signals (S 730 ).
- the electronic device 100 determines the starting and ending points of the speech sections included in all the audio signals based on the direction information on the pre-stored sound source (S 740 ).
- the electronic device 100 may determine each of the audio signals corresponding to at least two direction information as the audio signals of the starting and ending points if at least two direction information among the plurality of direction information is included in the preset error range or the error range of the at least two direction information is less than the preset threshold value.
- the electronic device 100 detects the speech sections from all the audio signals based on the indexes for the audio signals corresponding to the starting and ending points and performs the speech recognition on the detected speech section (S 750 ).
- the electronic device 100 may detect the speech sections including the speech signals among all the audio signals based on the indexes for the audio signals corresponding to the starting and ending points. Next, the electronic device 100 performs the preprocessing process of amplifying the plurality of audio signals included in the speech sections and attenuating the rest audio signals that are noise signals.
- the electronic device 100 may perform signal processing to amplify the audio signal in the direction corresponding to the direction information on the sound sources for the audio signals determined as the starting and ending points from the audio signal in the previously detected speech section and attenuate the audio signals in the rest directions by at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- LCMA linearly constrained minimum variance
- MVDR minimum variance distortion-less response
- GSS geometric source separation
- BSE blind source extraction
- the electronic device 100 may perform the speech recognition on the uttered speech of the user using a speech recognition algorithm like the STT algorithm for the speech sections in which the audio signals are amplified.
- FIG. 9 is a first flow chart of a method for storing direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal, by the electronic device according to an exemplary embodiment of the present disclosure.
- the electronic device 100 calculates the maximum power value and the minimum power value from each of the plurality of audio signals (S 810 ). Next, the electronic device 100 calculates the power ratio from the calculated maximum power value and minimum power value (S 820 ). Next, the electronic device 100 determines at least one audio signal of which the calculated power ratio is equal to or more than the preset threshold value among the plurality of audio signals and stores the direction information on the sound source for at least one audio signal determined and the index for at least one audio signal (S 830 and S 840 ).
- the electronic device 100 calculates N*(N ⁇ 1)/2 power values from the plurality of audio signals using the generalized cross-correlation phase transform (GCC-PHAT) algorithm. Next, the electronic device 100 may determine the largest value among the calculated N*(N ⁇ 1)/2 power values as the maximum power value.
- GCC-PHAT generalized cross-correlation phase transform
- the electronic device 100 may calculate the N*(N ⁇ 1)/2 power values from the plurality of audio signals and delay values for each of the plurality of audio signals using the cross-correlation function like the above ⁇ Equation 1>.
- the delay values for each of the plurality of audio signals may be the information on the time when the audio signals are differently input to each of the plurality of microphones depending on the distance between the plurality of microphones. Therefore, the electronic device 100 may calculate the direction information on the sound source for the plurality of audio signals from the delay values for each of the plurality of frames.
- the direction information is the angle information between the sound direction of the sound sources for the plurality of audio signals and the plurality of microphones 111 . Therefore, the electronic device 100 may calculate the angle information that is the direction information on the sound source for the plurality of audio signals from the delay values calculated from the above ⁇ Equation 1>.
- the electronic device 100 may calculate the minimum power value from the plurality of audio signals using minima-controlled recursive average (MCRA) algorithm. Therefore, the electronic device 100 may calculate the power ratio from the maximum power value having the largest value among the power values calculated using the cross-correlation function like the above ⁇ Equation 1> and the minimum power value calculated using the MCRA algorithm. If the power ratio is calculated, the electronic device 100 may store the direction information on the sound source for at least one audio signal having the power ratio equal to or more than the preset threshold value by comparing the previously calculated power ratio with the preset threshold value and the index for at least one audio signal, among the plurality of audio signals.
- MCRA minima-controlled recursive average
- the electronic device 100 may store the minimum power value using the minima-controlled recursive average (MCRA) algorithm. Therefore, if the minimum power value is stored and then the audio signal is input, the electronic device 100 may compare the minimum power value calculated from the input audio signal with the pre-stored minimum power value to calculate the power ratio based on the lower value of the two minimum power values.
- MCRA minima-controlled recursive average
- FIG. 10 is a second flow chart of a method for storing, by the electronic device according to another exemplary embodiment of the present disclosure, direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal determined as the speech section.
- the electronic device 100 determines whether the plurality of audio signals are the predefined K-th audio signal (S 910 ). As the determination result, if the plurality of audio signals are not the predefined K-th audio signal, the electronic device 100 calculates the maximum power value and the minimum power value from the plurality of audio signals and compares the calculated minimum power value with the previous minimum power value pre-stored in the memory as described with reference to FIG. 9 (S 920 ). As the comparison result, if the currently calculated minimum power value is smaller than the minimum power value pre-stored in the memory, the electronic device 100 updates the minimum power value pre-stored in the memory to the minimum power value calculated from the plurality of audio signals (S 930 ).
- the electronic device 100 calculates the power ratio and the direction information from the previously calculated maximum power value and minimum power value (S 940 ).
- the method for calculating a power ratio and direction information from a plurality of audio signals is already described in detail with reference to FIG. 9 and therefore the detailed description thereof will be omitted.
- the electronic device 100 determines the previous minimum power value as the value for calculating the power ratio S 950 .
- the electronic device 100 may calculate the power ratio and the direction information from the maximum power value calculated from the plurality of audio signals and the previous minimum power value pre-stored in the memory, based on the foregoing step S 940 .
- the electronic device 100 may store the direction information on the sound source for at least one audio signal having the power ratio equal to or more than the preset threshold value by comparing the calculated power ratio with the preset threshold value and the index for at least one audio signal having the power ratio equal to or more than the preset threshold value, among the plurality of audio signals (S 960 and S 970 ).
- the electronic device 100 resets the minimum power value calculated from the K-th audio signal to be the initial value and stores it in the memory S 980 and then performs the operations of the foregoing steps S 940 to S 970 .
- the direction information on the sound source for at least one audio signal and the index for at least one audio signal are stored in the memory, as illustrated in FIG.
- the starting and ending points of the speech sections included in all the audio signals may be determined based on the direction information on the sound sources for the plurality of audio signals pre-stored in the memory and the speech sections included in all the audio signals may be detected based on the index information on the audio signals corresponding to the starting and ending points determined.
- the electronic device 100 may perform the preprocessing process of amplifying the plurality of audio signals included in the speech section and attenuating the rest audio signals that are noise signals and then perform the speech recognition on the uttered speech of the user using the speech recognition algorithm like the STT algorithm in the speech section in which the audio signal is amplified.
- the electronic device 100 according to the exemplary embodiment of the present disclosure preferably performs repeatedly each of the steps of FIGS. 8 to 10 as described above until events such as power off and deactivation of a speech recognition mode are generated.
- the method for recognizing a speech by an electronic device 100 may be implemented by at least one execution program for performing the speech recognition as described above, in which the execution program may be stored in a non-transitory computer readable medium or storage.
- a non-transitory computer readable medium is not a medium that stores data therein for a while, such as a register, a cache, a memory, or the like, but means a medium that semi-permanently stores data therein and is readable by a device.
- the foregoing programs may be stored in various types of recording media that are readable by a terminal, such as a random access memory (RAM), a flash memory, a read only memory (ROM), an erasable programmable ROM (EPROM), an electronically erasable programmable ROM (EEPROM), a register, a hard disk, a removable disk, a memory card, a universal serial bus (USB) memory, a compact-disk (CD) ROM, and the like.
- RAM random access memory
- ROM read only memory
- EPROM erasable programmable ROM
- EEPROM electronically erasable programmable ROM
- register a register
- hard disk a hard disk
- a removable disk a removable disk
- memory card a universal serial bus (USB) memory
- USB universal serial bus
- CD compact-disk
Abstract
Description
- This application claims priority from Korean Patent Application No. 10-2015-0153033, filed on Nov. 2, 2015 in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference in its entirety.
- 1. Field
- Apparatuses and methods consistent with the present disclosure relate to an electronic device and a method for recognizing a speech, and more particularly, to an electronic device and a method for detecting a speech section in an audio signal.
- 2. Description of the Related Art
- A speech recognition technology controlling various kinds of electronic devices using a speech signal has been widely used. Generally, the speech recognition technology means a technology of understanding an intention of an uttered speech of a user from a speech signal input from hardware or software device or a system and performing an operation based on the understood intention.
- However, the speech recognition technology recognizes various sounds generated from the surrounding environment as well as a speech signal for the uttered speech of the user and therefore may not correctly perform the intended operation of the user.
- Therefore, various speech section detection algorithms for detecting only a speech section for an uttered speech of a user from an input audio signal have been developed.
- As a general method for detecting a speech section, there are a method for detecting a speech section using energy for each audio signal in a frame unit, a method for detecting a speech section using zero crossing for each audio signal in a frame unit, a method for extracting a feature vector from an audio signal in a frame unit and detecting a speech section by determining existence and nonexistence of a speech signal from the pre-extracted feature vector using a support vector machine (SVM), or the like.
- The method for detecting a speech section using energy of an audio signal in a frame unit or zero crossing uses energy or zero crossing for audio signals for each frame. As a result, the existing method for detecting a speech section has relatively smaller computation for determining whether the audio signals for each frame are the speech signal over other methods for detecting a speech section but may often cause an error of detecting a noise signal as well as the speech signal as the speech section.
- Meanwhile, the method for detecting a speech section using a feature vector extracted from an audio signal in a frame unit and a SVM has more excellent detection accuracy for only the speech signal from the audio signals for each frame over the method for detecting a speech section using the foregoing energy or zero crossing but requires more computation to determine the existence and nonexistence of the speech signal from the audio signals for each frame and therefore may consume much more CPU resources over other methods for detecting a speech section.
- Additional aspects and/or advantages will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the embodiments.
- Exemplary embodiments overcome the above disadvantages and other disadvantages not described above. Also, the embodiments are not required to overcome the disadvantages described above, and an exemplary embodiment may not overcome any of the problems described above.
- The present disclosure correctly detects a speech section including a speech signal from an input audio signal in an electronic device.
- The present disclosure inputs speech signals of a short distance and a far distance and detects a speech section based on sound direction tracking of the speech signals in the electronic device.
- According to an aspect of the present disclosure, a method for recognizing a speech by an electronic device includes: receiving sounds generated from a sound source through a plurality of microphones; calculating power values of a plurality of audio signals generated by performing signal processing on each sound input through the plurality of microphones and calculating direction information on the sound source based on the calculated power values and storing the calculated direction information; and performing the speech recognition on a speech section included in the audio signal based on the direction information on the sound source.
- In the performing of the speech recognition, the speech section may be detected based on audio signals corresponding to starting and ending points among the plurality of audio signals and the speech recognition may be performed on the detected speech section.
- The storing may include: calculating a maximum power value and a minimum power value from the plurality of signal-processed audio signals; calculating a power ratio from the calculated maximum power value and minimum power value; determining at least one audio signal of which the calculated power ratio is equal to or more than a preset threshold value; and calculating the direction information on the sound source from a sound corresponding to the direction information included in the at least one audio signal determined and the at least one audio signal determined and storing the calculated direction information and an index for the at least one audio signal.
- The storing may further include: comparing the minimum power value calculated from the plurality of audio signals with a pre-stored minimum power value to determine a power value having a smaller size as the minimum power value for the plurality of audio signals, if the minimum power value calculated from a previous audio signal is pre-stored.
- The storing may further include: resetting a minimum power value calculated from an N-th audio signal to an initial value, if a predefined N-th audio signal is input.
- In the calculating of the maximum power value and the minimum power value, N*(N−1)/2 power values may be calculated from the plurality of audio signals using a generalized cross-correlation phase transform (GCC-PHAT) algorithm and the largest value among the N*(N−1)/2 power values may be determined as the maximum power value, if the number of microphones is N, and the minimum power value is calculated from the plurality of audio signals using a minima-controlled recursive average (MCRA) algorithm.
- The direction information may be angle information between sound direction of the sound source from which the sounds corresponding to each of the plurality of audio signals are generated and the plurality of microphones, and in the calculating of the maximum power value and the minimum power value, the direction information on the sound source from which the sounds corresponding to each of the plurality of audio signals are generated may be calculated from a delay value corresponding to the determined maximum power value.
- In the performing of the speech recognition, the speech recognition may be performed on a speech section included in audio signals corresponding to at least two direction information if the at least two of the plurality of direction information is included in a preset error range or the error range of the two direction information is less than a preset threshold value.
- The performing of the speech recognition may include: detecting the speech section from the audio signal based on the index for the at least one audio signal of which the power ratio is equal to or more than a preset threshold value; performing signal processing on the audio signal in the detected speech section based on the direction information on the sound source from which a sound corresponding to at least one audio signal of which the power ratio is equal to or more than a preset threshold value is generated; and performing the speech recognition from the signal-processed audio signal and transforming the speech into a text.
- In the performing of the signal processing, the signal processing may be performed on the audio signal in the detected speech section using at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- According to another aspect of the present disclosure, an electronic device includes: an input receiving sounds generated from a sound source through a plurality of microphones; a memory storing direction information on the sound source; and a processor performing signal processing on each sound input through the plurality of microphones, calculating power values of a plurality of signal-processed audio signals, calculating direction information on the sound source based on the calculated power values and storing the calculated direction information in the memory, and performing the speech recognition on a speech section included in the audio signal based on the direction information on the sound source.
- The processor may detect the speech section based on audio signals corresponding to starting and ending points among the plurality of audio signals and perform the speech recognition on the detected speech section.
- The processor may calculate a maximum power value and a minimum power value from the plurality of signal-processed audio signals, calculate a power ratio from the calculated maximum power value and minimum power value, calculate the direction information on the sound source from a sound corresponding to at least one audio signal of which the calculated power ratio is equal to or more than a preset threshold value, and store the calculated direction information and an index for the at least one audio signal in a memory.
- The processor may compare the minimum power value calculated from the plurality of audio signals with a pre-stored minimum power value to determine a power value having a smaller size as the minimum power value for the plurality of audio signals, if the minimum power value calculated from a previous audio signal is pre-stored in the memory.
- The processor may reset a minimum power value calculated from an N-th audio signal to an initial value, if a predefined N-th audio signal is input.
- The processor may calculate N*(N−1)/2 power values from the plurality of audio signals using a generalized cross-correlation phase transform (GCC-PHAT) algorithm, determine the largest value among the N*(N−1)/2 power values as the maximum power value, if the number of microphones is N, and calculate the minimum power value from the plurality of audio signals using a minima-controlled recursive average (MCRA) algorithm.
- The direction information may be the angle information between sound direction of the sound source from which the sounds corresponding to each of the plurality of audio signals are generated and the plurality of microphones, and the processor may calculate the direction information on the sound source from which the sounds corresponding to each of the plurality of audio signals are generated from a delay value corresponding to the determined maximum power value.
- The processor may perform the speech recognition on a speech section included in audio signals corresponding to at least two direction information if the at least two of the plurality of direction information is included in a preset error range or the error range of the two direction information is less than a preset threshold value.
- The processor may detect the speech section from the audio signal based on the index for the at least one audio signal of which the power ratio is equal to or more than a preset threshold value, perform signal processing on the audio signal in the detected speech section based on the direction information on the sound source from which a sound corresponding to at least one audio signal of which the power ratio is equal to or more than a preset threshold value is generated, and perform the speech recognition from the signal-processed audio signal and transforms the speech into a text.
- The signal processing may be performed on the audio signal in the detected speech section using at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme.
- A computer program stored in a recording medium and combined with an electronic device to perform the following steps of: receiving sounds generated from a sound source through a plurality of microphones; calculating power values of a plurality of audio signals generated by performing signal processing on each sound input through the plurality of microphones and calculating direction information on the sound source based on the calculated power values and storing the calculated direction information; and performing the speech recognition on a speech section included in the audio signal based on the direction information on the sound source.
- According to an aspect of the present disclosure, a non transitory computer readable storage stores a method, the method including receiving sound generated from a sound source via a plurality of microphones; calculating power values of a plurality of audio signals generated by performing signal processing on each sound input through the plurality of microphones; calculating direction information for the sound source based on the power values and storing the direction information; and performing speech recognition on a speech section included in an audio signal based on the direction information for the sound source.
- According to an aspect of the present disclosure, a system, includes: a sound source; a plurality of microphones receiving sound produced by the sound source; a memory storing a direction of the sound source; and a computer processor calculating power values of audio signals from the microphones, calculating the direction of the sound source using the power values, storing the direction in the memory and performing speech recognition on a speech section of an audio signal responsive to the direction.
- According to an aspect of the present disclosure, a system includes a sound source; an array of microphones receiving sound signals of sound produced by the sound source the microphones having one of different locations and different directionalities; and a computer processor calculating power values of sound signals from the array microphones, selecting an audio signal from the sound signals using the power values and corresponding different locations and different directionalities, and performing speech recognition on a speech section of the audio signal in a noisy environment.
- According to an aspect of the present disclosure, a method includes: calculating power values of audio signals generated by microphones receiving sound from a sound source, calculating a direction of the sound source based on the power values and storing the direction, and identifying end points of a speech section responsive to an angle of the direction; and performing speech recognition on the speech section.
- As described above, according to various exemplary embodiments of the present disclosure, the electronic device may correctly detect only the speech section from the audio signal while improving the processing speed for the speech section detection.
- The above and/or other aspects will be more apparent by describing certain exemplary embodiments with reference to the accompanying drawings, in which:
-
FIG. 1 is an exemplified diagram illustrating the environment in which an electronic device according to an exemplary embodiment of the present disclosure performs speech recognition; -
FIG. 2A is a schematic block diagram of the electronic device for recognizing a speech according to an exemplary embodiment of the present disclosure; -
FIG. 2B is a detailed block diagram of the electronic device for recognizing a speech according to an exemplary embodiment of the present disclosure; -
FIG. 3 is a block diagram illustrating a configuration of performing speech recognition in a processor according to an exemplary embodiment of the present disclosure; -
FIG. 4 is a detailed block diagram of a sound source direction detection module according to an exemplary embodiment of the present disclosure; -
FIGS. 5A to 5C are exemplified diagrams illustrating a speech section detection from an input audio signal in the electronic device according to an exemplary embodiment of the present disclosure; -
FIGS. 6A and 6B are exemplified diagram illustrating a result of tracking a sound source direction from the input audio signal in the electronic device according to an exemplary embodiment of the present disclosure; -
FIG. 7 is an exemplified diagram of internet of things services provided from the electronic device according to the exemplary embodiment of the present disclosure; -
FIG. 8 is a flow chart of a method for performing speech recognition by an electronic device according to an exemplary embodiment of the present disclosure; -
FIG. 9 is a first flow chart of a method for storing direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal, by the electronic device according to an exemplary embodiment of the present disclosure; and -
FIG. 10 is a second flow chart of a method for storing direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal determined as the speech section, by an electronic device according to another exemplary embodiment of the present disclosure. - Reference will now be made in detail to the embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below by referring to the figures.
- Prior to describing in detail the present disclosure, a description method of the present specification and drawings will be described.
- First, terms used in the present specification and claims are selected as general terms in consideration of functions of various exemplary embodiments of the present disclosure. However, these terms may be changed depending on intention of a person in the art, legal or technical analysis, appearance of new technologies, or the like. Further, some terms may be arbitrarily selected by the present applicant. These terms may be analyzed as meaning defined in the present specification, and if terms are not defined in detail, the terms may also be analyzed based on the overall content of the present specification and general technology knowledge of the technical field in the art.
- Further, like reference numerals or signs described in the respective drawings accompanying in the present specification represent parts or components performing substantially the same function. For convenience of explanation and understanding, other exemplary embodiments will be described using like reference numerals or signs. In other words, even though components having like reference numerals are all illustrated in a plurality of drawings, the plurality of drawings do not mean an exemplary embodiment.
- Further, to differentiate between components in the present specification and claim, terms including ordinal numbers like “first”, “second”, or the like may be used. The ordinal numbers are used to differentiate like or similar components from each other and the meaning of the terms should not be restrictively analyzed by the use of the ordinal numbers. For example, a use order, a disposition order, or the like of components coupled to the ordinal numbers should not be limited by the numbers. If necessary, the respective ordinal numbers may also be used by being replaced by each other.
- In the present specification, singular forms are intended to include plural forms unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” or “have” used in this specification, specify the presence of stated features, numerals steps, operations, components, parts mentioned in this specification, or a combination thereof, but do not preclude the presence or addition of one or more other features, numerals, steps, operations, components, parts, or a combination thereof.
- Further, in the exemplary embodiment of the present disclosure, the terms “module”, “unit”, “part”, etc., are terms naming components for performing at least one function or operation and these components may be implemented as hardware or software or implemented by a combination of hardware and software. Further, the plurality of “modules”, “units”, “parts”, etc., may be integrated as at least one module or chip to be implemented as at least one processor (not illustrated), except for the case in which each of the “modules”, “units”, “parts”, etc., need to be implemented as individual specific hardware.
- Further, in the exemplary embodiment of the present disclosure, when any portion is connected to other portions, this includes a direction connection and an indirect connection through other media. In addition, unless explicitly described otherwise, the meaning that any portion includes any components will be understood to imply the inclusion of other components but not the exclusion of any other components.
- Hereinafter, various exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
-
FIG. 1 is an exemplified diagram illustrating the environment in which an electronic device according to an exemplary embodiment of the present disclosure performs speech recognition. - As illustrated in
FIG. 1 , an electronic device 100 (seeFIG. 2 for details) for recognizing a speech performs speech recognition from a speech signal for an uttered speech of auser 2. Theelectronic device 100 for recognizing a speech may be peripheral devices such as arobot 1,TV 4, and acleaner 5 in the house or aterminal device 3 that may control each of the peripheral devices such as therobot 1, theTV 4, and thecleaner 5. - The
electronic device 100 may receive the speech signal for the uttered speech of theuser 2 through a plurality of microphones that are installed in theelectronic device 100 or receive the speech signal for the uttered speech of theuser 2 from the plurality of microphones that are installed in the house. - Meanwhile, when there is noise generated from the surrounding environment like a sound of the
TV 4 when a speech command is uttered, theelectronic device 100 may receive a sound generated from a sound source that includes the speech signal for the uttered speech of theuser 2 and a noise signal for noise generated from the surrounding environment through the plurality of microphones. - When receiving sounds generated from the sound source through the plurality of microphones, the
electronic device 100 performs signal processing on each sound input through each of the microphones. Next, theelectronic device 100 calculates power values of a plurality of signal-processed audio signals and determines a direction of the sound source based on the calculated power values. Next, theelectronic device 100 performs the speech recognition by removing the noise signal from signal-processed audio data from the sound input through the determined direction of the sound source and detecting only the speech signal. Therefore, theelectronic device 100 may improve the problem of wrongly recognizing the noise signal as the speech signal. - Meanwhile, the microphones that are installed in the electronic device 100 (see
FIG. 2 ) or installed at different locations in the house may include a plurality of microphone arrays having directionalities and may receive the sound generated from the sound source including the speech signal for the uttered speech of theuser 2 in various directions through the plurality of microphone arrays. As such, when the microphone includes the plurality of microphone arrays, the microphone that is installed in theelectronic device 100 or installed in the house may have a single configuration. -
FIG. 2A is a schematic block diagram of theelectronic device 100 for recognizing a speech according to an exemplary embodiment of the present disclosure andFIG. 2B is a detailed block diagram of the electronic device for recognizing a speech according to an exemplary embodiment of the present disclosure. - As illustrated in
FIGS. 2A and 2B , theelectronic device 100 is configured to include aninput 110, amemory 120, and aprocessor 130. - As illustrated in
FIG. 2B , theinput 110 includes a plurality ofmicrophones 111 and receives a sound generated from a sound source through a plurality ofmicrophones 111. - However, the present disclosure is not limited thereto, and when the
microphone 111 is configured in one, the correspondingmicrophone 111 may receive the sound generated from the sound source in various directions through the plurality of microphone arrays. Here, the sound source may include the speech signal for the uttered speech of the user and the noise signal for noise generated from the surrounding environment. - The
memory 120 stores a direction or direction information for the sound source. - The
processor 130 performs the signal processing on each sound input through the plurality ofmicrophones 111 and calculates the power values of the plurality of signal-processed audio signals. Next, theprocessor 130 calculates the direction information on the sound source based on the calculated power values and stores the direction information on the calculated sound source in thememory 120. Next, theprocessor 130 performs the speech recognition on a speech section included in the audio signal based on the direction information on the sound source. - In detail, the
processor 130 calculates a maximum power value and a minimum power value from each of the signal-processed audio signals if the signal-processed audio signals are input from each sound input through the plurality ofmicrophones 111. Next, theprocessor 130 calculates a power ratio from the maximum power value and the minimum power value that are calculated from each of the audio signals. Next, theprocessor 130 compares the power ratio calculated from each of the audio signals with a preset threshold value and calculates the direction information on the sound source from the sound corresponding to at least one audio signal having the power ratio that is equal to or more than the preset threshold value. Next, theprocessor 130 stores the calculated direction information on the sound source and an index for at least one audio signal having the power ratio that is equal to or more than the preset threshold value in thememory 120. - Here, the index information is identification information on the audio signal and according to the exemplary embodiment of the present disclosure, may be information on time when the audio signal is input.
- Next, the
processor 130 detects the speech sections from the audio signals each corresponding to starting and ending points for the uttered speech of the user among the plurality of audio signals based on the direction information on the sound source and the index that are stored in thememory 120 and performs the speech recognition on the detected speech sections. - In detail, when receiving the sounds generated from the sound source through the plurality of
microphones 111, theprocessor 130 performs the signal processing on each sound input through the plurality ofmicrophones 111 as the audio signals. Next, theprocessor 130 may sample each of the signal-processed audio signals in L numbers and then generate the L sampled audio signals in a frame unit. - Next, the
processor 130 calculates the maximum power value and the minimum power value from each of the plurality of audio signals and calculates the power ratio from the calculated maximum power value and minimum power value. Here, the maximum power value and the minimum power value may be a signal strength value for the audio signal. Therefore, theprocessor 130 may calculate the power ratio from the maximum power value having the largest signal strength value and the minimum power value having the smallest signal strength value among the plurality of audio signals. - Next, the
processor 130 stores the direction information on the sound source from which the sound corresponding to at least one audio signal of which the power ratio calculated from the maximum power value and the minimum power value is equal to or more than the preset threshold value among the plurality of audio signals is generated and the index for at least one audio signal of which the power ratio is equal to or more than the preset threshold value in thememory 120. - According to the exemplary embodiment of the present disclosure, when the number of
microphones 111 is N, theprocessor 130 calculates N*(N−1)/2 power values from the plurality of audio signals using generalized cross-correlation phase transform (GCC-PHAT) algorithm. Next, theprocessor 130 may determine the largest value among the calculated N*(N−1)/2 power values as the maximum power value. - For example, when the number of
microphones 111 is two, theprocessor 130 may calculate one power value from the plurality of audio signals. In this case, theprocessor 130 may determine the calculated power value as the maximum power value. Meanwhile, when the number ofmicrophones 111 is three, theprocessor 130 may calculate three power values from the plurality of audio signals and determine the largest value among the three power values as the maximum power value. - Meanwhile, the
processor 130 may calculate the N*(N−1)/2 power values and delay values for each of the plurality of audio signals from the plurality of audio signals using the cross-correlation function like the following <Equation 1>. Here, the delay values for each of the plurality of audio signals may be the information on the time when the audio signals are differently input to each of the plurality ofmicrophones 111 depending on a distance between the plurality ofmicrophones 111. -
- In the
above Equation 1, i and j are the indexes for the audio signals input from the plurality ofmicrophones 111 and Xi(k) is a discrete Fourier transform (DFT) signal for an i-th audio signal input from a first microphone among the plurality ofmicrophones 111. Further, Xj(k) is a discrete Fourier transform (DFT) signal for a j-th audio signal input from a second microphone among the plurality ofmicrophones 111. Further, ( )* represents a complex conjugate and k represents an index for a discrete frequency. - Meanwhile, according to the exemplary embodiment of the present disclosure, the cross-correlation function like the above <
Equation 1> may be used as well as one of various whitening methods for increasing resolving power, a method for differently allocating weighting for each frequency, and a regularization method for preventing diffusion may be used in a form modified from the above <Equation 1>. - Meanwhile, the
processor 130 may calculate the minimum power value from the plurality of audio signals using minima-controlled recursive average (MCRA) algorithm. Here, the generalized cross-correlation phase transform (GCC-PHAT) algorithm and the minima-controlled recursive average (MCRA) algorithm are the known technology, and therefore the detailed description of the present disclosure will be omitted. - Therefore, the
processor 130 may calculate the power ratio from the maximum power value having the largest value among the power values calculated using the cross-correlation function like the above <Equation 1> and the minimum power value calculated using the MCRA algorithm. - Meanwhile, the
processor 130 determines whether the minimum power value calculated from the previous audio signal is pre-stored in thememory 120, prior to calculating the power ratio from the maximum power value and the minimum power value. As the determination result, if the minimum power value is not pre-stored in thememory 120, theprocessor 130 may calculate the power ratio from the maximum power value having the largest value among the power values calculated using the cross-correlation function like the above <Equation 1> and the minimum power value calculated using the MCRA algorithm. - Meanwhile, if the minimum power value calculated from the previous audio signal is pre-stored in the
memory 120, theprocessor 130 compares the minimum power value calculated from the plurality of audio signals currently input with the pre-stored minimum power value to select the minimum power value having a relatively smaller size. In detail, if the size of the pre-stored minimum power value is smaller than that of the minimum power value currently calculated, theprocessor 130 calculates the power ratio from the pre-stored minimum power value and the maximum power value calculated from the plurality of audio signals currently input. - Meanwhile, if it is determined that the size of the minimum power value currently calculated is smaller than that of the pre-stored minimum power value, the
processor 130 updates the minimum power value pre-stored in thememory 120 to the minimum power value calculated from the plurality of audio signals currently input. Next, theprocessor 130 may calculate the power ratio from the maximum power value and the minimum power value that are calculated from the plurality of audio signals currently input. - Meanwhile, the
processor 130 performs the update of the minimum power value only before the sound corresponding to the pre-stored K-th audio signal is input. That is, if the sound corresponding to the pre-stored K-th audio signal is input, theprocessor 130 may reset the minimum power value calculated from the K-th audio signal to an initial value and store the initial value in thememory 120. - Meanwhile, if a sound corresponding to a K+1-th audio signal is input, the
processor 130 calculates the power ratio from the maximum power value and the minimum power value calculated from the K+1-th audio signal. Further, theprocessor 130 compares the minimum power value of the K+1-th audio signal with the minimum power value of the K-th audio signal reset to the initial value. As the comparison result, if it is determined that the minimum power value of the K+1-th audio signal is small, theprocessor 130 updates the minimum power value pre-stored in thememory 120 to the minimum power value of the K+1-th audio signal and if it is determined that the minimum power value of the K+1-th audio signal is large, theprocessor 130 keeps the minimum power value pre-stored in thememory 120. - Meanwhile, if the power ratio is calculated from the plurality of audio signals by the foregoing performance operation, the
processor 130 compares each power ratio with the preset threshold value to store the direction information on the sound source from which the sound corresponding to at least one audio signal having the power ratio that is equal to or more than the preset threshold value is generated and the index for at least one audio signal having the power ratio that is equal to or more than the preset threshold value in thememory 120. Therefore, if the direction information on the sound source from which the sound corresponding to at least one audio signal is generated and the index are stored in thememory 120, theprocessor 130 may determine the starting and ending points of the speech section included in the audio signal based on the direction information on the sound source stored in thememory 120. According to the exemplary embodiment of the present disclosure, when the direction information on the plurality of sound sources is stored in thememory 120, theprocessor 130 may determine each of the audio signals corresponding to at least two direction information as the audio signals of the starting and ending points if at least two direction information on the plurality of sound sources is included in the preset error range or the error range of the at least two direction information is less than the preset threshold value. - Here, the direction information is the angle information between the sound direction of the sound sources from which the sounds corresponding to the plurality of audio signals are generated and the plurality of
microphones 111. Therefore, theprocessor 130 may calculate the angle information that is the direction information on the sound sources from which the sounds corresponding to the plurality of audio signals are generated from the delay value calculated by the above <Equation 1> and thememory 120 may store the angle information on the plurality of audio signals of which the power ratio equal to or more than the preset threshold value is calculated and the index for the corresponding audio signal. - Therefore, the
processor 130 may determine whether each angle information on each of the plurality of audio signals pre-stored in thememory 120 belongs to the preset error range to acquire the angle information included in the preset error range. If at least two angle information included in the preset error range is acquired, theprocessor 130 determines the audio signal corresponding to the acquired angle information as a speech signal of a static sound source. - Meanwhile, if a difference in the angle information on each of the first and second audio signals among the plurality of pre-stored audio signals does not belong to the preset error range, the
processor 130 compares a difference value in the angle information on each of the first and second audio signals with the preset threshold value. As the comparison result, if the difference value in the angle information on each of the first and second audio signals is less than the preset threshold value, theprocessor 130 determines the first and second audio signals as a speech signal of a dynamic sound source. - If it is determined that at least two of the plurality of audio signals pre-stored in the
memory 120 are the speech signal by the various analyses, theprocessor 130 may determine each of at least two audio signals determined as the speech signal as the audio signals of the starting and ending points. - If at least two audio signals are determined as the audio signals of the starting and ending points, the
processor 130 may detect the speech sections based on the indexes for the audio signals determined as the starting and ending points. If the speech sections are detected, theprocessor 130 performs the signal processing on the audio signals included in the speech sections based on the direction information on the sound sources for the audio signals determined as the starting and ending points. - In detail, the
processor 130 may perform signal processing to amplify the signal-processed audio signal from the sound input from the corresponding direction based on the direction information on the sound sources for the audio signals determined as starting and ending points among the audio signals included in the speech sections and attenuate the audio signals in the rest directions. - According to the exemplary embodiment of the present disclosure, the
processor 130 may perform signal processing to amplify the audio signal in the direction corresponding to the direction information on the sound sources for the audio signals determined as the starting and ending points from the audio signal in the previously detected speech section and attenuate the audio signals in the rest directions by at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme. - Next, the
processor 130 performs the speech recognition from the audio signal in the signal-processed speech section and transforms it into a text. According to the exemplary embodiment of the present disclosure, theprocessor 130 may perform the speech recognition from the audio signal in the signal-processed speech section using a speech to text (STT) algorithm and transform it into a text form. - Meanwhile, as illustrated in
FIGS. 2A and 2B , the foregoinginput 110 may include a plurality ofmicrophones 111, amanipulator 113, atouch input 115, and auser input 117. The plurality ofmicrophones 111 output the uttered speech of the user or audio signals generated from other living environments to theprocessor 130. - The
manipulator 113 may be implemented as a key pad including various function keys, a numeric key, a special key, a character key, or the like and when adisplay 191 to be described below is implemented in a touch screen form, thetouch input 115 may be implemented as a touch pad having a mutual layer structure with thedisplay 130. In this case, thetouch input 115 may receive a touch command for an icon displayed through thedisplay 191 to be described below. - The
user input 117 may receive an IR signal or an RF signal from at least one peripheral device (not illustrated). Therefore, the foregoingprocessor 130 may control an operation of theelectronic device 100 based on the IR signal or the RF signal that is input through theuser input 117. Here, the IR or RF signal may be a control signal or a speech signal for controlling the operation of theelectronic device 100. - Meanwhile, the
electronic device 100 according to the exemplary embodiment of the present disclosure may further include various components besides theinput 110, thememory 120, and theprocessor 130 that are described above. - According to the exemplary embodiment of the present disclosure, when the
electronic device 100 is implemented as display devices such as a smart phone and a smart TV, as illustrated inFIGS. 2A and 2B , theelectronic device 100 may further include acommunicator 140, aspeech processor 150, aphotographer 160, asensor 170, asignal processor 180, and anoutput 190. - The
communicator 140 performs data communication with at least one peripheral device (not illustrated). According to an exemplary embodiment of the present disclosure, thecommunicator 140 may transmit the speech signal for the uttered speech of the user to a speech recognition server (not illustrated) and receive a speech recognition result in a text form that is recognized from the speech recognition server (not illustrated). According to another exemplary embodiment of the present disclosure, thecommunicator 140 may perform data communication with a web server (not illustrated) to receive content corresponding to a user command or a content related search result. - As illustrated in
FIGS. 2A and 2B , thecommunicator 140 may include a shortrange communication module 141, awireless communication module 143 such as a wireless LAN module, and aconnector 145 including at least one of wired communication modules such as a high-definition multimedia interface (HDMIsh), a universal serial bus (USB), Institute of Electrical and Electronics Engineers (IEEE) 1394. - The short
range communication module 141 is configured to wirelessly perform short range communication between the portable terminal device (not illustrated) and theelectronic device 100. Here, the shortrange communication module 141 may include at least one of a Bluetooth module, an infrared data association (IrDA) module, a near field communication (NFC) module, a WIFI module, and a Zigbee module. - Further, the
wireless communication module 143 is a module that is connected to an external network according to a wireless communication protocol such as IEEE to perform communications. In addition, the wireless communication module may further include a mobile communication module which is connected to a mobile communication network according to various mobile communication standards such as 3rd generation (3G), 3rd generation partnership project (3GPP), and long term evolution (LTE) to perform communications. - As such, the
communicator 140 may be implemented by the above-mentioned various short range communication schemes and may adopt other communication technologies not mentioned in the present specification as needed. - Meanwhile, the
connector 145 is configured to provide an interface with various source devices such as USB 2.0, USB 3.0, HDMI, and IEEE 1394. Theconnector 145 may receive content data transmitted from an external server (not illustrated) through a wired cable connected to theconnector 145 according to the control command of theprocessor 130 to be described below or may transmit pre-stored content data to an external recording medium. Further, theconnector 145 may receive power from a power source through the wired cable physically connected to theconnector 145. - The
speech processor 150 is configured to perform the speech recognition on the uttered speech section of the user among the audio signals input through the plurality ofmicrophones 111. In detail, thespeech processor 150 performs a pre-processing process of amplifying the plurality of audio signals included in the detected speech section when the speech section is detected from the input audio signals and attenuating the rest audio signals that are noise signals, according to the control command of theprocessor 130. Next, thespeech processor 150 may perform the speech recognition on the uttered speech of the user using a speech recognition algorithm like the STT algorithm for the speech sections in which the audio signals are amplified. - The
photographer 160 is to photograph still images or moving images according to the user command and may be implemented in plural like a front camera and a rear camera. - The
sensor 170 senses various operation states of theelectronic device 100 and a user interaction. In particular, thesensor 170 may sense a gripped state in which the user grips theelectronic device 100. In detail, theelectronic device 100 may be rotated or inclined in various directions. In this case, thesensor 170 may use at least one of various sensors such as a geomagnetic sensor, a gyro sensor, and an accelerator sensor to sense a gradient, etc., of theelectronic device 100 that is gripped by the user based on a rotational motion or a gravity direction. - The
signal processor 180 may be configured to process the content received through thecommunicator 140 and image data and audio data of the content stored in thememory 120. In detail, thesignal processor 180 may perform various image processing, such as decoding, scaling, noise filtering, frame rate conversion, and resolution conversion, on the image data included in the content. Further, thesignal processor 180 may perform various audio signal processing, such as decoding, amplification, and noise filtering, on the audio data included in the content. - The
output 190 outputs the signal-processed content through thesignal processor 180. Theoutput 190 may output the content through at least one of thedisplay 191 and anaudio output 192. That is, thedisplay 191 may display the image data that are image processed by thesignal processor 180 and theaudio output 192 may output the audio data, which suffer from audio signal processing, in an audible sound form. - Meanwhile, the
display 191 that displays the image data may be implemented as a liquid crystal display (LCD), an organic light emitting display (OLED), a plasma display panel (PDP), or the like. In particular, thedisplay 191 may be implemented in a touch screen form in which it forms a mutual layer structure with thetouch input 115. - Meanwhile, the foregoing
processor 130 may include aCPU 131, aROM 132, aRAM 133, and aGPU 135 that may be connected to one another via abus 137. - The
CPU 131 accesses thememory 120 to perform booting using an O/S stored in thememory 120. Further, theCPU 131 performs various operations using various programs, content, data, and the like that are stored in thememory 120. - A set of commands for system booting, and the like are stored in the
ROM 132. When a turn on command is input and thus power is supplied, theCPU 131 copies the O/S stored in thememory 120 to theRAM 133 according to the command stored in theROM 132 and executes the O/S to boot the system. If the booting is completed, theCPU 131 copies the various programs stored in thememory 120 to theRAM 133 and executes the programs copied to theRAM 133 to execute various operations. - The
GPU 135 generates a display screen including various objects like an icon, an image, a text, or the like. In detail, theGPU 135 calculates attribute values, such as coordinate values at which the respective objects will be displayed, shapes, sizes, and colors of the objects, based on a layout of the screen according to the received control command and generates display screens having various layouts including the objects based on the calculated attribute values. - The
processor 130 may be implemented as a system-on-a chip (SOC) or a system on chip (Soc) by being combined with various components such as theinput 110, thecommunicator 140, and thesensor 170 that are described above. - Meanwhile, the operation of the
processor 130 may be executed by programs that are stored in thememory 120. Here, thememory 120 may be implemented as at least one of a memory card (for example, SD card, memory stick) that may be detached from and attached to theROM 132, theRAM 133, or theelectronic device 100, a non-volatile memory, a volatile memory, a hard disk drive (HDD), and a solid state drive (SSD). - Meanwhile, as described above, the
processor 130 that detects the speech sections from the plurality of audio signals may detect the speech sections from the plurality of audio signals using the program module stored in thememory 120 as illustrated inFIG. 3 . -
FIG. 3 is a block diagram illustrating a configuration of performing speech recognition in a processor according to an exemplary embodiment of the present disclosure. - As illustrated in
FIG. 3 , theprocessor 130 may include a sound sourcedirection detection module 121, a soundsource direction recorder 12, an endpoint detection module 123, a speechsignal processing module 124, and aspeech recognition module 125. - If a plurality of signal-processed audio signals are input from sounds input through a plurality of microphones 111-1 and 111-2 or the
microphone 111 including a plurality of microphone arrays, the sound sourcedirection detection module 121 may calculate a maximum power value and a minimum power value from each of the plurality of audio signals and acquire the direction information on the sound sources from which the sounds corresponding to each of the plurality of audio signals are generated and indexes for the plurality of audio signals based on the calculated maximum power value and minimum power value. -
FIG. 4 is a detailed block diagram of the sound source direction detection module according to an exemplary embodiment of the present disclosure. - As illustrated in
FIG. 4 , the sound sourcedirection detection module 121 includes a sound source direction calculation module 121-1 and a speech section detection module 121-2. - The sound source direction calculation module 121-1 calculates N*(N−1)/2 power values and delay values for each of the plurality of audio signals from the audio signals input through the plurality of microphones 111-1 and 111-2 based on a cross-correlation function.
- The speech section detection module 121-2 acquires a maximum power value among the calculated power values and the delay value corresponding to the maximum power value from the sound source direction calculation module 121-1. Next, the speech section detection module 121-2 calculates the minimum power value from the plurality of audio signals using an MCRA algorithm. Here, the maximum power value and the minimum power value may be signal strength values for the audio signals.
- If the minimum power value is calculated, the speech section detection module 121-2 compares the calculated minimum power value with the pre-stored minimum power value to select a minimum power value having a smaller size and calculates a power ratio from the selected minimum power value and the maximum power value calculated from the plurality of audio signals. Next, the speech section detection module 121-2 compares the power ratio calculated from the maximum power value and the minimum power value with the preset threshold value to detect audio signals having the power ratio that is equal to or more than the preset threshold value and outputs the direction information on the sound source for the audio signal and the index for the audio signal from the detected audio signals.
- Therefore, the sound
source direction recorder 122 may record the direction information on the sound source for the audio signal and the index for the audio signal that are output through the speech section detection module 121-2 in thememory 120. - If the direction information on the sound source for at least one of the plurality of audio signals and the index for at least one of the plurality of audio signals are recorded in the
memory 120 by the series of execution operations, the endpoint detection module 123 may determine the starting and ending points of the speech section included in the audio signal based on the direction information on the sound source recorded in thememory 120. As described above, the direction information on the sound source recorded in thememory 120 may be the angle information between a sound direction of the sound sources from which the sounds corresponding to each of the plurality of audio signals are generated and the plurality of microphones 111-1 and 111-2. - Therefore, the end
point detection module 123 determines whether the angle information on each of the plurality of audio signals pre-stored in thememory 120 is included in the preset error range and if at least two angle information included in the preset error range is acquired, determines audio signals corresponding to the acquired angle information as speech signals from a static sound source. - Meanwhile, if a difference in the angle information of each of the first and second audio signals among the plurality of pre-stored audio signals is not included in the preset error range, the end
point detection module 123 may determine the first and second audio signals as speech signals from a dynamic sound source depending on whether a difference value of the angle information of each of the first and second audio signals is less than the preset threshold value. - If it is determined that at least two of the plurality of audio signals pre-stored in the
memory 120 are the speech signal by the various analyses, the endpoint detection module 123 may determine each of at least two audio signals determined as the speech signal as the audio signals of the starting and ending points. - If the audio signals of the starting and ending points are determined, the speech
signal processing module 124 detects the speech sections based on the indexes for the audio signals determined as the starting and ending points. Next, the speechsignal processing module 124 performs the signal processing to amplify the audio signal in the direction corresponding to the direction information on the sound sources for the audio signals determined as the starting and ending points and attenuate the audio signals in the rest directions. Therefore, thespeech recognition module 125 may perform the speech recognition from the audio signal in the speech section that is signal-processed by the speechsignal processing module 124 to transform the speech signal for the uttered speech of the user into the text. - As such, the
electronic device 100 according to the exemplary embodiment of the present disclosure may detect the section having the power ratio equal to or more than the preset threshold value as the speech section based on the power ratio calculated from the plurality of audio signals, thereby accurately detecting the speech section for the uttered speech of the user even in the environment that a lot of noise is present. Further, theelectronic device 100 according to the exemplary embodiment of the present disclosure performs the speech recognition only in the detected speech section, thereby more minimizing the computation required to perform the speech recognition than before. -
FIGS. 5A to 5C are exemplified diagrams illustrating a speech section detection from an input audio signal in the electronic device according to an exemplary embodiment of the present disclosure. - As illustrated in
FIG. 5A , the sounds including the speech signal may be received through the plurality ofmicrophones 111. Here, sections A toF 410 to 460 may be the speech section including the speech signal and the rest sections may be the noise section including the noise signal. - In detail, if the sounds generated from the sound source are input from the plurality of
microphones 111, theelectronic device 100 performs the signal processing on each of the input sounds. Next, theelectronic device 100 calculates the maximum power value and the minimum power value from each of the plurality of audio signals that are signal-processed, and calculates the power ratio from the calculated maximum power value and minimum power value. - As illustrated in
FIG. 5B , a power ratio of sections A′ to F′ 411 to 461 corresponding to the sections A toF 410 to 460 may be equal to or more than apreset threshold value 470. Therefore, theelectronic device 100 may detect the sections A′ to F′ 411 to 461 having the power ratio equal to or more than thepreset threshold value 470 as the speech sections. - Meanwhile, as illustrated in
FIG. 5C , angles of each audio signal of sections A″ to F″ 413 to 463 corresponding to the sections A′ to F′ 411 to 461 as the speech sections are present within the preset error range and angles of other sections may be present outside the preset error range. In this case, as illustrated inFIG. 6 to be described below, theelectronic device 100 may amplify only the audio signals in the directions corresponding to the angles present within the error range among the audio signals in the speech sections that are the sections A′ to F′ 411 to 461 having the power ratio equal to or more than thepreset threshold value 470. -
FIG. 6 , including 6A and 6B, is an exemplified diagram illustrating a result of tracking a sound source direction from the input audio signal in the electronic device according to an exemplary embodiment of the present disclosure. - Referring to
FIG. 5 , including 5A to 5C, the speech section may be detected from the audio signals input through the plurality ofmicrophones 111. - If the speech section is detected from the audio signals, the
electronic device 100 may perform the signal processing to amplify an audio signal in a specific direction among the audio signals in the speech section detected from the audio signals and attenuate audio signals in the rest directions. - In detail, the
electronic device 100 amplifies the audio signal in the direction corresponding to the corresponding angle information among the audio signals in the previously detected speech section based on the angle information on the sound sources for at least two audio signals determined as the starting and ending points among the plurality of audio signals having the power ratio equal to or more than the preset threshold value. Further, theelectronic device 100 attenuates the audio signals in the rest directions other than the audio signal in the direction corresponding to the corresponding angle information among the audio signals in the previously detected speech section. - Therefore, as illustrated in
FIGS. 6A and 6B , theelectronic device 100 may amplify audio signals inspeech processing sections 510 to 560 corresponding to the sections A toF 410 to 460 detected as the speech section and attenuate audio signals in the rest sections. - Meanwhile, the
electronic device 100 according to the exemplary embodiment of the present disclosure may provide various internet of things services based on the foregoing exemplary embodiments. -
FIG. 7 is an exemplified diagram of internet of things services provided from the electronic device according to the exemplary embodiment of the present disclosure. - As illustrated in
FIG. 7 , theelectronic device 100 may perform the speech recognition from the speech signal for the uttered speech of the user and control home appliances such as first andsecond TVs air conditioner 20, arefrigerator 30, and awashing machine 40 in the house based on the recognized speech command. - For example, the user may utter a speech command ‘turn on the TV!’ in his/her own room. If the speech command of the user is uttered, the
electronic device 100 receives sounds generated from sound sources including speech signals corresponding to the speech commands of the user through the plurality of microphones and performs signal processing on each of the input sounds. - Next, the
electronic device 100 understands directions in which the speech commands of the user are uttered based on the series of performance operations as described above. Next, theelectronic device 100 identifies home appliances associated with the directions in which the speech commands of the user are uttered based on the pre-stored direction information on each home appliance. - In detail, the
electronic device 100 may store the identification information corresponding to the first andsecond TVs air conditioner 20, therefrigerator 30, and thewashing machine 40, respectively and the direction information on each home appliance. Therefore, theelectronic device 100 may compare the direction in which the speech command of the user is uttered with the pre-stored direction information on each home appliance to detect the direction in which the speech command of the user is uttered and the home appliances present within the preset range. - As the foregoing example, the
first TV 10 is located in a living room and thesecond TV 10′ may be located in a room in which the user is currently located. Further, the home appliances present in the direction in which the speech command of the user is uttered and within the preset range may be thesecond TV 10′. In this case, theelectronic device 100 may transmit a power on control signal to thesecond TV 10′ in the room in which the user is currently located among the first andsecond TVs - Therefore, the
second TV 10′ may perform a power on operation based on the power on control signal received from theelectronic device 100 to watch broadcasting through thesecond TV 10′ present in the room in which the user is currently located. - Hereinafter, a method for performing speech recognition by the
electronic device 100 according to the exemplary embodiment of the present disclosure will be described in detail. -
FIG. 8 is a flow chart of a method for performing speech recognition by an electronic device according to an exemplary embodiment of the present disclosure. - As illustrated in
FIG. 8 , if the sounds generated from the sound sources are input through the plurality of microphones, theelectronic device 100 performs the signal processing on each of the input sounds to generate the plurality of signal-processed audio signals (S710). In detail, if the sounds generated from the sound sources are input through the plurality of microphones, theelectronic device 100 performs the signal processing on each of the input sounds to generate the plurality of signal-processed audio signals. Next, theelectronic device 100 may sample each of the signal-processed audio signals in L numbers and then generate the L sampled audio signals in a frame unit. If or when the plurality of audio signals are generated, theelectronic device 100 calculates the power values from each of the plurality of audio signals (S720). Next, theelectronic device 100 stores the direction information on the sound source from which the sound corresponding to at least one of the plurality of audio signals is generated and the index for at least one audio signal based on the power values calculated from the plurality of audio signals (S730). - Next, the
electronic device 100 determines the starting and ending points of the speech sections included in all the audio signals based on the direction information on the pre-stored sound source (S740). - According to the exemplary embodiment of the present disclosure, when the direction information on the sound sources for each of the plurality of sound sources is stored in the
memory 120, theelectronic device 100 may determine each of the audio signals corresponding to at least two direction information as the audio signals of the starting and ending points if at least two direction information among the plurality of direction information is included in the preset error range or the error range of the at least two direction information is less than the preset threshold value. - Next, the
electronic device 100 detects the speech sections from all the audio signals based on the indexes for the audio signals corresponding to the starting and ending points and performs the speech recognition on the detected speech section (S750). - In detail, the
electronic device 100 may detect the speech sections including the speech signals among all the audio signals based on the indexes for the audio signals corresponding to the starting and ending points. Next, theelectronic device 100 performs the preprocessing process of amplifying the plurality of audio signals included in the speech sections and attenuating the rest audio signals that are noise signals. - According to the exemplary embodiment of the present disclosure, the
electronic device 100 may perform signal processing to amplify the audio signal in the direction corresponding to the direction information on the sound sources for the audio signals determined as the starting and ending points from the audio signal in the previously detected speech section and attenuate the audio signals in the rest directions by at least one of a beam-forming scheme including at least one of linearly constrained minimum variance (LCMA) and minimum variance distortion-less response (MVDR), a geometric source separation (GSS) scheme, and a blind source extraction (BSE) scheme. - Next, the
electronic device 100 may perform the speech recognition on the uttered speech of the user using a speech recognition algorithm like the STT algorithm for the speech sections in which the audio signals are amplified. - Hereinafter, for the
electronic device 100 to detect the audio signals of the starting and ending points of the speech section from the audio signal, a method for storing direction information on a sound source for at least one audio signal detected as a speech section and an index for at least one audio signal will be described in detail. -
FIG. 9 is a first flow chart of a method for storing direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal, by the electronic device according to an exemplary embodiment of the present disclosure. - As illustrated in
FIG. 9 , if the plurality of signal-processed audio signals are generated from the sounds input from the plurality of microphones, theelectronic device 100 calculates the maximum power value and the minimum power value from each of the plurality of audio signals (S810). Next, theelectronic device 100 calculates the power ratio from the calculated maximum power value and minimum power value (S820). Next, theelectronic device 100 determines at least one audio signal of which the calculated power ratio is equal to or more than the preset threshold value among the plurality of audio signals and stores the direction information on the sound source for at least one audio signal determined and the index for at least one audio signal (S830 and S840). - In detail, when the number of microphones is N, the
electronic device 100 calculates N*(N−1)/2 power values from the plurality of audio signals using the generalized cross-correlation phase transform (GCC-PHAT) algorithm. Next, theelectronic device 100 may determine the largest value among the calculated N*(N−1)/2 power values as the maximum power value. - According to the exemplary embodiment of the present disclosure, the
electronic device 100 may calculate the N*(N−1)/2 power values from the plurality of audio signals and delay values for each of the plurality of audio signals using the cross-correlation function like the above <Equation 1>. Here, the delay values for each of the plurality of audio signals may be the information on the time when the audio signals are differently input to each of the plurality of microphones depending on the distance between the plurality of microphones. Therefore, theelectronic device 100 may calculate the direction information on the sound source for the plurality of audio signals from the delay values for each of the plurality of frames. - Here, the direction information is the angle information between the sound direction of the sound sources for the plurality of audio signals and the plurality of
microphones 111. Therefore, theelectronic device 100 may calculate the angle information that is the direction information on the sound source for the plurality of audio signals from the delay values calculated from the above <Equation 1>. - Meanwhile, the
electronic device 100 may calculate the minimum power value from the plurality of audio signals using minima-controlled recursive average (MCRA) algorithm. Therefore, theelectronic device 100 may calculate the power ratio from the maximum power value having the largest value among the power values calculated using the cross-correlation function like the above <Equation 1> and the minimum power value calculated using the MCRA algorithm. If the power ratio is calculated, theelectronic device 100 may store the direction information on the sound source for at least one audio signal having the power ratio equal to or more than the preset threshold value by comparing the previously calculated power ratio with the preset threshold value and the index for at least one audio signal, among the plurality of audio signals. - Meanwhile, the
electronic device 100 may store the minimum power value using the minima-controlled recursive average (MCRA) algorithm. Therefore, if the minimum power value is stored and then the audio signal is input, theelectronic device 100 may compare the minimum power value calculated from the input audio signal with the pre-stored minimum power value to calculate the power ratio based on the lower value of the two minimum power values. - Hereinafter, a method for storing, by an
electronic device 100, direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal determined as the speech section in the state in which the minimum power value is pre-stored will be described in detail. -
FIG. 10 is a second flow chart of a method for storing, by the electronic device according to another exemplary embodiment of the present disclosure, direction information on a sound source for at least one audio signal determined as a speech section and an index for at least one audio signal determined as the speech section. - As illustrated in
FIG. 10 , if the plurality of signal-processed audio signals are generated from the sounds input through the plurality of microphones, theelectronic device 100 determines whether the plurality of audio signals are the predefined K-th audio signal (S910). As the determination result, if the plurality of audio signals are not the predefined K-th audio signal, theelectronic device 100 calculates the maximum power value and the minimum power value from the plurality of audio signals and compares the calculated minimum power value with the previous minimum power value pre-stored in the memory as described with reference toFIG. 9 (S920). As the comparison result, if the currently calculated minimum power value is smaller than the minimum power value pre-stored in the memory, theelectronic device 100 updates the minimum power value pre-stored in the memory to the minimum power value calculated from the plurality of audio signals (S930). - Next, the
electronic device 100 calculates the power ratio and the direction information from the previously calculated maximum power value and minimum power value (S940). The method for calculating a power ratio and direction information from a plurality of audio signals is already described in detail with reference toFIG. 9 and therefore the detailed description thereof will be omitted. - Meanwhile, as the comparison result in the foregoing step S920, if the pre-stored previous minimum power value is smaller than the calculated minimum power value, the
electronic device 100 determines the previous minimum power value as the value for calculating the power ratio S950. Next, theelectronic device 100 may calculate the power ratio and the direction information from the maximum power value calculated from the plurality of audio signals and the previous minimum power value pre-stored in the memory, based on the foregoing step S940. - As such, if the power ratio is calculated from the plurality of audio signals, the
electronic device 100 may store the direction information on the sound source for at least one audio signal having the power ratio equal to or more than the preset threshold value by comparing the calculated power ratio with the preset threshold value and the index for at least one audio signal having the power ratio equal to or more than the preset threshold value, among the plurality of audio signals (S960 and S970). - Meanwhile, if the plurality of audio signals are the predefined K-th audio signal in the foregoing step S910, the
electronic device 100 resets the minimum power value calculated from the K-th audio signal to be the initial value and stores it in the memory S980 and then performs the operations of the foregoing steps S940 to S970. As such, if the direction information on the sound source for at least one audio signal and the index for at least one audio signal are stored in the memory, as illustrated inFIG. 8 , in theelectronic device 100, the starting and ending points of the speech sections included in all the audio signals may be determined based on the direction information on the sound sources for the plurality of audio signals pre-stored in the memory and the speech sections included in all the audio signals may be detected based on the index information on the audio signals corresponding to the starting and ending points determined. - Next, the
electronic device 100 may perform the preprocessing process of amplifying the plurality of audio signals included in the speech section and attenuating the rest audio signals that are noise signals and then perform the speech recognition on the uttered speech of the user using the speech recognition algorithm like the STT algorithm in the speech section in which the audio signal is amplified. Meanwhile, theelectronic device 100 according to the exemplary embodiment of the present disclosure preferably performs repeatedly each of the steps ofFIGS. 8 to 10 as described above until events such as power off and deactivation of a speech recognition mode are generated. - Meanwhile, the method for recognizing a speech by an
electronic device 100 according to the exemplary embodiment of the present disclosure may be implemented by at least one execution program for performing the speech recognition as described above, in which the execution program may be stored in a non-transitory computer readable medium or storage. - A non-transitory computer readable medium, as can be used herein, is not a medium that stores data therein for a while, such as a register, a cache, a memory, or the like, but means a medium that semi-permanently stores data therein and is readable by a device. In detail, the foregoing programs may be stored in various types of recording media that are readable by a terminal, such as a random access memory (RAM), a flash memory, a read only memory (ROM), an erasable programmable ROM (EPROM), an electronically erasable programmable ROM (EEPROM), a register, a hard disk, a removable disk, a memory card, a universal serial bus (USB) memory, a compact-disk (CD) ROM, and the like.
- Hereinabove, the present disclosure has been described with reference to exemplary embodiments thereof.
- Hereinabove, the exemplary embodiments of the present disclosure are illustrated and described, but the present disclosure is not limited to the foregoing specific exemplary embodiments and therefore it is apparent that various modifications can be made by those skilled in the art without departing from the spirit of the present disclosure described in the appended claims and these various modifications should not be individually construed from the technical ideas or prospects of the present disclosure.
- Although a few embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit thereof, the scope of which is defined in the claims and their equivalents.
Claims (20)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2015-0153033 | 2015-11-02 | ||
KR1020150153033A KR102444061B1 (en) | 2015-11-02 | 2015-11-02 | Electronic device and method for recognizing voice of speech |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170125037A1 true US20170125037A1 (en) | 2017-05-04 |
US10540995B2 US10540995B2 (en) | 2020-01-21 |
Family
ID=58635659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/340,528 Expired - Fee Related US10540995B2 (en) | 2015-11-02 | 2016-11-01 | Electronic device and method for recognizing speech |
Country Status (4)
Country | Link |
---|---|
US (1) | US10540995B2 (en) |
KR (1) | KR102444061B1 (en) |
CN (1) | CN108352159B (en) |
WO (1) | WO2017078361A1 (en) |
Cited By (79)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150379672A1 (en) * | 2014-06-27 | 2015-12-31 | Samsung Electronics Co., Ltd | Dynamically optimized deferred rendering pipeline |
US9772817B2 (en) | 2016-02-22 | 2017-09-26 | Sonos, Inc. | Room-corrected voice detection |
CN107742522A (en) * | 2017-10-23 | 2018-02-27 | 科大讯飞股份有限公司 | Target voice acquisition methods and device based on microphone array |
US9942678B1 (en) | 2016-09-27 | 2018-04-10 | Sonos, Inc. | Audio playback settings for voice interaction |
US9947316B2 (en) | 2016-02-22 | 2018-04-17 | Sonos, Inc. | Voice control of a media playback system |
US9965247B2 (en) | 2016-02-22 | 2018-05-08 | Sonos, Inc. | Voice controlled media playback system based on user profile |
US9978390B2 (en) | 2016-06-09 | 2018-05-22 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US10021503B2 (en) | 2016-08-05 | 2018-07-10 | Sonos, Inc. | Determining direction of networked microphone device relative to audio playback device |
US10034116B2 (en) | 2016-09-22 | 2018-07-24 | Sonos, Inc. | Acoustic position measurement |
US10051366B1 (en) | 2017-09-28 | 2018-08-14 | Sonos, Inc. | Three-dimensional beam forming with a microphone array |
US10075793B2 (en) | 2016-09-30 | 2018-09-11 | Sonos, Inc. | Multi-orientation playback device microphones |
US10097939B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Compensation for speaker nonlinearities |
US10095470B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Audio response playback |
US10115400B2 (en) | 2016-08-05 | 2018-10-30 | Sonos, Inc. | Multiple voice services |
CN108766457A (en) * | 2018-05-30 | 2018-11-06 | 北京小米移动软件有限公司 | Acoustic signal processing method, device, electronic equipment and storage medium |
US10134399B2 (en) | 2016-07-15 | 2018-11-20 | Sonos, Inc. | Contextualization of voice inputs |
US10152969B2 (en) | 2016-07-15 | 2018-12-11 | Sonos, Inc. | Voice detection by multiple devices |
US10181323B2 (en) | 2016-10-19 | 2019-01-15 | Sonos, Inc. | Arbitration-based voice recognition |
US10264030B2 (en) | 2016-02-22 | 2019-04-16 | Sonos, Inc. | Networked microphone device control |
US10365889B2 (en) | 2016-02-22 | 2019-07-30 | Sonos, Inc. | Metadata exchange involving a networked playback system and a networked microphone system |
US20190250881A1 (en) * | 2018-02-14 | 2019-08-15 | International Business Machines Corporation | Voice command filtering |
US10445057B2 (en) | 2017-09-08 | 2019-10-15 | Sonos, Inc. | Dynamic computation of system response volume |
US10446165B2 (en) | 2017-09-27 | 2019-10-15 | Sonos, Inc. | Robust short-time fourier transform acoustic echo cancellation during audio playback |
US10466962B2 (en) | 2017-09-29 | 2019-11-05 | Sonos, Inc. | Media playback system with voice assistance |
US10475449B2 (en) | 2017-08-07 | 2019-11-12 | Sonos, Inc. | Wake-word detection suppression |
US10482868B2 (en) | 2017-09-28 | 2019-11-19 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
CN110505547A (en) * | 2018-05-17 | 2019-11-26 | 深圳瑞利声学技术股份有限公司 | A kind of earphone wearing state detection method and earphone |
US10573321B1 (en) | 2018-09-25 | 2020-02-25 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US10587430B1 (en) | 2018-09-14 | 2020-03-10 | Sonos, Inc. | Networked devices, systems, and methods for associating playback devices based on sound codes |
US10586540B1 (en) | 2019-06-12 | 2020-03-10 | Sonos, Inc. | Network microphone device with command keyword conditioning |
US10602268B1 (en) | 2018-12-20 | 2020-03-24 | Sonos, Inc. | Optimization of network microphone devices using noise classification |
US10621981B2 (en) | 2017-09-28 | 2020-04-14 | Sonos, Inc. | Tone interference cancellation |
US10645493B2 (en) | 2018-08-21 | 2020-05-05 | Samsung Electronics Co., Ltd. | Sound direction detection sensor and electronic apparatus including the same |
CN111181949A (en) * | 2019-12-25 | 2020-05-19 | 视联动力信息技术股份有限公司 | Sound detection method, device, terminal equipment and storage medium |
US10681460B2 (en) | 2018-06-28 | 2020-06-09 | Sonos, Inc. | Systems and methods for associating playback devices with voice assistant services |
US10692518B2 (en) | 2018-09-29 | 2020-06-23 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection via multiple network microphone devices |
EP3676830A4 (en) * | 2017-11-30 | 2020-09-23 | Samsung Electronics Co., Ltd. | Method of providing service based on location of sound source and speech recognition device therefor |
US10797667B2 (en) | 2018-08-28 | 2020-10-06 | Sonos, Inc. | Audio notifications |
US10818290B2 (en) | 2017-12-11 | 2020-10-27 | Sonos, Inc. | Home graph |
US10847178B2 (en) | 2018-05-18 | 2020-11-24 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection |
US10867604B2 (en) | 2019-02-08 | 2020-12-15 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing |
US10871943B1 (en) | 2019-07-31 | 2020-12-22 | Sonos, Inc. | Noise classification for event detection |
US10880650B2 (en) | 2017-12-10 | 2020-12-29 | Sonos, Inc. | Network microphone devices with automatic do not disturb actuation capabilities |
US10878811B2 (en) | 2018-09-14 | 2020-12-29 | Sonos, Inc. | Networked devices, systems, and methods for intelligently deactivating wake-word engines |
US10959029B2 (en) | 2018-05-25 | 2021-03-23 | Sonos, Inc. | Determining and adapting to changes in microphone performance of playback devices |
CN112578338A (en) * | 2019-09-27 | 2021-03-30 | 阿里巴巴集团控股有限公司 | Sound source positioning method, device, equipment and storage medium |
US11024331B2 (en) | 2018-09-21 | 2021-06-01 | Sonos, Inc. | Voice detection optimization using sound metadata |
US11076035B2 (en) | 2018-08-28 | 2021-07-27 | Sonos, Inc. | Do not disturb feature for audio notifications |
US11100923B2 (en) | 2018-09-28 | 2021-08-24 | Sonos, Inc. | Systems and methods for selective wake word detection using neural network models |
US11120794B2 (en) | 2019-05-03 | 2021-09-14 | Sonos, Inc. | Voice assistant persistence across multiple network microphone devices |
US11132989B2 (en) | 2018-12-13 | 2021-09-28 | Sonos, Inc. | Networked microphone devices, systems, and methods of localized arbitration |
US11138975B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US11138969B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US11175880B2 (en) | 2018-05-10 | 2021-11-16 | Sonos, Inc. | Systems and methods for voice-assisted media content selection |
US11183183B2 (en) | 2018-12-07 | 2021-11-23 | Sonos, Inc. | Systems and methods of operating media playback systems having multiple voice assistant services |
US11183181B2 (en) | 2017-03-27 | 2021-11-23 | Sonos, Inc. | Systems and methods of multiple voice services |
US11189286B2 (en) | 2019-10-22 | 2021-11-30 | Sonos, Inc. | VAS toggle based on device orientation |
US11200889B2 (en) | 2018-11-15 | 2021-12-14 | Sonos, Inc. | Dilated convolutions and gating for efficient keyword spotting |
US11200890B2 (en) | 2018-05-01 | 2021-12-14 | International Business Machines Corporation | Distinguishing voice commands |
US11200900B2 (en) | 2019-12-20 | 2021-12-14 | Sonos, Inc. | Offline voice control |
US11200894B2 (en) | 2019-06-12 | 2021-12-14 | Sonos, Inc. | Network microphone device with command keyword eventing |
US20220028377A1 (en) * | 2018-12-19 | 2022-01-27 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling same |
US11238856B2 (en) | 2018-05-01 | 2022-02-01 | International Business Machines Corporation | Ignoring trigger words in streamed media content |
US11276395B1 (en) * | 2017-03-10 | 2022-03-15 | Amazon Technologies, Inc. | Voice-based parameter assignment for voice-capturing devices |
CN114268984A (en) * | 2021-11-15 | 2022-04-01 | 珠海格力电器股份有限公司 | Signal processing method, electronic device and storage medium |
US11308962B2 (en) | 2020-05-20 | 2022-04-19 | Sonos, Inc. | Input detection windowing |
US11308958B2 (en) | 2020-02-07 | 2022-04-19 | Sonos, Inc. | Localized wakeword verification |
US11315556B2 (en) | 2019-02-08 | 2022-04-26 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification |
US11343614B2 (en) | 2018-01-31 | 2022-05-24 | Sonos, Inc. | Device designation of playback and network microphone device arrangements |
US11355108B2 (en) | 2019-08-20 | 2022-06-07 | International Business Machines Corporation | Distinguishing voice commands |
US11361756B2 (en) | 2019-06-12 | 2022-06-14 | Sonos, Inc. | Conditional wake word eventing based on environment |
US11482224B2 (en) | 2020-05-20 | 2022-10-25 | Sonos, Inc. | Command keywords with input detection windowing |
US11551700B2 (en) | 2021-01-25 | 2023-01-10 | Sonos, Inc. | Systems and methods for power-efficient keyword detection |
US11556307B2 (en) | 2020-01-31 | 2023-01-17 | Sonos, Inc. | Local voice data processing |
US11562740B2 (en) | 2020-01-07 | 2023-01-24 | Sonos, Inc. | Voice verification for media playback |
US11698771B2 (en) | 2020-08-25 | 2023-07-11 | Sonos, Inc. | Vocal guidance engines for playback devices |
US11727919B2 (en) | 2020-05-20 | 2023-08-15 | Sonos, Inc. | Memory allocation for keyword spotting engines |
US11899519B2 (en) | 2018-10-23 | 2024-02-13 | Sonos, Inc. | Multiple stage network microphone device with reduced power consumption and processing load |
US11961519B2 (en) | 2022-04-18 | 2024-04-16 | Sonos, Inc. | Localized wakeword verification |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106782585B (en) * | 2017-01-26 | 2020-03-20 | 芋头科技(杭州)有限公司 | Pickup method and system based on microphone array |
KR102395013B1 (en) * | 2017-09-05 | 2022-05-04 | 엘지전자 주식회사 | Method for operating artificial intelligence home appliance and voice recognition server system |
KR102087307B1 (en) * | 2018-03-15 | 2020-03-10 | 한양대학교 산학협력단 | Method and apparatus for estimating direction of ensemble sound source based on deepening neural network for estimating direction of sound source robust to reverberation environment |
CN109256153B (en) * | 2018-08-29 | 2021-03-02 | 云知声智能科技股份有限公司 | Sound source positioning method and system |
JP2021536692A (en) * | 2018-09-13 | 2021-12-27 | アリババ グループ ホウルディング リミテッド | Human machine voice dialogue device and its operation method |
KR20200074680A (en) * | 2018-12-17 | 2020-06-25 | 삼성전자주식회사 | Terminal device and method for controlling thereof |
CN109903753B (en) * | 2018-12-28 | 2022-07-15 | 广州索答信息科技有限公司 | Multi-person sentence classification method, equipment, medium and system based on sound source angle |
CN112216303A (en) * | 2019-07-11 | 2021-01-12 | 北京声智科技有限公司 | Voice processing method and device and electronic equipment |
CN110517677B (en) * | 2019-08-27 | 2022-02-08 | 腾讯科技(深圳)有限公司 | Speech processing system, method, apparatus, speech recognition system, and storage medium |
TWI736117B (en) * | 2020-01-22 | 2021-08-11 | 瑞昱半導體股份有限公司 | Device and method for sound localization |
CN111312275B (en) * | 2020-02-13 | 2023-04-25 | 大连理工大学 | On-line sound source separation enhancement system based on sub-band decomposition |
CN112837703A (en) * | 2020-12-30 | 2021-05-25 | 深圳市联影高端医疗装备创新研究院 | Method, apparatus, device and medium for acquiring voice signal in medical imaging device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020004837A1 (en) * | 2000-07-07 | 2002-01-10 | Mitsubishi Denki Kabushiki Kaisha | E-mail communication terminal apparatus |
US20020048376A1 (en) * | 2000-08-24 | 2002-04-25 | Masakazu Ukita | Signal processing apparatus and signal processing method |
US20020138254A1 (en) * | 1997-07-18 | 2002-09-26 | Takehiko Isaka | Method and apparatus for processing speech signals |
US20030235295A1 (en) * | 2002-06-24 | 2003-12-25 | He Perry P. | Method and apparatus for non-linear processing of an audio signal |
US20100017205A1 (en) * | 2008-07-18 | 2010-01-21 | Qualcomm Incorporated | Systems, methods, apparatus, and computer program products for enhanced intelligibility |
US20110191102A1 (en) * | 2010-01-29 | 2011-08-04 | University Of Maryland, College Park | Systems and methods for speech extraction |
US20150222996A1 (en) * | 2014-01-31 | 2015-08-06 | Malaspina Labs (Barbados), Inc. | Directional Filtering of Audible Signals |
US20160064012A1 (en) * | 2014-08-27 | 2016-03-03 | Fujitsu Limited | Voice processing device, voice processing method, and non-transitory computer readable recording medium having therein program for voice processing |
US9621984B1 (en) * | 2015-10-14 | 2017-04-11 | Amazon Technologies, Inc. | Methods to process direction data of an audio input device using azimuth values |
Family Cites Families (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5774851A (en) * | 1985-08-15 | 1998-06-30 | Canon Kabushiki Kaisha | Speech recognition apparatus utilizing utterance length information |
JP3337588B2 (en) | 1995-03-31 | 2002-10-21 | 松下電器産業株式会社 | Voice response device |
KR100198019B1 (en) | 1996-11-20 | 1999-06-15 | 정선종 | Remote speech input and its processing method using microphone array |
US5867574A (en) * | 1997-05-19 | 1999-02-02 | Lucent Technologies Inc. | Voice activity detection system and method |
JP4005203B2 (en) * | 1998-02-03 | 2007-11-07 | 富士通テン株式会社 | In-vehicle speech recognition device |
US7437286B2 (en) | 2000-12-27 | 2008-10-14 | Intel Corporation | Voice barge-in in telephony speech recognition |
AU2002363054A1 (en) * | 2001-09-12 | 2003-05-06 | Bitwave Private Limited | System and apparatus for speech communication and speech recognition |
JP4195267B2 (en) * | 2002-03-14 | 2008-12-10 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Speech recognition apparatus, speech recognition method and program thereof |
JP3910898B2 (en) | 2002-09-17 | 2007-04-25 | 株式会社東芝 | Directivity setting device, directivity setting method, and directivity setting program |
US20090018828A1 (en) * | 2003-11-12 | 2009-01-15 | Honda Motor Co., Ltd. | Automatic Speech Recognition System |
JP4659556B2 (en) | 2005-08-11 | 2011-03-30 | 富士通株式会社 | Sound source direction detection device |
KR100751921B1 (en) | 2005-11-11 | 2007-08-24 | 고려대학교 산학협력단 | Method and apparatus for removing noise of multi-channel voice signal |
JP4282704B2 (en) * | 2006-09-27 | 2009-06-24 | 株式会社東芝 | Voice section detection apparatus and program |
WO2010004649A1 (en) * | 2008-07-11 | 2010-01-14 | パイオニア株式会社 | Delay amount determination device, sound image localization device, delay amount determination method, and delay amount determination processing program |
EP2339574B1 (en) * | 2009-11-20 | 2013-03-13 | Nxp B.V. | Speech detector |
JP5668553B2 (en) | 2011-03-18 | 2015-02-12 | 富士通株式会社 | Voice erroneous detection determination apparatus, voice erroneous detection determination method, and program |
DE112011105136B4 (en) * | 2011-04-08 | 2018-12-13 | Mitsubishi Electric Corporation | Speech recognition device and navigation device |
US9031259B2 (en) * | 2011-09-15 | 2015-05-12 | JVC Kenwood Corporation | Noise reduction apparatus, audio input apparatus, wireless communication apparatus, and noise reduction method |
US8942386B2 (en) * | 2011-11-30 | 2015-01-27 | Midas Technology, Inc. | Real-time quality monitoring of speech and audio signals in noisy reverberant environments for teleconferencing systems |
US9070374B2 (en) * | 2012-02-20 | 2015-06-30 | JVC Kenwood Corporation | Communication apparatus and condition notification method for notifying a used condition of communication apparatus by using a light-emitting device attached to communication apparatus |
KR20130101943A (en) | 2012-03-06 | 2013-09-16 | 삼성전자주식회사 | Endpoints detection apparatus for sound source and method thereof |
US9131295B2 (en) * | 2012-08-07 | 2015-09-08 | Microsoft Technology Licensing, Llc | Multi-microphone audio source separation based on combined statistical angle distributions |
FR3011377B1 (en) | 2013-10-01 | 2015-11-06 | Aldebaran Robotics | METHOD FOR LOCATING A SOUND SOURCE AND HUMANOID ROBOT USING SUCH A METHOD |
-
2015
- 2015-11-02 KR KR1020150153033A patent/KR102444061B1/en active IP Right Grant
-
2016
- 2016-11-01 WO PCT/KR2016/012427 patent/WO2017078361A1/en active Application Filing
- 2016-11-01 CN CN201680063709.3A patent/CN108352159B/en active Active
- 2016-11-01 US US15/340,528 patent/US10540995B2/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020138254A1 (en) * | 1997-07-18 | 2002-09-26 | Takehiko Isaka | Method and apparatus for processing speech signals |
US20020004837A1 (en) * | 2000-07-07 | 2002-01-10 | Mitsubishi Denki Kabushiki Kaisha | E-mail communication terminal apparatus |
US20020048376A1 (en) * | 2000-08-24 | 2002-04-25 | Masakazu Ukita | Signal processing apparatus and signal processing method |
US20030235295A1 (en) * | 2002-06-24 | 2003-12-25 | He Perry P. | Method and apparatus for non-linear processing of an audio signal |
US20100017205A1 (en) * | 2008-07-18 | 2010-01-21 | Qualcomm Incorporated | Systems, methods, apparatus, and computer program products for enhanced intelligibility |
US20110191102A1 (en) * | 2010-01-29 | 2011-08-04 | University Of Maryland, College Park | Systems and methods for speech extraction |
US20150222996A1 (en) * | 2014-01-31 | 2015-08-06 | Malaspina Labs (Barbados), Inc. | Directional Filtering of Audible Signals |
US20160064012A1 (en) * | 2014-08-27 | 2016-03-03 | Fujitsu Limited | Voice processing device, voice processing method, and non-transitory computer readable recording medium having therein program for voice processing |
US9621984B1 (en) * | 2015-10-14 | 2017-04-11 | Amazon Technologies, Inc. | Methods to process direction data of an audio input device using azimuth values |
Cited By (188)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9842428B2 (en) * | 2014-06-27 | 2017-12-12 | Samsung Electronics Co., Ltd. | Dynamically optimized deferred rendering pipeline |
US20150379672A1 (en) * | 2014-06-27 | 2015-12-31 | Samsung Electronics Co., Ltd | Dynamically optimized deferred rendering pipeline |
US10740065B2 (en) | 2016-02-22 | 2020-08-11 | Sonos, Inc. | Voice controlled media playback system |
US11514898B2 (en) | 2016-02-22 | 2022-11-29 | Sonos, Inc. | Voice control of a media playback system |
US10970035B2 (en) | 2016-02-22 | 2021-04-06 | Sonos, Inc. | Audio response playback |
US9947316B2 (en) | 2016-02-22 | 2018-04-17 | Sonos, Inc. | Voice control of a media playback system |
US9965247B2 (en) | 2016-02-22 | 2018-05-08 | Sonos, Inc. | Voice controlled media playback system based on user profile |
US11006214B2 (en) | 2016-02-22 | 2021-05-11 | Sonos, Inc. | Default playback device designation |
US11042355B2 (en) | 2016-02-22 | 2021-06-22 | Sonos, Inc. | Handling of loss of pairing between networked devices |
US9772817B2 (en) | 2016-02-22 | 2017-09-26 | Sonos, Inc. | Room-corrected voice detection |
US11832068B2 (en) | 2016-02-22 | 2023-11-28 | Sonos, Inc. | Music service selection |
US11863593B2 (en) | 2016-02-22 | 2024-01-02 | Sonos, Inc. | Networked microphone device control |
US10097939B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Compensation for speaker nonlinearities |
US10097919B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Music service selection |
US10095470B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Audio response playback |
US11137979B2 (en) | 2016-02-22 | 2021-10-05 | Sonos, Inc. | Metadata exchange involving a networked playback system and a networked microphone system |
US10555077B2 (en) | 2016-02-22 | 2020-02-04 | Sonos, Inc. | Music service selection |
US10847143B2 (en) | 2016-02-22 | 2020-11-24 | Sonos, Inc. | Voice control of a media playback system |
US11184704B2 (en) | 2016-02-22 | 2021-11-23 | Sonos, Inc. | Music service selection |
US10142754B2 (en) | 2016-02-22 | 2018-11-27 | Sonos, Inc. | Sensor on moving component of transducer |
US11750969B2 (en) | 2016-02-22 | 2023-09-05 | Sonos, Inc. | Default playback device designation |
US11212612B2 (en) | 2016-02-22 | 2021-12-28 | Sonos, Inc. | Voice control of a media playback system |
US10212512B2 (en) | 2016-02-22 | 2019-02-19 | Sonos, Inc. | Default playback devices |
US10225651B2 (en) | 2016-02-22 | 2019-03-05 | Sonos, Inc. | Default playback device designation |
US10264030B2 (en) | 2016-02-22 | 2019-04-16 | Sonos, Inc. | Networked microphone device control |
US10764679B2 (en) | 2016-02-22 | 2020-09-01 | Sonos, Inc. | Voice control of a media playback system |
US10971139B2 (en) | 2016-02-22 | 2021-04-06 | Sonos, Inc. | Voice control of a media playback system |
US10743101B2 (en) | 2016-02-22 | 2020-08-11 | Sonos, Inc. | Content mixing |
US11736860B2 (en) | 2016-02-22 | 2023-08-22 | Sonos, Inc. | Voice control of a media playback system |
US10365889B2 (en) | 2016-02-22 | 2019-07-30 | Sonos, Inc. | Metadata exchange involving a networked playback system and a networked microphone system |
US11726742B2 (en) | 2016-02-22 | 2023-08-15 | Sonos, Inc. | Handling of loss of pairing between networked devices |
US11405430B2 (en) | 2016-02-22 | 2022-08-02 | Sonos, Inc. | Networked microphone device control |
US11556306B2 (en) | 2016-02-22 | 2023-01-17 | Sonos, Inc. | Voice controlled media playback system |
US10409549B2 (en) | 2016-02-22 | 2019-09-10 | Sonos, Inc. | Audio response playback |
US10509626B2 (en) | 2016-02-22 | 2019-12-17 | Sonos, Inc | Handling of loss of pairing between networked devices |
US10499146B2 (en) | 2016-02-22 | 2019-12-03 | Sonos, Inc. | Voice control of a media playback system |
US11513763B2 (en) | 2016-02-22 | 2022-11-29 | Sonos, Inc. | Audio response playback |
US11133018B2 (en) | 2016-06-09 | 2021-09-28 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US10332537B2 (en) | 2016-06-09 | 2019-06-25 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US11545169B2 (en) | 2016-06-09 | 2023-01-03 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US9978390B2 (en) | 2016-06-09 | 2018-05-22 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US10714115B2 (en) | 2016-06-09 | 2020-07-14 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US11184969B2 (en) | 2016-07-15 | 2021-11-23 | Sonos, Inc. | Contextualization of voice inputs |
US10699711B2 (en) | 2016-07-15 | 2020-06-30 | Sonos, Inc. | Voice detection by multiple devices |
US10593331B2 (en) | 2016-07-15 | 2020-03-17 | Sonos, Inc. | Contextualization of voice inputs |
US11664023B2 (en) | 2016-07-15 | 2023-05-30 | Sonos, Inc. | Voice detection by multiple devices |
US10152969B2 (en) | 2016-07-15 | 2018-12-11 | Sonos, Inc. | Voice detection by multiple devices |
US10134399B2 (en) | 2016-07-15 | 2018-11-20 | Sonos, Inc. | Contextualization of voice inputs |
US10297256B2 (en) | 2016-07-15 | 2019-05-21 | Sonos, Inc. | Voice detection by multiple devices |
US10021503B2 (en) | 2016-08-05 | 2018-07-10 | Sonos, Inc. | Determining direction of networked microphone device relative to audio playback device |
US10115400B2 (en) | 2016-08-05 | 2018-10-30 | Sonos, Inc. | Multiple voice services |
US10847164B2 (en) | 2016-08-05 | 2020-11-24 | Sonos, Inc. | Playback device supporting concurrent voice assistants |
US10565999B2 (en) | 2016-08-05 | 2020-02-18 | Sonos, Inc. | Playback device supporting concurrent voice assistant services |
US11531520B2 (en) | 2016-08-05 | 2022-12-20 | Sonos, Inc. | Playback device supporting concurrent voice assistants |
US10354658B2 (en) | 2016-08-05 | 2019-07-16 | Sonos, Inc. | Voice control of playback device using voice assistant service(s) |
US10565998B2 (en) | 2016-08-05 | 2020-02-18 | Sonos, Inc. | Playback device supporting concurrent voice assistant services |
US10034116B2 (en) | 2016-09-22 | 2018-07-24 | Sonos, Inc. | Acoustic position measurement |
US11641559B2 (en) | 2016-09-27 | 2023-05-02 | Sonos, Inc. | Audio playback settings for voice interaction |
US10582322B2 (en) | 2016-09-27 | 2020-03-03 | Sonos, Inc. | Audio playback settings for voice interaction |
US9942678B1 (en) | 2016-09-27 | 2018-04-10 | Sonos, Inc. | Audio playback settings for voice interaction |
US11516610B2 (en) | 2016-09-30 | 2022-11-29 | Sonos, Inc. | Orientation-based playback device microphone selection |
US10873819B2 (en) | 2016-09-30 | 2020-12-22 | Sonos, Inc. | Orientation-based playback device microphone selection |
US10313812B2 (en) | 2016-09-30 | 2019-06-04 | Sonos, Inc. | Orientation-based playback device microphone selection |
US10075793B2 (en) | 2016-09-30 | 2018-09-11 | Sonos, Inc. | Multi-orientation playback device microphones |
US10117037B2 (en) | 2016-09-30 | 2018-10-30 | Sonos, Inc. | Orientation-based playback device microphone selection |
US11727933B2 (en) | 2016-10-19 | 2023-08-15 | Sonos, Inc. | Arbitration-based voice recognition |
US11308961B2 (en) | 2016-10-19 | 2022-04-19 | Sonos, Inc. | Arbitration-based voice recognition |
US10614807B2 (en) | 2016-10-19 | 2020-04-07 | Sonos, Inc. | Arbitration-based voice recognition |
US10181323B2 (en) | 2016-10-19 | 2019-01-15 | Sonos, Inc. | Arbitration-based voice recognition |
US11276395B1 (en) * | 2017-03-10 | 2022-03-15 | Amazon Technologies, Inc. | Voice-based parameter assignment for voice-capturing devices |
US11183181B2 (en) | 2017-03-27 | 2021-11-23 | Sonos, Inc. | Systems and methods of multiple voice services |
US11900937B2 (en) | 2017-08-07 | 2024-02-13 | Sonos, Inc. | Wake-word detection suppression |
US11380322B2 (en) | 2017-08-07 | 2022-07-05 | Sonos, Inc. | Wake-word detection suppression |
US10475449B2 (en) | 2017-08-07 | 2019-11-12 | Sonos, Inc. | Wake-word detection suppression |
US11500611B2 (en) | 2017-09-08 | 2022-11-15 | Sonos, Inc. | Dynamic computation of system response volume |
US11080005B2 (en) | 2017-09-08 | 2021-08-03 | Sonos, Inc. | Dynamic computation of system response volume |
US10445057B2 (en) | 2017-09-08 | 2019-10-15 | Sonos, Inc. | Dynamic computation of system response volume |
US11646045B2 (en) | 2017-09-27 | 2023-05-09 | Sonos, Inc. | Robust short-time fourier transform acoustic echo cancellation during audio playback |
US11017789B2 (en) | 2017-09-27 | 2021-05-25 | Sonos, Inc. | Robust Short-Time Fourier Transform acoustic echo cancellation during audio playback |
US10446165B2 (en) | 2017-09-27 | 2019-10-15 | Sonos, Inc. | Robust short-time fourier transform acoustic echo cancellation during audio playback |
US10482868B2 (en) | 2017-09-28 | 2019-11-19 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
US11769505B2 (en) | 2017-09-28 | 2023-09-26 | Sonos, Inc. | Echo of tone interferance cancellation using two acoustic echo cancellers |
US10880644B1 (en) | 2017-09-28 | 2020-12-29 | Sonos, Inc. | Three-dimensional beam forming with a microphone array |
US10511904B2 (en) | 2017-09-28 | 2019-12-17 | Sonos, Inc. | Three-dimensional beam forming with a microphone array |
US10891932B2 (en) | 2017-09-28 | 2021-01-12 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
US10051366B1 (en) | 2017-09-28 | 2018-08-14 | Sonos, Inc. | Three-dimensional beam forming with a microphone array |
US11538451B2 (en) | 2017-09-28 | 2022-12-27 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
US10621981B2 (en) | 2017-09-28 | 2020-04-14 | Sonos, Inc. | Tone interference cancellation |
US11302326B2 (en) | 2017-09-28 | 2022-04-12 | Sonos, Inc. | Tone interference cancellation |
US10466962B2 (en) | 2017-09-29 | 2019-11-05 | Sonos, Inc. | Media playback system with voice assistance |
US10606555B1 (en) | 2017-09-29 | 2020-03-31 | Sonos, Inc. | Media playback system with concurrent voice assistance |
US11175888B2 (en) | 2017-09-29 | 2021-11-16 | Sonos, Inc. | Media playback system with concurrent voice assistance |
US11893308B2 (en) | 2017-09-29 | 2024-02-06 | Sonos, Inc. | Media playback system with concurrent voice assistance |
US11288039B2 (en) | 2017-09-29 | 2022-03-29 | Sonos, Inc. | Media playback system with concurrent voice assistance |
EP3703053A4 (en) * | 2017-10-23 | 2021-07-21 | Iflytek Co., Ltd. | Microphone array-based target voice acquisition method and device |
CN107742522A (en) * | 2017-10-23 | 2018-02-27 | 科大讯飞股份有限公司 | Target voice acquisition methods and device based on microphone array |
US10984790B2 (en) | 2017-11-30 | 2021-04-20 | Samsung Electronics Co., Ltd. | Method of providing service based on location of sound source and speech recognition device therefor |
EP3676830A4 (en) * | 2017-11-30 | 2020-09-23 | Samsung Electronics Co., Ltd. | Method of providing service based on location of sound source and speech recognition device therefor |
US11451908B2 (en) | 2017-12-10 | 2022-09-20 | Sonos, Inc. | Network microphone devices with automatic do not disturb actuation capabilities |
US10880650B2 (en) | 2017-12-10 | 2020-12-29 | Sonos, Inc. | Network microphone devices with automatic do not disturb actuation capabilities |
US10818290B2 (en) | 2017-12-11 | 2020-10-27 | Sonos, Inc. | Home graph |
US11676590B2 (en) | 2017-12-11 | 2023-06-13 | Sonos, Inc. | Home graph |
US11343614B2 (en) | 2018-01-31 | 2022-05-24 | Sonos, Inc. | Device designation of playback and network microphone device arrangements |
US11689858B2 (en) | 2018-01-31 | 2023-06-27 | Sonos, Inc. | Device designation of playback and network microphone device arrangements |
US11150869B2 (en) * | 2018-02-14 | 2021-10-19 | International Business Machines Corporation | Voice command filtering |
US20190250881A1 (en) * | 2018-02-14 | 2019-08-15 | International Business Machines Corporation | Voice command filtering |
US11200890B2 (en) | 2018-05-01 | 2021-12-14 | International Business Machines Corporation | Distinguishing voice commands |
US11238856B2 (en) | 2018-05-01 | 2022-02-01 | International Business Machines Corporation | Ignoring trigger words in streamed media content |
US11175880B2 (en) | 2018-05-10 | 2021-11-16 | Sonos, Inc. | Systems and methods for voice-assisted media content selection |
US11797263B2 (en) | 2018-05-10 | 2023-10-24 | Sonos, Inc. | Systems and methods for voice-assisted media content selection |
CN110505547A (en) * | 2018-05-17 | 2019-11-26 | 深圳瑞利声学技术股份有限公司 | A kind of earphone wearing state detection method and earphone |
US10847178B2 (en) | 2018-05-18 | 2020-11-24 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection |
US11715489B2 (en) | 2018-05-18 | 2023-08-01 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection |
US11792590B2 (en) | 2018-05-25 | 2023-10-17 | Sonos, Inc. | Determining and adapting to changes in microphone performance of playback devices |
US10959029B2 (en) | 2018-05-25 | 2021-03-23 | Sonos, Inc. | Determining and adapting to changes in microphone performance of playback devices |
US10798483B2 (en) | 2018-05-30 | 2020-10-06 | Beijing Xiaomi Mobile Software Co., Ltd. | Audio signal processing method and device, electronic equipment and storage medium |
CN108766457A (en) * | 2018-05-30 | 2018-11-06 | 北京小米移动软件有限公司 | Acoustic signal processing method, device, electronic equipment and storage medium |
US10681460B2 (en) | 2018-06-28 | 2020-06-09 | Sonos, Inc. | Systems and methods for associating playback devices with voice assistant services |
US11197096B2 (en) | 2018-06-28 | 2021-12-07 | Sonos, Inc. | Systems and methods for associating playback devices with voice assistant services |
US11696074B2 (en) | 2018-06-28 | 2023-07-04 | Sonos, Inc. | Systems and methods for associating playback devices with voice assistant services |
US10873808B2 (en) | 2018-08-21 | 2020-12-22 | Samsung Electronics Co., Ltd. | Sound direction detection sensor and electronic apparatus including the same |
US10645493B2 (en) | 2018-08-21 | 2020-05-05 | Samsung Electronics Co., Ltd. | Sound direction detection sensor and electronic apparatus including the same |
US11563842B2 (en) | 2018-08-28 | 2023-01-24 | Sonos, Inc. | Do not disturb feature for audio notifications |
US11482978B2 (en) | 2018-08-28 | 2022-10-25 | Sonos, Inc. | Audio notifications |
US11076035B2 (en) | 2018-08-28 | 2021-07-27 | Sonos, Inc. | Do not disturb feature for audio notifications |
US10797667B2 (en) | 2018-08-28 | 2020-10-06 | Sonos, Inc. | Audio notifications |
US11432030B2 (en) | 2018-09-14 | 2022-08-30 | Sonos, Inc. | Networked devices, systems, and methods for associating playback devices based on sound codes |
US11778259B2 (en) | 2018-09-14 | 2023-10-03 | Sonos, Inc. | Networked devices, systems and methods for associating playback devices based on sound codes |
US10587430B1 (en) | 2018-09-14 | 2020-03-10 | Sonos, Inc. | Networked devices, systems, and methods for associating playback devices based on sound codes |
US11551690B2 (en) | 2018-09-14 | 2023-01-10 | Sonos, Inc. | Networked devices, systems, and methods for intelligently deactivating wake-word engines |
US10878811B2 (en) | 2018-09-14 | 2020-12-29 | Sonos, Inc. | Networked devices, systems, and methods for intelligently deactivating wake-word engines |
US11790937B2 (en) | 2018-09-21 | 2023-10-17 | Sonos, Inc. | Voice detection optimization using sound metadata |
US11024331B2 (en) | 2018-09-21 | 2021-06-01 | Sonos, Inc. | Voice detection optimization using sound metadata |
US10811015B2 (en) | 2018-09-25 | 2020-10-20 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US11727936B2 (en) | 2018-09-25 | 2023-08-15 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US11031014B2 (en) | 2018-09-25 | 2021-06-08 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US10573321B1 (en) | 2018-09-25 | 2020-02-25 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US11100923B2 (en) | 2018-09-28 | 2021-08-24 | Sonos, Inc. | Systems and methods for selective wake word detection using neural network models |
US11790911B2 (en) | 2018-09-28 | 2023-10-17 | Sonos, Inc. | Systems and methods for selective wake word detection using neural network models |
US11501795B2 (en) | 2018-09-29 | 2022-11-15 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection via multiple network microphone devices |
US10692518B2 (en) | 2018-09-29 | 2020-06-23 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection via multiple network microphone devices |
US11899519B2 (en) | 2018-10-23 | 2024-02-13 | Sonos, Inc. | Multiple stage network microphone device with reduced power consumption and processing load |
US11741948B2 (en) | 2018-11-15 | 2023-08-29 | Sonos Vox France Sas | Dilated convolutions and gating for efficient keyword spotting |
US11200889B2 (en) | 2018-11-15 | 2021-12-14 | Sonos, Inc. | Dilated convolutions and gating for efficient keyword spotting |
US11557294B2 (en) | 2018-12-07 | 2023-01-17 | Sonos, Inc. | Systems and methods of operating media playback systems having multiple voice assistant services |
US11183183B2 (en) | 2018-12-07 | 2021-11-23 | Sonos, Inc. | Systems and methods of operating media playback systems having multiple voice assistant services |
US11132989B2 (en) | 2018-12-13 | 2021-09-28 | Sonos, Inc. | Networked microphone devices, systems, and methods of localized arbitration |
US11538460B2 (en) | 2018-12-13 | 2022-12-27 | Sonos, Inc. | Networked microphone devices, systems, and methods of localized arbitration |
US11908464B2 (en) * | 2018-12-19 | 2024-02-20 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling same |
US20220028377A1 (en) * | 2018-12-19 | 2022-01-27 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling same |
US11540047B2 (en) | 2018-12-20 | 2022-12-27 | Sonos, Inc. | Optimization of network microphone devices using noise classification |
US11159880B2 (en) | 2018-12-20 | 2021-10-26 | Sonos, Inc. | Optimization of network microphone devices using noise classification |
US10602268B1 (en) | 2018-12-20 | 2020-03-24 | Sonos, Inc. | Optimization of network microphone devices using noise classification |
US10867604B2 (en) | 2019-02-08 | 2020-12-15 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing |
US11315556B2 (en) | 2019-02-08 | 2022-04-26 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification |
US11646023B2 (en) | 2019-02-08 | 2023-05-09 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing |
US11798553B2 (en) | 2019-05-03 | 2023-10-24 | Sonos, Inc. | Voice assistant persistence across multiple network microphone devices |
US11120794B2 (en) | 2019-05-03 | 2021-09-14 | Sonos, Inc. | Voice assistant persistence across multiple network microphone devices |
US11854547B2 (en) | 2019-06-12 | 2023-12-26 | Sonos, Inc. | Network microphone device with command keyword eventing |
US11361756B2 (en) | 2019-06-12 | 2022-06-14 | Sonos, Inc. | Conditional wake word eventing based on environment |
US11501773B2 (en) | 2019-06-12 | 2022-11-15 | Sonos, Inc. | Network microphone device with command keyword conditioning |
US10586540B1 (en) | 2019-06-12 | 2020-03-10 | Sonos, Inc. | Network microphone device with command keyword conditioning |
US11200894B2 (en) | 2019-06-12 | 2021-12-14 | Sonos, Inc. | Network microphone device with command keyword eventing |
US11354092B2 (en) | 2019-07-31 | 2022-06-07 | Sonos, Inc. | Noise classification for event detection |
US11710487B2 (en) | 2019-07-31 | 2023-07-25 | Sonos, Inc. | Locally distributed keyword detection |
US11714600B2 (en) | 2019-07-31 | 2023-08-01 | Sonos, Inc. | Noise classification for event detection |
US10871943B1 (en) | 2019-07-31 | 2020-12-22 | Sonos, Inc. | Noise classification for event detection |
US11138975B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US11551669B2 (en) | 2019-07-31 | 2023-01-10 | Sonos, Inc. | Locally distributed keyword detection |
US11138969B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US11355108B2 (en) | 2019-08-20 | 2022-06-07 | International Business Machines Corporation | Distinguishing voice commands |
CN112578338A (en) * | 2019-09-27 | 2021-03-30 | 阿里巴巴集团控股有限公司 | Sound source positioning method, device, equipment and storage medium |
US11189286B2 (en) | 2019-10-22 | 2021-11-30 | Sonos, Inc. | VAS toggle based on device orientation |
US11862161B2 (en) | 2019-10-22 | 2024-01-02 | Sonos, Inc. | VAS toggle based on device orientation |
US11200900B2 (en) | 2019-12-20 | 2021-12-14 | Sonos, Inc. | Offline voice control |
US11869503B2 (en) | 2019-12-20 | 2024-01-09 | Sonos, Inc. | Offline voice control |
CN111181949A (en) * | 2019-12-25 | 2020-05-19 | 视联动力信息技术股份有限公司 | Sound detection method, device, terminal equipment and storage medium |
US11562740B2 (en) | 2020-01-07 | 2023-01-24 | Sonos, Inc. | Voice verification for media playback |
US11556307B2 (en) | 2020-01-31 | 2023-01-17 | Sonos, Inc. | Local voice data processing |
US11308958B2 (en) | 2020-02-07 | 2022-04-19 | Sonos, Inc. | Localized wakeword verification |
US11694689B2 (en) | 2020-05-20 | 2023-07-04 | Sonos, Inc. | Input detection windowing |
US11308962B2 (en) | 2020-05-20 | 2022-04-19 | Sonos, Inc. | Input detection windowing |
US11727919B2 (en) | 2020-05-20 | 2023-08-15 | Sonos, Inc. | Memory allocation for keyword spotting engines |
US11482224B2 (en) | 2020-05-20 | 2022-10-25 | Sonos, Inc. | Command keywords with input detection windowing |
US11698771B2 (en) | 2020-08-25 | 2023-07-11 | Sonos, Inc. | Vocal guidance engines for playback devices |
US11551700B2 (en) | 2021-01-25 | 2023-01-10 | Sonos, Inc. | Systems and methods for power-efficient keyword detection |
CN114268984A (en) * | 2021-11-15 | 2022-04-01 | 珠海格力电器股份有限公司 | Signal processing method, electronic device and storage medium |
US11961519B2 (en) | 2022-04-18 | 2024-04-16 | Sonos, Inc. | Localized wakeword verification |
Also Published As
Publication number | Publication date |
---|---|
CN108352159B (en) | 2023-05-30 |
CN108352159A (en) | 2018-07-31 |
US10540995B2 (en) | 2020-01-21 |
WO2017078361A1 (en) | 2017-05-11 |
KR102444061B1 (en) | 2022-09-16 |
KR20170050908A (en) | 2017-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10540995B2 (en) | Electronic device and method for recognizing speech | |
US10056096B2 (en) | Electronic device and method capable of voice recognition | |
US10796693B2 (en) | Modifying input based on determined characteristics | |
US20200312335A1 (en) | Electronic device and method of operating the same | |
CN109240576B (en) | Image processing method and device in game, electronic device and storage medium | |
US10762897B2 (en) | Method and display device for recognizing voice | |
US9953647B2 (en) | Method and apparatus for speech recognition | |
KR102261552B1 (en) | Providing Method For Voice Command and Electronic Device supporting the same | |
TW202113756A (en) | Image processing method and device, electronic equipment and storage medium | |
US9665804B2 (en) | Systems and methods for tracking an object | |
US20160124497A1 (en) | Method and apparatus for controlling screen display on electronic devices | |
US10831440B2 (en) | Coordinating input on multiple local devices | |
US9571930B2 (en) | Audio data detection with a computing device | |
US9426606B2 (en) | Electronic apparatus and method of pairing in electronic apparatus | |
US11589222B2 (en) | Electronic apparatus, user terminal, and method for controlling the electronic apparatus and the user terminal | |
KR20150103586A (en) | Method for processing voice input and electronic device using the same | |
TW202036462A (en) | Method, apparatus and electronic device for image generating and storage medium thereof | |
KR20220059194A (en) | Method and apparatus of object tracking adaptive to target object | |
US20210005189A1 (en) | Digital assistant device command performance based on category | |
JP2018507494A (en) | Feature extraction method and apparatus | |
US20170243579A1 (en) | Electronic apparatus and service providing method thereof | |
KR102537781B1 (en) | Electronic apparatus and Method for contolling the electronic apparatus thereof | |
US10482151B2 (en) | Method for providing alternative service and electronic device thereof | |
CN114694661A (en) | First terminal device, second terminal device and voice awakening method | |
CN109358755B (en) | Gesture detection method and device for mobile terminal and mobile terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SAMSUNG ELECTRONICS CO., LTD., KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SHIN, KI-HOON;REEL/FRAME:040216/0519 Effective date: 20161018 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20240121 |