US20140074466A1 - Answering questions using environmental context - Google Patents

Answering questions using environmental context Download PDF

Info

Publication number
US20140074466A1
US20140074466A1 US13/626,439 US201213626439A US2014074466A1 US 20140074466 A1 US20140074466 A1 US 20140074466A1 US 201213626439 A US201213626439 A US 201213626439A US 2014074466 A1 US2014074466 A1 US 2014074466A1
Authority
US
United States
Prior art keywords
data
transcription
engine
query
identifies
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/626,439
Inventor
Matthew Sharifi
Gheorghe Postelnicu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US13/626,439 priority Critical patent/US20140074466A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SHARIFI, MATTHEW, POSTELNICU, GHEORGHE
Priority to PCT/US2013/035095 priority patent/WO2014039106A1/en
Priority to EP20130162403 priority patent/EP2706470A1/en
Priority to KR1020130037540A priority patent/KR102029276B1/en
Priority to CN201610628594.XA priority patent/CN106250508B/en
Priority to CN201310394518.3A priority patent/CN103714104B/en
Publication of US20140074466A1 publication Critical patent/US20140074466A1/en
Priority to US15/224,944 priority patent/US9576576B2/en
Priority to US15/410,180 priority patent/US9786279B2/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Priority to KR1020190119592A priority patent/KR102140177B1/en
Priority to KR1020200092439A priority patent/KR102241972B1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/433Query formulation using audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the present specification relates to identifying results of a query based on a natural language query and environmental information, for example to answer questions using environmental information as context.
  • a search query includes one or more terms that a user submits to a search engine when the user requests the search engine to execute a search.
  • a user may enter query terms of a search query by typing on a keyboard or, in the context of a voice query, by speaking the query terms into a microphone of a mobile device.
  • Voice queries may be processed using speech recognition technology.
  • environmental information such as ambient noise
  • a user may ask a question about a television program that they are viewing, such as “What actor is in this movie?”
  • the user's mobile device detects the user's utterance and environmental data, which may include the soundtrack audio of the television program.
  • the mobile computing device encodes the utterance and the environmental data as waveform data, and provides the waveform data to a server-based computing environment.
  • the computing environment separates the utterance from the environmental data of the waveform data, and then obtains a transcription of the utterance.
  • the computing environment further identifies entity data relating to the environmental data and the utterance, such as by identifying the name of the movie. From the transcription and the entity data, the computing environment can then identify one or more results, for example, results in response to the user's question.
  • the one or more results can include an answer to the user's question of “What actor is in this movie?” (e.g., the name of the actor).
  • the computing environment can provide such results to the user of the mobile computing device.
  • Generating the query includes associating the transcription with the data that identifies the entity. Associating further includes tagging the transcription with the data that identifies the entity. Associating further includes substituting a portion of the transcription with the data that identifies the entity. Substituting further includes substituting one or more words of the transcription with the data that identifies the entity.
  • Receiving the environmental data further includes receiving environmental audio data, environmental image data, or both. Receiving the environmental audio data further includes receiving additional audio data that includes background noise.
  • FIG. 1 depicts an example system for identifying content item data based on environmental audio data and a spoken natural language query.
  • FIG. 2 depicts a flowchart for an example process for identifying content item data based on environmental audio data and a spoken natural language query.
  • FIGS. 3A-3B depicts portions of an example system for identifying content item.
  • FIG. 4 depicts an example system for identifying media content items based on environmental image data and a spoken natural language query.
  • FIG. 5 depicts a system for identifying one or more results based on environmental audio data and an utterance.
  • FIG. 6 depicts a flowchart for an example process for identifying one or more results based on environmental data and an utterance.
  • FIG. 7 depicts a computer device and a mobile computer device that may be used to implement the techniques described here.
  • a computing environment that answers spoken natural language queries using environmental information as context may process queries using multiple processes.
  • the computing environment can identify media content based on environmental information, such as ambient noises.
  • a computing environment can augment the spoken natural language query with context that is derived from the environmental information, such as data that identifies media content, in order to provide a more satisfying answer to the spoken natural language query.
  • FIG. 1 depicts a system 100 for identifying content item data based on environmental audio data and a spoken natural language query.
  • the system 100 can identify content item data that is based on the environmental audio data and that matches a particular content type associated with the spoken natural language query.
  • the system 100 includes a mobile computing device 102 , a disambiguation engine 104 , a speech recognition engine 106 , a keyword mapping engine 108 , and a content recognition engine 110 .
  • the mobile computing device 102 is in communication with the disambiguation engine 104 over one or more networks.
  • the mobile device 110 can include a microphone, a camera, or other detection mechanisms for detecting utterances from a user 112 and/or environmental data associated with the user 112 .
  • the user 112 is watching a television program.
  • the user 112 would like to know who directed the television program that is currently playing.
  • the user 112 may not know the name of the television program that is currently playing, and may therefore ask the question “Who directed this show?”
  • the mobile computing device 102 detects this utterance, as well as environmental audio data associated with the environment of the user 112 .
  • the environmental audio data associated with the environment of the user 112 can include background noise of the environment of the user 112 .
  • the environmental audio data includes the sounds of the television program.
  • the environmental audio data that is associated with the currently displayed television program can include audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.).
  • the mobile computing device 102 detects the environmental audio data after detecting the utterance; detects the environmental audio data concurrently with detecting the utterance; or both.
  • the mobile computing device 102 processes the detected utterance and the environmental audio data to generate waveform data 114 that represents the detected utterance and the environmental audio data and transmits the waveform data 114 to the disambiguation engine 104 (e.g., over a network), during operation (A).
  • the environmental audio data is streamed from the mobile computing device 110 .
  • the disambiguation engine 104 receives the waveform data 114 from the mobile computing device 102 .
  • the disambiguation engine 104 processes the waveform data 114 , including separating (or extracting) the utterance from other portions of the waveform data 114 and transmits the utterance to the speech recognition engine 106 (e.g., over a network), during operation (B).
  • the disambiguation engine 104 separates the utterance (“Who directed this show?”) from the background noise of the environment of the user 112 (e.g., audio of the currently displayed television program).
  • the disambiguation engine 104 utilizes a voice detector to facilitate separation of the utterance from the background noise by identifying a portion of the waveform data 114 that includes voice activity, or voice activity associated with the user of the computing device 102 .
  • the utterance relates to a query (e.g., a query relating to the currently displayed television program).
  • the waveform data 114 includes the detected utterance.
  • the disambiguation engine 104 can request the environmental audio data from the mobile computing device 102 relating to the utterance.
  • the speech recognition engine 106 receives the portion of the waveform data 114 that corresponds to the utterance from the disambiguation engine 104 .
  • the speech recognition engine 106 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 108 , during operation (C).
  • the speech recognition engine 106 processes the utterance received from the speech recognition engine 106 .
  • processing of the utterance by the speech recognition system 106 includes generating a transcription of the utterance. Generating the transcription of the utterance can include transcribing the utterance into text or text-related data. In other words, the speech recognition system 106 can provide a representation of language in written form of the utterance.
  • the speech recognition system 106 transcribes the utterance to generate the transcription of “Who directed this show?”
  • the speech recognition system 106 provides two or more transcriptions of the utterance.
  • the speech recognition system 106 transcribes the utterance to generate the transcriptions of “Who directed this show?” and “Who directed this shoe?”
  • the keyword mapping engine 108 receives the transcription from the speech recognition engine 106 .
  • the keyword mapping engine 108 identifies one or more keywords in the transcription that are associated with a particular content type and provides the particular content type to the disambiguation engine 104 , during operation (D).
  • the one or more content types can include ‘movie’, ‘music’, ‘television show’, ‘audio podcast’, ‘mage,’ ‘artwork,’ ‘book,’ ‘magazine,’ ‘trailer,’ ‘video podcast’, ‘Internet video’, or ‘video game’.
  • the keyword mapping engine 108 identifies the keyword “directed” from the transcription of “Who directed this show?”
  • the keyword “directed” is associated with the ‘television show’ content type.
  • a keyword of the transcription that is identified by the keyword mapping engine 108 is associated with two or more content types.
  • the keyword “directed” is associated with the ‘television show’ and ‘movie’ content types.
  • the keyword mapping engine 108 identifies two or more keywords in the transcription that are associated with a particular content type. For example, the keyword mapping engines 108 identifies the keywords “directed” and “show” that are associated with a particular content type. In some embodiments, the identified two or more keywords are associated with the same content type. For example, the identified keywords “directed” and “show” are both associated with the ‘television show’ content type. In some embodiments, the identified two or more keywords are associated with differing content types. For example, the identified keyword “directed” is associated with the ‘movie’ content type and the identified keyword “show” is associated with the ‘television show’ content type.
  • the keyword mapping engine 108 transmits (e.g., over a network) the particular content type to the disambiguation engine 108 .
  • the keyword mapping engine 108 identifies the one or more keywords in the transcription that are associated with a particular content type using one or more databases that, for each of multiple content types, maps at least one of the keywords to at least one of the multiple content types.
  • the keyword mapping engine 108 includes (or is in communication with) a database (or multiple databases).
  • the database includes, or is associated with, a mapping between keywords and content types.
  • the database provides a connection (e.g., mapping) between the keywords and the content types such that the keyword mapping engine 108 is able to identify one or more keywords in the transcription that are associated with particular content types.
  • one or more of the mappings between the keywords and the content types can include a unidirectional (e.g., one-way) mapping (i.e., a mapping from the keywords to the content types).
  • one or more of the mappings between the keywords and the content types can include a bidirectional (e.g., two-way) mapping (i.e., a mapping from the keywords to the content types and from the content types to the keywords).
  • the one or more databases maps one or more of the keywords to two or more content types.
  • the keyword mapping engine 108 uses the one or more databases that maps the keyword “directed” to the ‘movie’ and ‘television show’ content types.
  • the mapping between the keywords and the content types can include mappings between multiple, varying versions of a root keyword (e.g., the word family) and the content types.
  • the differing versions of the keyword can include differing grammatical categories such as tense (e.g., past, present, future) and word class (e.g., noun, verb).
  • the database can include mappings of the word family of the root word “direct” such as “directors,” “direction,” and “directed” to the one or more content types.
  • the disambiguation engine 104 receives data identifying the particular content type associated with the transcription of the utterance from the keyword mapping engine 108 . Furthermore, as mentioned above, the disambiguation engine 104 receives the waveform data 114 from the mobile computing device 102 that includes the environmental audio data associated with the utterance. The disambiguation engine 104 then provides the environmental audio data and the particular content type to the content recognition engine 110 , during operation (E).
  • the disambiguation engine 104 transmits the environmental audio data relating to the currently displayed television program that includes audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.) and the particular content type of the transcription of the utterance (e.g., ‘television show’ content type) to the content recognition engine 110 .
  • audio of the currently displayed television program e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.
  • the particular content type of the transcription of the utterance e.g., ‘television show’ content type
  • the disambiguation engine 104 provides a portion of the environmental audio data to the content recognition engine 110 .
  • the portion of the environmental audio data can include background noise detected by the mobile computing device 102 after detecting the utterance.
  • the portion of the environmental audio data can include background noise detected by the mobile computing device 102 concurrently with detecting the utterance.
  • the background noise (of the waveform data 114 ) is associated with a particular content type that is associated with a keyword of the transcription.
  • the keyword “directed” of the transcription “Who directed this show?” is associated with the ‘television show’ content type
  • the background noise e.g., the environmental audio data relating to the currently displayed television program
  • the content recognition engine 110 receives the environmental audio data and the particular content type from the disambiguation engine 104 .
  • the content recognition engine 110 identifies content item data that is based on the environmental audio data and that matches the particular content type and provides the content item data to the disambiguation engine 104 , during operation (F).
  • the content recognition engine 110 appropriately processes the environmental audio data to identify content item data that is associated with the environmental audio data (e.g., a name of a television show, a name of a song, etc.).
  • the content recognition engine 110 matches the identified content item data with the particular content type (e.g., content type of the transcription of the utterance).
  • the content recognition engine 110 transmits (e.g., over a network) the identified content item data to the disambiguation engine 104 .
  • the content recognition engine 110 identifies content item data that is based on the environmental audio data relating to the currently displayed television program, and further that matches the ‘television show’ content type. To that end, the content recognition engine 110 can identify content item data based on dialogue of the currently displayed television program, or soundtrack audio associated with the currently displayed television program, depending on the portion of the environmental audio data received by the content recognition engine 110 .
  • the content recognition engine 110 is an audio fingerprinting engine that utilizes content fingerprinting using wavelets to identify the content item data. Specifically, the content recognition engine 110 converts the waveform data 114 into a spectrogram. From the spectrogram, the content recognition engine 110 extracts spectral images. The spectral images can be represented as wavelets. For each of the spectral images that are extracted from the spectrogram, the content recognition engine 110 extracts the “top” wavelets based on the respective magnitudes of the wavelets. For each spectral image, the content recognition engine 110 computes a wavelet signature of the image. In some examples, the wavelet signatures is a truncated, quantized version of the wavelet decomposition of the image.
  • m ⁇ n wavelets are returned without compression.
  • the content recognition engine 110 utilizes a subset of the wavelets that most characterize the song. Specifically, the t “top” wavelets (by magnitude) are selected, where t ⁇ m ⁇ n.
  • the content recognition engine 110 creates a compact representation of the sparse wavelet-vector described above, for example, using MinHash to compute sub-fingerprints for these sparse bit vectors.
  • the content recognition engine 110 identifies content item data that is based on the soundtrack audio associated with the currently displayed television program and that also matches the ‘television show’ content type.
  • the content recognition engine 110 identifies content item data relating to a name of the currently displayed television program. For example, the content recognition engine 110 can determine that a particular content item (e.g., a specific television show) is associated with a theme song (e.g., the soundtrack audio), and that the particular content item (e.g., the specific television show) matches the particular content type (e.g., ‘television show’ content type).
  • the content recognition engine 110 can identify data (e.g., the name of the specific television show) that relates to the particular content item (e.g., the currently displayed television program) that is based on the environmental audio data (e.g., the soundtrack audio), and further that matches the particular content type (e.g., ‘television show’ content type).
  • data e.g., the name of the specific television show
  • the particular content item e.g., the currently displayed television program
  • the environmental audio data e.g., the soundtrack audio
  • the particular content type e.g., ‘television show’ content type
  • the disambiguation engine 104 receives the identified content item data from the content recognition engine 110 .
  • the disambiguation engine 104 then provides the identified content item data to the mobile computing device 102 , at operation (G).
  • the disambiguation engine 104 transmits the identified content item data relating to the currently displayed television program (e.g., a name of the currently displayed television program) to the mobile computing device 102 .
  • one or more of the mobile computing device 102 , the disambiguation engine 104 , the speech recognition engine 106 , the keyword mapping engine 108 , and the content recognition engine 110 can be in communication with a subset (or each) of the mobile computing device 102 , the disambiguation engine 104 , the speech recognition engine 106 , the keyword mapping engine 108 , and the content recognition engine 110 .
  • one or more of the disambiguation engine 104 , the speech recognition engine 106 , the keyword mapping engine 108 , and the content recognition engine 110 can be implemented using one or more computing devices, such as one or more computing servers, a distributed computing system, or a server farm or cluster.
  • the environmental audio data is streamed from the mobile computing device 110 to the disambiguation engine 104 .
  • the above-mentioned process e.g., operations (A)-(H)
  • the above-mentioned process is performed as the environmental audio data is received by the disambiguation engine 104 (i.e., performed incrementally).
  • operations (A)-(H) are performed iteratively until content item data is identified.
  • FIG. 2 depicts a flowchart of an example process 200 for identifying content item data based on environmental audio data and a spoken natural language query.
  • the example process 200 can be executed using one or more computing devices.
  • the mobile computing device 102 the disambiguation engine 104 , the speech recognition engine 106 , the keyword mapping engine 108 , and/or the content recognition engine 110 can be used to execute the example process 200 .
  • Audio data that encodes a spoken natural language query and environmental audio data is received ( 202 ).
  • the disambiguation engine 104 receives the waveform data 114 from the mobile computing device 102 .
  • the waveform data 114 includes the spoken natural query of the user (e.g., “Who directed this show?”) and the environmental audio data (e.g., audio of the currently displayed television program).
  • the disambiguation engine 104 separates the spoken natural language query (“Who directed this show?”) from the background noise of the environment of the user 112 (e.g., audio of the currently displayed television program).
  • a transcription of the natural language query is obtained ( 204 ).
  • the speech recognition system 106 transcribes the natural language query to generate a transcription of the natural language query (e.g., “Who directed this show?”).
  • a particular content type that is associated with one or more keywords in the transcription is determined ( 206 ).
  • the keyword mapping engine 108 identifies one or more keywords (e.g., “directed”) in the transcription (e.g., “Who directed this show?”) that are associated with a particular content type (e.g., ‘television show’ content type).
  • the keyword mapping engine 108 determines the particular content type that is associated with one or more keywords in the transcription using one or more databases that, for each of multiple content types, maps at least one of the keywords to at least one of the multiple content types.
  • the database provides a connection (e.g., mapping) between the keywords (e.g., “directed”) and the content types (e.g., ‘television show’ content type).
  • At least a portion of the environmental audio data is provided to a content recognition engine ( 208 ).
  • the disambiguation engine 104 provides at least the portion the environmental audio data encoded by the waveform data 114 (e.g., audio of the currently displayed television program) to the content recognition engine 110 .
  • the disambiguation engine 104 also provides the particular content type (e.g. ‘television show’ content type) that is associated with the one or more keywords (e.g., “directed”) in the transcription to the content recognition engine 110 .
  • a content item is identified that is output by the content recognition engine, and that matches the particular content type ( 210 ).
  • the content recognition engine 110 identifies a content item or content item data that is based on the environmental audio data (e.g., audio of the currently displayed television program) and that matches the particular content type (e.g. ‘television show’ content type).
  • FIGS. 3A and 3B depict portions 300 a and 300 b, respectively, of a system for identifying content item data.
  • FIGS. 3A and 3B include disambiguation engines 304 a and 304 b, respectively; and include content recognition engines 310 a and 310 b, respectively.
  • the disambiguation engines 304 a and 304 b are similar to the disambiguation engine 104 of system 100 depicted in FIG. 1 ; and the content recognition engines 310 a and 310 b are similar to the content recognition engine 110 of system 100 depicted in FIG. 1 .
  • FIG. 3A depicts the portion 300 a including the content recognition engine 310 a.
  • the content recognition engine 310 a is able to identify content item data based on environmental data and that matches a particular content type.
  • the content recognition engine 310 a is able to appropriately process the environmental data to identify content item data based on the environmental data, and further select one or more of the identified content item data such that the selected content item data matches the particular content type.
  • the disambiguation engine 304 a provides the environmental data and the particular content type to the content recognition engine 310 a, during operation (A). In some embodiments, the disambiguation engine 304 a provides a portion of the environmental data to the content recognition engine 310 a.
  • the content recognition engine 310 a receives the environmental data and the particular content type from the disambiguation engine 304 a. The content recognition engine 310 a then identifies content item data that is based on the environmental data and that matches the particular content type and provides the identified content item data to the disambiguation engine 304 a, during operation (B). Specifically, the content recognition engine 310 a identifies content item data (e.g., a name of a television show, a name of a song, etc.) that is based on the environmental data. The content recognition engine 310 a then selects one or more of the identified content item data that matches the particular content type. In other words, the content recognition engine 310 a filters the identified content item data based on the particular content type. The content recognition engine 310 a transmits (e.g., over a network) the identified content item data to the disambiguation engine 304 a.
  • content item data e.g., a name of a television show, a name of a song, etc.
  • the content recognition engine 310 a identifies content item data that is based on the soundtrack audio associated with the currently displayed television program.
  • the content recognition engine 310 a filters the identified content item data based on the ‘television show’ content type. For example, the content recognition engine 310 a identifies a ‘theme song name’ and a ‘TV show name’ associated with the soundtrack audio.
  • the content recognition engine 310 a filters the identified content item data such that the identified content item data also matches the ‘television show’ content type. For example, the content recognition engine 310 a selects the ‘TV show name’ identifying data, and transmits the ‘TV show name’ identifying data to the disambiguation engine 304 a.
  • the content recognition engine 310 a selects a corpus (or index) based on the content type (e.g., ‘television show’ content type). Specifically, the content recognition engine 310 a can have access to a first index relating to the ‘television show’ content type and a second index relating to a ‘movie’ content type. The content recognition engine 310 a appropriately selects the first index based on the ‘television show’ content type. Thus, by selecting the first index (and not selecting the second index), the content recognition engine 310 a can more efficiently identify the content item data (e.g., a name of the television show).
  • the content recognition engine 310 a can more efficiently identify the content item data (e.g., a name of the television show).
  • the disambiguation engine 304 a receives the content item data from the content recognition engine 310 a. For example, the disambiguation engine 304 a receives the ‘TV show name’ identifying data from the content recognition engine 310 a . The disambiguation engine 304 a then provides the identifying data to a third party (e.g., the mobile computing device 102 of FIG. 1 ), during operation (C). For example, the disambiguation engine 304 a provides the ‘TV show name’ identifying data to the third party.
  • a third party e.g., the mobile computing device 102 of FIG. 1
  • FIG. 3 b depicts the portion 300 b including the content recognition engine 310 b.
  • the content recognition engine 310 b is able to identify content item data based on environmental data.
  • the content recognition engine 310 b is able to appropriately process the environmental data to identify content item data based on the environmental data, and provide the content item data to the disambiguation engine 304 b.
  • the disambiguation engine 310 b selects one or more of the identified content item data such that the selected content item data matches the particular content type.
  • the disambiguation engine 304 b provides the environmental data to the content recognition engine 310 b, during operation (A). In some embodiments, the disambiguation engine 304 b provides a portion of the environmental data to the content recognition engine 310 b.
  • the content recognition engine 310 b receives the environmental data from the disambiguation engine 304 b.
  • the content recognition engine 310 b then identifies content item data that is based on the environmental data and provides the identified content item data to the disambiguation engine 304 b, during operation (B).
  • the content recognition engine 310 b identifies content item data associated with two or more content items (e.g., a name of a television show, a name of a song, etc.) that is based on the environmental data.
  • the content recognition engine 310 b transmits (e.g., over a network) two or more candidates representing the identified content item data to the disambiguation engine 304 b.
  • the content recognition engine 310 b identifies content item data relating to two or more content items that is based on the soundtrack audio associated with the currently displayed television program. For example, the content recognition engine 310 b identifies a ‘theme song name’ and a ‘TV show name’ associated with the soundtrack audio, and transmits the ‘theme song name’ and ‘TV show name’ identifying data to the disambiguation engine 304 b.
  • the disambiguation engine 304 b receives the two or more candidates from the content recognition engine 310 b. For example, the disambiguation engine 304 b receives the ‘theme song name’ and ‘TV show name’ candidates from the content recognition engine 310 b. The disambiguation engine 304 b then selects one of the two or more candidates based on a particular content type and provides the selected candidate to a third party (e.g., the mobile computing device 102 of FIG. 1 ), during operation (C). Specifically, the disambiguation engine 304 b previously receives the particular content type (e.g., that is associated with an utterance), as described above with respect to FIG. 1 .
  • a third party e.g., the mobile computing device 102 of FIG. 1
  • the disambiguation engine 304 b selects a particular candidate of the two or more candidates based on the particular content type. Specifically, the disambiguation engine 304 b selects the particular candidate of the two or more candidates that matches the particular content type. For example, the disambiguation engine 304 b selects the ‘TV show name’ candidate as the ‘TV show name’ candidate matches the ‘television show’ content type.
  • the two or more candidates from the content recognition engine 310 b are associated with a ranking score.
  • the ranking score can be associated with any scoring metric as determined by the disambiguation engine 304 b .
  • the disambiguation engine 304 b can further adjust the ranking score of two or more candidates based on the particular content type. Specifically, the disambiguation engine 304 b can increase the ranking score of one or more of the candidates when the respective candidates are matched to the particular content type. For example, the ranking score of the candidate ‘TV show name’ can be increased as it matches the ‘television show’ content type. Furthermore, the disambiguation engine 304 b can decrease the ranking score of one or more of the candidates when the respective candidates are not matched to the particular content type. For example, the ranking score of the candidate ‘theme song name’ can be decreased as it does not match the ‘television show’ content type.
  • the two or more candidates can be ranked based on the respective adjusted ranking scores by the disambiguation engine 304 b.
  • the disambiguation engine 304 b can rank the ‘TV show name’ candidate above the ‘theme song name’ candidate as the ‘TV show name’ candidate has a higher adjusted ranking score as compared to the adjusted ranking score of the ‘theme song name’ candidate.
  • the disambiguation engine 304 b selects the candidate ranked highest (i.e., has the highest adjusted ranking score).
  • FIG. 4 depicts a system 400 for identifying content item data based on environmental image data and a spoken natural language query.
  • the system 400 can identify content item data that is based on the environmental image data and that matches a particular content type associated with the spoken natural language query.
  • the system 400 includes a mobile computing device 402 , a disambiguation engine 404 , a speech recognition engine 406 , a keyword mapping engine 408 , and a content recognition engine 410 , analogous to that of the mobile computing device 102 , the disambiguation engine 104 , the speech recognition engine 106 , the keyword mapping engine 108 , and the content recognition engine 110 , respectively, of system 100 illustrated in FIG. 1 .
  • the user 112 is looking at a CD album cover of a soundtrack of a movie. In the illustrated example, the user 112 would like to know what songs are on the soundtrack. In some examples, the user 112 may not know the name of the movie soundtrack, and may therefore ask the question “What songs are on this?” or “What songs play in this movie?”
  • the mobile computing device 402 detects this utterance, as well as environmental image data associated with the environment of the user 112 .
  • the environmental image data associated with the environment of the user 112 can include image data of the environment of the user 112 .
  • the environmental image data includes an image of the CD album cover that depicts images related to the movie (e.g., an image of a movie poster of the associated movie).
  • the mobile computing device 402 detects the environmental image data utilizing a camera of the mobile computing device 402 that captures an image (or video) of the CD album cover.
  • the mobile computing device 402 processes the detected utterance to generate waveform data 414 that represents the detected utterance and transmits the waveform data 414 and the environmental image data to the disambiguation engine 404 (e.g., over a network), during operation (A).
  • the disambiguation engine 404 receives the waveform data 414 and the environmental image data from the mobile computing device 402 .
  • the disambiguation engine 404 processes the waveform data 414 and transmits the utterance to the speech recognition engine 406 (e.g., over a network), during operation (B).
  • the utterance relates to a query (e.g., a query relating to the movie soundtrack).
  • the speech recognition system 406 receives the utterance from the disambiguation engine 404 .
  • the speech recognition system 406 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 408 , during operation (C).
  • the speech recognition system 406 processes the utterance received from the speech recognition engine 406 by generating a transcription of the utterance.
  • the speech recognition system 406 transcribes the utterance to generate the transcription of “What songs are on this?”
  • the speech recognition system 406 provides two or more transcriptions of the utterance. For example, the speech recognition system 406 transcribes the utterance to generate the transcriptions of “What songs are on this?” and “What sinks are on this?”
  • the keyword mapping engine 408 receives the transcription from the speech recognition engine 406 .
  • the keyword mapping engine 408 identifies one or more keywords in the transcription that are associated with a particular content type and provides the particular content type to the disambiguation engine 404 , during operation (D).
  • the keyword mapping engine 408 identifies the keyword “songs” from the transcription of “What songs are on this?”
  • the keyword “songs” is associated with the ‘music’ content type.
  • a keyword of the transcription that is identified by the keyword mapping engine 408 is associated with two or more content types.
  • the keyword “songs” is associated with the ‘music’ and ‘singer’ content types.
  • the keyword mapping engine 408 transmits (e.g., over a network) the particular content type to the disambiguation engine 408 .
  • the keyword mapping engine 408 identifies the one or more keywords in the transcription that are associated with a particular content type using one or more databases that, for each of multiple content types, maps at least one of the keywords to at least one of the multiple content types. For example, the keyword mapping engine 408 uses the one or more databases that maps the keyword “songs” to the ‘music’ and ‘singer’ content types.
  • the disambiguation engine 404 receives the particular content type associated with the transcription of the utterance from the keyword mapping engine 408 . Furthermore, as mentioned above, the disambiguation engine 404 receives the environmental image data associated with the utterance. The disambiguation engine 404 then provides the environmental image data and the particular content type to the content recognition engine 410 , during operation (E).
  • the disambiguation engine 404 transmits the environmental image data relating to the movie soundtrack (e.g., an image of the movie poster CD album cover) and the particular content type of the transcription of the utterance (e.g., ‘music’ content type) to the content recognition engine 410 .
  • the environmental image data relating to the movie soundtrack e.g., an image of the movie poster CD album cover
  • the particular content type of the transcription of the utterance e.g., ‘music’ content type
  • the content recognition engine 410 receives the environmental image data and the particular content type from the disambiguation engine 404 .
  • the content recognition engine 410 identifies content item data that is based on the environmental image data and that matches the particular content type and provides the identified content item data to the disambiguation engine 404 , during operation (F).
  • the content recognition engine 410 appropriately processes the environmental image data to identify content item data (e.g., a name of a content item).
  • the content recognition engine 410 matches the identified content item with the particular content type (e.g., content type of the transcription of the utterance).
  • the content recognition engine 408 transmits (e.g., over a network) the identified content item data to the disambiguation engine 408 .
  • the content recognition engine 410 identifies data that is based on the environmental image data relating to the image of the movie poster CD album cover, and further that matches the ‘music’ content type.
  • the content recognition engine 410 identifies content item data that is based on the movie poster associated with the CD album cover and that also matches the ‘music’ content type. Thus, in some examples, the content recognition engine 410 identifies content item data relating to a name of the movie soundtrack. For example, the content recognition engine 410 can determine that a particular content item (e.g., a specific movie soundtrack) is associated with a movie poster, and that the particular content item (e.g., the specific movie soundtrack) matches the particular content type (e.g., ‘music’ content type).
  • a particular content item e.g., a specific movie soundtrack
  • the particular content item e.g., the specific movie soundtrack
  • the content recognition 410 can identify data (e.g., the name of the specific movie soundtrack) that relates to the particular content item (e.g., the specific movie soundtrack) that is based on the environmental image data (e.g., the image of the CD album cover), and further that matches the particular content type (e.g., ‘music’ content type).
  • data e.g., the name of the specific movie soundtrack
  • the environmental image data e.g., the image of the CD album cover
  • the particular content type e.g., ‘music’ content type
  • the disambiguation engine 404 receives the identified content item data from the content recognition engine 410 .
  • the disambiguation engine 404 then provides the identified content item data to the mobile computing device 402 , at operation (G).
  • the disambiguation engine 404 transmits the identified content item data relating to the movie soundtrack (e.g., a name of the movie soundtrack) to the mobile computing device 402 .
  • FIGS. 1 to 4 illustrate several example processes in which the computing environment can identify media content (or other content) based on environmental information, such as ambient noises. Other processes for identifying content can also be used.
  • FIGS. 5 and 6 illustrate other example processes in which a computing environment can augment the spoken natural language query with context that is derived from the environmental information, such as data that identifies media content, in order to provide a more satisfying answer to the spoken natural language query.
  • FIG. 5 depicts a system 500 for identifying one or more results based on environmental audio data and an utterance.
  • the one or more results can represent one or more answers to a natural language query.
  • the system 500 includes a mobile computing device 502 , a coordination engine 504 , a speech recognition engine 506 , a content identification engine 508 , and a natural language query processing engine 510 .
  • the mobile computing device 502 is in communication with the coordination engine 504 over one or more networks.
  • the mobile device 510 can include a microphone, a camera, or other detection mechanisms for detecting utterances from a user 512 and/or environmental data associated with the user 512 .
  • the user 512 is watching a television program.
  • the user 512 would like to know who directed the television program (e.g., an entity) that is currently playing.
  • the user 512 may not know the name of the television program that is currently playing, and may therefore ask the question “Who directed this show?”
  • the mobile computing device 502 detects this utterance, as well as environmental data associated with the environment of the user 512 .
  • the environmental data associated with the environment of the user 512 can include background noise of the environment of the user 512 .
  • the environmental data includes the sounds of the television program (e.g., an entity).
  • the environmental data that is associated with the currently displayed television program can include audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.).
  • the environmental data can include environmental audio data, environmental image data, or both.
  • the mobile computing device 502 detects the environmental audio data after detecting the utterance; detects the environmental audio data concurrently with detecting the utterance; or both.
  • the mobile computing device 502 processes the detected utterance and the environmental data to generate waveform data 514 that represents the detected utterance and detected environmental audio data (e.g., the sounds of the television program) and transmits the waveform data 514 to the coordination engine 504 (e.g., over a network), during operation (A).
  • waveform data 514 that represents the detected utterance and detected environmental audio data (e.g., the sounds of the television program)
  • the coordination engine 504 e.g., over a network
  • the coordination engine 504 receives the waveform data 514 from the mobile computing device 502 .
  • the coordination engine 504 processes the waveform data 514 , including separating (or extracting) the utterance from other portions of the waveform data 514 and transmits the portion of the waveform data 514 corresponding to the utterance to the speech recognition engine 506 (e.g., over a network), during operation (B).
  • the coordination engine 504 separates the utterance (“Who directed this show?”) from the background noise of the environment of the user 512 (e.g., audio of the currently displayed television program).
  • the coordination engine 504 utilizes a voice detector to facilitate separation of the utterance from the background noise by identifying a portion of the waveform data 514 that includes voice activity.
  • the utterance relates to a query (e.g., a query relating to the currently displayed television program).
  • the speech recognition engine 506 receives the portion of the waveform data 514 corresponding to the utterance from the coordination engine 504 .
  • the coordination engine 506 obtains a transcription of the utterance and provides the transcription to the coordination engine 504 , during operation (C).
  • the speech recognition system 506 appropriately processes the portion of the waveform data 514 corresponding to the utterance received from the coordination engine 504 .
  • processing of the portion of the waveform data 514 corresponding to the utterance by the speech recognition engine 506 includes generating a transcription of the utterance. Generating the transcription of the utterance can include transcribing the utterance into text or text-related data. In other words, the speech recognition engine 506 can provide a representation of language in written form of the utterance.
  • the speech recognition engine 506 transcribes the utterance to generate the transcription of “Who directed this show?”
  • the speech recognition engine 506 provides two or more transcriptions of the utterance.
  • the speech recognition engine 506 transcribes the utterance to generate the transcriptions of “Who directed this show?” and “Who directed this shoe?”
  • the coordination engine 504 receives the transcription of the utterance from the speech recognition engine 506 . Furthermore, as mentioned above, the coordination engine 504 receives the waveform data 514 from the mobile computing device 502 that includes the environmental audio data associated with the utterance. The coordination engine 504 then identifies an entity using the environmental data. Specifically, the coordination engine 504 obtains data that identifies an entity from the content identification engine 508 . To that end, the coordination engine 504 provides the environmental audio data and the portion of the waveform 514 corresponding to the utterance to the content identification engine 508 (e.g., over a network), during operation (D).
  • the coordination engine 504 provides the environmental audio data and the portion of the waveform 514 corresponding to the utterance to the content identification engine 508 (e.g., over a network), during operation (D).
  • the coordination engine 504 transmits the environmental data relating to the currently displayed television program (e.g., the entity) that includes audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.) and the portion of the waveform 514 corresponding to the utterance (“Who directed this show?”) to the content identification engine 508 .
  • the environmental data relating to the currently displayed television program e.g., the entity
  • audio of the currently displayed television program e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.
  • the portion of the waveform 514 corresponding to the utterance (“Who directed this show?”
  • the coordination engine 504 provides a portion of the environmental data to the content identification engine 508 .
  • the portion of the environmental data can include background noise detected by the mobile computing device 502 after detecting the utterance.
  • the portion of the environmental data can include background noise detected by the mobile computing device 502 concurrently with detecting the utterance.
  • the content identification engine 508 receives the environmental data and the portion of the waveform 514 corresponding to the utterance from the coordination engine 504 .
  • the content identification engine 508 identifies data that identifies the entity (e.g., content item data) that is based on the environmental data and the utterance and provides the data that identifies the entity to the coordination engine 504 (e.g., over a network), during operation (E).
  • the content identification engine 508 appropriately processes the environmental data and the portion of the waveform 514 corresponding to the utterance to identify data that identifies the entity (e.g., content item data) that is associated with the environmental data (e.g., a name of a television show, a name of a song, etc.).
  • the content identification engine 508 processes the environmental audio data to identify content item data that is associated with the currently displayed television program.
  • the content identification engine 508 is the system 100 of FIG. 1 .
  • the coordination engine 504 receives the data that identifies the entity (e.g., the content item data) from the content identification engine 508 . Furthermore, as mentioned above, the coordination engine 504 receives the transcription from the speech recognition engine 506 . The coordination engine 504 then provides a query including the transcription and the data that identifies the entity to the natural language query processing engine 510 (e.g., over a network), during operation (F). For example, the coordination engine 504 submits a query to the natural language query processing engine 510 that includes the transcription of the utterance (“Who directed this show?”) and the content item data (‘television show name’) to the natural language query processing engine 510 .
  • the natural language query processing engine 510 that includes the transcription of the utterance (“Who directed this show?”) and the content item data (‘television show name’) to the natural language query processing engine 510 .
  • the coordination engine 504 generates the query. In some examples, the coordination engine 504 obtains the query (e.g., from a third-party server). For example, the coordination engine 504 can submit the transcription of the utterance, and the data that identifies the entity to the third-party server, and receive back the query based on the transcription and the data that identifies the entity.
  • the coordination engine 504 obtains the query (e.g., from a third-party server). For example, the coordination engine 504 can submit the transcription of the utterance, and the data that identifies the entity to the third-party server, and receive back the query based on the transcription and the data that identifies the entity.
  • generating the query by the coordination engine 504 can include associating the transcription of the utterance with the data that identifies the entity (e.g., the content item data).
  • associating the transcription of the utterance with the content item data can include tagging the transcription with the data that identifies the entity.
  • the coordination engine 504 can tag the transcription “Who directed this show?” with the ‘television show name’ or other identifying information associated with the content item data (e.g., an identification (ID) number).
  • associating the transcription of the utterance with the data that identifies the entity can include substituting a portion of the transcription with the data that identifies the entity.
  • the coordination engine 504 can substitute a portion of the transcription “Who directed this show?” with the ‘television show name’ or data identifying the ‘television show name.’
  • substituting a portion of the transcription with the data that identifies the entity can include substituting one or more words of the transcription of the utterance with the data that identifies the entity.
  • the coordination engine 504 can substitute the ‘television show name’ or data identifying the ‘television show name’ in the transcription of “Who directed this show?”
  • the substitution can result in the transcription including “Who directed ‘television show name’?” or “Who directed ‘ID number’?”
  • the natural language query processing engine 510 receives the query that includes the transcription and the data that identifies the entity (e.g., the content item data) from the coordination engine 504 .
  • the natural language query processing engine 510 appropriately processes the query and based on the processing, provides one or more results to the coordination engine 504 (e.g., over a network), during operation (G). In other words, the coordination engine 510 obtains one or more results of the query (e.g., from the natural language query processing engine 510 ).
  • the natural language query processing engine 510 obtains information resources (from a collection of information resources) relevant to the query (the transcription of the utterance and the content item data). In some examples, the natural language query processing engine 510 matches the query against database information (e.g., text documents, images, audio, video, etc.) and a score is calculated on how well each object in the database matches the query. The natural language query processing engine 510 identifies one or more results based on the matched objects (e.g., objects having a score above a threshold score).
  • database information e.g., text documents, images, audio, video, etc.
  • the natural language query processing engine 510 receives the query that includes the transcription of the utterance “Who directed this show” and the ‘television show name’ (or other identifying information).
  • the natural language query processing engine 510 matches the query against database information, and provides one or more results that match the query.
  • the natural language query processing engine 510 calculates a score of each of the matching objects.
  • the coordination engine 504 receives the one or more results from the natural language processing engine 510 .
  • the coordination engine 504 then provides the one or more results to the mobile computing device 502 (e.g., over a network), at operation (H).
  • the coordination engine 504 transmits the one or more results (e.g., the name of the director of the television show) to the mobile computing device 502 .
  • one or more of the mobile computing device 502 , the coordination engine 504 , the speech recognition engine 506 , the content identification engine 508 , and the natural language query processing engine 510 can be in communication with a subset (or each) of the mobile computing device 502 , the coordination engine 504 , the speech recognition engine 506 , the content identification engine 508 , and the natural language query processing engine 510 .
  • one or more of the coordination engine 504 , the speech recognition engine 506 , the content identification engine 508 , and the natural language query processing engine 510 can be implemented using one or more computing devices, such as one or more computing servers, a distributed computing system, or a server farm or cluster.
  • FIG. 6 depicts a flowchart of an example process 600 for identifying one or more results based on environmental data and an utterance.
  • the example process 600 can be executed using one or more computing devices.
  • the mobile computing device 502 , the coordination engine 504 , the speech recognition engine 506 , the content identification engine 508 , and/or the natural language query processing engine 510 can be used to execute the example process 600 .
  • Audio data that encodes an utterance and environmental data is received ( 602 ).
  • the coordination engine 504 receives the waveform data 514 from the mobile computing device 502 .
  • the waveform data 514 includes the utterance of the user (e.g., “Who directed this show?”) and the environmental data (e.g., audio of the currently displayed television program).
  • receiving the environmental data can include receiving environmental audio data, environmental image data, or both.
  • receiving the environmental data includes receiving additional audio data that includes background noise.
  • a transcription of the utterance is obtained ( 604 ).
  • the coordination engine 504 obtains a transcription of the utterance using the speech recognition engine 506 .
  • the speech recognition engine 506 transcribes the utterance to generate a transcription of the utterance (e.g., “Who directed this show?”).
  • An entity is identified using the environmental data ( 606 ).
  • the coordination engine 504 obtains data identifying the entity using the content identification engine 508 .
  • the content identification engine 508 can appropriately process the environmental data (e.g., the environmental audio data associated with the displayed television program) to identify data identifying the entity (e.g., content item data) that is associated with the environmental data (e.g., a name of a television show, a name of a song, etc.).
  • the content identification engine 508 can further process the waveform 514 corresponding to the utterance (concurrently or subsequently to processing the environmental data) to identify the entity.
  • the coordination engine 504 generates a query. In some examples, generating of the query by the coordination engine 504 can include associating the transcription of the utterance with the data that identifies the entity. In some examples, associating the transcription of the utterance with the content item data can include substituting a portion of the transcription with the data that identifies the entity. In some example, substituting a portion of the transcription with the data that identifies the entity can include substituting one or more words of the transcription of the utterance with the data that identifies the entity.
  • the query is submitted to a natural language processing engine ( 608 ).
  • the coordination engine 504 submits the query to the natural language query processing engine 510 .
  • the query can include at least a portion of the transcription and the data that identifies the entity (e.g., the content item data).
  • the coordination engine 504 submits a query to the natural language query processing engine that includes the transcription of the utterance (“Who directed this show?”) and the content item data (television show name') to the natural language query processing engine 510 .
  • One or more results of the query are obtained ( 610 ).
  • the coordination engine 510 obtains one or more results (e.g., the name of the director of the television show) of the query from the natural language query processing engine 510 .
  • the coordination engine 504 then provides the one or more results to the mobile computing device 502 .
  • FIG. 7 shows an example of a generic computer device 700 and a generic mobile computer device 750 , which may be used with the techniques described here.
  • Computing device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
  • Computing device 750 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices.
  • the components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.
  • Computing device 700 includes a processor 702 , memory 704 , a storage device 706 , a high-speed interface 708 connecting to memory 704 and high-speed expansion ports 710 , and a low speed interface 712 connecting to low speed bus 714 and storage device 706 .
  • Each of the components 702 , 704 , 706 , 708 , 710 , and 712 are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 702 may process instructions for execution within the computing device 700 , including instructions stored in the memory 704 or on the storage device 706 to display graphical information for a GUI on an external input/output device, such as display 716 coupled to high speed interface 708 .
  • multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
  • multiple computing devices 700 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
  • the memory 704 stores information within the computing device 700 .
  • the memory 704 is a volatile memory unit or units.
  • the memory 704 is a non-volatile memory unit or units.
  • the memory 704 may also be another form of computer-readable medium, such as a magnetic or optical disk.
  • the storage device 706 is capable of providing mass storage for the computing device 700 .
  • the storage device 706 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
  • a computer program product may be tangibly embodied in an information carrier.
  • the computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 704 , the storage device 706 , or a memory on processor 702 .
  • the high speed controller 708 manages bandwidth-intensive operations for the computing device 700 , while the low speed controller 712 manages lower bandwidth-intensive operations.
  • the high-speed controller 708 is coupled to memory 704 , display 716 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 710 , which may accept various expansion cards (not shown).
  • low-speed controller 712 is coupled to storage device 706 and low-speed expansion port 714 .
  • the low-speed expansion port which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • input/output devices such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • the computing device 700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 720 , or multiple times in a group of such servers. It may also be implemented as part of a rack server system 724 . In addition, it may be implemented in a personal computer such as a laptop computer 722 . Alternatively, components from computing device 700 may be combined with other components in a mobile device (not shown), such as device 750 . Each of such devices may contain one or more of computing device 700 , 750 , and an entire system may be made up of multiple computing devices 700 , 750 communicating with each other.
  • Computing device 750 includes a processor 752 , memory 764 , an input/output device such as a display 754 , a communication interface 766 , and a transceiver 768 , among other components.
  • the device 750 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage.
  • a storage device such as a microdrive or other device, to provide additional storage.
  • Each of the components 750 , 752 , 764 , 754 , 766 , and 768 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 752 may execute instructions within the computing device 650 , including instructions stored in the memory 764 .
  • the processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
  • the processor may provide, for example, for coordination of the other components of the device 750 , such as control of user interfaces, applications run by device 750 , and wireless communication by device 750 .
  • Processor 752 may communicate with a user through control interface 658 and display interface 756 coupled to a display 754 .
  • the display 754 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
  • the display interface 756 may comprise appropriate circuitry for driving the display 754 to present graphical and other information to a user.
  • the control interface 758 may receive commands from a user and convert them for submission to the processor 752 .
  • an external interface 762 may be provide in communication with processor 752 , so as to enable near area communication of device 750 with other devices. External interface 762 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
  • the memory 764 stores information within the computing device 750 .
  • the memory 764 may be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
  • Expansion memory 754 may also be provided and connected to device 750 through expansion interface 752 , which may include, for example, a SIMM (Single In Line Memory Module) card interface.
  • SIMM Single In Line Memory Module
  • expansion memory 754 may provide extra storage space for device 750 , or may also store applications or other information for device 750 .
  • expansion memory 754 may include instructions to carry out or supplement the processes described above, and may include secure information also.
  • expansion memory 754 may be provide as a security module for device 750 , and may be programmed with instructions that permit secure use of device 750 .
  • secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
  • the memory may include, for example, flash memory and/or NVRAM memory, as discussed below.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 764 , expansion memory 754 , memory on processor 752 , or a propagated signal that may be received, for example, over transceiver 768 or external interface 762 .
  • Device 750 may communicate wirelessly through communication interface 766 , which may include digital signal processing circuitry where necessary. Communication interface 766 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 768 . In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 750 may provide additional navigation- and location-related wireless data to device 750 , which may be used as appropriate by applications running on device 750 .
  • GPS Global Positioning System
  • Device 750 may also communicate audibly using audio codec 760 , which may receive spoken information from a user and convert it to usable digital information. Audio codec 760 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 750 . Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 750 .
  • Audio codec 760 may receive spoken information from a user and convert it to usable digital information. Audio codec 760 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 750 . Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 750 .
  • the computing device 750 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 780 . It may also be implemented as part of a smartphone 782 , personal digital assistant, or other similar mobile device.
  • implementations of the systems and techniques described here may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof.
  • ASICs application specific integrated circuits
  • These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • the systems and techniques described here may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • the systems and techniques described here may be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components.
  • the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
  • LAN local area network
  • WAN wide area network
  • the Internet the global information network
  • the computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving audio data encoding an utterance and environmental data, obtaining a transcription of the utterance, identifying an entity using the environmental data, submitting a query to a natural language query processing engine, wherein the query includes at least a portion of the transcription and data that identifies the entity, and obtaining one or more results of the query.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Provisional Patent Application No. 61/698,934, filed Sep. 10, 2012, the entire contents of the previous application is hereby incorporated by reference.
  • FIELD
  • The present specification relates to identifying results of a query based on a natural language query and environmental information, for example to answer questions using environmental information as context.
  • BACKGROUND
  • In general a search query includes one or more terms that a user submits to a search engine when the user requests the search engine to execute a search. Among other approaches, a user may enter query terms of a search query by typing on a keyboard or, in the context of a voice query, by speaking the query terms into a microphone of a mobile device. Voice queries may be processed using speech recognition technology.
  • SUMMARY
  • According to some innovative aspects of the subject matter described in this specification, environmental information, such as ambient noise, may aid a query processing system in answering a natural language query. For example, a user may ask a question about a television program that they are viewing, such as “What actor is in this movie?” The user's mobile device detects the user's utterance and environmental data, which may include the soundtrack audio of the television program. The mobile computing device encodes the utterance and the environmental data as waveform data, and provides the waveform data to a server-based computing environment.
  • The computing environment separates the utterance from the environmental data of the waveform data, and then obtains a transcription of the utterance. The computing environment further identifies entity data relating to the environmental data and the utterance, such as by identifying the name of the movie. From the transcription and the entity data, the computing environment can then identify one or more results, for example, results in response to the user's question. Specifically, the one or more results can include an answer to the user's question of “What actor is in this movie?” (e.g., the name of the actor). The computing environment can provide such results to the user of the mobile computing device.
  • Innovative aspects of the subject matter described in this specification may be embodied in methods that include the actions of receiving audio data encoding an utterance and environmental data, obtaining a transcription of the utterance, identifying an entity using the environmental data, submitting a query to a natural language query processing engine, wherein the query includes at least a portion of the transcription and data that identifies the entity, and obtaining one or more results of the query.
  • Other embodiments of these aspects include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices.
  • These and other embodiments may each optionally include one or more of the following features. For instance, outputting a representation of at least one of the results. The entity is identified further using the utterance. Generating the query. Generating the query includes associating the transcription with the data that identifies the entity. Associating further includes tagging the transcription with the data that identifies the entity. Associating further includes substituting a portion of the transcription with the data that identifies the entity. Substituting further includes substituting one or more words of the transcription with the data that identifies the entity. Receiving the environmental data further includes receiving environmental audio data, environmental image data, or both. Receiving the environmental audio data further includes receiving additional audio data that includes background noise.
  • The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
  • DESCRIPTION OF DRAWINGS
  • FIG. 1 depicts an example system for identifying content item data based on environmental audio data and a spoken natural language query.
  • FIG. 2 depicts a flowchart for an example process for identifying content item data based on environmental audio data and a spoken natural language query.
  • FIGS. 3A-3B depicts portions of an example system for identifying content item.
  • FIG. 4 depicts an example system for identifying media content items based on environmental image data and a spoken natural language query.
  • FIG. 5 depicts a system for identifying one or more results based on environmental audio data and an utterance.
  • FIG. 6 depicts a flowchart for an example process for identifying one or more results based on environmental data and an utterance.
  • FIG. 7 depicts a computer device and a mobile computer device that may be used to implement the techniques described here.
  • Like reference symbols in the various drawings indicate like elements.
  • DETAILED DESCRIPTION
  • A computing environment that answers spoken natural language queries using environmental information as context may process queries using multiple processes. In an example of some processes, illustrated in FIGS. 1 to 4, the computing environment can identify media content based on environmental information, such as ambient noises. In an example of other processes, illustrated in FIGS. 5 and 6, a computing environment can augment the spoken natural language query with context that is derived from the environmental information, such as data that identifies media content, in order to provide a more satisfying answer to the spoken natural language query.
  • In more detail, FIG. 1 depicts a system 100 for identifying content item data based on environmental audio data and a spoken natural language query. Briefly, the system 100 can identify content item data that is based on the environmental audio data and that matches a particular content type associated with the spoken natural language query. The system 100 includes a mobile computing device 102, a disambiguation engine 104, a speech recognition engine 106, a keyword mapping engine 108, and a content recognition engine 110. The mobile computing device 102 is in communication with the disambiguation engine 104 over one or more networks. The mobile device 110 can include a microphone, a camera, or other detection mechanisms for detecting utterances from a user 112 and/or environmental data associated with the user 112.
  • In some examples, the user 112 is watching a television program. In the illustrated example, the user 112 would like to know who directed the television program that is currently playing. In some examples, the user 112 may not know the name of the television program that is currently playing, and may therefore ask the question “Who directed this show?” The mobile computing device 102 detects this utterance, as well as environmental audio data associated with the environment of the user 112.
  • In some examples, the environmental audio data associated with the environment of the user 112 can include background noise of the environment of the user 112. For example, the environmental audio data includes the sounds of the television program. In some examples, the environmental audio data that is associated with the currently displayed television program can include audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.).
  • In some examples, the mobile computing device 102 detects the environmental audio data after detecting the utterance; detects the environmental audio data concurrently with detecting the utterance; or both. The mobile computing device 102 processes the detected utterance and the environmental audio data to generate waveform data 114 that represents the detected utterance and the environmental audio data and transmits the waveform data 114 to the disambiguation engine 104 (e.g., over a network), during operation (A). In some examples, the environmental audio data is streamed from the mobile computing device 110.
  • The disambiguation engine 104 receives the waveform data 114 from the mobile computing device 102. The disambiguation engine 104 processes the waveform data 114, including separating (or extracting) the utterance from other portions of the waveform data 114 and transmits the utterance to the speech recognition engine 106 (e.g., over a network), during operation (B). For example, the disambiguation engine 104 separates the utterance (“Who directed this show?”) from the background noise of the environment of the user 112 (e.g., audio of the currently displayed television program).
  • In some examples, the disambiguation engine 104 utilizes a voice detector to facilitate separation of the utterance from the background noise by identifying a portion of the waveform data 114 that includes voice activity, or voice activity associated with the user of the computing device 102. In some examples, the utterance relates to a query (e.g., a query relating to the currently displayed television program). In some examples, the waveform data 114 includes the detected utterance. In response, the disambiguation engine 104 can request the environmental audio data from the mobile computing device 102 relating to the utterance.
  • The speech recognition engine 106 receives the portion of the waveform data 114 that corresponds to the utterance from the disambiguation engine 104. The speech recognition engine 106 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 108, during operation (C). Specifically, the speech recognition engine 106 processes the utterance received from the speech recognition engine 106. In some examples, processing of the utterance by the speech recognition system 106 includes generating a transcription of the utterance. Generating the transcription of the utterance can include transcribing the utterance into text or text-related data. In other words, the speech recognition system 106 can provide a representation of language in written form of the utterance.
  • For example, the speech recognition system 106 transcribes the utterance to generate the transcription of “Who directed this show?” In some embodiments, the speech recognition system 106 provides two or more transcriptions of the utterance. For example, the speech recognition system 106 transcribes the utterance to generate the transcriptions of “Who directed this show?” and “Who directed this shoe?”
  • The keyword mapping engine 108 receives the transcription from the speech recognition engine 106. The keyword mapping engine 108 identifies one or more keywords in the transcription that are associated with a particular content type and provides the particular content type to the disambiguation engine 104, during operation (D). In some embodiments, the one or more content types can include ‘movie’, ‘music’, ‘television show’, ‘audio podcast’, ‘mage,’ ‘artwork,’ ‘book,’ ‘magazine,’ ‘trailer,’ ‘video podcast’, ‘Internet video’, or ‘video game’.
  • For example, the keyword mapping engine 108 identifies the keyword “directed” from the transcription of “Who directed this show?” The keyword “directed” is associated with the ‘television show’ content type. In some embodiments, a keyword of the transcription that is identified by the keyword mapping engine 108 is associated with two or more content types. For example, the keyword “directed” is associated with the ‘television show’ and ‘movie’ content types.
  • In some embodiments, the keyword mapping engine 108 identifies two or more keywords in the transcription that are associated with a particular content type. For example, the keyword mapping engines 108 identifies the keywords “directed” and “show” that are associated with a particular content type. In some embodiments, the identified two or more keywords are associated with the same content type. For example, the identified keywords “directed” and “show” are both associated with the ‘television show’ content type. In some embodiments, the identified two or more keywords are associated with differing content types. For example, the identified keyword “directed” is associated with the ‘movie’ content type and the identified keyword “show” is associated with the ‘television show’ content type. The keyword mapping engine 108 transmits (e.g., over a network) the particular content type to the disambiguation engine 108.
  • In some embodiments, the keyword mapping engine 108 identifies the one or more keywords in the transcription that are associated with a particular content type using one or more databases that, for each of multiple content types, maps at least one of the keywords to at least one of the multiple content types. Specifically, the keyword mapping engine 108 includes (or is in communication with) a database (or multiple databases). The database includes, or is associated with, a mapping between keywords and content types. Specifically, the database provides a connection (e.g., mapping) between the keywords and the content types such that the keyword mapping engine 108 is able to identify one or more keywords in the transcription that are associated with particular content types.
  • In some embodiments, one or more of the mappings between the keywords and the content types can include a unidirectional (e.g., one-way) mapping (i.e., a mapping from the keywords to the content types). In some embodiments, one or more of the mappings between the keywords and the content types can include a bidirectional (e.g., two-way) mapping (i.e., a mapping from the keywords to the content types and from the content types to the keywords). In some embodiments, the one or more databases maps one or more of the keywords to two or more content types.
  • For example, the keyword mapping engine 108 uses the one or more databases that maps the keyword “directed” to the ‘movie’ and ‘television show’ content types. In some embodiments, the mapping between the keywords and the content types can include mappings between multiple, varying versions of a root keyword (e.g., the word family) and the content types. The differing versions of the keyword can include differing grammatical categories such as tense (e.g., past, present, future) and word class (e.g., noun, verb). For example, the database can include mappings of the word family of the root word “direct” such as “directors,” “direction,” and “directed” to the one or more content types.
  • The disambiguation engine 104 receives data identifying the particular content type associated with the transcription of the utterance from the keyword mapping engine 108. Furthermore, as mentioned above, the disambiguation engine 104 receives the waveform data 114 from the mobile computing device 102 that includes the environmental audio data associated with the utterance. The disambiguation engine 104 then provides the environmental audio data and the particular content type to the content recognition engine 110, during operation (E).
  • For example, the disambiguation engine 104 transmits the environmental audio data relating to the currently displayed television program that includes audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.) and the particular content type of the transcription of the utterance (e.g., ‘television show’ content type) to the content recognition engine 110.
  • In some embodiments, the disambiguation engine 104 provides a portion of the environmental audio data to the content recognition engine 110. In some examples, the portion of the environmental audio data can include background noise detected by the mobile computing device 102 after detecting the utterance. In some examples, the portion of the environmental audio data can include background noise detected by the mobile computing device 102 concurrently with detecting the utterance.
  • In some embodiments, the background noise (of the waveform data 114) is associated with a particular content type that is associated with a keyword of the transcription. For example, the keyword “directed” of the transcription “Who directed this show?” is associated with the ‘television show’ content type, and the background noise (e.g., the environmental audio data relating to the currently displayed television program) is also associated with the ‘television show’ content type.
  • The content recognition engine 110 receives the environmental audio data and the particular content type from the disambiguation engine 104. The content recognition engine 110 identifies content item data that is based on the environmental audio data and that matches the particular content type and provides the content item data to the disambiguation engine 104, during operation (F). Specifically, the content recognition engine 110 appropriately processes the environmental audio data to identify content item data that is associated with the environmental audio data (e.g., a name of a television show, a name of a song, etc.). Additionally, the content recognition engine 110 matches the identified content item data with the particular content type (e.g., content type of the transcription of the utterance). The content recognition engine 110 transmits (e.g., over a network) the identified content item data to the disambiguation engine 104.
  • For example, the content recognition engine 110 identifies content item data that is based on the environmental audio data relating to the currently displayed television program, and further that matches the ‘television show’ content type. To that end, the content recognition engine 110 can identify content item data based on dialogue of the currently displayed television program, or soundtrack audio associated with the currently displayed television program, depending on the portion of the environmental audio data received by the content recognition engine 110.
  • In some embodiments, the content recognition engine 110 is an audio fingerprinting engine that utilizes content fingerprinting using wavelets to identify the content item data. Specifically, the content recognition engine 110 converts the waveform data 114 into a spectrogram. From the spectrogram, the content recognition engine 110 extracts spectral images. The spectral images can be represented as wavelets. For each of the spectral images that are extracted from the spectrogram, the content recognition engine 110 extracts the “top” wavelets based on the respective magnitudes of the wavelets. For each spectral image, the content recognition engine 110 computes a wavelet signature of the image. In some examples, the wavelet signatures is a truncated, quantized version of the wavelet decomposition of the image.
  • For example, to describe an m×n image with wavelets, m×n wavelets are returned without compression. Additionally, the content recognition engine 110 utilizes a subset of the wavelets that most characterize the song. Specifically, the t “top” wavelets (by magnitude) are selected, where t<<m×n. Furthermore, the content recognition engine 110 creates a compact representation of the sparse wavelet-vector described above, for example, using MinHash to compute sub-fingerprints for these sparse bit vectors.
  • In some examples, when the environmental audio data includes at least the soundtrack audio associated with the currently displayed television program, the content recognition engine 110 identifies content item data that is based on the soundtrack audio associated with the currently displayed television program and that also matches the ‘television show’ content type. Thus, in some examples, the content recognition engine 110 identifies content item data relating to a name of the currently displayed television program. For example, the content recognition engine 110 can determine that a particular content item (e.g., a specific television show) is associated with a theme song (e.g., the soundtrack audio), and that the particular content item (e.g., the specific television show) matches the particular content type (e.g., ‘television show’ content type). Thus, the content recognition engine 110 can identify data (e.g., the name of the specific television show) that relates to the particular content item (e.g., the currently displayed television program) that is based on the environmental audio data (e.g., the soundtrack audio), and further that matches the particular content type (e.g., ‘television show’ content type).
  • The disambiguation engine 104 receives the identified content item data from the content recognition engine 110. The disambiguation engine 104 then provides the identified content item data to the mobile computing device 102, at operation (G). For example, the disambiguation engine 104 transmits the identified content item data relating to the currently displayed television program (e.g., a name of the currently displayed television program) to the mobile computing device 102.
  • In some examples, one or more of the mobile computing device 102, the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and the content recognition engine 110 can be in communication with a subset (or each) of the mobile computing device 102, the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and the content recognition engine 110. In some embodiments, one or more of the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and the content recognition engine 110 can be implemented using one or more computing devices, such as one or more computing servers, a distributed computing system, or a server farm or cluster.
  • In some embodiments, as mentioned above, the environmental audio data is streamed from the mobile computing device 110 to the disambiguation engine 104. When the environmental audio data is streamed, the above-mentioned process (e.g., operations (A)-(H)) is performed as the environmental audio data is received by the disambiguation engine 104 (i.e., performed incrementally). In other words, as each portion of the environmental audio data is received by (e.g., streamed to) the disambiguation engine 104, operations (A)-(H) are performed iteratively until content item data is identified.
  • FIG. 2 depicts a flowchart of an example process 200 for identifying content item data based on environmental audio data and a spoken natural language query. The example process 200 can be executed using one or more computing devices. For example, the mobile computing device 102, the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and/or the content recognition engine 110 can be used to execute the example process 200.
  • Audio data that encodes a spoken natural language query and environmental audio data is received (202). For example, the disambiguation engine 104 receives the waveform data 114 from the mobile computing device 102. The waveform data 114 includes the spoken natural query of the user (e.g., “Who directed this show?”) and the environmental audio data (e.g., audio of the currently displayed television program). The disambiguation engine 104 separates the spoken natural language query (“Who directed this show?”) from the background noise of the environment of the user 112 (e.g., audio of the currently displayed television program).
  • A transcription of the natural language query is obtained (204). For example, the speech recognition system 106 transcribes the natural language query to generate a transcription of the natural language query (e.g., “Who directed this show?”).
  • A particular content type that is associated with one or more keywords in the transcription is determined (206). For example, the keyword mapping engine 108 identifies one or more keywords (e.g., “directed”) in the transcription (e.g., “Who directed this show?”) that are associated with a particular content type (e.g., ‘television show’ content type). In some embodiments, the keyword mapping engine 108 determines the particular content type that is associated with one or more keywords in the transcription using one or more databases that, for each of multiple content types, maps at least one of the keywords to at least one of the multiple content types. The database provides a connection (e.g., mapping) between the keywords (e.g., “directed”) and the content types (e.g., ‘television show’ content type).
  • At least a portion of the environmental audio data is provided to a content recognition engine (208). For example, the disambiguation engine 104 provides at least the portion the environmental audio data encoded by the waveform data 114 (e.g., audio of the currently displayed television program) to the content recognition engine 110. In some examples, the disambiguation engine 104 also provides the particular content type (e.g. ‘television show’ content type) that is associated with the one or more keywords (e.g., “directed”) in the transcription to the content recognition engine 110.
  • A content item is identified that is output by the content recognition engine, and that matches the particular content type (210). For example, the content recognition engine 110 identifies a content item or content item data that is based on the environmental audio data (e.g., audio of the currently displayed television program) and that matches the particular content type (e.g. ‘television show’ content type).
  • FIGS. 3A and 3B depict portions 300 a and 300 b, respectively, of a system for identifying content item data. Specifically, FIGS. 3A and 3B include disambiguation engines 304 a and 304 b, respectively; and include content recognition engines 310 a and 310 b, respectively. The disambiguation engines 304 a and 304 b are similar to the disambiguation engine 104 of system 100 depicted in FIG. 1; and the content recognition engines 310 a and 310 b are similar to the content recognition engine 110 of system 100 depicted in FIG. 1.
  • FIG. 3A depicts the portion 300 a including the content recognition engine 310 a. The content recognition engine 310 a is able to identify content item data based on environmental data and that matches a particular content type. In other words, the content recognition engine 310 a is able to appropriately process the environmental data to identify content item data based on the environmental data, and further select one or more of the identified content item data such that the selected content item data matches the particular content type.
  • Specifically, the disambiguation engine 304 a provides the environmental data and the particular content type to the content recognition engine 310 a, during operation (A). In some embodiments, the disambiguation engine 304 a provides a portion of the environmental data to the content recognition engine 310 a.
  • The content recognition engine 310 a receives the environmental data and the particular content type from the disambiguation engine 304 a. The content recognition engine 310 a then identifies content item data that is based on the environmental data and that matches the particular content type and provides the identified content item data to the disambiguation engine 304 a, during operation (B). Specifically, the content recognition engine 310 a identifies content item data (e.g., a name of a television show, a name of a song, etc.) that is based on the environmental data. The content recognition engine 310 a then selects one or more of the identified content item data that matches the particular content type. In other words, the content recognition engine 310 a filters the identified content item data based on the particular content type. The content recognition engine 310 a transmits (e.g., over a network) the identified content item data to the disambiguation engine 304 a.
  • In some examples, when the environmental data includes at least soundtrack audio associated with a currently displayed television program, as mentioned above with respect to FIG. 1, the content recognition engine 310 a identifies content item data that is based on the soundtrack audio associated with the currently displayed television program. The content recognition engine 310 a then filters the identified content item data based on the ‘television show’ content type. For example, the content recognition engine 310 a identifies a ‘theme song name’ and a ‘TV show name’ associated with the soundtrack audio. The content recognition engine 310 a then filters the identified content item data such that the identified content item data also matches the ‘television show’ content type. For example, the content recognition engine 310 a selects the ‘TV show name’ identifying data, and transmits the ‘TV show name’ identifying data to the disambiguation engine 304 a.
  • In some examples, the content recognition engine 310 a selects a corpus (or index) based on the content type (e.g., ‘television show’ content type). Specifically, the content recognition engine 310 a can have access to a first index relating to the ‘television show’ content type and a second index relating to a ‘movie’ content type. The content recognition engine 310 a appropriately selects the first index based on the ‘television show’ content type. Thus, by selecting the first index (and not selecting the second index), the content recognition engine 310 a can more efficiently identify the content item data (e.g., a name of the television show).
  • The disambiguation engine 304 a receives the content item data from the content recognition engine 310 a. For example, the disambiguation engine 304 a receives the ‘TV show name’ identifying data from the content recognition engine 310 a. The disambiguation engine 304 a then provides the identifying data to a third party (e.g., the mobile computing device 102 of FIG. 1), during operation (C). For example, the disambiguation engine 304 a provides the ‘TV show name’ identifying data to the third party.
  • FIG. 3 b depicts the portion 300 b including the content recognition engine 310 b. The content recognition engine 310 b is able to identify content item data based on environmental data. In other words, the content recognition engine 310 b is able to appropriately process the environmental data to identify content item data based on the environmental data, and provide the content item data to the disambiguation engine 304 b. The disambiguation engine 310 b selects one or more of the identified content item data such that the selected content item data matches the particular content type.
  • Specifically, the disambiguation engine 304 b provides the environmental data to the content recognition engine 310 b, during operation (A). In some embodiments, the disambiguation engine 304 b provides a portion of the environmental data to the content recognition engine 310 b.
  • The content recognition engine 310 b receives the environmental data from the disambiguation engine 304 b. The content recognition engine 310 b then identifies content item data that is based on the environmental data and provides the identified content item data to the disambiguation engine 304 b, during operation (B). Specifically, the content recognition engine 310 b identifies content item data associated with two or more content items (e.g., a name of a television show, a name of a song, etc.) that is based on the environmental data. The content recognition engine 310 b transmits (e.g., over a network) two or more candidates representing the identified content item data to the disambiguation engine 304 b.
  • In some examples, when the environmental data includes at least soundtrack audio associated with a currently displayed television program, as mentioned above with respect to FIG. 1, the content recognition engine 310 b identifies content item data relating to two or more content items that is based on the soundtrack audio associated with the currently displayed television program. For example, the content recognition engine 310 b identifies a ‘theme song name’ and a ‘TV show name’ associated with the soundtrack audio, and transmits the ‘theme song name’ and ‘TV show name’ identifying data to the disambiguation engine 304 b.
  • The disambiguation engine 304 b receives the two or more candidates from the content recognition engine 310 b. For example, the disambiguation engine 304 b receives the ‘theme song name’ and ‘TV show name’ candidates from the content recognition engine 310 b. The disambiguation engine 304 b then selects one of the two or more candidates based on a particular content type and provides the selected candidate to a third party (e.g., the mobile computing device 102 of FIG. 1), during operation (C). Specifically, the disambiguation engine 304 b previously receives the particular content type (e.g., that is associated with an utterance), as described above with respect to FIG. 1. The disambiguation engine 304 b selects a particular candidate of the two or more candidates based on the particular content type. Specifically, the disambiguation engine 304 b selects the particular candidate of the two or more candidates that matches the particular content type. For example, the disambiguation engine 304 b selects the ‘TV show name’ candidate as the ‘TV show name’ candidate matches the ‘television show’ content type.
  • In some embodiments, the two or more candidates from the content recognition engine 310 b are associated with a ranking score. The ranking score can be associated with any scoring metric as determined by the disambiguation engine 304 b. The disambiguation engine 304 b can further adjust the ranking score of two or more candidates based on the particular content type. Specifically, the disambiguation engine 304 b can increase the ranking score of one or more of the candidates when the respective candidates are matched to the particular content type. For example, the ranking score of the candidate ‘TV show name’ can be increased as it matches the ‘television show’ content type. Furthermore, the disambiguation engine 304 b can decrease the ranking score of one or more of the candidates when the respective candidates are not matched to the particular content type. For example, the ranking score of the candidate ‘theme song name’ can be decreased as it does not match the ‘television show’ content type.
  • In some embodiments, the two or more candidates can be ranked based on the respective adjusted ranking scores by the disambiguation engine 304 b. For example, the disambiguation engine 304 b can rank the ‘TV show name’ candidate above the ‘theme song name’ candidate as the ‘TV show name’ candidate has a higher adjusted ranking score as compared to the adjusted ranking score of the ‘theme song name’ candidate. In some examples, the disambiguation engine 304 b selects the candidate ranked highest (i.e., has the highest adjusted ranking score).
  • FIG. 4 depicts a system 400 for identifying content item data based on environmental image data and a spoken natural language query. In short, the system 400 can identify content item data that is based on the environmental image data and that matches a particular content type associated with the spoken natural language query. The system 400 includes a mobile computing device 402, a disambiguation engine 404, a speech recognition engine 406, a keyword mapping engine 408, and a content recognition engine 410, analogous to that of the mobile computing device 102, the disambiguation engine 104, the speech recognition engine 106, the keyword mapping engine 108, and the content recognition engine 110, respectively, of system 100 illustrated in FIG. 1.
  • In some examples, the user 112 is looking at a CD album cover of a soundtrack of a movie. In the illustrated example, the user 112 would like to know what songs are on the soundtrack. In some examples, the user 112 may not know the name of the movie soundtrack, and may therefore ask the question “What songs are on this?” or “What songs play in this movie?” The mobile computing device 402 detects this utterance, as well as environmental image data associated with the environment of the user 112.
  • In some examples, the environmental image data associated with the environment of the user 112 can include image data of the environment of the user 112. For example, the environmental image data includes an image of the CD album cover that depicts images related to the movie (e.g., an image of a movie poster of the associated movie). In some examples, the mobile computing device 402 detects the environmental image data utilizing a camera of the mobile computing device 402 that captures an image (or video) of the CD album cover.
  • The mobile computing device 402 processes the detected utterance to generate waveform data 414 that represents the detected utterance and transmits the waveform data 414 and the environmental image data to the disambiguation engine 404 (e.g., over a network), during operation (A).
  • The disambiguation engine 404 receives the waveform data 414 and the environmental image data from the mobile computing device 402. The disambiguation engine 404 processes the waveform data 414 and transmits the utterance to the speech recognition engine 406 (e.g., over a network), during operation (B). In some examples, the utterance relates to a query (e.g., a query relating to the movie soundtrack).
  • The speech recognition system 406 receives the utterance from the disambiguation engine 404. The speech recognition system 406 obtains a transcription of the utterance and provides the transcription to the keyword mapping engine 408, during operation (C). Specifically, the speech recognition system 406 processes the utterance received from the speech recognition engine 406 by generating a transcription of the utterance.
  • For example, the speech recognition system 406 transcribes the utterance to generate the transcription of “What songs are on this?” In some embodiments, the speech recognition system 406 provides two or more transcriptions of the utterance. For example, the speech recognition system 406 transcribes the utterance to generate the transcriptions of “What songs are on this?” and “What sinks are on this?”
  • The keyword mapping engine 408 receives the transcription from the speech recognition engine 406. The keyword mapping engine 408 identifies one or more keywords in the transcription that are associated with a particular content type and provides the particular content type to the disambiguation engine 404, during operation (D).
  • For example, the keyword mapping engine 408 identifies the keyword “songs” from the transcription of “What songs are on this?” The keyword “songs” is associated with the ‘music’ content type. In some embodiments, a keyword of the transcription that is identified by the keyword mapping engine 408 is associated with two or more content types. For example, the keyword “songs” is associated with the ‘music’ and ‘singer’ content types. The keyword mapping engine 408 transmits (e.g., over a network) the particular content type to the disambiguation engine 408.
  • In some embodiments, analogous to that mentioned above, the keyword mapping engine 408 identifies the one or more keywords in the transcription that are associated with a particular content type using one or more databases that, for each of multiple content types, maps at least one of the keywords to at least one of the multiple content types. For example, the keyword mapping engine 408 uses the one or more databases that maps the keyword “songs” to the ‘music’ and ‘singer’ content types.
  • The disambiguation engine 404 receives the particular content type associated with the transcription of the utterance from the keyword mapping engine 408. Furthermore, as mentioned above, the disambiguation engine 404 receives the environmental image data associated with the utterance. The disambiguation engine 404 then provides the environmental image data and the particular content type to the content recognition engine 410, during operation (E).
  • For example, the disambiguation engine 404 transmits the environmental image data relating to the movie soundtrack (e.g., an image of the movie poster CD album cover) and the particular content type of the transcription of the utterance (e.g., ‘music’ content type) to the content recognition engine 410.
  • The content recognition engine 410 receives the environmental image data and the particular content type from the disambiguation engine 404. The content recognition engine 410 then identifies content item data that is based on the environmental image data and that matches the particular content type and provides the identified content item data to the disambiguation engine 404, during operation (F). Specifically, the content recognition engine 410 appropriately processes the environmental image data to identify content item data (e.g., a name of a content item). Additionally, the content recognition engine 410 matches the identified content item with the particular content type (e.g., content type of the transcription of the utterance). The content recognition engine 408 transmits (e.g., over a network) the identified content item data to the disambiguation engine 408.
  • For example, the content recognition engine 410 identifies data that is based on the environmental image data relating to the image of the movie poster CD album cover, and further that matches the ‘music’ content type.
  • In some examples, when the environmental image data includes at least the movie poster image associated with the CD album cover, the content recognition engine 410 identifies content item data that is based on the movie poster associated with the CD album cover and that also matches the ‘music’ content type. Thus, in some examples, the content recognition engine 410 identifies content item data relating to a name of the movie soundtrack. For example, the content recognition engine 410 can determine that a particular content item (e.g., a specific movie soundtrack) is associated with a movie poster, and that the particular content item (e.g., the specific movie soundtrack) matches the particular content type (e.g., ‘music’ content type). Thus, the content recognition 410 can identify data (e.g., the name of the specific movie soundtrack) that relates to the particular content item (e.g., the specific movie soundtrack) that is based on the environmental image data (e.g., the image of the CD album cover), and further that matches the particular content type (e.g., ‘music’ content type).
  • The disambiguation engine 404 receives the identified content item data from the content recognition engine 410. The disambiguation engine 404 then provides the identified content item data to the mobile computing device 402, at operation (G). For example, the disambiguation engine 404 transmits the identified content item data relating to the movie soundtrack (e.g., a name of the movie soundtrack) to the mobile computing device 402.
  • As noted above, FIGS. 1 to 4 illustrate several example processes in which the computing environment can identify media content (or other content) based on environmental information, such as ambient noises. Other processes for identifying content can also be used. Generally, FIGS. 5 and 6 illustrate other example processes in which a computing environment can augment the spoken natural language query with context that is derived from the environmental information, such as data that identifies media content, in order to provide a more satisfying answer to the spoken natural language query.
  • In more detail, FIG. 5 depicts a system 500 for identifying one or more results based on environmental audio data and an utterance. In some examples, the one or more results can represent one or more answers to a natural language query. The system 500 includes a mobile computing device 502, a coordination engine 504, a speech recognition engine 506, a content identification engine 508, and a natural language query processing engine 510. The mobile computing device 502 is in communication with the coordination engine 504 over one or more networks. The mobile device 510 can include a microphone, a camera, or other detection mechanisms for detecting utterances from a user 512 and/or environmental data associated with the user 512.
  • Similar to the system 100 of FIG. 1, the user 512 is watching a television program. In the illustrated example, the user 512 would like to know who directed the television program (e.g., an entity) that is currently playing. In some examples, the user 512 may not know the name of the television program that is currently playing, and may therefore ask the question “Who directed this show?” The mobile computing device 502 detects this utterance, as well as environmental data associated with the environment of the user 512.
  • In some examples, the environmental data associated with the environment of the user 512 can include background noise of the environment of the user 512. For example, the environmental data includes the sounds of the television program (e.g., an entity). In some examples, the environmental data that is associated with the currently displayed television program can include audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.). In some examples, the environmental data can include environmental audio data, environmental image data, or both. In some examples, the mobile computing device 502 detects the environmental audio data after detecting the utterance; detects the environmental audio data concurrently with detecting the utterance; or both. The mobile computing device 502 processes the detected utterance and the environmental data to generate waveform data 514 that represents the detected utterance and detected environmental audio data (e.g., the sounds of the television program) and transmits the waveform data 514 to the coordination engine 504 (e.g., over a network), during operation (A).
  • The coordination engine 504 receives the waveform data 514 from the mobile computing device 502. The coordination engine 504 processes the waveform data 514, including separating (or extracting) the utterance from other portions of the waveform data 514 and transmits the portion of the waveform data 514 corresponding to the utterance to the speech recognition engine 506 (e.g., over a network), during operation (B). For example, the coordination engine 504 separates the utterance (“Who directed this show?”) from the background noise of the environment of the user 512 (e.g., audio of the currently displayed television program). In some examples, the coordination engine 504 utilizes a voice detector to facilitate separation of the utterance from the background noise by identifying a portion of the waveform data 514 that includes voice activity. In some examples, the utterance relates to a query (e.g., a query relating to the currently displayed television program).
  • The speech recognition engine 506 receives the portion of the waveform data 514 corresponding to the utterance from the coordination engine 504. The coordination engine 506 obtains a transcription of the utterance and provides the transcription to the coordination engine 504, during operation (C). Specifically, the speech recognition system 506 appropriately processes the portion of the waveform data 514 corresponding to the utterance received from the coordination engine 504. In some examples, processing of the portion of the waveform data 514 corresponding to the utterance by the speech recognition engine 506 includes generating a transcription of the utterance. Generating the transcription of the utterance can include transcribing the utterance into text or text-related data. In other words, the speech recognition engine 506 can provide a representation of language in written form of the utterance.
  • For example, the speech recognition engine 506 transcribes the utterance to generate the transcription of “Who directed this show?” In some embodiments, the speech recognition engine 506 provides two or more transcriptions of the utterance. For example, the speech recognition engine 506 transcribes the utterance to generate the transcriptions of “Who directed this show?” and “Who directed this shoe?”
  • The coordination engine 504 receives the transcription of the utterance from the speech recognition engine 506. Furthermore, as mentioned above, the coordination engine 504 receives the waveform data 514 from the mobile computing device 502 that includes the environmental audio data associated with the utterance. The coordination engine 504 then identifies an entity using the environmental data. Specifically, the coordination engine 504 obtains data that identifies an entity from the content identification engine 508. To that end, the coordination engine 504 provides the environmental audio data and the portion of the waveform 514 corresponding to the utterance to the content identification engine 508 (e.g., over a network), during operation (D).
  • For example, the coordination engine 504 transmits the environmental data relating to the currently displayed television program (e.g., the entity) that includes audio of the currently displayed television program (e.g., dialogue of the currently displayed television program, soundtrack audio associated with the currently displayed television program, etc.) and the portion of the waveform 514 corresponding to the utterance (“Who directed this show?”) to the content identification engine 508.
  • In some embodiments, the coordination engine 504 provides a portion of the environmental data to the content identification engine 508. In some examples, the portion of the environmental data can include background noise detected by the mobile computing device 502 after detecting the utterance. In some examples, the portion of the environmental data can include background noise detected by the mobile computing device 502 concurrently with detecting the utterance.
  • The content identification engine 508 receives the environmental data and the portion of the waveform 514 corresponding to the utterance from the coordination engine 504. The content identification engine 508 identifies data that identifies the entity (e.g., content item data) that is based on the environmental data and the utterance and provides the data that identifies the entity to the coordination engine 504 (e.g., over a network), during operation (E). Specifically, the content identification engine 508 appropriately processes the environmental data and the portion of the waveform 514 corresponding to the utterance to identify data that identifies the entity (e.g., content item data) that is associated with the environmental data (e.g., a name of a television show, a name of a song, etc.).
  • For example, the content identification engine 508 processes the environmental audio data to identify content item data that is associated with the currently displayed television program. In some embodiments, the content identification engine 508 is the system 100 of FIG. 1.
  • The coordination engine 504 receives the data that identifies the entity (e.g., the content item data) from the content identification engine 508. Furthermore, as mentioned above, the coordination engine 504 receives the transcription from the speech recognition engine 506. The coordination engine 504 then provides a query including the transcription and the data that identifies the entity to the natural language query processing engine 510 (e.g., over a network), during operation (F). For example, the coordination engine 504 submits a query to the natural language query processing engine 510 that includes the transcription of the utterance (“Who directed this show?”) and the content item data (‘television show name’) to the natural language query processing engine 510.
  • In some examples, the coordination engine 504 generates the query. In some examples, the coordination engine 504 obtains the query (e.g., from a third-party server). For example, the coordination engine 504 can submit the transcription of the utterance, and the data that identifies the entity to the third-party server, and receive back the query based on the transcription and the data that identifies the entity.
  • In some embodiments, generating the query by the coordination engine 504 can include associating the transcription of the utterance with the data that identifies the entity (e.g., the content item data). In some examples, associating the transcription of the utterance with the content item data can include tagging the transcription with the data that identifies the entity. For example, the coordination engine 504 can tag the transcription “Who directed this show?” with the ‘television show name’ or other identifying information associated with the content item data (e.g., an identification (ID) number). In some examples, associating the transcription of the utterance with the data that identifies the entity can include substituting a portion of the transcription with the data that identifies the entity. For example, the coordination engine 504 can substitute a portion of the transcription “Who directed this show?” with the ‘television show name’ or data identifying the ‘television show name.’ In some examples, substituting a portion of the transcription with the data that identifies the entity can include substituting one or more words of the transcription of the utterance with the data that identifies the entity. For example, the coordination engine 504 can substitute the ‘television show name’ or data identifying the ‘television show name’ in the transcription of “Who directed this show?” For example, the substitution can result in the transcription including “Who directed ‘television show name’?” or “Who directed ‘ID number’?”
  • The natural language query processing engine 510 receives the query that includes the transcription and the data that identifies the entity (e.g., the content item data) from the coordination engine 504. The natural language query processing engine 510 appropriately processes the query and based on the processing, provides one or more results to the coordination engine 504 (e.g., over a network), during operation (G). In other words, the coordination engine 510 obtains one or more results of the query (e.g., from the natural language query processing engine 510).
  • Specifically, the natural language query processing engine 510 obtains information resources (from a collection of information resources) relevant to the query (the transcription of the utterance and the content item data). In some examples, the natural language query processing engine 510 matches the query against database information (e.g., text documents, images, audio, video, etc.) and a score is calculated on how well each object in the database matches the query. The natural language query processing engine 510 identifies one or more results based on the matched objects (e.g., objects having a score above a threshold score).
  • For example, the natural language query processing engine 510 receives the query that includes the transcription of the utterance “Who directed this show” and the ‘television show name’ (or other identifying information). The natural language query processing engine 510 matches the query against database information, and provides one or more results that match the query. The natural language query processing engine 510 calculates a score of each of the matching objects.
  • The coordination engine 504 receives the one or more results from the natural language processing engine 510. The coordination engine 504 then provides the one or more results to the mobile computing device 502 (e.g., over a network), at operation (H). For example, the coordination engine 504 transmits the one or more results (e.g., the name of the director of the television show) to the mobile computing device 502.
  • In some examples, one or more of the mobile computing device 502, the coordination engine 504, the speech recognition engine 506, the content identification engine 508, and the natural language query processing engine 510 can be in communication with a subset (or each) of the mobile computing device 502, the coordination engine 504, the speech recognition engine 506, the content identification engine 508, and the natural language query processing engine 510. In some embodiments, one or more of the coordination engine 504, the speech recognition engine 506, the content identification engine 508, and the natural language query processing engine 510 can be implemented using one or more computing devices, such as one or more computing servers, a distributed computing system, or a server farm or cluster.
  • FIG. 6 depicts a flowchart of an example process 600 for identifying one or more results based on environmental data and an utterance. The example process 600 can be executed using one or more computing devices. For example, the mobile computing device 502, the coordination engine 504, the speech recognition engine 506, the content identification engine 508, and/or the natural language query processing engine 510 can be used to execute the example process 600.
  • Audio data that encodes an utterance and environmental data is received (602). For example, the coordination engine 504 receives the waveform data 514 from the mobile computing device 502. The waveform data 514 includes the utterance of the user (e.g., “Who directed this show?”) and the environmental data (e.g., audio of the currently displayed television program). In some examples, receiving the environmental data can include receiving environmental audio data, environmental image data, or both. In some examples, receiving the environmental data includes receiving additional audio data that includes background noise.
  • A transcription of the utterance is obtained (604). For example, the coordination engine 504 obtains a transcription of the utterance using the speech recognition engine 506. The speech recognition engine 506 transcribes the utterance to generate a transcription of the utterance (e.g., “Who directed this show?”).
  • An entity is identified using the environmental data (606). For example, the coordination engine 504 obtains data identifying the entity using the content identification engine 508. The content identification engine 508 can appropriately process the environmental data (e.g., the environmental audio data associated with the displayed television program) to identify data identifying the entity (e.g., content item data) that is associated with the environmental data (e.g., a name of a television show, a name of a song, etc.). In some examples, the content identification engine 508 can further process the waveform 514 corresponding to the utterance (concurrently or subsequently to processing the environmental data) to identify the entity.
  • In some examples, the coordination engine 504 generates a query. In some examples, generating of the query by the coordination engine 504 can include associating the transcription of the utterance with the data that identifies the entity. In some examples, associating the transcription of the utterance with the content item data can include substituting a portion of the transcription with the data that identifies the entity. In some example, substituting a portion of the transcription with the data that identifies the entity can include substituting one or more words of the transcription of the utterance with the data that identifies the entity.
  • The query is submitted to a natural language processing engine (608). For example, the coordination engine 504 submits the query to the natural language query processing engine 510. The query can include at least a portion of the transcription and the data that identifies the entity (e.g., the content item data). For example, the coordination engine 504 submits a query to the natural language query processing engine that includes the transcription of the utterance (“Who directed this show?”) and the content item data (television show name') to the natural language query processing engine 510.
  • One or more results of the query are obtained (610). For example, the coordination engine 510 obtains one or more results (e.g., the name of the director of the television show) of the query from the natural language query processing engine 510. In some examples, the coordination engine 504 then provides the one or more results to the mobile computing device 502.
  • FIG. 7 shows an example of a generic computer device 700 and a generic mobile computer device 750, which may be used with the techniques described here. Computing device 700 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 750 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.
  • Computing device 700 includes a processor 702, memory 704, a storage device 706, a high-speed interface 708 connecting to memory 704 and high-speed expansion ports 710, and a low speed interface 712 connecting to low speed bus 714 and storage device 706. Each of the components 702, 704, 706, 708, 710, and 712, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 702 may process instructions for execution within the computing device 700, including instructions stored in the memory 704 or on the storage device 706 to display graphical information for a GUI on an external input/output device, such as display 716 coupled to high speed interface 708. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 700 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
  • The memory 704 stores information within the computing device 700. In one implementation, the memory 704 is a volatile memory unit or units. In another implementation, the memory 704 is a non-volatile memory unit or units. The memory 704 may also be another form of computer-readable medium, such as a magnetic or optical disk.
  • The storage device 706 is capable of providing mass storage for the computing device 700. In one implementation, the storage device 706 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product may be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 704, the storage device 706, or a memory on processor 702.
  • The high speed controller 708 manages bandwidth-intensive operations for the computing device 700, while the low speed controller 712 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 708 is coupled to memory 704, display 716 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 710, which may accept various expansion cards (not shown). In the implementation, low-speed controller 712 is coupled to storage device 706 and low-speed expansion port 714. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • The computing device 700 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 720, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 724. In addition, it may be implemented in a personal computer such as a laptop computer 722. Alternatively, components from computing device 700 may be combined with other components in a mobile device (not shown), such as device 750. Each of such devices may contain one or more of computing device 700, 750, and an entire system may be made up of multiple computing devices 700, 750 communicating with each other.
  • Computing device 750 includes a processor 752, memory 764, an input/output device such as a display 754, a communication interface 766, and a transceiver 768, among other components. The device 750 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 750, 752, 764, 754, 766, and 768, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
  • The processor 752 may execute instructions within the computing device 650, including instructions stored in the memory 764. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 750, such as control of user interfaces, applications run by device 750, and wireless communication by device 750.
  • Processor 752 may communicate with a user through control interface 658 and display interface 756 coupled to a display 754. The display 754 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 756 may comprise appropriate circuitry for driving the display 754 to present graphical and other information to a user. The control interface 758 may receive commands from a user and convert them for submission to the processor 752. In addition, an external interface 762 may be provide in communication with processor 752, so as to enable near area communication of device 750 with other devices. External interface 762 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
  • The memory 764 stores information within the computing device 750. The memory 764 may be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 754 may also be provided and connected to device 750 through expansion interface 752, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 754 may provide extra storage space for device 750, or may also store applications or other information for device 750. Specifically, expansion memory 754 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 754 may be provide as a security module for device 750, and may be programmed with instructions that permit secure use of device 750. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
  • The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 764, expansion memory 754, memory on processor 752, or a propagated signal that may be received, for example, over transceiver 768 or external interface 762.
  • Device 750 may communicate wirelessly through communication interface 766, which may include digital signal processing circuitry where necessary. Communication interface 766 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 768. In addition, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 750 may provide additional navigation- and location-related wireless data to device 750, which may be used as appropriate by applications running on device 750.
  • Device 750 may also communicate audibly using audio codec 760, which may receive spoken information from a user and convert it to usable digital information. Audio codec 760 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 750. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 750.
  • The computing device 750 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 780. It may also be implemented as part of a smartphone 782, personal digital assistant, or other similar mobile device.
  • Various implementations of the systems and techniques described here may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms “machine-readable medium” “computer-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • To provide for interaction with a user, the systems and techniques described here may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user may provide input to the computer. Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • The systems and techniques described here may be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
  • The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • While this disclosure includes some specifics, these should not be construed as limitations on the scope of the disclosure or of what may be claimed, but rather as descriptions of features of example implementations of the disclosure. Certain features that are described in this disclosure in the context of separate implementations can also be provided in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be provided in multiple implementations separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
  • Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
  • Thus, particular implementations of the present disclosure have been described. Other implementations are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used, with steps re-ordered, added, or removed. Accordingly, other implementations are within the scope of the following claims.

Claims (22)

1. A computer-implemented method comprising:
receiving audio data encoding (i) an utterance and (ii) background audio data;
obtaining a transcription of the utterance;
identifying an entity using the background audio data;
submitting a query to a natural language query processing engine, wherein the query includes at least a portion of the transcription and data that identifies the entity that is identified using the background audio data; and
obtaining one or more results of the query.
2. The computer-implemented method of claim 1, further comprising outputting a representation of at least one of the results.
3. The computer-implemented method of claim 1, wherein the entity is identified further using the utterance.
4. The computer-implemented method of claim 1, further comprising generating the query.
5. The computer-implemented method of claim 4, wherein generating the query comprises associating the transcription with the data that identifies the entity.
6. The computer-implemented method of claim 5, wherein associating further includes tagging the transcription with the data that identifies the entity.
7. The computer-implemented method of claim 5, wherein associating further includes substituting a portion of the transcription with the data that identifies the entity.
8. The computer-implemented method of claim 7, wherein substituting further includes substituting one or more words of the transcription with the data that identifies the entity.
9-10. (canceled)
11. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising:
receiving, from a computing device, audio data encoding an utterance recorded by the computing device, and image data encoding an image captured by the computing device;
obtaining a transcription of the utterance;
identifying an entity using the image data;
submitting a query to a natural language query processing engine, wherein the query includes at least a portion of the transcription and data that identifies the entity that is identified using the image data; and
obtaining one or more results of the query.
12. The system of claim 11, the operations further include generating the query, wherein generating the query comprises associating the transcription with the data that identifies the entity.
13. The system of claim 12, wherein associating further includes tagging the transcription with the data that identifies the entity.
14. The system of claim 12, wherein associating further includes substituting a portion of the transcription with the data that identifies the entity.
15. The system of claim 14, wherein substituting further includes substituting one or more words of the transcription with the data that identifies the entity.
16-17. (canceled)
18. A computer-readable storage device storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
receiving (i) audio data encoding an utterance, and (ii) environmental data;
obtaining a transcription of the utterance;
identifying a title of an item of media content using the environmental data;
submitting a query to a natural language query processing engine, wherein the query includes at least a portion of the transcription and data that identifies the title of the item of media content that is identified using the environmental data; and
obtaining one or more results of the query.
19. The computer-readable storage device of claim 18, the operations further comprise generating the query, wherein generating the query comprises associating the transcription with the data that identifies the entity.
20. The computer-readable storage device of claim 19, wherein associating further includes tagging the transcription with the data that identifies the entity.
21. The computer-readable storage device of claim 19, wherein associating further includes substituting a portion of the transcription with the data that identifies the entity.
22. The computer-readable storage device of claim 21, wherein substituting further includes substituting one or more words of the transcription with the data that identifies the entity.
23. The computer-implemented method of claim 1, further comprising:
providing, to a query generation engine, the transcription of the utterance and the data that identifies the entity; and
receiving, from the query generation engine, the query based on providing the transcription of the utterance and the data that identifies the entity to the query generation engine.
24. The computer-implemented method of claim 1, further comprising:
determining, for each result of the one or more results, a score based on a matching between the query and the result;
comparing, for each result of the one or more results, the score to a threshold score; and
based on comparing the score to the threshold score, identifying one or more particular results of the one or more results associated with a score that satisfied the threshold score.
US13/626,439 2012-09-10 2012-09-25 Answering questions using environmental context Abandoned US20140074466A1 (en)

Priority Applications (10)

Application Number Priority Date Filing Date Title
US13/626,439 US20140074466A1 (en) 2012-09-10 2012-09-25 Answering questions using environmental context
PCT/US2013/035095 WO2014039106A1 (en) 2012-09-10 2013-04-03 Answering questions using environmental context
EP20130162403 EP2706470A1 (en) 2012-09-10 2013-04-04 Answering questions using environmental context
CN201310394518.3A CN103714104B (en) 2012-09-10 2013-04-05 Use environmental context is answered a question
CN201610628594.XA CN106250508B (en) 2012-09-10 2013-04-05 Use environment context is answered a question
KR1020130037540A KR102029276B1 (en) 2012-09-10 2013-04-05 Answering questions using environmental context
US15/224,944 US9576576B2 (en) 2012-09-10 2016-08-01 Answering questions using environmental context
US15/410,180 US9786279B2 (en) 2012-09-10 2017-01-19 Answering questions using environmental context
KR1020190119592A KR102140177B1 (en) 2012-09-10 2019-09-27 Answering questions using environmental context
KR1020200092439A KR102241972B1 (en) 2012-09-10 2020-07-24 Answering questions using environmental context

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261698934P 2012-09-10 2012-09-10
US13/626,439 US20140074466A1 (en) 2012-09-10 2012-09-25 Answering questions using environmental context

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/224,944 Continuation US9576576B2 (en) 2012-09-10 2016-08-01 Answering questions using environmental context

Publications (1)

Publication Number Publication Date
US20140074466A1 true US20140074466A1 (en) 2014-03-13

Family

ID=50234196

Family Applications (3)

Application Number Title Priority Date Filing Date
US13/626,439 Abandoned US20140074466A1 (en) 2012-09-10 2012-09-25 Answering questions using environmental context
US15/224,944 Active US9576576B2 (en) 2012-09-10 2016-08-01 Answering questions using environmental context
US15/410,180 Active US9786279B2 (en) 2012-09-10 2017-01-19 Answering questions using environmental context

Family Applications After (2)

Application Number Title Priority Date Filing Date
US15/224,944 Active US9576576B2 (en) 2012-09-10 2016-08-01 Answering questions using environmental context
US15/410,180 Active US9786279B2 (en) 2012-09-10 2017-01-19 Answering questions using environmental context

Country Status (1)

Country Link
US (3) US20140074466A1 (en)

Cited By (134)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140195230A1 (en) * 2013-01-07 2014-07-10 Samsung Electronics Co., Ltd. Display apparatus and method for controlling the same
CN105206266A (en) * 2015-09-01 2015-12-30 重庆长安汽车股份有限公司 Vehicle-mounted voice control system and method based on user intention guess
US20160066106A1 (en) * 2014-08-27 2016-03-03 Auditory Labs, Llc Mobile audio receiver
US9286910B1 (en) * 2014-03-13 2016-03-15 Amazon Technologies, Inc. System for resolving ambiguous queries based on user context
US20160092447A1 (en) * 2014-09-30 2016-03-31 Rovi Guides, Inc. Systems and methods for searching for a media asset
US9438949B2 (en) 2009-02-12 2016-09-06 Digimarc Corporation Media processing methods and arrangements
US20170003933A1 (en) * 2014-04-22 2017-01-05 Sony Corporation Information processing device, information processing method, and computer program
WO2017090947A1 (en) * 2015-11-27 2017-06-01 Samsung Electronics Co., Ltd. Question and answer processing method and electronic device for supporting the same
CN108154881A (en) * 2017-11-28 2018-06-12 苏州市东皓计算机系统工程有限公司 A kind of voice recognition method of computer
US10043516B2 (en) * 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US20190228792A1 (en) * 2015-08-27 2019-07-25 Auditory Labs Llc Auditory interpretation device with display
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US20190341026A1 (en) * 2018-05-04 2019-11-07 Qualcomm Incorporated Audio analytics for natural language processing
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10681212B2 (en) 2015-06-05 2020-06-09 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US10692504B2 (en) 2010-02-25 2020-06-23 Apple Inc. User profiling for voice input processing
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US10839806B2 (en) 2017-07-10 2020-11-17 Samsung Electronics Co., Ltd. Voice processing method and electronic device supporting the same
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US20210248168A1 (en) * 2018-08-24 2021-08-12 Hewlett-Packard Development Company, L.P. Identifying digital elements
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US20220084516A1 (en) * 2018-12-06 2022-03-17 Comcast Cable Communications, Llc Voice Command Trigger Words
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11410677B2 (en) 2020-11-24 2022-08-09 Qualcomm Incorporated Adaptive sound event classification
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
EP4102501A1 (en) * 2021-06-08 2022-12-14 Comcast Cable Communications LLC Processing voice commands
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US20230115098A1 (en) * 2021-10-11 2023-04-13 Microsoft Technology Licensing, Llc Suggested queries for transcript search
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
CN116059646A (en) * 2023-04-06 2023-05-05 深圳尚米网络技术有限公司 Interactive expert guidance system
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11664044B2 (en) 2019-11-25 2023-05-30 Qualcomm Incorporated Sound event detection learning
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11954405B2 (en) 2022-11-07 2024-04-09 Apple Inc. Zero latency digital assistant

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150012840A1 (en) * 2013-07-02 2015-01-08 International Business Machines Corporation Identification and Sharing of Selections within Streaming Content
US20150162000A1 (en) * 2013-12-10 2015-06-11 Harman International Industries, Incorporated Context aware, proactive digital assistant
JP7020799B2 (en) * 2017-05-16 2022-02-16 ソニーグループ株式会社 Information processing equipment and information processing method
CN107463636B (en) * 2017-07-17 2021-02-19 北京小米移动软件有限公司 Voice interaction data configuration method and device and computer readable storage medium
US10381008B1 (en) * 2017-11-18 2019-08-13 Tp Lab, Inc. Voice-based interactive network monitor
US10762089B2 (en) 2017-12-05 2020-09-01 International Business Machines Corporation Open ended question identification for investigations
CN107910001A (en) * 2017-12-05 2018-04-13 杰克缝纫机股份有限公司 A kind of online voice assistant apparatus and system of industrial sewing machine
US11531858B2 (en) 2018-01-02 2022-12-20 International Business Machines Corporation Cognitive conversational agent for providing personalized insights on-the-fly
US11455501B2 (en) 2018-02-21 2022-09-27 Hewlett-Packard Development Company, L.P. Response based on hierarchical models
US11782962B2 (en) 2019-08-12 2023-10-10 Nec Corporation Temporal context-aware representation learning for question routing
US11188923B2 (en) * 2019-08-29 2021-11-30 Bank Of America Corporation Real-time knowledge-based widget prioritization and display
WO2021162489A1 (en) 2020-02-12 2021-08-19 Samsung Electronics Co., Ltd. Method and voice assistance apparatus for providing an intelligence response
US11657030B2 (en) 2020-11-16 2023-05-23 Bank Of America Corporation Multi-dimensional data tagging and reuse

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012030A (en) * 1998-04-21 2000-01-04 Nortel Networks Corporation Management of speech and audio prompts in multimodal interfaces
US20060247927A1 (en) * 2005-04-29 2006-11-02 Robbins Kenneth L Controlling an output while receiving a user input
US20070010992A1 (en) * 2005-07-08 2007-01-11 Microsoft Corporation Processing collocation mistakes in documents
US20070160345A1 (en) * 2004-05-10 2007-07-12 Masaharu Sakai Multimedia reproduction device and menu screen display method
US20070168335A1 (en) * 2006-01-17 2007-07-19 Moore Dennis B Deep enterprise search
US20080256033A1 (en) * 2007-04-10 2008-10-16 Motorola, Inc. Method and apparatus for distributed voice searching
US20100223056A1 (en) * 2009-02-27 2010-09-02 Autonomy Corporation Ltd. Various apparatus and methods for a speech recognition system
US20120084312A1 (en) * 2010-10-01 2012-04-05 Google Inc. Choosing recognized text from a background environment
US8438163B1 (en) * 2010-12-07 2013-05-07 Google Inc. Automatic learning of logos for visual recognition

Family Cites Families (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7562392B1 (en) 1999-05-19 2009-07-14 Digimarc Corporation Methods of interacting with audio and ambient music
US6931451B1 (en) * 1996-10-03 2005-08-16 Gotuit Media Corp. Systems and methods for modifying broadcast programming
US6269331B1 (en) 1996-11-14 2001-07-31 Nokia Mobile Phones Limited Transmission of comfort noise parameters during discontinuous transmission
KR100266578B1 (en) * 1997-06-11 2000-09-15 구자홍 Automatic tone correction method and apparatus
US5970446A (en) 1997-11-25 1999-10-19 At&T Corp Selective noise/channel/coding models and recognizers for automatic speech recognition
US6185527B1 (en) 1999-01-19 2001-02-06 International Business Machines Corporation System and method for automatic audio content analysis for word spotting, indexing, classification and retrieval
US7194752B1 (en) * 1999-10-19 2007-03-20 Iceberg Industries, Llc Method and apparatus for automatically recognizing input audio and/or video streams
US6415258B1 (en) 1999-10-06 2002-07-02 Microsoft Corporation Background audio recovery system
US6941275B1 (en) * 1999-10-07 2005-09-06 Remi Swierczek Music identification system
US6442519B1 (en) 1999-11-10 2002-08-27 International Business Machines Corp. Speaker model adaptation via network of similar users
JP4438144B2 (en) 1999-11-11 2010-03-24 ソニー株式会社 Signal classification method and apparatus, descriptor generation method and apparatus, signal search method and apparatus
US6578008B1 (en) * 2000-01-12 2003-06-10 Aaron R. Chacker Method and system for an online talent business
US7444353B1 (en) 2000-01-31 2008-10-28 Chen Alexander C Apparatus for delivering music and information
US6834308B1 (en) * 2000-02-17 2004-12-21 Audible Magic Corporation Method and apparatus for identifying media content presented on a media playing device
US6785670B1 (en) 2000-03-16 2004-08-31 International Business Machines Corporation Automatically initiating an internet-based search from within a displayed document
US20010041328A1 (en) 2000-05-11 2001-11-15 Fisher Samuel Heyward Foreign language immersion simulation process and apparatus
US7343553B1 (en) * 2000-05-19 2008-03-11 Evan John Kaye Voice clip identification method
US7853664B1 (en) * 2000-07-31 2010-12-14 Landmark Digital Services Llc Method and system for purchasing pre-recorded music
US6990453B2 (en) * 2000-07-31 2006-01-24 Landmark Digital Services Llc System and methods for recognizing sound and music signals in high noise and distortion
US6876966B1 (en) 2000-10-16 2005-04-05 Microsoft Corporation Pattern recognition training method and apparatus using inserted noise followed by noise reduction
US6636590B1 (en) * 2000-10-30 2003-10-21 Ingenio, Inc. Apparatus and method for specifying and obtaining services through voice commands
US6748360B2 (en) * 2000-11-03 2004-06-08 International Business Machines Corporation System for selling a product utilizing audio content identification
US20020072982A1 (en) * 2000-12-12 2002-06-13 Shazam Entertainment Ltd. Method and system for interacting with a user in an experiential environment
US6959276B2 (en) 2001-09-27 2005-10-25 Microsoft Corporation Including the category of environmental noise when processing speech signals
US6941324B2 (en) * 2002-03-21 2005-09-06 Microsoft Corporation Methods and systems for processing playlists
US20030187953A1 (en) * 2002-03-26 2003-10-02 Pearson Jeffrey J. Method of preparing and integrating set programming for the internet
ES2312772T3 (en) * 2002-04-25 2009-03-01 Landmark Digital Services Llc SOLID EQUIVALENCE AND INVENTORY OF AUDIO PATTERN.
US7398209B2 (en) * 2002-06-03 2008-07-08 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
EP1652173B1 (en) 2002-06-28 2015-12-30 Chemtron Research LLC Method and system for processing speech
US6907397B2 (en) 2002-09-16 2005-06-14 Matsushita Electric Industrial Co., Ltd. System and method of media file access and retrieval using speech recognition
JP4109063B2 (en) 2002-09-18 2008-06-25 パイオニア株式会社 Speech recognition apparatus and speech recognition method
JP4352790B2 (en) 2002-10-31 2009-10-28 セイコーエプソン株式会社 Acoustic model creation method, speech recognition device, and vehicle having speech recognition device
US7519534B2 (en) 2002-10-31 2009-04-14 Agiletv Corporation Speech controlled access to content on a presentation medium
US7457745B2 (en) 2002-12-03 2008-11-25 Hrl Laboratories, Llc Method and apparatus for fast on-line automatic speaker/environment adaptation for speech/speaker recognition in the presence of changing environments
US7617094B2 (en) 2003-02-28 2009-11-10 Palo Alto Research Center Incorporated Methods, apparatus, and products for identifying a conversation
EP1652385B1 (en) * 2003-07-25 2007-09-12 Koninklijke Philips Electronics N.V. Method and device for generating and detecting fingerprints for synchronizing audio and video
US7324943B2 (en) 2003-10-02 2008-01-29 Matsushita Electric Industrial Co., Ltd. Voice tagging, voice annotation, and speech recognition for portable devices with optional post processing
US7379875B2 (en) * 2003-10-24 2008-05-27 Microsoft Corporation Systems and methods for generating audio thumbnails
EP1531478A1 (en) 2003-11-12 2005-05-18 Sony International (Europe) GmbH Apparatus and method for classifying an audio signal
JP2005157494A (en) 2003-11-20 2005-06-16 Aruze Corp Conversation control apparatus and conversation control method
US7634095B2 (en) 2004-02-23 2009-12-15 General Motors Company Dynamic tuning of hands-free algorithm for noise and driving conditions
US7221902B2 (en) * 2004-04-07 2007-05-22 Nokia Corporation Mobile station and interface adapted for feature extraction from an input media sample
US20060041926A1 (en) 2004-04-30 2006-02-23 Vulcan Inc. Voice control of multimedia content
US7672543B2 (en) * 2005-08-23 2010-03-02 Ricoh Co., Ltd. Triggering applications based on a captured text in a mixed media environment
US7386105B2 (en) 2005-05-27 2008-06-10 Nice Systems Ltd Method and apparatus for fraud detection
US7945653B2 (en) 2006-10-11 2011-05-17 Facebook, Inc. Tagging digital media
US8271107B2 (en) 2006-01-13 2012-09-18 International Business Machines Corporation Controlling audio operation for data management and data rendering
KR100735820B1 (en) 2006-03-02 2007-07-06 삼성전자주식회사 Speech recognition method and apparatus for multimedia data retrieval in mobile device
US8019815B2 (en) * 2006-04-24 2011-09-13 Keener Jr Ellis Barlow Interactive audio/video method on the internet
US8831183B2 (en) 2006-12-22 2014-09-09 Genesys Telecommunications Laboratories, Inc Method for selecting interactive voice response modes using human voice detection analysis
US10056077B2 (en) 2007-03-07 2018-08-21 Nuance Communications, Inc. Using speech recognition results based on an unstructured language model with a music system
US8635243B2 (en) 2007-03-07 2014-01-21 Research In Motion Limited Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application
US20080221880A1 (en) 2007-03-07 2008-09-11 Cerra Joseph P Mobile music environment speech processing facility
US8861898B2 (en) 2007-03-16 2014-10-14 Sony Corporation Content image search
WO2009005760A2 (en) 2007-06-29 2009-01-08 Lawrence Genen Method or apparatus for purchasing one or more media based on a recommendation
US7788095B2 (en) 2007-11-18 2010-08-31 Nice Systems, Ltd. Method and apparatus for fast search in call-center monitoring
US20090157523A1 (en) 2007-12-13 2009-06-18 Chacha Search, Inc. Method and system for human assisted referral to providers of products and services
US20090240668A1 (en) 2008-03-18 2009-09-24 Yi Li System and method for embedding search capability in digital images
US8121837B2 (en) 2008-04-24 2012-02-21 Nuance Communications, Inc. Adjusting a speech engine for a mobile computing device based on background noise
TWI352970B (en) 2008-04-30 2011-11-21 Delta Electronics Inc Voice input system and voice input method
CN101751806A (en) * 2008-12-18 2010-06-23 鸿富锦精密工业(深圳)有限公司 Audio frequency play device with interactive function and interactive method thereof
CN101770705B (en) * 2009-01-05 2013-08-21 鸿富锦精密工业(深圳)有限公司 Audio playing device with interaction function and interaction method thereof
CN101807415B (en) * 2009-02-17 2013-11-06 鸿富锦精密工业(深圳)有限公司 Audio playing device with interactive function and interactive method thereof
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
EP2339576B1 (en) 2009-12-23 2019-08-07 Google LLC Multi-modal input on an electronic device
EP2362620A1 (en) 2010-02-23 2011-08-31 Vodafone Holding GmbH Method of editing a noise-database and computer device
US8265928B2 (en) 2010-04-14 2012-09-11 Google Inc. Geotagged environmental audio for enhanced speech recognition accuracy
US9311395B2 (en) 2010-06-10 2016-04-12 Aol Inc. Systems and methods for manipulating electronic content based on speech recognition
US8234111B2 (en) 2010-06-14 2012-07-31 Google Inc. Speech and noise models for speech recognition
US9047371B2 (en) 2010-07-29 2015-06-02 Soundhound, Inc. System and method for matching a query against a broadcast stream
US8744860B2 (en) 2010-08-02 2014-06-03 At&T Intellectual Property I, L.P. Apparatus and method for providing messages in a social network
US8700392B1 (en) 2010-09-10 2014-04-15 Amazon Technologies, Inc. Speech-inclusive device interfaces
US8645132B2 (en) 2011-08-24 2014-02-04 Sensory, Inc. Truly handsfree speech recognition in high noise environments
US9800941B2 (en) 2011-01-03 2017-10-24 Curt Evans Text-synchronized media utilization and manipulation for transcripts
US20120179557A1 (en) 2011-01-12 2012-07-12 John Nicholas Gross Performance Based Internet Reward System
EP2686846A4 (en) 2011-03-18 2015-04-22 Nokia Corp Apparatus for audio signal processing
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US9311915B2 (en) 2013-07-31 2016-04-12 Google Inc. Context-based speech recognition
US9552816B2 (en) 2014-12-19 2017-01-24 Amazon Technologies, Inc. Application focus in speech-based systems

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6012030A (en) * 1998-04-21 2000-01-04 Nortel Networks Corporation Management of speech and audio prompts in multimodal interfaces
US20070160345A1 (en) * 2004-05-10 2007-07-12 Masaharu Sakai Multimedia reproduction device and menu screen display method
US20060247927A1 (en) * 2005-04-29 2006-11-02 Robbins Kenneth L Controlling an output while receiving a user input
US20070010992A1 (en) * 2005-07-08 2007-01-11 Microsoft Corporation Processing collocation mistakes in documents
US20070168335A1 (en) * 2006-01-17 2007-07-19 Moore Dennis B Deep enterprise search
US20080256033A1 (en) * 2007-04-10 2008-10-16 Motorola, Inc. Method and apparatus for distributed voice searching
US20100223056A1 (en) * 2009-02-27 2010-09-02 Autonomy Corporation Ltd. Various apparatus and methods for a speech recognition system
US20120084312A1 (en) * 2010-10-01 2012-04-05 Google Inc. Choosing recognized text from a background environment
US8438163B1 (en) * 2010-12-07 2013-05-07 Google Inc. Automatic learning of logos for visual recognition

Cited By (211)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US11900936B2 (en) 2008-10-02 2024-02-13 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US9438949B2 (en) 2009-02-12 2016-09-06 Digimarc Corporation Media processing methods and arrangements
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10692504B2 (en) 2010-02-25 2020-06-23 Apple Inc. User profiling for voice input processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US20140195230A1 (en) * 2013-01-07 2014-07-10 Samsung Electronics Co., Ltd. Display apparatus and method for controlling the same
US11557310B2 (en) 2013-02-07 2023-01-17 Apple Inc. Voice trigger for a digital assistant
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US9286910B1 (en) * 2014-03-13 2016-03-15 Amazon Technologies, Inc. System for resolving ambiguous queries based on user context
US10474426B2 (en) * 2014-04-22 2019-11-12 Sony Corporation Information processing device, information processing method, and computer program
US20170003933A1 (en) * 2014-04-22 2017-01-05 Sony Corporation Information processing device, information processing method, and computer program
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US10714095B2 (en) 2014-05-30 2020-07-14 Apple Inc. Intelligent assistant for home automation
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
US10657966B2 (en) 2014-05-30 2020-05-19 Apple Inc. Better resolution when referencing to concepts
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US10299050B2 (en) * 2014-08-27 2019-05-21 Auditory Labs, Llc Mobile audio receiver
US20160066106A1 (en) * 2014-08-27 2016-03-03 Auditory Labs, Llc Mobile audio receiver
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US20160092447A1 (en) * 2014-09-30 2016-03-31 Rovi Guides, Inc. Systems and methods for searching for a media asset
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US11301507B2 (en) 2014-09-30 2022-04-12 Rovi Guides, Inc. Systems and methods for searching for a media asset
US11860927B2 (en) 2014-09-30 2024-01-02 Rovi Guides, Inc. Systems and methods for searching for a media asset
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US9830321B2 (en) * 2014-09-30 2017-11-28 Rovi Guides, Inc. Systems and methods for searching for a media asset
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US10681212B2 (en) 2015-06-05 2020-06-09 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US10580431B2 (en) * 2015-08-27 2020-03-03 Auditory Labs Llc Auditory interpretation device with display
US20190228792A1 (en) * 2015-08-27 2019-07-25 Auditory Labs Llc Auditory interpretation device with display
CN105206266A (en) * 2015-09-01 2015-12-30 重庆长安汽车股份有限公司 Vehicle-mounted voice control system and method based on user intention guess
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US10446145B2 (en) 2015-11-27 2019-10-15 Samsung Electronics Co., Ltd. Question and answer processing method and electronic device for supporting the same
KR20170061923A (en) * 2015-11-27 2017-06-07 삼성전자주식회사 Method For Processing of Question and answer and electronic device supporting the same
CN108292317A (en) * 2015-11-27 2018-07-17 三星电子株式会社 Problem and answer processing method and the electronic equipment for supporting this method
EP3332338A4 (en) * 2015-11-27 2018-08-15 Samsung Electronics Co., Ltd. Question and answer processing method and electronic device for supporting the same
KR102558437B1 (en) * 2015-11-27 2023-07-24 삼성전자주식회사 Method For Processing of Question and answer and electronic device supporting the same
WO2017090947A1 (en) * 2015-11-27 2017-06-01 Samsung Electronics Co., Ltd. Question and answer processing method and electronic device for supporting the same
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US20180308486A1 (en) * 2016-09-23 2018-10-25 Apple Inc. Intelligent automated assistant
US10553215B2 (en) * 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
US10043516B2 (en) * 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US10847142B2 (en) 2017-05-11 2020-11-24 Apple Inc. Maintaining privacy of personal information
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11538469B2 (en) 2017-05-12 2022-12-27 Apple Inc. Low-latency intelligent automated assistant
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10909171B2 (en) 2017-05-16 2021-02-02 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US11670302B2 (en) 2017-07-10 2023-06-06 Samsung Electronics Co., Ltd. Voice processing method and electronic device supporting the same
US10839806B2 (en) 2017-07-10 2020-11-17 Samsung Electronics Co., Ltd. Voice processing method and electronic device supporting the same
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
CN108154881A (en) * 2017-11-28 2018-06-12 苏州市东皓计算机系统工程有限公司 A kind of voice recognition method of computer
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US11094316B2 (en) * 2018-05-04 2021-08-17 Qualcomm Incorporated Audio analytics for natural language processing
US20190341026A1 (en) * 2018-05-04 2019-11-07 Qualcomm Incorporated Audio analytics for natural language processing
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US10944859B2 (en) 2018-06-03 2021-03-09 Apple Inc. Accelerated task performance
US10504518B1 (en) 2018-06-03 2019-12-10 Apple Inc. Accelerated task performance
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US20210248168A1 (en) * 2018-08-24 2021-08-12 Hewlett-Packard Development Company, L.P. Identifying digital elements
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US20220084516A1 (en) * 2018-12-06 2022-03-17 Comcast Cable Communications, Llc Voice Command Trigger Words
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11675491B2 (en) 2019-05-06 2023-06-13 Apple Inc. User configurable task triggers
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11664044B2 (en) 2019-11-25 2023-05-30 Qualcomm Incorporated Sound event detection learning
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11410677B2 (en) 2020-11-24 2022-08-09 Qualcomm Incorporated Adaptive sound event classification
EP4102501A1 (en) * 2021-06-08 2022-12-14 Comcast Cable Communications LLC Processing voice commands
US11914644B2 (en) * 2021-10-11 2024-02-27 Microsoft Technology Licensing, Llc Suggested queries for transcript search
US20230115098A1 (en) * 2021-10-11 2023-04-13 Microsoft Technology Licensing, Llc Suggested queries for transcript search
US11954405B2 (en) 2022-11-07 2024-04-09 Apple Inc. Zero latency digital assistant
CN116059646A (en) * 2023-04-06 2023-05-05 深圳尚米网络技术有限公司 Interactive expert guidance system

Also Published As

Publication number Publication date
US20160343371A1 (en) 2016-11-24
US20170133014A1 (en) 2017-05-11
US9786279B2 (en) 2017-10-10
US9576576B2 (en) 2017-02-21

Similar Documents

Publication Publication Date Title
US9786279B2 (en) Answering questions using environmental context
US9031840B2 (en) Identifying media content
KR102241972B1 (en) Answering questions using environmental context
US11842727B2 (en) Natural language processing with contextual data representing displayed content
US20210056133A1 (en) Query response using media consumption history
US9123330B1 (en) Large-scale speaker identification
US9148619B2 (en) Music soundtrack recommendation engine for videos
US20130166303A1 (en) Accessing media data using metadata repository
US10565256B2 (en) Contextually disambiguating queries
US20140379346A1 (en) Video analysis based language model adaptation
EP3583514A1 (en) Contextually disambiguating queries
EP2706470A1 (en) Answering questions using environmental context
Bourlard et al. Processing and linking audio events in large multimedia archives: The eu inevent project
US11640426B1 (en) Background audio identification for query disambiguation
US11657805B2 (en) Dynamic context-based routing of speech processing
US11830497B2 (en) Multi-domain intent handling with cross-domain contextual signals
US20220415311A1 (en) Early invocation for contextual data processing
Sonawane et al. Sand search engine
WO2022271555A1 (en) Early invocation for contextual data processing

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHARIFI, MATTHEW;POSTELNICU, GHEORGHE;SIGNING DATES FROM 20120919 TO 20120920;REEL/FRAME:029165/0515

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044144/0001

Effective date: 20170929