US20140328570A1 - Identifying, describing, and sharing salient events in images and videos - Google Patents
Identifying, describing, and sharing salient events in images and videos Download PDFInfo
- Publication number
- US20140328570A1 US20140328570A1 US14/332,071 US201414332071A US2014328570A1 US 20140328570 A1 US20140328570 A1 US 20140328570A1 US 201414332071 A US201414332071 A US 201414332071A US 2014328570 A1 US2014328570 A1 US 2014328570A1
- Authority
- US
- United States
- Prior art keywords
- event
- computing system
- salient
- video
- segments
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/02—Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
- G11B27/031—Electronic editing of digitised analogue information signals, e.g. audio or video signals
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/44029—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display for generating different versions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
Definitions
- visual content e.g., digital photos and videos
- mobile device applications instant messaging and electronic mail, social media services, and other electronic communication methods.
- FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module as disclosed herein;
- FIG. 2 is a simplified schematic diagram of an environment of at least one embodiment of the multimedia content understanding module of FIG. 1 ;
- FIG. 3 is a simplified flow diagram of at least one embodiment of a process executable by the computing system of FIG. 1 to provide multimedia content understanding and assistance as disclosed herein;
- FIG. 4 is a simplified schematic illustration of at least one embodiment of feature, concept, event, and salient activity modeling as disclosed herein;
- FIG. 5 is a simplified example of at least one embodiment of automated salient event detection in a video as disclosed herein;
- FIG. 6 is a simplified block diagram of an exemplary computing environment in connection with which at least one embodiment of the system of FIG. 1 may be implemented.
- FIG. 1 an embodiment of a multimedia content understanding and assistance computing system 100 is shown in the context of an environment that may be created during the operation of the system 100 (e.g., a physical and/or virtual execution or “runtime” environment).
- the illustrative multimedia content understanding and assistance computing system 100 is embodied as a number of machine-readable instructions, modules, data structures and/or other components, which may be implemented as computer hardware, firmware, software, or a combination thereof.
- the multimedia content understanding and assistance computing system 100 may be referred to herein as a “visual content assistant,” a “video assistant,” an “image assistant,” a “multimedia assistant,” or by similar terminology.
- the computing system 100 executes computer vision algorithms, including machine learning algorithms, semantic reasoning techniques, and/or other technologies to, among other things, in an automated fashion, identify, understand, and describe events that are depicted in multimedia input 102 .
- the illustrative computing system 100 can, among other things, help users quickly and easily locate “salient activities” in lengthy and/or large volumes of video footage, so that the most important or meaningful segments can be extracted, retained and shared.
- the computing system 100 can compile the salient event segments into a visual presentation 120 (e.g., a “highlight reel” video clip) that can be stored and/or shared over a computer network.
- the system 100 can, alternatively or in addition, generate a natural language (NL) description 122 of the multimedia input 102 , which describes the content of the input 102 in a manner that can be used for, among other things, searching, retrieval, and establishing links between the input 102 and other electronic content (such as text or voice input, advertisements, other videos, documents, or other multimedia content).
- NL natural language
- the events that can be automatically detected and described by the computing system 100 include “complex events.”
- “complex event” may refer to, among other things, an event that is made up of multiple “constituent” people, objects, scenes and/or activities.
- a birthday party is a complex event that can include the activities of singing, blowing out candles, opening presents, and eating cake.
- a child acting out an improvisation is a complex event that may include people smiling, laughing, dancing, drawing a picture, and applause.
- a group activity relating to a political issue, sports event, or music performance is a complex event that may involve a group of people walking or standing together, a person holding a sign, written words on the sign, a person wearing a t-shirt with a slogan printed on the shirt, and human voices shouting.
- complex events include human interactions with other people (e.g., conversations, meetings, presentations, etc.) and human interactions with objects (e.g., cooking, repairing a machine, conducting an experiment, building a house, etc.).
- the activities that make up a complex event are not limited to visual features.
- activities may refer to, among other things, visual, audio, and/or text features, which may be detected by the computing system 100 in an automated fashion using a number of different algorithms and feature detection techniques, as described in more detail below.
- an activity as used herein may refer to any semantic element of the multimedia input 102 that, as determined by the computing system 100 , evidences an event.
- multimedia input may refer to, among other things, a collection of digital images, a video, a collection of videos, or a collection of images and videos (where a “collection” includes two or more images and/or videos).
- References herein to a “video” may refer to, among other things, a relatively short video clip, an entire full-length video production, or different segments within a video or video clip (where a segment includes a sequence of two or more frames of the video).
- Any video of the input 102 may include or have associated therewith an audio soundtrack and/or a speech transcript, where the speech transcript may be generated by, for example, an automated speech recognition (ASR) module of the computing system 100 .
- ASR automated speech recognition
- Any video or image of the input 102 may include or have associated therewith a text transcript, where the text transcript may be generated by, for example, an optical character recognition (OCR) module of the computing system 100 .
- OCR optical character recognition
- References herein to an “image” may refer to, among other things, a still image (e.g., a digital photograph) or a frame of a video (e.g., a “key frame”).
- a multimedia content understanding module 104 of the computing system 100 is embodied as software, firmware, hardware, or a combination thereof.
- the multimedia content understanding module 104 applies a number of different feature detection algorithms 130 to the multimedia input 102 , using a multimedia content knowledge base 132 , and generates an event description 106 based on the output of the algorithms 130 .
- the multimedia knowledge base 132 is embodied as software, firmware, hardware, or a combination thereof (e.g., as a database, table, or other suitable data structure or computer programming construct).
- the illustrative multimedia content understanding module 104 executes different feature detection algorithms 130 on different parts or segments of the multimedia input 102 to detect different features, or the multimedia content understanding module 104 executes all or a subset of the feature detection algorithms 130 on all portions of the multimedia input 102 .
- Some examples of feature detection algorithms and techniques including low-level, mid-level, and complex event detection and recognition techniques, are described in the priority application, Cheng et al., U.S. Utility patent application Ser. No. 13/737,607 (“Classification, Search, and Retrieval of Complex Video Events”); and also in Chakraborty et al., U.S. Utility patent application Ser. No. 14/021,696, filed Sep.
- the event description 106 semantically describes an event depicted by the multimedia input 102 , as determined by the multimedia content understanding module 104 .
- the event description 106 is determined algorithmically by the computing system 100 analyzing the multimedia input 102 .
- the event description 106 may be user-supplied or determined by the system 100 based on meta data or other descriptive information associated with the input 102 .
- the illustrative event description 106 generated by the understanding module 104 indicates an event type or category, such as “birthday party,” “wedding,” “soccer game,” “hiking trip,” or “family activity.”
- the event description 106 may be embodied as, for example, a natural language word or phrase that is encoded in a tag or label, which the computing system 100 associates with the multimedia input 102 (e.g., as an extensible markup language or XML tag).
- the event description 106 may be embodied as structured data, e.g., a data type or data structure including semantics, such as “Party(retirement),” “Party(birthday),” “Sports_Event(soccer),” “Performance(singing),” or “Performance(dancing).”
- the illustrative multimedia content understanding module 104 accesses one or more feature models 134 and/or concept models 136 .
- the feature models 134 and the concept models 136 are embodied as software, firmware, hardware, or a combination thereof, e.g., a knowledge base, database, table, or other suitable data structure or computer programming construct.
- the models 134 , 136 correlate semantic descriptions of features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts.
- the feature models 134 may define relationships between sets of low level features detected by the algorithms 130 with semantic descriptions of those sets of features (e.g., “object,” “person,” “face,” “ball,” “vehicle,” etc.).
- the concept model 136 may define relationships between sets of features detected by the algorithms 130 and higher-level “concepts,” such as people, objects, actions and poses (e.g., “sitting,” “running,” “throwing,” etc.).
- the semantic descriptions of features and concepts that are maintained by the models 134 , 136 may be embodied as natural language descriptions and/or structured data.
- a mapping 140 of the knowledge base 132 indicates relationships between various combinations of features, concepts, events, and activities.
- the event description 106 can be determined using semantic reasoning in connection with the knowledge base 132 and/or the mapping 140 .
- the computing system 100 may utilize, for example, a knowledge representation language or ontology.
- the computing system 100 uses the event description 106 and the knowledge base 132 to determine one or more “salient” activities that are associated with the occurrence of the detected event. To do this, the computing system 100 may access salient event criteria 138 and/or the mapping 140 of the knowledge base 132 .
- the illustrative salient event criteria 138 indicate one or more criteria for determining whether an activity is a salient activity in relation to one or more events. For instance, the salient event criteria 138 identify salient activities and the corresponding feature detection information that the computing system 100 needs in order to algorithmically identify those salient activities in the input 102 (where the feature detection information may include, for example, parameters of computer vision algorithms 130 ).
- the salient event criteria 138 includes saliency indicators 238 ( FIG.
- a salient event criterion 138 may be embodied as, for example, one or more per-defined, selected, or computed data values.
- a saliency indicator 238 may be embodied as, for example, a pre-defined, selected, or computed data value, such as a priority, a weight or a rank that can be used to arrange or prioritize the salient event segments 112 .
- the mapping 140 of the knowledge base 132 links activities with events, so that, once the event description 106 is determined, the understanding module 104 can determine the activities that are associated with the event description 106 and look for those activities in the input 102 .
- the mapping 140 may establish one-to-one, one-to-many, or many-to-many logical relationships between the various events and activities in the knowledge base 132 .
- mapping 140 and the various other portions of the knowledge base 132 can be configured and defined according to the requirements of a particular design of the computing system 100 (e.g., according to domain-specific requirements).
- the computing system 100 executes one or more algorithms 130 to identify particular portions or segments of the multimedia input 102 that depict those salient activities.
- the computing system 100 determines that the multimedia input 102 depicts a birthday party (the event)
- the illustrative multimedia content understanding module 104 accesses the multimedia content knowledge base 132 to determine the constituent activities that are associated with a birthday party (e.g., blowing out candles, etc.), and selects one or more of the feature detection algorithms 130 to execute on the multimedia input 102 to look for scenes in the input 102 that depict those constituent activities.
- the understanding module 104 executes the selected algorithms 130 to identify salient event segments 112 of the input 102 , such that the identified salient event segments 112 each depict one (or more) of the constituent activities that are associated with the birthday party.
- an output generator module 114 of the computing system 100 can do a number of different things with the salient activity information.
- the output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118 are each embodied as software, firmware, hardware, or a combination thereof.
- the visual presentation generator module 116 of the output generator module 114 automatically extracts (e.g., removes or makes a copy of) the salient event segments 112 from the input 102 and incorporates the extracted segments 112 into a visual presentation 120 , such as a video clip (e.g., a “highlight reel”) or multimedia presentation, using a presentation template 142 .
- a visual presentation 120 such as a video clip (e.g., a “highlight reel”) or multimedia presentation
- the visual presentation generator module 116 may select the particular presentation template 142 to use to create the presentation 120 based on a characteristic of the multimedia input 102 , the event description 106 , user input, domain-specific criteria, and/or other presentation template selection criteria.
- the natural language generator module 118 of the output generator module 114 automatically generates a natural language description 122 of the event 106 , including natural language descriptions of the salient event segments 112 and suitable transition phrases, using a natural language template 144 .
- the natural language presentation generator module 118 may select the particular natural language template 144 to use to create the NL description 122 based on a characteristic of the multimedia input 102 , the event description 106 , user input, domain-specific criteria, and/or other NL template selection criteria.
- An example of a natural language description 122 for a highlight reel of a child's birthday party which may be output by the NL generator module 118 , may include: “Child's birthday party, including children playing games followed by singing, blowing out candles, and eating cake.”
- Some examples of methods for generating the natural language description 122 e.g., “recounting” are described in the aforementioned priority patent application Ser. No. 13/737,607.
- the NL generator module 118 may formulate the NL description 122 as natural language speech using, e.g., stored NL speech samples (which may be stored in, for example, data storage 620 ).
- the NL speech samples may include prepared NL descriptions of complex events and activities.
- the NL description 122 may be constructed “on the fly,” using, e.g., a natural language generator and text-to-speech (TTS) subsystem, which may be implemented as part of the computing system 100 or as external modules or systems with which the computing system 100 is in communication over a computer network (e.g., a network 646 ).
- TTS text-to-speech
- the presentation templates 142 provide the specifications that the output generator module 114 uses to select salient event segments 112 for inclusion in the visual presentation 120 , arrange the salient event segments 112 , and create the visual presentation 120 .
- a presentation template 142 specifies, for a particular event type, the type of content to include in the visual presentation 120 , the number of salient event segments 112 , the order in which to arrange the segments 112 , (e.g., chronological or by subject matter), the pace and transitions between the segments 112 , the accompanying audio or text, and/or other aspects of the visual presentation 120 .
- the presentation template 142 may further specify a maximum duration of the visual presentation 120 , which may correspond to a maximum duration permitted by a video sharing service or a social media service (in consideration of the limitations of the computer network infrastructure or for other reasons).
- Portions of the templates 142 may be embodied as an ontology or knowledge base that incorporates or accesses previously developed knowledge, such as knowledge obtained from the analysis of many inputs 102 over time, and information drawn from other data sources that are publicly available (e.g., on the Internet).
- the output generator module 114 or more specifically, the visual presentation generator module 116 formulates the visual presentation 120 according to the system- or user-selected template 142 (e.g., by inserting the salient event segments 112 extracted from the multimedia input 102 into appropriate “slots” in the template 142 ).
- the illustrative computing system 100 includes a number of semantic content learning modules 152 , including feature learning modules 154 , a concept learning module 156 , a salient event learning module 158 , and a template learning module 160 .
- the learning modules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection 150 and create and /or update portions of the knowledge base 132 and/or the presentation templates 142 .
- the learning modules 152 may be used to initially populate and/or periodically update portions of the knowledge base 132 and/or the templates 142 , 144 .
- the feature learning modules 154 analyze sample images and videos from the collection 150 and populate or update the feature models 134 .
- the feature learning modules 154 may, over time or as a result of analyzing portions of the collection 150 , algorithmically learn patterns of computer vision algorithm output that evidence a particular feature, and update the feature models 134 accordingly.
- the concept learning module 156 may, over time or as a result of analyzing portions of the collection 150 , algorithmically learn combinations of low level features that evidence particular concepts, and update the concept model 136 accordingly.
- the illustrative salient event learning module 158 analyzes portions of the image/video collection 150 to determine salient event criteria 138 , to identify events for inclusion in the mapping 140 , to identify activities that are associated with events, and to determine the saliency of various activities with respect to different events. For example, the salient event learning module 158 may identify a new event or activity for inclusion in the mapping 140 , or identify new salient event criteria 138 , based on the frequency of occurrence of certain features and/or concepts in the collection 150 .
- the salient event learning module 158 can also identify multi-modal salient event markers including “non visual” characteristics of input videos such as object motion, changes in motion patterns, changes in camera position or camera motion, amount or direction of camera motion, camera angle, audio features (e.g., cheering sounds or speech).
- non visual characteristics of input videos such as object motion, changes in motion patterns, changes in camera position or camera motion, amount or direction of camera motion, camera angle, audio features (e.g., cheering sounds or speech).
- the illustrative template learning module 160 analyzes the collection 150 to create and/or update the presentation templates 142 .
- the template learning module 160 may, after reviewing a set of professionally made videos in the collection 150 , incorporate organizational elements of those professionally made videos into one or more of the templates 142 or create a new template 142 that includes specifications gleaned from the professionally made videos. Portions of the template learning module 160 may algorithmically analyze the audio tracks of videos in the collection 150 and update the NL templates 144 , as well.
- the video collection 150 refers generally to one or more bodies of retrievable multimedia digital content that may be stored in computer memory at the computing system 100 and/or other computing systems or devices.
- the video collection 150 may include images and/or videos stored remotely at Internet sites such as YOUTUBE and INSTAGRAM, and/or images/videos that are stored in one or more local collections, such as storage media of a personal computer or mobile device (e.g., a “camera roll” of a mobile device camera application).
- images/videos in the collection 150 need not have been previously tagged with meta data or other identifying material in order to be useful to the computing system 100 .
- the computing system 100 can operate on images/videos 150 and/or multimedia input 102 whether or not it has been previously tagged or annotated in any way.
- any of the learning modules 152 can learn and apply those existing descriptions to the knowledge base 132 and/or the templates 142 , 144 .
- the output generator module 114 interfaces with an interactive storyboard module 124 to allow the end user to modify the (machine-generated) visual presentation 120 and/or the (machine-generated) NL description 122 , as desired.
- the illustrative interactive storyboard module 124 includes an editing module 126 , a sharing module 128 , and an auto-suggest module 162 .
- the interactive storyboard module 124 and its submodules 126 , 128 , 162 are each embodied as software, firmware, hardware, or a combination thereof.
- the editing module 126 displays the elements of the visual presentation 120 on a display device (e.g., a display device 642 , FIG.
- the interactive storyboard module 124 presents the salient event segments 112 using a storyboard format that enables the user to intuitively review, rearrange, add and delete segments of the presentation 120 (e.g. by tapping on a touchscreen of the HCI subsystem 638 ).
- the interactive storyboard module 124 stores the updated version of the presentation 120 in computer memory (e.g., a data storage 620 ).
- the sharing module 128 is responsive to user interaction with the computing system 100 that indicates that the user would like to “share” the newly created or updated presentation 120 with other people, e.g., over a computer network, e-mail, a messaging service, or other electronic communication mechanism.
- the illustrative sharing module 128 can be pre-configured or user-configured to automatically enable sharing in response to the completion of a presentation 120 , or to only share the presentation 120 in response to affirmative user approval of the presentation 120 .
- the sharing module 128 is configured to automatically share (e.g., upload to an Internet-based photo or video sharing site or service) the presentation 120 in response to a single user interaction (e.g., “one click” sharing).
- the sharing module 128 interfaces with network interface technology of the computing system 100 (e.g., a communication subsystem 644 ).
- the auto-suggest module 162 leverages the information produced by other modules of the computing system 100 , including the event description 106 , the NL description 122 , and/or the visual presentation 120 , to provide an intelligent automatic image/video suggestion service.
- the auto-suggest module 162 associates, or interactively suggests, the visual presentation 120 or the multimedia input 102 to be associated, with other electronic content based on the event description 106 or the NL description 122 that the computing system 100 has automatically assigned to the multimedia input 102 or the visual presentation 120 .
- the auto-suggest module 162 includes a persistent input monitoring mechanism that monitors user inputs received by the editing module 126 or other user interface modules of the computing system 100 , including inputs received by other applications or services running on the computing system 100 .
- the auto-suggest module 162 evaluates the user inputs over time, compares the user inputs to the event descriptions 106 and/or the NL descriptions 122 (using, e.g., a matching algorithm), determines if any user inputs match any of the event descriptions 106 or NL descriptions 122 , and, if an input matches an event description 106 or an NL description 122 , generates an image suggestion, which suggests the relevant images/videos 102 , 120 in response to the user input based on the comparison of the description(s) 106 , 122 to the user input.
- the auto-suggest module 162 detects a textual description input as a wall post to a social media page or a text message, the auto-suggest module 162 looks for images/videos in the collection 150 or stored in other locations, which depict visual content relevant to the content of the wall post or text message. If the auto-suggest module 162 determines that an image/video 102 , 120 contains visual content that matches the content of the wall post or text message, the auto-suggest module 162 displays a thumbnail of the matching image/video as a suggested supplement or attachment to the wall post or text message.
- the auto-suggest module 162 operates in conjunction with other modules of the computing system 100 to interactively suggest an event description 106 or NL description 122 to associate with an image/video 102 , 120 .
- the system 100 may suggest that the event description 106 and/or the NL description 122 associated with the image in the collection 150 be automatically propagated to the unlabeled image/video 102 , 120 .
- the illustrative computing system 100 also includes a user preference learning module 148 .
- the user preference learning module 148 is embodied as software, firmware, hardware, or a combination thereof.
- the user preference learning module 148 monitors implicit and/or explicit user interactions with the presentation 120 (user feedback 146 ) and executes, e.g., machine learning algorithms to learn user-specific specifications and/or preferences as to, for example, the types of activities that the user considers to be “salient” with respect to particular events, the user's specifications or preferences as to the ordering of salient events in various types of different presentations 120 , and/or other aspects of the creation of the presentation 120 and/or the NL description 122 .
- the user preference learning module 148 updates the templates 142 , 144 and/or portions of the knowledge base 132 (e.g., the salient event criteria 138 ) based on its analysis of the user feedback 146 .
- multimedia content understanding module 104 is shown in greater detail, in the context of an environment that may be created during the operation of the computing system 100 (e.g., a physical and/or virtual execution or “runtime” environment).
- the multimedia content understanding module 104 and each of the components shown in FIG. 2 are embodied as machine-readable instructions, modules, data structures and/or other components, in software, firmware, hardware, or a combination thereof.
- the illustrative multimedia content understanding module 104 includes a number of feature detection modules 202 , including a visual feature detection module 212 , an audio feature detection module 214 , a text feature detection module 216 , and a camera configuration feature detection module 218 .
- the feature detection modules 202 including the modules 212 , 214 , 216 , 218 , are embodied as software, firmware, hardware, or a combination thereof.
- the various feature detection modules 212 , 214 , 216 , 218 analyze different aspects of the multimedia input 102 using respective portions of the feature models 134 .
- the multimedia content understanding module 104 employs external devices, applications and services as needed in order to create, from the multimedia input 102 , one or more image/video segments 204 , an audio track 206 , and a text/speech transcript 208 .
- the image/video segment(s) 204 each include one or more digital images of the multimedia input 102 (e.g., a still image, a set of still images, a video, or a set of videos).
- the visual feature detection module 212 analyzes each segment 204 using the visual feature models 236 , and outputs a set of visual features 220 that have been detected in the segment 204 . To do this, the visual feature detection module 212 employs a number of automated feature recognition algorithms 130 to detect lower-level features of interest in the input 102 , and interfaces with the visual feature models 236 to recognize and semantically classify the detected features.
- low-level may refer to, among other things, visual features that capture characteristic shapes and motion without significant spatio-temporal variations between different instances of the features.
- Static visual features include features that are extracted from individual keyframes of a video at a defined extraction rate (e.g., 1 frame/second).
- static visual feature detectors include GIST, SIFT (Scale-Invariant Feature Transform), and colorSIFT.
- the GIST feature detector can be used to detect abstract scene and layout information, including perceptual dimensions such as naturalness, openness, roughness, etc.
- the SIFT feature detector can be used to detect the appearance of an image at particular interest points without regard to image scale, rotation, level of illumination, noise, and minor changes in viewpoint.
- the colorSIFT feature detector extends the SIFT feature detector to include color keypoints and color descriptors, such as intensity, shadow, and shading effects.
- Dynamic visual features include features that are computed over x-y-t segments or windows of a video. Dynamic feature detectors can detect the appearance of actors, objects and scenes as well as their motion information. Some examples of dynamic feature detectors include MoSIFT, STIP (Spatio-Temporal Interest Point), DTF-HOG (Dense Trajectory based Histograms of Oriented Gradients), and DTF-MBH (Dense-Trajectory based Motion Boundary Histogram).
- the MoSIFT feature detector extends the SIFT feature detector to the time dimension and can collect both local appearance and local motion information, and identify interest points in the video that contain at least a minimal amount of movement.
- the STIP feature detector computes a spatio-temporal second-moment matrix at each video point using independent spatial and temporal scale values, a separable Gaussian smoothing function, and space-time gradients.
- the DTF-HoG feature detector tracks two-dimensional interest points over time rather than three-dimensional interest points in the x-y-t domain, by sampling and tracking feature points on a dense grid and extracting the dense trajectories.
- the HoGs are computed along the dense trajectories to eliminate the effects of camera motion (which may be particularly important in the context of unconstrained or “in the wild” videos).
- the DTF-MBH feature detector applies the MBH descriptors to the dense trajectories to capture object motion information.
- the MBH descriptors represent the gradient of optical flow rather than the optical flow itself.
- the MBH descriptors can suppress the effects of camera motion, as well.
- HoF hoverograms of optical flow
- Additional details of the illustrative low-level feature detectors can be found in the priority application, U.S. Provisional patent application Ser. No. 61/637,196.
- the illustrative visual feature detection module 212 quantizes the extracted low-level features by feature type using the visual feature models 236 .
- the feature models 236 or portions thereof are machine-learned (e.g., from training data in the collection 150 ) using, e.g., k-means clustering techniques.
- the visual feature detection module 212 can aggregate the quantized low-level features by feature type, by using, for example, a Bag-of-Words (BoW) model in which a frequency histogram of visual words is computed over the entire length of a video.
- the visual feature detection module 212 identifies the visual features 220 to the event detection module 228 .
- Some embodiments of the computing system 100 can detect the presence of a variety of different types of multimedia features in the multimedia input 102 , including audio and text, in addition to the more typical visual features (e.g., actors, objects, scenes, actions).
- the illustrative audio feature detection module 214 analyzes the audio track of an input 102 using mathematical sound processing algorithms and uses the audio feature model 238 (e.g., an acoustic model) to detect and classify audio features 222 .
- the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an input video 102 , and, with the audio feature model 238 , classify the acoustic characteristic as indicating a “cheering” sound or “applause.”
- Some examples of low level audio features that can be used to mathematically detect audio events in the input 102 include Mel frequency cepstral coefficients (MFCCs), spectral centroid (SC), spectral roll off (SRO), time domain zero crossing (TDZC), and spectral flux.
- the audio feature model 238 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to the visual feature models 236 except that the audio features of the training data are analyzed rather than the visual features, in order to develop the audio feature model 238 .
- the audio feature detection module 214 identifies the detected audio features 222 to the event detection module 228 .
- the text feature detection module 216 interfaces with an automated speech recognition (ASR) system and/or a video optical character recognition (OCR) system.
- ASR and/or OCR system may be a part of the computing system 100 or in communication with the computing system 100 via a computer network (e.g., a network 646 ).
- An ASR system may identify spoken words present in the audio track of a video input 102 and provide a text translation of the spoken words (e.g., a transcript 208 ) to the text feature detection module 216 .
- An OCR system may recognize text that is present in a visual scene of an image or video, and provide the recognized text (e.g., a transcript 208 ) to the text feature detection module 216 .
- the OCR system may be used to detect words or phrases displayed on apparel, street signs, or buildings that are depicted in one or more scenes of the input images/video 102 .
- the text feature detection module 216 evaluates the transcript 208 using a text model 240 to extract portions of the transcript 208 that may be semantically meaningful. Portions of the text model 240 may be embodied as a language model or vocabulary, for example.
- the illustrative text model 240 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to the visual feature models 236 except that the text features of the training data are analyzed rather than the visual features, in order to develop the text model 240 .
- an ASR transcript 208 of a video input 102 may include the words, “he just scored” and/or an OCR transcript 208 of the video input 102 may include the phrase “GO NINERS.”
- the text feature detection module 216 identifies these textual features 224 to the event detection module 228 .
- the camera configuration feature detection module 218 detects “meta-level” features 226 in images and video input 102 .
- meta-level features 226 include camera motion, camera view angle, number of shots taken, shot composition, and shot duration (e.g., number of frames in a video input 102 ).
- the computing system 100 uses one or more of the meta-level features 226 to discern the intent of the person taking the picture or video: what were they trying to capture?
- the camera angle, the direction and speed of the motion of the camera relative to a detected event in a video input 102 can reveal people or objects of interest to the camera holder.
- the camera configuration feature detection module 218 may determine, based on information in the camera configuration model 242 , that if a set of meta-level features 226 indicates that the camera is tracking the movement of a particular person or object in the scene at a particular speed, the tracked person or object is likely to be of interest to the camera holder, and thus, salient event segments 112 should include the tracked person or object.
- the illustrative camera configuration feature model 242 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to the visual feature models 236 except that the meta-level features of the training data are analyzed rather than the visual features, in order to develop the camera configuration feature model 238 .
- the camera configuration feature module 218 identifies the meta-level features 226 to the event detection module 228 .
- the illustrative event detection module 228 uses data fusion techniques to combine the visual features 220 , the audio features 222 , the textual features 224 , and the meta-level features 226 , to the extent that each or any of these types of features are detected in the multimedia input 102 . In this way, the event detection module 228 can utilize a variety of visual and non-visual features to identify events and salient activities in the input 102 .
- the illustrative event detection module 228 applies a supervised learning model, such as Support Vector Machine (SVM) classifiers, to the visual features 220 (e.g., BoW features).
- SVM Support Vector Machine
- the event detection module 228 uses data fusion strategies (e.g., early and late fusion) to identify events in the input 102 , based on the fused low-level features 220 , 222 , 224 , 226 .
- the event detection module 228 performs concept detection based on the low-level features 220 (and/or the low-level features 222 , 224 , 226 ) and determines the events based on the detected concepts, using the concept model 136 .
- the event detection module 228 may use one or more concept classifiers to analyze the low-level features 220 , 222 , 224 , 226 and use the concept model 136 to classify the low-level features 220 , 222 , 224 , 226 as representative of certain higher-level concepts such as scenes, actions, actors, and objects.
- the illustrative concept model 136 includes data 244 (e.g., semantic elements) identifying low level features 220 , 222 , 224 , 226 , data 248 (e.g., semantic elements) identifying semantic concepts, and data 246 (e.g., semantic elements) identifying relationships between the various low level features 220 , 222 , 224 , 226 and the concepts 248 .
- the event detection module 228 may apply one or more event classifiers to the features 244 , relationships 246 , and/or concepts 238 to determine whether a combination of features 244 , relationships 246 , and/or concepts 238 evidences an event.
- the relationships 246 may include, for example, temporal relations between actions, objects, and/or audio events (e.g., “is followed by”), compositional relationships (e.g., Person X is doing Y with object Z), interaction relationships (e.g., person X is pushing an object Y or Person Y is using an object Z), state relations involving people or objects (e.g., “is performing,” “is saying”), co-occurrence relations (e.g., “is wearing,” “is carrying”), spatial relations (e.g., “is the same object as”), temporal relations between objects (e.g., “is the same object as”), and/or other types of attributed relationships (e.g., spatial, causal, procedural, etc.).
- compositional relationships e.g., Person X is doing Y with object Z
- interaction relationships e.g., person X is pushing an object Y or Person Y is using an object Z
- state relations involving people or objects e.g., “is performing,” “is saying
- the relationships 246 may specify a variety of different types of relationships between low level features 220 , 222 , 224 , 226 and concepts 248 , and/or between different types of concepts 248 . Maintaining the data relating to features 244 , relationships 246 , and concepts 248 allows the system 100 to detect higher level semantic concepts that tend to evidence events, including complex events.
- relationships 246 include not only relationships between different visual features, but also relationships between different types of multimedia features and concepts; for instance, relationships between audio features 222 and visual features 220 (e.g., a loud sound is followed by a bright light) or relationships between text features and sound features (e.g., a GO NINERS sign and cheering).
- Each or any of the models 134 , 136 and/or the mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between features, concepts, events, and salient activities.
- indicators of evidentiary significance are determined using machine learning techniques.
- a machine learning analysis of training videos depicting a “person making a sandwich” may indicate that semantic descriptions such as “kitchen” (scene), “hands visible” (actor), “placing fillings on bread” (action) and “spreading creamy substance” (action) are highly likely to be associated with a person making a sandwich, while other semantic descriptions such as “outdoor event,” (scene), “vehicle moving” (action) or “person jumping” (action) are unlikely to be associated with that particular event.
- semantic descriptions such as “kitchen” (scene), “hands visible” (actor), “placing fillings on bread” (action) and “spreading creamy substance” (action) are highly likely to be associated with a person making a sandwich, while other semantic descriptions such as “outdoor event,” (scene), “vehicle moving” (action) or “person jumping” (action) are unlikely to be associated with that particular event.
- Such indicators can be used by the multimedia content understanding module 104 , or the event detection module 228 more specifically, to
- the event detection module 228 includes a description formulator 230 and an annotator 232 .
- the event detection module and each of its submodules 230 , 232 is embodied as software, firmware, hardware, or a combination thereof.
- the description formulator 230 Once the event detection module 228 has determined (e.g., through semantic reasoning) an event associated with the input 102 , the description formulator 230 generates the event description 106 as described above, and the annotator 232 annotates or otherwise associates the event description 106 with the multimedia input 102 (e.g., by appending a meta tag to the input 102 or by other suitable techniques).
- the event detection module 228 identifies the event description 106 to the salient activity detector module 234 .
- the salient activity detector module 234 uses the salient event criteria 138 to evaluate the event description 106 and/or the detected features 220 , 222 , 224 , 226 of the multimedia input 102 to determine the salient activities associated with the event description 106 or with the multimedia input 102 more generally. To do this, the salient activity detector module 234 maps the event description 106 to salient activities using, e.g., the mapping 140 and/or knowledge contained in the salient event criteria 138 .
- the salient event criteria 138 can be derived from a number of different sources.
- salient event markers 250 can be determined by applying machine learning techniques to training samples (e.g., portions of the image/video collection 150 ) of meta-level features 226 .
- the computing system 100 can learn, over time, characteristic data values of meta-level features 226 or combinations of meta-level features that tend to be representative of salient events.
- the illustrative salient event criteria 138 also includes harvested salient event criteria 252 .
- the harvested salient event criteria 252 is derived by analyzing samples of training data (e.g., portions of the image/video collection 150 ) to determine, for example, the activities that appear most often in videos depicting certain types of events. Activities that frequently appear in videos may be considered salient activities according to the computing system 100 .
- the salient event criteria 138 also includes salient activity templates 254 .
- the templates 254 may include portions of the presentation templates 142 .
- a presentation template 142 may specify a list of activities that are considered to be “salient” for a particular type of video montage or other visual presentation 120 .
- the salient event criteria 138 may also include salient event criteria that is specified by or derived from user inputs 256 .
- the interactive storyboard module 124 may determine salient event criteria based on user inputs received by the editing module 126 .
- the user preference learning module 148 may derive new or updated salient event criteria based on its analysis of the user feedback 146 .
- the salient event criteria 138 identifies the salient activities associated with the event description 106 .
- the salient event criteria 138 also specifies the information that the computing system 100 needs to detect those salient activities in the input 102 using the feature detection algorithms 130 , e.g., salient event detection criteria.
- the salient event detection criteria may include data values and/or algorithm parameters that indicate a particular combination of features 220 , 222 , 224 , 226 that is associated with a salient activity.
- Each or any of the salient event criteria 138 may have associated therewith one or more saliency indicators 238 .
- the saliency indicators 238 can be used by the computing system 100 to select or prioritize the salient event segments 112 .
- the saliency indicators 238 may be embodied as attributes of the markers 250 , the harvested criteria 252 , the templates 254 and/or the user inputs 256 .
- each salient event criterion may have an associated saliency indicator 238 .
- a salient event criterion may have multiple saliency indicators 238 , as the criterion may have different degrees of saliency in relation to different events.
- the salient activity detector module 234 identifies the salient activity detection criteria 236 (e.g., the instructions or data for algorithmically detecting the salient activities in the input 102 ) and the saliency indicator(s) 238 to the salient event segment identifier module 240 .
- the salient event identifier module 240 uses the saliency indicator(s) 238 and/or the salient activity detection criteria 236 to select the appropriate feature detection algorithms 130 to execute on the input 102 , in order to algorithmically identify the salient event segments 112 , executes the selected algorithms 130 , and provides data indicating the identified salient event segments 112 and the event description 106 to the output generator module 114 .
- the process 300 may be embodied as computerized programs, routines, logic and/or instructions executed by the computing system 100 , for example by one or more of the modules and other components shown in FIGS. 1 and 2 , described above.
- the system 100 receives one or more input files, e.g. a multimedia input 102 .
- the input file(s) can be embodied as, for example, raw video footage or digital pictures captured by smartphone or other personal electronics device.
- the input file(s) may be stored on a local computing device and/or a remote computing device (e.g., in personal cloud, such as through a document storing application like DROPBOX.
- the input file(s) may be received by a file uploading, file transfer, or messaging capability of the end user's computing device and/or the computing system 100 (e.g., the communication subsystems 644 , 672 ).
- the computing system 100 may perform a preliminary step of filtering the input file(s) based on one or more of the saliency indicators.
- the computing system 100 may evaluate the meta level features 226 of the input as a preliminary step and filter out any files or video frames that fall outside the scope of the saliency indicators for the meta level features 238 .
- This pre-processing step may, for example, help eliminate low quality footage prior to execution of other feature detection algorithms 130 .
- the computing system 100 executes the feature detection algorithms 130 on the input remaining after the pre-processing (if any) of block 312 .
- the feature detection algorithms 130 detect visual features 220 , audio features 222 , and/or textual features 224 as described above.
- the algorithms may also detect meta level features 226 , if not already performed at block 312 .
- the system 100 evaluates the detected features 220 , 222 , 224 , and/or 226 and based on that evaluation, determines an event that is evidenced by the detected features.
- the system 100 maps the visual, audio, and/or textual features 220 , 222 , 224 to semantic concepts (using e.g., the concept model 136 and/or the mapping 140 ) at block 318 .
- the system 320 merges the features 220 , 222 , 224 and semantic concepts at block 320 (using, e.g., the mapping 140 ) to formulate a semantic description of the event that the system 100 has determined is evidenced by the detected features.
- the semantic description may be embodied as the event description 106 as described above.
- the system 100 associates the semantic description of the event with the input file(s) (using, e.g., meta tags).
- the system 100 automatically classifies the input file(s) as depicting an event based on the output of the feature detection algorithms 130 .
- the system 100 classifies the input without executing the feature detection algorithms 130 .
- the system 100 may receive the event type (e.g., an event description 106 ) from an end user, e.g., as a text string, search query, or meta tag.
- the system 100 may extract meta tags previously associated with the input and use those meta tags to classify the input.
- feature detection algorithms may be used to determine the salient event segments as described herein, but not to determine the initial event classification.
- the computing system 100 determines the salient activities that are associated with the event determined at block 316 . To do this, the system 100 uses the salient event criteria 138 and/or the mapping 140 to evaluate the event information generated at block 316 . For example, the system 100 determines, using the mapping 140 , activities that are associated with the detected event. The system 100 may also use the saliency indicators 238 at block 326 to prioritize the salient activities so that, for example, if a template 142 specifies a limitation on the length or duration of the visual presentation 120 , segments of the input that depict the higher priority salient activities can be included in the presentation 120 and segments that depict lower priority activities may be excluded from the presentation 120 .
- the system 100 determines the salient activity detection criteria for each of the salient activities identified by the salient event criteria.
- the salient activity detection criteria are used by the system 100 to algorithmically identify the salient event segments 112 of the input file(s).
- the system 100 may utilize the salient activity detection criteria as input or parameters of one or more of the feature detection algorithms 130 .
- the system 100 identifies the salient event segments 112 in the multimedia input file(s). To do this, the system 100 executes one or more of the feature detection algorithms 130 using the salient activity detection criteria determined at block 328 . The system 100 also uses the saliency indicators 238 , if any, to filter or prioritize the salient event segments 112 (block 334 ). At block 336 , the computing system 100 generates the visual presentation 120 (e.g., a video clip or “montage”), and/or the NL description 122 , using, e.g., the templates 142 , 144 as described above (block 338 ).
- the visual presentation 120 e.g., a video clip or “montage”
- the system 100 presents the visual presentation 120 and/or the NL description 122 to the end user using an interactive storyboard type of user interface mechanism.
- the interactive storyboard mechanism allows the user to review, edit, approve, and share the visual presentation 120 and/or the NL description 122 .
- the computing system 100 incorporates feedback learned or observed through the user's interactions with the interactive storyboard into one or more of the system models, templates, and/or knowledge bases.
- the templates 142 , 144 , the salient event criteria 138 (including saliency indicators 238 ), and/or other portions of the knowledge base 132 may be updated in response to the user's interactions (e.g., editing, viewing, sharing) with the visual presentation 120 and/or the NL description 122 via the interactive storyboard mechanism.
- the user's interactions e.g., editing, viewing, sharing
- this disclosure refers to a “storyboard” type mechanism, other types and formats of intuitive human computer interfaces or other mechanisms for the review and editing of multimedia files may be used equally as well.
- the illustrative mapping 140 and portions thereof may be embodied as one or more data structures, such as a searchable database, table, or knowledge base, in software, firmware, hardware, or a combination thereof.
- the mapping 140 establishes relationships between and/or among semantic elements of the various stored models described above (e.g., the feature models 134 and the concept model 136 ).
- the illustrative mapping 140 defines logical links or connections 420 , 422 , 424 , 426 , 428 between the various types of detected features 410 , concepts 412 , relations 414 , events 416 , and salient activities 418 .
- the system 140 can use the mapping 140 and particularly the links 420 , 422 , 424 , 426 , 428 , in performing the semantic reasoning to determine events and salient activities based on the features 410 , concepts 412 , and relations 414 .
- the mapping 140 may be embodied as, for example, an ontology that defines the various relationships between the semantic elements shown in FIG. 4 .
- the mapping 140 may be initially developed through a manual authoring process and/or by executing machine learning algorithms on sets of training data.
- the mapping 140 may be updated in response to use of the system 100 over time using, e.g., one or more machine learning techniques.
- the mapping 140 may be stored in computer memory, e.g., as part of the stored models, knowledge base, and/or templates 626 , 666 .
- Video 510 is an input to the system 100 .
- the system 100 analyzes the video using feature detection algorithms and semantic reasoning as described above.
- Time-dependent output of the system 100 's semantic analysis of the detected features is shown by the graphics 512 , 514 , 516 , 518 .
- the portion 520 represents a salient event segment 112 of the video 510 .
- the system 100 has determined, using the techniques described above, that the salient event segment 520 depicts the salient activity of blowing out candles.
- the system 100 has identified, using the techniques described above, salient event segments 522 , 524 , each of which depicts a person singing, and a salient event segment 526 , which depicts a person opening a box.
- the system 100 did not detect any segments depicting the activity of cutting a cake, in the video 510 .
- the system 100 can extract the segments 520 , 522 , 524 , 526 from the video 510 and incorporate the segments 520 , 522 524 , 526 into a video clip that includes only the most interesting or salient portions of the video 510 .
- the service takes one or more input video files, which can be e.g. raw footage captured by smartphone, and identifies a most-relevant event type for the uploaded video, e.g. birthday party, wedding, football game.
- the service identifies the event using feature recognition algorithms, such as described above and in the aforementioned priority patent applications of SRI International.
- the step of event identification can be done manually, such as by the user selecting an event type from a menu or by typing in keywords.
- the event type corresponds to a stored template identifying the key activities typically associated with that event type.
- the service automatically identifies segments (sequences of frames) within the uploaded file that depict the various key activities or moments associated with the relevant event type.
- the service automatically creates a highlight clip by splicing together the automatically-identified segments. The user can review the clip and instruct the service to save the clip, download it, and/or post/share it on a desired social network or other site.
- a video creation service provides interactive storyboard editing capabilities.
- the process begins with one or more user-designated input video files.
- an event type is identified, and video segments depicting key activities/moments are algorithmically identified.
- the system may identify more salient event segments than it actually proposes to use in a highlight clip, e.g., due to limits on total clip length, uncertainty about which segments are best, redundant capture of the same event/activities by multiple video sources, or other factors.
- the service displays to the user an interactive storyboard, with thumbnails or other icons representing each of the segments along a timeline, and a visual indication of the segments that the system tentatively proposes to use in the highlight clip.
- the system 100 can select salient event segments from the different videos in the group of videos taken by the family members and merge them into a highlight clip.
- the user can modify the content of the highlight clip interactively by selecting different segments to use, with the interactive storyboard interface.
- the user can also change the beginning and/or ending frames of a segment by selecting the segment for editing, previewing it along with neighboring frames from the original footage, and using interactive controls to mark desired start and end frames for the system.
- the service constructs a highlight clip by splicing segments together in accordance with the user's edits. As above, the user can preview the clip and decide to save or share the clip, for example.
- the system 100 can handle multiple input files at one time, even if the input files are from multiple different users/devices.
- the automated creation of the highlight clip can be performed by the system 100 in real time, on live video (e.g., immediately after a user has finished filming, the system 100 can initiate the highlight creation processes).
- users can add transition effects between segments, where the transition effects may be automatically selected by the service and/or chosen by the user.
- the event description 106 or the NL description 122 generated automatically by the system 100 can take the form of, e.g., meta data that can be used for indexing, search and retrieval, and/or for advertising (e.g., as ad words).
- the meta data can include keywords that are derived from the algorithmically performed complex activity recognition and other semantic video analysis (e.g. face, location, object recognition; text OCR; voice recognition), performed by the system 100 using the feature detection algorithms 130 as described above.
- the meta data can be used by query expansion and/or query augmentation mechanisms to facilitate keyword searching, clustering, or browsing of a collection 150 .
- the meta data can be used for automatic image/video suggestion in response to user input or in relation to other electronic content (e.g., text or images posted on an Internet web site) (e.g., by the auto-suggest module 162 ).
- other electronic content e.g., text or images posted on an Internet web site
- the system 100 can use the meta data to automatically, based on the text input (which may be only partially complete), generate a list of relevant images and/or videos, which the user may want to attach to the message or share along with the post.
- the content processing e.g., the complex event recognition
- a server computer e.g., by a proprietary video creation service
- captured video files are uploaded by the customer to the server, e.g. using a client application running on a personal electronic device or through interactive website.
- Interactive aspects such as storyboard selection and editing of clip segments, may be carried out via online interaction between the customer's capture device (e.g., camera, smartphone, etc.), or other customer local device (e.g. tablet, laptop), and the video service server computer.
- the server can assemble clip segments as desired, and redefine beginning and end frames of segments, with respect to the uploaded video content.
- Results can be streamed to the customer's device for interactive viewing.
- computer vision algorithms such as complex event recognition algorithms
- the video creation service can be delivered as an executable application running on the customer's device.
- FIG. 6 a simplified block diagram of an embodiment 600 of the multimedia content understanding and assistance computing system 100 is shown. While the illustrative embodiment 600 is shown as involving multiple components and devices, it should be understood that the computing system 100 may constitute a single computing device, alone or in combination with other devices.
- the embodiment 600 includes a user computing device 610 , which embodies features and functionality of a “client-side” or “front end” portion 618 of the computing system 100 depicted in FIG. 1 , and a server computing device 650 , which embodies features and functionality of a “server-side” or “back end” portion 658 of the system 100 .
- the embodiment 600 includes a display device 680 and a camera 682 , each of which may be used alternatively or in addition to the camera 630 and display device 642 of the user computing device 610 .
- Each or any of the computing devices 610 , 650 , 680 , 682 may be in communication with one another via one or more networks 646 .
- the computing system 100 or portions thereof may be distributed across multiple computing devices that are connected to the network(s) 646 as shown. In other embodiments, however, the computing system 100 may be located entirely on, for example, the computing device 610 or one of the devices 650 , 680 , 682 . In some embodiments, portions of the system 100 may be incorporated into other systems or computer applications. Such applications or systems may include, for example, commercial off the shelf (COTS) or custom-developed virtual personal assistant applications, video montage creation applications, content sharing services such as YOUTUBE and INSTAGRAM, and social media services such as FACEBOOK and TWITTER.
- COTS commercial off the shelf
- video montage creation applications such as YOUTUBE and INSTAGRAM
- social media services such as FACEBOOK and TWITTER.
- application or “computer application” may refer to, among other things, any type of computer program or group of computer programs, whether implemented in software, hardware, or a combination thereof, and includes self-contained, vertical, and/or shrink-wrapped software applications, distributed and cloud-based applications, and/or others. Portions of a computer application may be embodied as firmware, as one or more components of an operating system, a runtime library, an application programming interface (API), as a self-contained software application, or as a component of another software application, for example.
- API application programming interface
- the illustrative user computing device 610 includes at least one processor 612 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 614 , and an input/output (I/O) subsystem 616 .
- the computing device 610 may be embodied as any type of computing device capable of performing the functions described herein, such as a personal computer (e.g., desktop, laptop, tablet, smart phone, body-mounted device, wearable device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices.
- the I/O subsystem 616 typically includes, among other things, an I/O controller, a memory controller, and one or more I/O ports.
- the processor 612 and the I/O subsystem 616 are communicatively coupled to the memory 614 .
- the memory 614 may be embodied as any type of suitable computer memory device (e.g., volatile memory such as various forms of random access memory).
- the I/O subsystem 616 is communicatively coupled to a number of hardware and/or software components, including the components of the computing system shown in FIGS. 1 and 2 or portions thereof (e.g., the multimedia content assistant front end modules 618 ), the camera 630 , and the display device 642 .
- a “camera” may refer to any device that is capable of acquiring and recording two-dimensional (2D) or three-dimensional (3D) video images of portions of the real-world environment, and may include cameras with one or more fixed camera parameters and/or cameras having one or more variable parameters, fixed-location cameras (such as “stand-off” cameras that are installed in walls or ceilings), and/or mobile cameras (such as cameras that are integrated with consumer electronic devices, such as laptop computers, smart phones, tablet computers, wearable electronic devices and/or others.
- fixed-location cameras such as “stand-off” cameras that are installed in walls or ceilings
- mobile cameras such as cameras that are integrated with consumer electronic devices, such as laptop computers, smart phones, tablet computers, wearable electronic devices and/or others.
- the camera 630 , a microphone 632 , speaker(s) 640 , and the display device 642 may form part of a human-computer interface subsystem 638 , which includes one or more user input devices (e.g., a touchscreen, keyboard, virtual keypad, microphone, etc.) and one or more output devices (e.g., speakers, displays, LEDs, etc.).
- the human-computer interface device(s) 638 may include, for example, a touchscreen display, a touch-sensitive keypad, a kinetic sensor and/or other gesture-detecting device, an eye-tracking sensor, and/or other devices that are capable of detecting human interactions with a computing device.
- the devices 630 , 640 , 642 , 680 , 682 are illustrated in FIG. 6 as being in communication with the user computing device 610 , either by the I/O subsystem 616 or a network 646 . It should be understood that any or all of the devices 630 , 640 , 642 , 680 , 682 may be integrated with the computing device 610 or embodied as a separate component.
- the camera 630 and/or microphone 632 may be embodied in a wearable device, such as a head-mounted display, GOOGLE GLASS-type device or BLUETOOTH earpiece, which then communicates wirelessly with the computing device 610 .
- the devices 630 , 640 , 642 , 680 , 682 may be embodied in a single computing device, such as a smartphone or tablet computing device.
- the I/O subsystem 616 is also communicatively coupled to one or more storage media 620 , an ASR subsystem 634 , an OCR subsystem 636 , and a communication subsystem 644 . It should be understood that each of the foregoing components and/or systems may be integrated with the computing device 610 or may be a separate component or system that is in communication with the I/O subsystem 616 (e.g., over a network 646 or a bus connection).
- the illustrative ASR subsystem 634 and the illustrative OCR subsystem 636 are, illustratively, COTS systems that are configured to interface with the computing system 100 .
- the storage media 620 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others).
- suitable data storage devices e.g., flash memory, memory cards, memory sticks, and/or others.
- portions of the computing system 100 e.g., the front end modules 618 and/or the multimedia inputs 102 , the visual presentation 120 , the NL description 122 , the algorithms 130 , the knowledge base 132 , the templates 142 , 144 , and/or other data, reside at least temporarily in the storage media 620 .
- Portions of the computing system 100 may be copied to the memory 614 during operation of the computing device 610 , for faster processing or other reasons.
- the communication subsystem 644 communicatively couples the user computing device 610 to one or more other devices, systems, or communication networks, e.g., a local area network, wide area network, personal cloud, enterprise cloud, public cloud, and/or the Internet, for example. Accordingly, the communication subsystem 644 may include one or more wired or wireless network interface software, firmware, or hardware, for example, as may be needed pursuant to the specifications and/or design of the particular embodiment of the system 100 .
- the display device 680 , the camera 682 , and the server computing device 650 each may be embodied as any suitable type of computing device or personal electronic device capable of performing the functions described herein, such as any of the aforementioned types of devices or other electronic devices.
- the server computing device 650 may operate a “back end” portion 658 of the multimedia content assistant computing system 100 .
- the server computing device 650 may include one or more server computers including storage media 660 , which may be used to store portions of the computing system 100 , such as the back end modules 658 and/or portions of the multimedia inputs 102 , the visual presentation 120 , the NL description 122 , the algorithms 130 , the knowledge base 132 , the templates 142 , 144 , and/or other data.
- the illustrative server computing device 650 also includes an HCI subsystem 670 , and a communication subsystem 672 . In general, components of the server computing device 650 having similar names to components of the computing device 610 described above may be embodied similarly.
- each of the devices 680 , 682 may include components similar to those described above in connection with the user computing device 610 and/or the server computing device 650 .
- the computing system 100 may include other components, sub-components, and devices not illustrated in FIG. 6 for clarity of the description.
- the components of the computing system 100 are communicatively coupled as shown in FIG. 6 by signal paths, which may be embodied as any type of wired or wireless signal paths capable of facilitating communication between the respective devices and components.
- An embodiment of the technologies disclosed herein may include any one or more, and any combination of, the examples described below.
- a video assistant for understanding content of a video is embodied in one or more non-transitory machine accessible storage media of a computing system and includes instructions executable by one or more processors to cause the computing system to: detect a plurality of different features in a plurality of different segments of a video by executing a plurality of different feature detection algorithms on the video, each of the video segments comprising one or more frames of the video; determine an event evidenced by the detected features; determine a plurality of salient activities associated with the event; and algorithmically identify a plurality of different salient event segments of the video, each of the salient event segments depicting a salient activity associated with the event.
- An example 2 includes the subject matter of example 1, and includes instructions executable to determine salient event criteria associated with the event, and determine the salient activities associated with the event based on the salient event criteria.
- An example 3 includes the subject matter of example 2, and includes instructions executable to determine the salient event criteria by one or more of: algorithmically analyzing a video collection and algorithmically learning a user specification relating to the salient event criteria.
- An example 4 includes the subject matter of any of examples 1-3, and includes instructions executable to determine a saliency indicator associated with each of the salient event segments of the video, and select a subset of the plurality of salient event segments for inclusion in a visual presentation based on the saliency indicator.
- An example 5 includes the subject matter of any of examples 1-4, and includes instructions executable to extract the salient event segments from the video and incorporate the extracted salient event segments into a video clip.
- An example 6 includes the subject matter of example 5, and includes instructions executable to, in response to user input, share the video clip with another computing device over a network.
- An example 7 includes the subject matter of example 5, and includes instructions executable to one or more of: (i) by a human-computer interface device, interactively edit the video clip and (ii) automatically edit the video clip.
- An example 8 includes the subject matter of example 7, and includes instructions executable to store data relating to the interactive editing of the video clip, execute one or more machine learning algorithms on the stored data, and, in response to the execution of the one or more machine learning algorithms on the stored data, update one or more of: the determination of salient activities associated with the event and the identification of salient event segments.
- An example 9 includes the subject matter of any of examples 1-8, and includes instructions executable to, by a human-computer interface device, output a natural language description of the one or more salient event segments.
- a computing system for understanding content of a video includes: one or more computing devices; and a plurality of processor-executable modules embodied in one or more non-transitory machine accessible storage media of the one or more computing devices, the processor-executable modules comprising: a visual content understanding module to cause the computing system to: detect a plurality of different features in a plurality of different segments of a video by executing one or more event recognition algorithms; determine a semantic description of an event evidenced by one or more of the detected features; and identify one or more salient event segments of the video, each salient event segment depicting a salient activity associated with the event; an output generator module to cause the computing system to output a video clip comprising the salient event segments; and an interactive storyboard module to cause the computing system to one or more of: interactively edit the video clip and share the video clip over a network.
- An example 11 includes the subject matter of example 10, wherein the visual content understanding module is to cause the computing system to determine a configuration of a camera used to record the video; derive, from the camera configuration, a user intent with respect to the video; and identify the salient event segments by selecting one or more segments of the video that relate to the user intent.
- An example 12 includes the subject matter of example 11, wherein in response to the user intent, the output generator module is to cause the computing system to select a template for creating the video clip.
- An example 13 includes the subject matter of example 10, wherein the visual content understanding module is to cause the computing system to determine the semantic description based on a plurality of different algorithmically-detected features comprising two or more of: a visual feature, an audio feature, a textual feature, and a meta-level feature indicative of a camera configuration.
- An example 14 includes the subject matter of any of examples 10-13, wherein the visual content understanding module is to cause the computing system to determine relationships between the detected features, map the detected features and relationships to semantic concepts, and formulate the semantic description to comprise the semantic concepts.
- An example 15 includes the subject matter of any of examples 10-14, wherein the visual content understanding module is to cause the computing system to, in an automated fashion, associate the semantic description with the video.
- An example 16 includes the subject matter of any of examples 10-15, wherein the visual content understanding module is to cause the computing system to determine a salient event criterion and, based on the salient event criterion, identify the salient activity associated with the event.
- An example 17 includes the subject matter of example 16, wherein the computing system is to learn the salient event criterion by analyzing a professionally-made video.
- An example 18 includes the subject matter of example 16, wherein the computing system is to analyze one or more of: semantic content of a video collection and user input, and determine the salient event criterion based on the analysis of the one or more of the semantic content and the user input.
- An example 19 includes the subject matter of any of examples 10-18, wherein the computing system is to determine a saliency indicator comprising data associated with one or more of the detected features, and use the saliency indicator to identify the salient event segments.
- a computing system for understanding visual content in digital images includes: one or more computing devices; and instructions embodied in one or more non-transitory machine accessible storage media of the one or more computing devices, the instructions executable by the one or more computing devices to cause the computing system to: detect a plurality of different features in a set of digital images by executing a plurality of different feature detection algorithms on the set of images; map the one or more features detected by the feature detection algorithms to an event, the event evidenced by the one or more detected features; determine a plurality of salient activities associated with the event; extract one or more salient event segments from the set of images, each of the salient event segments depicting a salient activity associated with the event; and incorporate the extracted one or more salient event segments into a visual presentation.
- An example 21 includes the subject matter of example 20, wherein the instructions cause the computing system to select at least two of: a visual feature detection algorithm, an audio feature detection algorithm, and a textual feature detection algorithm, execute the selected feature detection algorithms to detect at least two of: a visual feature, an audio feature, and a textual feature of the set of images, and determine the event evidenced by at least two of: the visual feature, the audio feature, and the textual feature.
- An example 22 includes the subject matter of example 20 or example 21, wherein the instructions cause the computing system to, in an automated fashion, generate a semantic description of the event based on the one or more features detected by the feature detection algorithms.
- An example 23 includes the subject matter of any of examples 20-22, wherein the instructions cause the computing system to determine a saliency indicator associated with each of the salient event segments, and arrange the salient event segments in the visual presentation according to the saliency indicators associated with the salient event segments.
- An example 24. includes the subject matter of example 23, wherein the instructions cause the computing system to one or more of: (i) by a human-computer interface device of the computing system, interactively rearrange the salient event segments in the visual presentation and (ii) automatically rearrange the salient event segments in the visual presentation.
- An example 25 includes the subject matter of example 23, wherein the instructions cause the computing system to select a subset of the salient event segments based on the saliency indicators associated with the salient event segments, and create a visual presentation comprising the salient event segments in the selected subset of salient event segments.
- An example 26 includes the subject matter of any of examples 20-25, wherein the instructions cause the computing system to, in an automated fashion, associate a description of the event with the images in the set of digital images.
- An example 27 includes the subject matter of example 26, wherein the instructions cause the computing system to detect user input comprising a textual description, compare the textual description to the description of the event associated with the images in the set of digital images, and, in an automated fashion, suggest one or more images having a relevancy to the text description as determined by the comparison of the textual description of the user input to the description of the event.
- references in the specification to “an embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is believed to be within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly indicated.
- Embodiments in accordance with the disclosure may be implemented in hardware, firmware, software, or any combination thereof. Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors.
- a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “virtual machine” running on one or more computing devices).
- a machine-readable medium may include any suitable form of volatile or non-volatile memory.
- Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required.
- any of the described modules and/or data structures may be combined or divided into sub-modules, sub-processes or other units of computer code or data as may be required by a particular design or implementation.
- specific arrangements or orderings of schematic elements may be shown for ease of description. However, the specific ordering or arrangement of such elements is not meant to imply that a particular order or sequence of processing, or separation of processes, is required in all embodiments.
- schematic elements used to represent instruction blocks or modules may be implemented using any suitable form of machine-readable instruction, and each such instruction may be implemented using any suitable programming language, library, application-programming interface (API), and/or other software development tools or frameworks.
- schematic elements used to represent data or information may be implemented using any suitable electronic arrangement or data structure.
- connections, relationships or associations between elements may be simplified or not shown in the drawings so as not to obscure the disclosure.
Abstract
Description
- This application claims priority to and the benefit of U.S. Utility patent application Ser. No. 13/737,607, filed Jan. 9, 2013, which claims priority to and the benefit of U.S. Provisional Patent Application Ser. No. 61/637,196, filed Apr. 23, 2012, each of which is incorporated herein by this reference in its entirety.
- This invention was made in part with government support under NBC contract no. D11PC20066 awarded by the Department of the Interior. The United States Government has certain rights in this invention.
- With the integration of digital video recording technology into more and more consumer-oriented electronic devices, visual content (e.g., digital photos and videos) is frequently captured, viewed, and shared by mobile device applications, instant messaging and electronic mail, social media services, and other electronic communication methods.
- In computer vision, mathematical techniques are used to detect the presence of and recognize various elements of the visual scenes that are depicted in digital images. Localized portions of an image, known as features, may be used to analyze and classify the image. Low-level features, such as interest points and edges, may be computed from an image and used to detect, for example, people, objects, and landmarks that are depicted in the image. Machine learning algorithms are often used for image recognition.
- This disclosure is illustrated by way of example and not by way of limitation in the accompanying figures. The figures may, alone or in combination, illustrate one or more embodiments of the disclosure. Elements illustrated in the figures are not necessarily drawn to scale. Reference labels may be repeated among the figures to indicate corresponding or analogous elements.
-
FIG. 1 is a simplified schematic diagram of an environment of at least one embodiment of a multimedia content understanding and assistance computing system including a multimedia content understanding module as disclosed herein; -
FIG. 2 is a simplified schematic diagram of an environment of at least one embodiment of the multimedia content understanding module ofFIG. 1 ; -
FIG. 3 is a simplified flow diagram of at least one embodiment of a process executable by the computing system ofFIG. 1 to provide multimedia content understanding and assistance as disclosed herein; -
FIG. 4 is a simplified schematic illustration of at least one embodiment of feature, concept, event, and salient activity modeling as disclosed herein; -
FIG. 5 is a simplified example of at least one embodiment of automated salient event detection in a video as disclosed herein; and -
FIG. 6 is a simplified block diagram of an exemplary computing environment in connection with which at least one embodiment of the system ofFIG. 1 may be implemented. - While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are described in detail below. It should be understood that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed. On the contrary, the intent is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
- The use of visual content, e.g., digital images and video, as a communication modality is becoming increasingly common. Mobile cameras are used to capture not just special holidays and important events, but also to record other, relatively mundane yet entertaining or otherwise memorable scenes and activities that are more difficult to classify. For instance, a picture or video of a child's impromptu imitation of a family member, a trick performed by a pet, or a humorous interaction involving a group of friends may be very meaningful to the camera holder and/or worthy of sharing with others, but difficult to describe or “tag” with words in a way that would facilitate retrieval of the image at a later time. Even with visual content that is more easily categorized, such as footage of weddings and birthdays, it can be very difficult and time-consuming for the user to manually identify and locate the most interesting or important scenes, particularly when the collection of images or video is very large. Further, where the collection contains multiple images of the same scene, it can be challenging for users to discern the one or two specific images that represent the “best” depictions of the scene.
- Referring now to
FIG. 1 , an embodiment of a multimedia content understanding andassistance computing system 100 is shown in the context of an environment that may be created during the operation of the system 100 (e.g., a physical and/or virtual execution or “runtime” environment). The illustrative multimedia content understanding andassistance computing system 100 is embodied as a number of machine-readable instructions, modules, data structures and/or other components, which may be implemented as computer hardware, firmware, software, or a combination thereof. For ease of discussion, the multimedia content understanding andassistance computing system 100 may be referred to herein as a “visual content assistant,” a “video assistant,” an “image assistant,” a “multimedia assistant,” or by similar terminology. - The
computing system 100 executes computer vision algorithms, including machine learning algorithms, semantic reasoning techniques, and/or other technologies to, among other things, in an automated fashion, identify, understand, and describe events that are depicted inmultimedia input 102. As described in more detail below, theillustrative computing system 100 can, among other things, help users quickly and easily locate “salient activities” in lengthy and/or large volumes of video footage, so that the most important or meaningful segments can be extracted, retained and shared. Thecomputing system 100 can compile the salient event segments into a visual presentation 120 (e.g., a “highlight reel” video clip) that can be stored and/or shared over a computer network. Thesystem 100 can, alternatively or in addition, generate a natural language (NL)description 122 of themultimedia input 102, which describes the content of theinput 102 in a manner that can be used for, among other things, searching, retrieval, and establishing links between theinput 102 and other electronic content (such as text or voice input, advertisements, other videos, documents, or other multimedia content). - The events that can be automatically detected and described by the
computing system 100 include “complex events.” As used herein, “complex event” may refer to, among other things, an event that is made up of multiple “constituent” people, objects, scenes and/or activities. For example, a birthday party is a complex event that can include the activities of singing, blowing out candles, opening presents, and eating cake. Similarly, a child acting out an improvisation is a complex event that may include people smiling, laughing, dancing, drawing a picture, and applause. A group activity relating to a political issue, sports event, or music performance is a complex event that may involve a group of people walking or standing together, a person holding a sign, written words on the sign, a person wearing a t-shirt with a slogan printed on the shirt, and human voices shouting. Other examples of complex events include human interactions with other people (e.g., conversations, meetings, presentations, etc.) and human interactions with objects (e.g., cooking, repairing a machine, conducting an experiment, building a house, etc.). The activities that make up a complex event are not limited to visual features. Rather, “activities” as used herein may refer to, among other things, visual, audio, and/or text features, which may be detected by thecomputing system 100 in an automated fashion using a number of different algorithms and feature detection techniques, as described in more detail below. Stated another way, an activity as used herein may refer to any semantic element of themultimedia input 102 that, as determined by thecomputing system 100, evidences an event. - As used herein, “multimedia input” may refer to, among other things, a collection of digital images, a video, a collection of videos, or a collection of images and videos (where a “collection” includes two or more images and/or videos). References herein to a “video” may refer to, among other things, a relatively short video clip, an entire full-length video production, or different segments within a video or video clip (where a segment includes a sequence of two or more frames of the video). Any video of the
input 102 may include or have associated therewith an audio soundtrack and/or a speech transcript, where the speech transcript may be generated by, for example, an automated speech recognition (ASR) module of thecomputing system 100. Any video or image of theinput 102 may include or have associated therewith a text transcript, where the text transcript may be generated by, for example, an optical character recognition (OCR) module of thecomputing system 100. References herein to an “image” may refer to, among other things, a still image (e.g., a digital photograph) or a frame of a video (e.g., a “key frame”). - A multimedia content understanding
module 104 of thecomputing system 100 is embodied as software, firmware, hardware, or a combination thereof. The multimediacontent understanding module 104 applies a number of different feature detection algorithms 130 to themultimedia input 102, using a multimediacontent knowledge base 132, and generates anevent description 106 based on the output of the algorithms 130. Themultimedia knowledge base 132 is embodied as software, firmware, hardware, or a combination thereof (e.g., as a database, table, or other suitable data structure or computer programming construct). The illustrative multimediacontent understanding module 104 executes different feature detection algorithms 130 on different parts or segments of themultimedia input 102 to detect different features, or the multimediacontent understanding module 104 executes all or a subset of the feature detection algorithms 130 on all portions of themultimedia input 102. Some examples of feature detection algorithms and techniques, including low-level, mid-level, and complex event detection and recognition techniques, are described in the priority application, Cheng et al., U.S. Utility patent application Ser. No. 13/737,607 (“Classification, Search, and Retrieval of Complex Video Events”); and also in Chakraborty et al., U.S. Utility patent application Ser. No. 14/021,696, filed Sep. 9, 2013 (“Recognizing Entity Interactions in Visual Media”), Chakraborty et al., U.S. Utility patent application Ser. No. 13/967,521, filed Aug. 15, 2013 (“3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image”), Han et al., U.S. Pat. No. 8,634,638 (“Real-Time Action Detection and Classification”), Eledath et al., U.S. Pat. No. 8,339,456 (“Apparatus for Intelligent and Autonomous Video Content and Streaming”), all of SRI International and each of which is incorporated herein by this reference. Additionally, technologies for visual feature detection and indexing are disclosed in Sawhney, Harpreet S. et al., U.S. Utility patent application Ser. No. ______ [Ref. No. SRI-US-7096-2] (“Multi-Dimensional Realization of Visual Content of an Image Collection”). - The
event description 106 semantically describes an event depicted by themultimedia input 102, as determined by the multimediacontent understanding module 104. In the illustrative embodiments, theevent description 106 is determined algorithmically by thecomputing system 100 analyzing themultimedia input 102. In other embodiments, theevent description 106 may be user-supplied or determined by thesystem 100 based on meta data or other descriptive information associated with theinput 102. Theillustrative event description 106 generated by theunderstanding module 104 indicates an event type or category, such as “birthday party,” “wedding,” “soccer game,” “hiking trip,” or “family activity.” Theevent description 106 may be embodied as, for example, a natural language word or phrase that is encoded in a tag or label, which thecomputing system 100 associates with the multimedia input 102 (e.g., as an extensible markup language or XML tag). Alternatively or in addition, theevent description 106 may be embodied as structured data, e.g., a data type or data structure including semantics, such as “Party(retirement),” “Party(birthday),” “Sports_Event(soccer),” “Performance(singing),” or “Performance(dancing).” - To generate the
event description 106, the illustrative multimediacontent understanding module 104 accesses one ormore feature models 134 and/orconcept models 136. Thefeature models 134 and theconcept models 136 are embodied as software, firmware, hardware, or a combination thereof, e.g., a knowledge base, database, table, or other suitable data structure or computer programming construct. Themodels feature models 134 may define relationships between sets of low level features detected by the algorithms 130 with semantic descriptions of those sets of features (e.g., “object,” “person,” “face,” “ball,” “vehicle,” etc.). Similarly, theconcept model 136 may define relationships between sets of features detected by the algorithms 130 and higher-level “concepts,” such as people, objects, actions and poses (e.g., “sitting,” “running,” “throwing,” etc.). The semantic descriptions of features and concepts that are maintained by themodels FIG. 4 , amapping 140 of theknowledge base 132 indicates relationships between various combinations of features, concepts, events, and activities. As described below, theevent description 106 can be determined using semantic reasoning in connection with theknowledge base 132 and/or themapping 140. To establish “relationships” and “associations” as described herein, thecomputing system 100 may utilize, for example, a knowledge representation language or ontology. - The
computing system 100 uses theevent description 106 and theknowledge base 132 to determine one or more “salient” activities that are associated with the occurrence of the detected event. To do this, thecomputing system 100 may accesssalient event criteria 138 and/or themapping 140 of theknowledge base 132. The illustrativesalient event criteria 138 indicate one or more criteria for determining whether an activity is a salient activity in relation to one or more events. For instance, thesalient event criteria 138 identify salient activities and the corresponding feature detection information that thecomputing system 100 needs in order to algorithmically identify those salient activities in the input 102 (where the feature detection information may include, for example, parameters of computer vision algorithms 130). In some embodiments, thesalient event criteria 138 includes saliency indicators 238 (FIG. 2 ), which indicate, for particular salient activities, a variable degree of saliency associated with the activity as it relates to a particular event. Asalient event criterion 138 may be embodied as, for example, one or more per-defined, selected, or computed data values. Asaliency indicator 238 may be embodied as, for example, a pre-defined, selected, or computed data value, such as a priority, a weight or a rank that can be used to arrange or prioritize thesalient event segments 112. - The
mapping 140 of theknowledge base 132 links activities with events, so that, once theevent description 106 is determined, theunderstanding module 104 can determine the activities that are associated with theevent description 106 and look for those activities in theinput 102. Themapping 140 may establish one-to-one, one-to-many, or many-to-many logical relationships between the various events and activities in theknowledge base 132. For example, the activity of “singing” may be associated with “party” events and “performance” events while the activity of “blowing out candles” may be only associated with the event of “birthday party.” In general, themapping 140 and the various other portions of theknowledge base 132 can be configured and defined according to the requirements of a particular design of the computing system 100 (e.g., according to domain-specific requirements). - Once the salient activities are determined, the
computing system 100 executes one or more algorithms 130 to identify particular portions or segments of themultimedia input 102 that depict those salient activities. As an example, if thecomputing system 100 determines that themultimedia input 102 depicts a birthday party (the event), the illustrative multimediacontent understanding module 104 accesses the multimediacontent knowledge base 132 to determine the constituent activities that are associated with a birthday party (e.g., blowing out candles, etc.), and selects one or more of the feature detection algorithms 130 to execute on themultimedia input 102 to look for scenes in theinput 102 that depict those constituent activities. Theunderstanding module 104 executes the selected algorithms 130 to identifysalient event segments 112 of theinput 102, such that the identifiedsalient event segments 112 each depict one (or more) of the constituent activities that are associated with the birthday party. - Once the
computing system 100 has identified the portions or segments of themultimedia input 102 that depict the salient activities, anoutput generator module 114 of thecomputing system 100 can do a number of different things with the salient activity information. Theoutput generator module 114 and its submodules, a visual presentation generator module 116 and a naturallanguage generator module 118, are each embodied as software, firmware, hardware, or a combination thereof. - The visual presentation generator module 116 of the
output generator module 114 automatically extracts (e.g., removes or makes a copy of) thesalient event segments 112 from theinput 102 and incorporates the extractedsegments 112 into avisual presentation 120, such as a video clip (e.g., a “highlight reel”) or multimedia presentation, using apresentation template 142. In doing so, the visual presentation generator module 116 may select theparticular presentation template 142 to use to create thepresentation 120 based on a characteristic of themultimedia input 102, theevent description 106, user input, domain-specific criteria, and/or other presentation template selection criteria. - The natural
language generator module 118 of theoutput generator module 114 automatically generates anatural language description 122 of theevent 106, including natural language descriptions of thesalient event segments 112 and suitable transition phrases, using anatural language template 144. In doing so, the natural languagepresentation generator module 118 may select the particularnatural language template 144 to use to create theNL description 122 based on a characteristic of themultimedia input 102, theevent description 106, user input, domain-specific criteria, and/or other NL template selection criteria. An example of anatural language description 122 for a highlight reel of a child's birthday party, which may be output by theNL generator module 118, may include: “Child's birthday party, including children playing games followed by singing, blowing out candles, and eating cake.” Some examples of methods for generating the natural language description 122 (e.g., “recounting”) are described in the aforementioned priority patent application Ser. No. 13/737,607. - The
NL generator module 118 may formulate theNL description 122 as natural language speech using, e.g., stored NL speech samples (which may be stored in, for example, data storage 620). The NL speech samples may include prepared NL descriptions of complex events and activities. Alternatively or in addition, theNL description 122 may be constructed “on the fly,” using, e.g., a natural language generator and text-to-speech (TTS) subsystem, which may be implemented as part of thecomputing system 100 or as external modules or systems with which thecomputing system 100 is in communication over a computer network (e.g., a network 646). - The
presentation templates 142 provide the specifications that theoutput generator module 114 uses to selectsalient event segments 112 for inclusion in thevisual presentation 120, arrange thesalient event segments 112, and create thevisual presentation 120. For example, apresentation template 142 specifies, for a particular event type, the type of content to include in thevisual presentation 120, the number ofsalient event segments 112, the order in which to arrange thesegments 112, (e.g., chronological or by subject matter), the pace and transitions between thesegments 112, the accompanying audio or text, and/or other aspects of thevisual presentation 120. Thepresentation template 142 may further specify a maximum duration of thevisual presentation 120, which may correspond to a maximum duration permitted by a video sharing service or a social media service (in consideration of the limitations of the computer network infrastructure or for other reasons). Portions of thetemplates 142 may be embodied as an ontology or knowledge base that incorporates or accesses previously developed knowledge, such as knowledge obtained from the analysis ofmany inputs 102 over time, and information drawn from other data sources that are publicly available (e.g., on the Internet). Theoutput generator module 114 or more specifically, the visual presentation generator module 116, formulates thevisual presentation 120 according to the system- or user-selected template 142 (e.g., by inserting thesalient event segments 112 extracted from themultimedia input 102 into appropriate “slots” in the template 142). - The
illustrative computing system 100 includes a number of semanticcontent learning modules 152, including feature learning modules 154, aconcept learning module 156, a salientevent learning module 158, and atemplate learning module 160. The learningmodules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection 150 and create and /or update portions of theknowledge base 132 and/or thepresentation templates 142. For example, the learningmodules 152 may be used to initially populate and/or periodically update portions of theknowledge base 132 and/or thetemplates collection 150 and populate or update thefeature models 134. For example, the feature learning modules 154 may, over time or as a result of analyzing portions of thecollection 150, algorithmically learn patterns of computer vision algorithm output that evidence a particular feature, and update thefeature models 134 accordingly. Similarly, theconcept learning module 156 may, over time or as a result of analyzing portions of thecollection 150, algorithmically learn combinations of low level features that evidence particular concepts, and update theconcept model 136 accordingly. - The illustrative salient
event learning module 158 analyzes portions of the image/video collection 150 to determinesalient event criteria 138, to identify events for inclusion in themapping 140, to identify activities that are associated with events, and to determine the saliency of various activities with respect to different events. For example, the salientevent learning module 158 may identify a new event or activity for inclusion in themapping 140, or identify newsalient event criteria 138, based on the frequency of occurrence of certain features and/or concepts in thecollection 150. The salientevent learning module 158 can also identify multi-modal salient event markers including “non visual” characteristics of input videos such as object motion, changes in motion patterns, changes in camera position or camera motion, amount or direction of camera motion, camera angle, audio features (e.g., cheering sounds or speech). - The illustrative
template learning module 160 analyzes thecollection 150 to create and/or update thepresentation templates 142. For example, thetemplate learning module 160 may, after reviewing a set of professionally made videos in thecollection 150, incorporate organizational elements of those professionally made videos into one or more of thetemplates 142 or create anew template 142 that includes specifications gleaned from the professionally made videos. Portions of thetemplate learning module 160 may algorithmically analyze the audio tracks of videos in thecollection 150 and update theNL templates 144, as well. - The
video collection 150 refers generally to one or more bodies of retrievable multimedia digital content that may be stored in computer memory at thecomputing system 100 and/or other computing systems or devices. Thevideo collection 150 may include images and/or videos stored remotely at Internet sites such as YOUTUBE and INSTAGRAM, and/or images/videos that are stored in one or more local collections, such as storage media of a personal computer or mobile device (e.g., a “camera roll” of a mobile device camera application). In any case, images/videos in thecollection 150 need not have been previously tagged with meta data or other identifying material in order to be useful to thecomputing system 100. Thecomputing system 100 can operate on images/videos 150 and/ormultimedia input 102 whether or not it has been previously tagged or annotated in any way. To the extent that any of the content in thecollection 150 is already tagged with descriptions, any of the learningmodules 152 can learn and apply those existing descriptions to theknowledge base 132 and/or thetemplates - The
output generator module 114 interfaces with aninteractive storyboard module 124 to allow the end user to modify the (machine-generated)visual presentation 120 and/or the (machine-generated)NL description 122, as desired. The illustrativeinteractive storyboard module 124 includes anediting module 126, asharing module 128, and an auto-suggestmodule 162. Theinteractive storyboard module 124 and itssubmodules editing module 126 displays the elements of thevisual presentation 120 on a display device (e.g., adisplay device 642,FIG. 6 ) and interactively modifies thevisual presentation 120 in response to human-computer interaction (HCI) received by a human-computer interface device (e.g., amicrophone 632, thedisplay device 642, or another part of an HCI subsystem 638). Theinteractive storyboard module 124 presents thesalient event segments 112 using a storyboard format that enables the user to intuitively review, rearrange, add and delete segments of the presentation 120 (e.g. by tapping on a touchscreen of the HCI subsystem 638). When the user's interaction with thepresentation 120 is complete, theinteractive storyboard module 124 stores the updated version of thepresentation 120 in computer memory (e.g., a data storage 620). - The
sharing module 128 is responsive to user interaction with thecomputing system 100 that indicates that the user would like to “share” the newly created or updatedpresentation 120 with other people, e.g., over a computer network, e-mail, a messaging service, or other electronic communication mechanism. Theillustrative sharing module 128 can be pre-configured or user-configured to automatically enable sharing in response to the completion of apresentation 120, or to only share thepresentation 120 in response to affirmative user approval of thepresentation 120. In either case, thesharing module 128 is configured to automatically share (e.g., upload to an Internet-based photo or video sharing site or service) thepresentation 120 in response to a single user interaction (e.g., “one click” sharing). To do this, thesharing module 128 interfaces with network interface technology of the computing system 100 (e.g., a communication subsystem 644). - The auto-suggest
module 162 leverages the information produced by other modules of thecomputing system 100, including theevent description 106, theNL description 122, and/or thevisual presentation 120, to provide an intelligent automatic image/video suggestion service. In some embodiments, the auto-suggestmodule 162 associates, or interactively suggests, thevisual presentation 120 or themultimedia input 102 to be associated, with other electronic content based on theevent description 106 or theNL description 122 that thecomputing system 100 has automatically assigned to themultimedia input 102 or thevisual presentation 120. To do this, the auto-suggestmodule 162 includes a persistent input monitoring mechanism that monitors user inputs received by theediting module 126 or other user interface modules of thecomputing system 100, including inputs received by other applications or services running on thecomputing system 100. The auto-suggestmodule 162 evaluates the user inputs over time, compares the user inputs to theevent descriptions 106 and/or the NL descriptions 122 (using, e.g., a matching algorithm), determines if any user inputs match any of theevent descriptions 106 orNL descriptions 122, and, if an input matches anevent description 106 or anNL description 122, generates an image suggestion, which suggests the relevant images/videos module 162 detects a textual description input as a wall post to a social media page or a text message, the auto-suggestmodule 162 looks for images/videos in thecollection 150 or stored in other locations, which depict visual content relevant to the content of the wall post or text message. If the auto-suggestmodule 162 determines that an image/video module 162 displays a thumbnail of the matching image/video as a suggested supplement or attachment to the wall post or text message. - In some embodiments, the auto-suggest
module 162 operates in conjunction with other modules of thecomputing system 100 to interactively suggest anevent description 106 orNL description 122 to associate with an image/video system 100 determines that an unlabeled image/video collection 150, thesystem 100 may suggest that theevent description 106 and/or theNL description 122 associated with the image in thecollection 150 be automatically propagated to the unlabeled image/video - The
illustrative computing system 100 also includes a userpreference learning module 148. The userpreference learning module 148 is embodied as software, firmware, hardware, or a combination thereof. The userpreference learning module 148 monitors implicit and/or explicit user interactions with the presentation 120 (user feedback 146) and executes, e.g., machine learning algorithms to learn user-specific specifications and/or preferences as to, for example, the types of activities that the user considers to be “salient” with respect to particular events, the user's specifications or preferences as to the ordering of salient events in various types ofdifferent presentations 120, and/or other aspects of the creation of thepresentation 120 and/or theNL description 122. The userpreference learning module 148 updates thetemplates - Referring now to
FIG. 2 , an embodiment of the multimediacontent understanding module 104 is shown in greater detail, in the context of an environment that may be created during the operation of the computing system 100 (e.g., a physical and/or virtual execution or “runtime” environment). The multimediacontent understanding module 104 and each of the components shown inFIG. 2 are embodied as machine-readable instructions, modules, data structures and/or other components, in software, firmware, hardware, or a combination thereof. - The illustrative multimedia
content understanding module 104 includes a number offeature detection modules 202, including a visualfeature detection module 212, an audiofeature detection module 214, a textfeature detection module 216, and a camera configurationfeature detection module 218. Thefeature detection modules 202, including themodules feature detection modules multimedia input 102 using respective portions of thefeature models 134. To enable this, the multimediacontent understanding module 104 employs external devices, applications and services as needed in order to create, from themultimedia input 102, one or more image/video segments 204, anaudio track 206, and a text/speech transcript 208. - The image/video segment(s) 204 each include one or more digital images of the multimedia input 102 (e.g., a still image, a set of still images, a video, or a set of videos). The visual
feature detection module 212 analyzes eachsegment 204 using thevisual feature models 236, and outputs a set ofvisual features 220 that have been detected in thesegment 204. To do this, the visualfeature detection module 212 employs a number of automated feature recognition algorithms 130 to detect lower-level features of interest in theinput 102, and interfaces with thevisual feature models 236 to recognize and semantically classify the detected features. As used herein, “low-level” may refer to, among other things, visual features that capture characteristic shapes and motion without significant spatio-temporal variations between different instances of the features. With regard tovideo input 102, both static and dynamic low-level visual features can be detected. Static visual features include features that are extracted from individual keyframes of a video at a defined extraction rate (e.g., 1 frame/second). Some examples of static visual feature detectors include GIST, SIFT (Scale-Invariant Feature Transform), and colorSIFT. The GIST feature detector can be used to detect abstract scene and layout information, including perceptual dimensions such as naturalness, openness, roughness, etc. The SIFT feature detector can be used to detect the appearance of an image at particular interest points without regard to image scale, rotation, level of illumination, noise, and minor changes in viewpoint. The colorSIFT feature detector extends the SIFT feature detector to include color keypoints and color descriptors, such as intensity, shadow, and shading effects. - Dynamic visual features include features that are computed over x-y-t segments or windows of a video. Dynamic feature detectors can detect the appearance of actors, objects and scenes as well as their motion information. Some examples of dynamic feature detectors include MoSIFT, STIP (Spatio-Temporal Interest Point), DTF-HOG (Dense Trajectory based Histograms of Oriented Gradients), and DTF-MBH (Dense-Trajectory based Motion Boundary Histogram). The MoSIFT feature detector extends the SIFT feature detector to the time dimension and can collect both local appearance and local motion information, and identify interest points in the video that contain at least a minimal amount of movement. The STIP feature detector computes a spatio-temporal second-moment matrix at each video point using independent spatial and temporal scale values, a separable Gaussian smoothing function, and space-time gradients. The DTF-HoG feature detector tracks two-dimensional interest points over time rather than three-dimensional interest points in the x-y-t domain, by sampling and tracking feature points on a dense grid and extracting the dense trajectories. The HoGs are computed along the dense trajectories to eliminate the effects of camera motion (which may be particularly important in the context of unconstrained or “in the wild” videos). The DTF-MBH feature detector applies the MBH descriptors to the dense trajectories to capture object motion information. The MBH descriptors represent the gradient of optical flow rather than the optical flow itself. Thus, the MBH descriptors can suppress the effects of camera motion, as well. However, HoF (histograms of optical flow) may be used, alternatively or in addition, in some embodiments. Additional details of the illustrative low-level feature detectors can be found in the priority application, U.S. Provisional patent application Ser. No. 61/637,196.
- The illustrative visual
feature detection module 212 quantizes the extracted low-level features by feature type using thevisual feature models 236. In some embodiments, thefeature models 236 or portions thereof are machine-learned (e.g., from training data in the collection 150) using, e.g., k-means clustering techniques. The visualfeature detection module 212 can aggregate the quantized low-level features by feature type, by using, for example, a Bag-of-Words (BoW) model in which a frequency histogram of visual words is computed over the entire length of a video. The visualfeature detection module 212 identifies thevisual features 220 to theevent detection module 228. - Some embodiments of the
computing system 100 can detect the presence of a variety of different types of multimedia features in themultimedia input 102, including audio and text, in addition to the more typical visual features (e.g., actors, objects, scenes, actions). The illustrative audiofeature detection module 214 analyzes the audio track of aninput 102 using mathematical sound processing algorithms and uses the audio feature model 238 (e.g., an acoustic model) to detect and classify audio features 222. For example, the audiofeature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of aninput video 102, and, with theaudio feature model 238, classify the acoustic characteristic as indicating a “cheering” sound or “applause.” Some examples of low level audio features that can be used to mathematically detect audio events in theinput 102 include Mel frequency cepstral coefficients (MFCCs), spectral centroid (SC), spectral roll off (SRO), time domain zero crossing (TDZC), and spectral flux. Theaudio feature model 238 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to thevisual feature models 236 except that the audio features of the training data are analyzed rather than the visual features, in order to develop theaudio feature model 238. The audiofeature detection module 214 identifies the detected audio features 222 to theevent detection module 228. - The text
feature detection module 216 interfaces with an automated speech recognition (ASR) system and/or a video optical character recognition (OCR) system. The ASR and/or OCR system may be a part of thecomputing system 100 or in communication with thecomputing system 100 via a computer network (e.g., a network 646). An ASR system may identify spoken words present in the audio track of avideo input 102 and provide a text translation of the spoken words (e.g., a transcript 208) to the textfeature detection module 216. An OCR system may recognize text that is present in a visual scene of an image or video, and provide the recognized text (e.g., a transcript 208) to the textfeature detection module 216. For example, the OCR system may be used to detect words or phrases displayed on apparel, street signs, or buildings that are depicted in one or more scenes of the input images/video 102. The textfeature detection module 216 evaluates thetranscript 208 using atext model 240 to extract portions of thetranscript 208 that may be semantically meaningful. Portions of thetext model 240 may be embodied as a language model or vocabulary, for example. Theillustrative text model 240 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to thevisual feature models 236 except that the text features of the training data are analyzed rather than the visual features, in order to develop thetext model 240. As an example of the use of the textfeature detection module 216, anASR transcript 208 of avideo input 102 may include the words, “he just scored” and/or anOCR transcript 208 of thevideo input 102 may include the phrase “GO NINERS.” The textfeature detection module 216 identifies thesetextual features 224 to theevent detection module 228. - The camera configuration
feature detection module 218 detects “meta-level” features 226 in images andvideo input 102. Some examples of meta-level features 226 include camera motion, camera view angle, number of shots taken, shot composition, and shot duration (e.g., number of frames in a video input 102). Thecomputing system 100 uses one or more of the meta-level features 226 to discern the intent of the person taking the picture or video: what were they trying to capture? The camera angle, the direction and speed of the motion of the camera relative to a detected event in a video input 102 (e.g., tracking) can reveal people or objects of interest to the camera holder. For example, the camera configurationfeature detection module 218 may determine, based on information in the camera configuration model 242, that if a set of meta-level features 226 indicates that the camera is tracking the movement of a particular person or object in the scene at a particular speed, the tracked person or object is likely to be of interest to the camera holder, and thus,salient event segments 112 should include the tracked person or object. The illustrative camera configuration feature model 242 is manually authored and/or developed using training data and machine learning techniques, in a similar fashion to thevisual feature models 236 except that the meta-level features of the training data are analyzed rather than the visual features, in order to develop the cameraconfiguration feature model 238. The cameraconfiguration feature module 218 identifies the meta-level features 226 to theevent detection module 228. - Referring now to the
event detection module 228, the illustrativeevent detection module 228 uses data fusion techniques to combine thevisual features 220, the audio features 222, thetextual features 224, and the meta-level features 226, to the extent that each or any of these types of features are detected in themultimedia input 102. In this way, theevent detection module 228 can utilize a variety of visual and non-visual features to identify events and salient activities in theinput 102. - The illustrative
event detection module 228 applies a supervised learning model, such as Support Vector Machine (SVM) classifiers, to the visual features 220 (e.g., BoW features). Theevent detection module 228 uses data fusion strategies (e.g., early and late fusion) to identify events in theinput 102, based on the fused low-level features 220, 222, 224, 226. In other embodiments, theevent detection module 228 performs concept detection based on the low-level features 220 (and/or the low-level features 222, 224, 226) and determines the events based on the detected concepts, using theconcept model 136. For example, theevent detection module 228 may use one or more concept classifiers to analyze the low-level features 220, 222, 224, 226 and use theconcept model 136 to classify the low-level features 220, 222, 224, 226 as representative of certain higher-level concepts such as scenes, actions, actors, and objects. As such, theillustrative concept model 136 includes data 244 (e.g., semantic elements) identifying low level features 220, 222, 224, 226, data 248 (e.g., semantic elements) identifying semantic concepts, and data 246 (e.g., semantic elements) identifying relationships between the various low level features 220, 222, 224, 226 and theconcepts 248. Theevent detection module 228 may apply one or more event classifiers to thefeatures 244,relationships 246, and/orconcepts 238 to determine whether a combination offeatures 244,relationships 246, and/orconcepts 238 evidences an event. Therelationships 246 may include, for example, temporal relations between actions, objects, and/or audio events (e.g., “is followed by”), compositional relationships (e.g., Person X is doing Y with object Z), interaction relationships (e.g., person X is pushing an object Y or Person Y is using an object Z), state relations involving people or objects (e.g., “is performing,” “is saying”), co-occurrence relations (e.g., “is wearing,” “is carrying”), spatial relations (e.g., “is the same object as”), temporal relations between objects (e.g., “is the same object as”), and/or other types of attributed relationships (e.g., spatial, causal, procedural, etc.). Therelationships 246 may specify a variety of different types of relationships between low level features 220, 222, 224, 226 andconcepts 248, and/or between different types ofconcepts 248. Maintaining the data relating tofeatures 244,relationships 246, andconcepts 248 allows thesystem 100 to detect higher level semantic concepts that tend to evidence events, including complex events. Some additional examples of such higher level concepts include “crowd dancing,” “person giving speech,” “people drinking,” and “person running.” It should be noted that therelationships 246 include not only relationships between different visual features, but also relationships between different types of multimedia features and concepts; for instance, relationships betweenaudio features 222 and visual features 220 (e.g., a loud sound is followed by a bright light) or relationships between text features and sound features (e.g., a GO NINERS sign and cheering). - Each or any of the
models mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between features, concepts, events, and salient activities. In some embodiments, indicators of evidentiary significance are determined using machine learning techniques. For example, a machine learning analysis of training videos depicting a “person making a sandwich” (a complex event) may indicate that semantic descriptions such as “kitchen” (scene), “hands visible” (actor), “placing fillings on bread” (action) and “spreading creamy substance” (action) are highly likely to be associated with a person making a sandwich, while other semantic descriptions such as “outdoor event,” (scene), “vehicle moving” (action) or “person jumping” (action) are unlikely to be associated with that particular event. Such indicators can be used by the multimediacontent understanding module 104, or theevent detection module 228 more specifically, to perform semantic reasoning. - The
event detection module 228 includes adescription formulator 230 and anannotator 232. The event detection module and each of itssubmodules event detection module 228 has determined (e.g., through semantic reasoning) an event associated with theinput 102, thedescription formulator 230 generates theevent description 106 as described above, and theannotator 232 annotates or otherwise associates theevent description 106 with the multimedia input 102 (e.g., by appending a meta tag to theinput 102 or by other suitable techniques). - The
event detection module 228 identifies theevent description 106 to the salientactivity detector module 234. The salientactivity detector module 234 uses thesalient event criteria 138 to evaluate theevent description 106 and/or the detected features 220, 222, 224, 226 of themultimedia input 102 to determine the salient activities associated with theevent description 106 or with themultimedia input 102 more generally. To do this, the salientactivity detector module 234 maps theevent description 106 to salient activities using, e.g., themapping 140 and/or knowledge contained in thesalient event criteria 138. Thesalient event criteria 138 can be derived from a number of different sources. For example,salient event markers 250 can be determined by applying machine learning techniques to training samples (e.g., portions of the image/video collection 150) of meta-level features 226. In other words, thecomputing system 100 can learn, over time, characteristic data values of meta-level features 226 or combinations of meta-level features that tend to be representative of salient events. The illustrativesalient event criteria 138 also includes harvestedsalient event criteria 252. The harvestedsalient event criteria 252 is derived by analyzing samples of training data (e.g., portions of the image/video collection 150) to determine, for example, the activities that appear most often in videos depicting certain types of events. Activities that frequently appear in videos may be considered salient activities according to thecomputing system 100. Thesalient event criteria 138 also includessalient activity templates 254. Thetemplates 254 may include portions of thepresentation templates 142. For example, apresentation template 142 may specify a list of activities that are considered to be “salient” for a particular type of video montage or othervisual presentation 120. Thesalient event criteria 138 may also include salient event criteria that is specified by or derived from user inputs 256. For example, theinteractive storyboard module 124 may determine salient event criteria based on user inputs received by theediting module 126. As another example, the userpreference learning module 148 may derive new or updated salient event criteria based on its analysis of the user feedback 146. - The
salient event criteria 138 identifies the salient activities associated with theevent description 106. Thesalient event criteria 138 also specifies the information that thecomputing system 100 needs to detect those salient activities in theinput 102 using the feature detection algorithms 130, e.g., salient event detection criteria. For example, the salient event detection criteria may include data values and/or algorithm parameters that indicate a particular combination offeatures salient event criteria 138 may have associated therewith one ormore saliency indicators 238. Thesaliency indicators 238 can be used by thecomputing system 100 to select or prioritize thesalient event segments 112. Thesaliency indicators 238 may be embodied as attributes of themarkers 250, the harvestedcriteria 252, thetemplates 254 and/or the user inputs 256. For instance, each salient event criterion may have an associatedsaliency indicator 238. Further, a salient event criterion may havemultiple saliency indicators 238, as the criterion may have different degrees of saliency in relation to different events. - The salient
activity detector module 234 identifies the salient activity detection criteria 236 (e.g., the instructions or data for algorithmically detecting the salient activities in the input 102) and the saliency indicator(s) 238 to the salient eventsegment identifier module 240. The salientevent identifier module 240 uses the saliency indicator(s) 238 and/or the salientactivity detection criteria 236 to select the appropriate feature detection algorithms 130 to execute on theinput 102, in order to algorithmically identify thesalient event segments 112, executes the selected algorithms 130, and provides data indicating the identifiedsalient event segments 112 and theevent description 106 to theoutput generator module 114. - Referring now to
FIG. 3 , an example of aprocess 300 executable by thecomputing system 100 to provide video content understanding and assistance services is shown. Theprocess 300 may be embodied as computerized programs, routines, logic and/or instructions executed by thecomputing system 100, for example by one or more of the modules and other components shown inFIGS. 1 and 2 , described above. Atblock 310, thesystem 100 receives one or more input files, e.g. amultimedia input 102. The input file(s) can be embodied as, for example, raw video footage or digital pictures captured by smartphone or other personal electronics device. The input file(s) may be stored on a local computing device and/or a remote computing device (e.g., in personal cloud, such as through a document storing application like DROPBOX. Thus, the input file(s) may be received by a file uploading, file transfer, or messaging capability of the end user's computing device and/or the computing system 100 (e.g., thecommunication subsystems 644, 672). Atblock 312, thecomputing system 100 may perform a preliminary step of filtering the input file(s) based on one or more of the saliency indicators. For example, thecomputing system 100 may evaluate the meta level features 226 of the input as a preliminary step and filter out any files or video frames that fall outside the scope of the saliency indicators for the meta level features 238. This pre-processing step may, for example, help eliminate low quality footage prior to execution of other feature detection algorithms 130. - At
block 314, thecomputing system 100 executes the feature detection algorithms 130 on the input remaining after the pre-processing (if any) ofblock 312. The feature detection algorithms 130 detectvisual features 220, audio features 222, and/ortextual features 224 as described above. The algorithms may also detect meta level features 226, if not already performed atblock 312. Atblock 316, thesystem 100 evaluates the detected features 220, 222, 224, and/or 226 and based on that evaluation, determines an event that is evidenced by the detected features. To do this, thesystem 100 maps the visual, audio, and/ortextual features concept model 136 and/or the mapping 140) atblock 318. Thesystem 320 merges thefeatures system 100 has determined is evidenced by the detected features. The semantic description may be embodied as theevent description 106 as described above. Atblock 322, thesystem 100 associates the semantic description of the event with the input file(s) (using, e.g., meta tags). Thus, as a result ofblock 316, thesystem 100 automatically classifies the input file(s) as depicting an event based on the output of the feature detection algorithms 130. - In other embodiments, at
block 316, thesystem 100 classifies the input without executing the feature detection algorithms 130. For example, thesystem 100 may receive the event type (e.g., an event description 106) from an end user, e.g., as a text string, search query, or meta tag. Thesystem 100 may extract meta tags previously associated with the input and use those meta tags to classify the input. Thus, in some embodiments, feature detection algorithms may be used to determine the salient event segments as described herein, but not to determine the initial event classification. - At
block 324, thecomputing system 100 determines the salient activities that are associated with the event determined atblock 316. To do this, thesystem 100 uses thesalient event criteria 138 and/or themapping 140 to evaluate the event information generated atblock 316. For example, thesystem 100 determines, using themapping 140, activities that are associated with the detected event. Thesystem 100 may also use thesaliency indicators 238 atblock 326 to prioritize the salient activities so that, for example, if atemplate 142 specifies a limitation on the length or duration of thevisual presentation 120, segments of the input that depict the higher priority salient activities can be included in thepresentation 120 and segments that depict lower priority activities may be excluded from thepresentation 120. Atblock 328, thesystem 100 determines the salient activity detection criteria for each of the salient activities identified by the salient event criteria. As discussed above, the salient activity detection criteria are used by thesystem 100 to algorithmically identify thesalient event segments 112 of the input file(s). For example, thesystem 100 may utilize the salient activity detection criteria as input or parameters of one or more of the feature detection algorithms 130. - At
block 332, thesystem 100 identifies thesalient event segments 112 in the multimedia input file(s). To do this, thesystem 100 executes one or more of the feature detection algorithms 130 using the salient activity detection criteria determined atblock 328. Thesystem 100 also uses thesaliency indicators 238, if any, to filter or prioritize the salient event segments 112 (block 334). Atblock 336, thecomputing system 100 generates the visual presentation 120 (e.g., a video clip or “montage”), and/or theNL description 122, using, e.g., thetemplates block 340, thesystem 100 presents thevisual presentation 120 and/or theNL description 122 to the end user using an interactive storyboard type of user interface mechanism. The interactive storyboard mechanism allows the user to review, edit, approve, and share thevisual presentation 120 and/or theNL description 122. Atblock 342, thecomputing system 100 incorporates feedback learned or observed through the user's interactions with the interactive storyboard into one or more of the system models, templates, and/or knowledge bases. As mentioned above, thetemplates knowledge base 132 may be updated in response to the user's interactions (e.g., editing, viewing, sharing) with thevisual presentation 120 and/or theNL description 122 via the interactive storyboard mechanism. It should be noted that while this disclosure refers to a “storyboard” type mechanism, other types and formats of intuitive human computer interfaces or other mechanisms for the review and editing of multimedia files may be used equally as well. - Referring now to
FIG. 4 , an embodiment of themapping 140 is shown in greater detail. Theillustrative mapping 140 and portions thereof may be embodied as one or more data structures, such as a searchable database, table, or knowledge base, in software, firmware, hardware, or a combination thereof. Themapping 140 establishes relationships between and/or among semantic elements of the various stored models described above (e.g., thefeature models 134 and the concept model 136). As shown inFIG. 4 , theillustrative mapping 140 defines logical links orconnections features 410,concepts 412,relations 414,events 416, andsalient activities 418. Thesystem 140 can use themapping 140 and particularly thelinks features 410,concepts 412, andrelations 414. Themapping 140 may be embodied as, for example, an ontology that defines the various relationships between the semantic elements shown inFIG. 4 . Themapping 140 may be initially developed through a manual authoring process and/or by executing machine learning algorithms on sets of training data. Themapping 140 may be updated in response to use of thesystem 100 over time using, e.g., one or more machine learning techniques. Themapping 140 may be stored in computer memory, e.g., as part of the stored models, knowledge base, and/or templates 626, 666. - Referring now to
FIG. 5 , an example 500 of salient event segment identification as disclosed herein is shown.Video 510 is an input to thesystem 100. Thesystem 100 analyzes the video using feature detection algorithms and semantic reasoning as described above. Time-dependent output of thesystem 100's semantic analysis of the detected features is shown by thegraphics portion 520 represents asalient event segment 112 of thevideo 510. Thesystem 100 has determined, using the techniques described above, that thesalient event segment 520 depicts the salient activity of blowing out candles. Similarly, thesystem 100 has identified, using the techniques described above,salient event segments salient event segment 526, which depicts a person opening a box. Thesystem 100 did not detect any segments depicting the activity of cutting a cake, in thevideo 510. Thesystem 100 can extract thesegments video 510 and incorporate thesegments video 510. - One usage scenario of the technology disclosed herein provides a fully-automated video creation service with a “one-click” sharing capability. In this embodiment, the service takes one or more input video files, which can be e.g. raw footage captured by smartphone, and identifies a most-relevant event type for the uploaded video, e.g. birthday party, wedding, football game. The service identifies the event using feature recognition algorithms, such as described above and in the aforementioned priority patent applications of SRI International. In some instances, the step of event identification can be done manually, such as by the user selecting an event type from a menu or by typing in keywords. In this embodiment, the event type corresponds to a stored template identifying the key activities typically associated with that event type. For example, for a birthday party, associated activities would include blowing out candles, singing the Happy Birthday song, opening gifts, posing for pictures, etc. Using feature detection algorithms for complex activity recognition, such as those described above and in the aforementioned priority patent applications, the service automatically identifies segments (sequences of frames) within the uploaded file that depict the various key activities or moments associated with the relevant event type. The service automatically creates a highlight clip by splicing together the automatically-identified segments. The user can review the clip and instruct the service to save the clip, download it, and/or post/share it on a desired social network or other site.
- In another usage scenario, a video creation service provides interactive storyboard editing capabilities. In this embodiment, the process begins with one or more user-designated input video files. As above, an event type is identified, and video segments depicting key activities/moments are algorithmically identified. In this embodiment, the system may identify more salient event segments than it actually proposes to use in a highlight clip, e.g., due to limits on total clip length, uncertainty about which segments are best, redundant capture of the same event/activities by multiple video sources, or other factors. Then, the service displays to the user an interactive storyboard, with thumbnails or other icons representing each of the segments along a timeline, and a visual indication of the segments that the system tentatively proposes to use in the highlight clip. If there is video from multiple sources for the same activities, then multiple corresponding rows of segments can be displayed, with visual indication of which segments from each row are to be used in the edited clip. An example situation in which this may occur is a child's birthday party, where both parents and other relatives (e.g., grandparents) may be taking video of the child's party. In this case, the
system 100 can select salient event segments from the different videos in the group of videos taken by the family members and merge them into a highlight clip. In this embodiment, the user can modify the content of the highlight clip interactively by selecting different segments to use, with the interactive storyboard interface. In some embodiments, the user can also change the beginning and/or ending frames of a segment by selecting the segment for editing, previewing it along with neighboring frames from the original footage, and using interactive controls to mark desired start and end frames for the system. Once the user review/editing is complete, the service constructs a highlight clip by splicing segments together in accordance with the user's edits. As above, the user can preview the clip and decide to save or share the clip, for example. - Other embodiments include additional features, alternatively or in addition to those described above. As illustrated in the above example, the
system 100 can handle multiple input files at one time, even if the input files are from multiple different users/devices. In some embodiments, the automated creation of the highlight clip can be performed by thesystem 100 in real time, on live video (e.g., immediately after a user has finished filming, thesystem 100 can initiate the highlight creation processes). In the interactive embodiments, users can add transition effects between segments, where the transition effects may be automatically selected by the service and/or chosen by the user. - In some embodiments, the
event description 106 or theNL description 122 generated automatically by thesystem 100 can take the form of, e.g., meta data that can be used for indexing, search and retrieval, and/or for advertising (e.g., as ad words). The meta data can include keywords that are derived from the algorithmically performed complex activity recognition and other semantic video analysis (e.g. face, location, object recognition; text OCR; voice recognition), performed by thesystem 100 using the feature detection algorithms 130 as described above. The meta data can be used by query expansion and/or query augmentation mechanisms to facilitate keyword searching, clustering, or browsing of acollection 150. Alternatively or in addition, the meta data can be used for automatic image/video suggestion in response to user input or in relation to other electronic content (e.g., text or images posted on an Internet web site) (e.g., by the auto-suggest module 162). For example, if a user begins typing text in an email, text message, or social media post, thesystem 100 can use the meta data to automatically, based on the text input (which may be only partially complete), generate a list of relevant images and/or videos, which the user may want to attach to the message or share along with the post. - In some embodiments, the content processing, e.g., the complex event recognition, is done on a server computer (e.g., by a proprietary video creation service), so captured video files are uploaded by the customer to the server, e.g. using a client application running on a personal electronic device or through interactive website. Interactive aspects, such as storyboard selection and editing of clip segments, may be carried out via online interaction between the customer's capture device (e.g., camera, smartphone, etc.), or other customer local device (e.g. tablet, laptop), and the video service server computer. Responsive to local commands entered on the customer's device, the server can assemble clip segments as desired, and redefine beginning and end frames of segments, with respect to the uploaded video content. Results can be streamed to the customer's device for interactive viewing. Alternatively or in addition, computer vision algorithms (such as complex event recognition algorithms) may be implemented locally on the user's capture device and the video creation service can be delivered as an executable application running on the customer's device.
- Referring now to
FIG. 6 , a simplified block diagram of anembodiment 600 of the multimedia content understanding andassistance computing system 100 is shown. While theillustrative embodiment 600 is shown as involving multiple components and devices, it should be understood that thecomputing system 100 may constitute a single computing device, alone or in combination with other devices. Theembodiment 600 includes a user computing device 610, which embodies features and functionality of a “client-side” or “front end”portion 618 of thecomputing system 100 depicted inFIG. 1 , and aserver computing device 650, which embodies features and functionality of a “server-side” or “back end” portion 658 of thesystem 100. Theembodiment 600 includes adisplay device 680 and acamera 682, each of which may be used alternatively or in addition to thecamera 630 anddisplay device 642 of the user computing device 610. Each or any of thecomputing devices more networks 646. - The
computing system 100 or portions thereof may be distributed across multiple computing devices that are connected to the network(s) 646 as shown. In other embodiments, however, thecomputing system 100 may be located entirely on, for example, the computing device 610 or one of thedevices system 100 may be incorporated into other systems or computer applications. Such applications or systems may include, for example, commercial off the shelf (COTS) or custom-developed virtual personal assistant applications, video montage creation applications, content sharing services such as YOUTUBE and INSTAGRAM, and social media services such as FACEBOOK and TWITTER. As used herein, “application” or “computer application” may refer to, among other things, any type of computer program or group of computer programs, whether implemented in software, hardware, or a combination thereof, and includes self-contained, vertical, and/or shrink-wrapped software applications, distributed and cloud-based applications, and/or others. Portions of a computer application may be embodied as firmware, as one or more components of an operating system, a runtime library, an application programming interface (API), as a self-contained software application, or as a component of another software application, for example. - The illustrative user computing device 610 includes at least one processor 612 (e.g. a microprocessor, microcontroller, digital signal processor, etc.),
memory 614, and an input/output (I/O)subsystem 616. The computing device 610 may be embodied as any type of computing device capable of performing the functions described herein, such as a personal computer (e.g., desktop, laptop, tablet, smart phone, body-mounted device, wearable device, etc.), a server, an enterprise computer system, a network of computers, a combination of computers and other electronic devices, or other electronic devices. Although not specifically shown, it should be understood that the I/O subsystem 616 typically includes, among other things, an I/O controller, a memory controller, and one or more I/O ports. Theprocessor 612 and the I/O subsystem 616 are communicatively coupled to thememory 614. Thememory 614 may be embodied as any type of suitable computer memory device (e.g., volatile memory such as various forms of random access memory). - The I/
O subsystem 616 is communicatively coupled to a number of hardware and/or software components, including the components of the computing system shown inFIGS. 1 and 2 or portions thereof (e.g., the multimedia content assistant front end modules 618), thecamera 630, and thedisplay device 642. As used herein, a “camera” may refer to any device that is capable of acquiring and recording two-dimensional (2D) or three-dimensional (3D) video images of portions of the real-world environment, and may include cameras with one or more fixed camera parameters and/or cameras having one or more variable parameters, fixed-location cameras (such as “stand-off” cameras that are installed in walls or ceilings), and/or mobile cameras (such as cameras that are integrated with consumer electronic devices, such as laptop computers, smart phones, tablet computers, wearable electronic devices and/or others. - The
camera 630, amicrophone 632, speaker(s) 640, and thedisplay device 642 may form part of a human-computer interface subsystem 638, which includes one or more user input devices (e.g., a touchscreen, keyboard, virtual keypad, microphone, etc.) and one or more output devices (e.g., speakers, displays, LEDs, etc.). The human-computer interface device(s) 638 may include, for example, a touchscreen display, a touch-sensitive keypad, a kinetic sensor and/or other gesture-detecting device, an eye-tracking sensor, and/or other devices that are capable of detecting human interactions with a computing device. - The
devices FIG. 6 as being in communication with the user computing device 610, either by the I/O subsystem 616 or anetwork 646. It should be understood that any or all of thedevices camera 630 and/ormicrophone 632 may be embodied in a wearable device, such as a head-mounted display, GOOGLE GLASS-type device or BLUETOOTH earpiece, which then communicates wirelessly with the computing device 610. Alternatively, thedevices - The I/
O subsystem 616 is also communicatively coupled to one ormore storage media 620, anASR subsystem 634, anOCR subsystem 636, and acommunication subsystem 644. It should be understood that each of the foregoing components and/or systems may be integrated with the computing device 610 or may be a separate component or system that is in communication with the I/O subsystem 616 (e.g., over anetwork 646 or a bus connection). Theillustrative ASR subsystem 634 and theillustrative OCR subsystem 636 are, illustratively, COTS systems that are configured to interface with thecomputing system 100. - The
storage media 620 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others). In some embodiments, portions of thecomputing system 100, e.g., thefront end modules 618 and/or themultimedia inputs 102, thevisual presentation 120, theNL description 122, the algorithms 130, theknowledge base 132, thetemplates storage media 620. Portions of thecomputing system 100, e.g., themultimedia inputs 102, thevisual presentation 120, theNL description 122, the algorithms 130, theknowledge base 132, thetemplates memory 614 during operation of the computing device 610, for faster processing or other reasons. - The
communication subsystem 644 communicatively couples the user computing device 610 to one or more other devices, systems, or communication networks, e.g., a local area network, wide area network, personal cloud, enterprise cloud, public cloud, and/or the Internet, for example. Accordingly, thecommunication subsystem 644 may include one or more wired or wireless network interface software, firmware, or hardware, for example, as may be needed pursuant to the specifications and/or design of the particular embodiment of thesystem 100. - The
display device 680, thecamera 682, and theserver computing device 650 each may be embodied as any suitable type of computing device or personal electronic device capable of performing the functions described herein, such as any of the aforementioned types of devices or other electronic devices. For example, in some embodiments, theserver computing device 650 may operate a “back end” portion 658 of the multimedia contentassistant computing system 100. Theserver computing device 650 may include one or more server computers includingstorage media 660, which may be used to store portions of thecomputing system 100, such as the back end modules 658 and/or portions of themultimedia inputs 102, thevisual presentation 120, theNL description 122, the algorithms 130, theknowledge base 132, thetemplates server computing device 650 also includes anHCI subsystem 670, and a communication subsystem 672. In general, components of theserver computing device 650 having similar names to components of the computing device 610 described above may be embodied similarly. Further, each of thedevices server computing device 650. Thecomputing system 100 may include other components, sub-components, and devices not illustrated inFIG. 6 for clarity of the description. In general, the components of thecomputing system 100 are communicatively coupled as shown inFIG. 6 by signal paths, which may be embodied as any type of wired or wireless signal paths capable of facilitating communication between the respective devices and components. - Illustrative examples of the technologies disclosed herein are provided below. An embodiment of the technologies may include any one or more, and any combination of, the examples described below.
- In an example 1, a video assistant for understanding content of a video is embodied in one or more non-transitory machine accessible storage media of a computing system and includes instructions executable by one or more processors to cause the computing system to: detect a plurality of different features in a plurality of different segments of a video by executing a plurality of different feature detection algorithms on the video, each of the video segments comprising one or more frames of the video; determine an event evidenced by the detected features; determine a plurality of salient activities associated with the event; and algorithmically identify a plurality of different salient event segments of the video, each of the salient event segments depicting a salient activity associated with the event.
- An example 2 includes the subject matter of example 1, and includes instructions executable to determine salient event criteria associated with the event, and determine the salient activities associated with the event based on the salient event criteria. An example 3 includes the subject matter of example 2, and includes instructions executable to determine the salient event criteria by one or more of: algorithmically analyzing a video collection and algorithmically learning a user specification relating to the salient event criteria. An example 4 includes the subject matter of any of examples 1-3, and includes instructions executable to determine a saliency indicator associated with each of the salient event segments of the video, and select a subset of the plurality of salient event segments for inclusion in a visual presentation based on the saliency indicator. An example 5 includes the subject matter of any of examples 1-4, and includes instructions executable to extract the salient event segments from the video and incorporate the extracted salient event segments into a video clip. An example 6 includes the subject matter of example 5, and includes instructions executable to, in response to user input, share the video clip with another computing device over a network. An example 7 includes the subject matter of example 5, and includes instructions executable to one or more of: (i) by a human-computer interface device, interactively edit the video clip and (ii) automatically edit the video clip. An example 8 includes the subject matter of example 7, and includes instructions executable to store data relating to the interactive editing of the video clip, execute one or more machine learning algorithms on the stored data, and, in response to the execution of the one or more machine learning algorithms on the stored data, update one or more of: the determination of salient activities associated with the event and the identification of salient event segments. An example 9 includes the subject matter of any of examples 1-8, and includes instructions executable to, by a human-computer interface device, output a natural language description of the one or more salient event segments.
- In an example 10, a computing system for understanding content of a video includes: one or more computing devices; and a plurality of processor-executable modules embodied in one or more non-transitory machine accessible storage media of the one or more computing devices, the processor-executable modules comprising: a visual content understanding module to cause the computing system to: detect a plurality of different features in a plurality of different segments of a video by executing one or more event recognition algorithms; determine a semantic description of an event evidenced by one or more of the detected features; and identify one or more salient event segments of the video, each salient event segment depicting a salient activity associated with the event; an output generator module to cause the computing system to output a video clip comprising the salient event segments; and an interactive storyboard module to cause the computing system to one or more of: interactively edit the video clip and share the video clip over a network.
- An example 11 includes the subject matter of example 10, wherein the visual content understanding module is to cause the computing system to determine a configuration of a camera used to record the video; derive, from the camera configuration, a user intent with respect to the video; and identify the salient event segments by selecting one or more segments of the video that relate to the user intent. An example 12 includes the subject matter of example 11, wherein in response to the user intent, the output generator module is to cause the computing system to select a template for creating the video clip. An example 13 includes the subject matter of example 10, wherein the visual content understanding module is to cause the computing system to determine the semantic description based on a plurality of different algorithmically-detected features comprising two or more of: a visual feature, an audio feature, a textual feature, and a meta-level feature indicative of a camera configuration. An example 14 includes the subject matter of any of examples 10-13, wherein the visual content understanding module is to cause the computing system to determine relationships between the detected features, map the detected features and relationships to semantic concepts, and formulate the semantic description to comprise the semantic concepts. An example 15 includes the subject matter of any of examples 10-14, wherein the visual content understanding module is to cause the computing system to, in an automated fashion, associate the semantic description with the video. An example 16 includes the subject matter of any of examples 10-15, wherein the visual content understanding module is to cause the computing system to determine a salient event criterion and, based on the salient event criterion, identify the salient activity associated with the event. An example 17 includes the subject matter of example 16, wherein the computing system is to learn the salient event criterion by analyzing a professionally-made video. An example 18 includes the subject matter of example 16, wherein the computing system is to analyze one or more of: semantic content of a video collection and user input, and determine the salient event criterion based on the analysis of the one or more of the semantic content and the user input. An example 19 includes the subject matter of any of examples 10-18, wherein the computing system is to determine a saliency indicator comprising data associated with one or more of the detected features, and use the saliency indicator to identify the salient event segments.
- In an example 20, a computing system for understanding visual content in digital images includes: one or more computing devices; and instructions embodied in one or more non-transitory machine accessible storage media of the one or more computing devices, the instructions executable by the one or more computing devices to cause the computing system to: detect a plurality of different features in a set of digital images by executing a plurality of different feature detection algorithms on the set of images; map the one or more features detected by the feature detection algorithms to an event, the event evidenced by the one or more detected features; determine a plurality of salient activities associated with the event; extract one or more salient event segments from the set of images, each of the salient event segments depicting a salient activity associated with the event; and incorporate the extracted one or more salient event segments into a visual presentation.
- An example 21 includes the subject matter of example 20, wherein the instructions cause the computing system to select at least two of: a visual feature detection algorithm, an audio feature detection algorithm, and a textual feature detection algorithm, execute the selected feature detection algorithms to detect at least two of: a visual feature, an audio feature, and a textual feature of the set of images, and determine the event evidenced by at least two of: the visual feature, the audio feature, and the textual feature. An example 22 includes the subject matter of example 20 or example 21, wherein the instructions cause the computing system to, in an automated fashion, generate a semantic description of the event based on the one or more features detected by the feature detection algorithms. An example 23 includes the subject matter of any of examples 20-22, wherein the instructions cause the computing system to determine a saliency indicator associated with each of the salient event segments, and arrange the salient event segments in the visual presentation according to the saliency indicators associated with the salient event segments. An example 24. includes the subject matter of example 23, wherein the instructions cause the computing system to one or more of: (i) by a human-computer interface device of the computing system, interactively rearrange the salient event segments in the visual presentation and (ii) automatically rearrange the salient event segments in the visual presentation. An example 25 includes the subject matter of example 23, wherein the instructions cause the computing system to select a subset of the salient event segments based on the saliency indicators associated with the salient event segments, and create a visual presentation comprising the salient event segments in the selected subset of salient event segments. An example 26 includes the subject matter of any of examples 20-25, wherein the instructions cause the computing system to, in an automated fashion, associate a description of the event with the images in the set of digital images. An example 27 includes the subject matter of example 26, wherein the instructions cause the computing system to detect user input comprising a textual description, compare the textual description to the description of the event associated with the images in the set of digital images, and, in an automated fashion, suggest one or more images having a relevancy to the text description as determined by the comparison of the textual description of the user input to the description of the event.
- In the foregoing description, numerous specific details, examples, and scenarios are set forth in order to provide a more thorough understanding of the present disclosure. It will be appreciated, however, that embodiments of the disclosure may be practiced without such specific details. Further, such examples and scenarios are provided for illustration, and are not intended to limit the disclosure in any way. Those of ordinary skill in the art, with the included descriptions, should be able to implement appropriate functionality without undue experimentation.
- References in the specification to “an embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is believed to be within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly indicated.
- Embodiments in accordance with the disclosure may be implemented in hardware, firmware, software, or any combination thereof. Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device or a “virtual machine” running on one or more computing devices). For example, a machine-readable medium may include any suitable form of volatile or non-volatile memory.
- Modules, data structures, blocks, and the like are referred to as such for ease of discussion, and are not intended to imply that any specific implementation details are required. For example, any of the described modules and/or data structures may be combined or divided into sub-modules, sub-processes or other units of computer code or data as may be required by a particular design or implementation. In the drawings, specific arrangements or orderings of schematic elements may be shown for ease of description. However, the specific ordering or arrangement of such elements is not meant to imply that a particular order or sequence of processing, or separation of processes, is required in all embodiments. In general, schematic elements used to represent instruction blocks or modules may be implemented using any suitable form of machine-readable instruction, and each such instruction may be implemented using any suitable programming language, library, application-programming interface (API), and/or other software development tools or frameworks. Similarly, schematic elements used to represent data or information may be implemented using any suitable electronic arrangement or data structure. Further, some connections, relationships or associations between elements may be simplified or not shown in the drawings so as not to obscure the disclosure. This disclosure is to be considered as exemplary and not restrictive in character, and all changes and modifications that come within the spirit of the disclosure are desired to be protected.
Claims (27)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/332,071 US20140328570A1 (en) | 2013-01-09 | 2014-07-15 | Identifying, describing, and sharing salient events in images and videos |
US14/846,318 US10679063B2 (en) | 2012-04-23 | 2015-09-04 | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/737,607 US9244924B2 (en) | 2012-04-23 | 2013-01-09 | Classification, search, and retrieval of complex video events |
US14/332,071 US20140328570A1 (en) | 2013-01-09 | 2014-07-15 | Identifying, describing, and sharing salient events in images and videos |
Related Parent Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/737,607 Continuation US9244924B2 (en) | 2012-04-23 | 2013-01-09 | Classification, search, and retrieval of complex video events |
US13/737,607 Continuation-In-Part US9244924B2 (en) | 2012-04-23 | 2013-01-09 | Classification, search, and retrieval of complex video events |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/846,318 Continuation US10679063B2 (en) | 2012-04-23 | 2015-09-04 | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140328570A1 true US20140328570A1 (en) | 2014-11-06 |
Family
ID=51841475
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/332,071 Abandoned US20140328570A1 (en) | 2012-04-23 | 2014-07-15 | Identifying, describing, and sharing salient events in images and videos |
US14/846,318 Active 2035-06-08 US10679063B2 (en) | 2012-04-23 | 2015-09-04 | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/846,318 Active 2035-06-08 US10679063B2 (en) | 2012-04-23 | 2015-09-04 | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics |
Country Status (1)
Country | Link |
---|---|
US (2) | US20140328570A1 (en) |
Cited By (168)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140245153A1 (en) * | 2013-02-28 | 2014-08-28 | Nk Works Co., Ltd. | Image processing apparatus, computer-readable medium storing an image processing program, and image processing method |
US20150205771A1 (en) * | 2014-01-22 | 2015-07-23 | Panasonic Intellectual Property Corporation Of America | Terminal apparatus, server apparatus, method for supporting posting of information, and non-transitory recording medium storing computer program |
US20150373281A1 (en) * | 2014-06-19 | 2015-12-24 | BrightSky Labs, Inc. | Systems and methods for identifying media portions of interest |
US20160012597A1 (en) * | 2014-07-09 | 2016-01-14 | Nant Holdings Ip, Llc | Feature trackability ranking, systems and methods |
US20160026874A1 (en) * | 2014-07-23 | 2016-01-28 | Gopro, Inc. | Activity identification in video |
US20160199742A1 (en) * | 2014-01-31 | 2016-07-14 | Google Inc. | Automatic generation of a game replay video |
US20160217348A1 (en) * | 2015-01-27 | 2016-07-28 | Samsung Electronics Co., Ltd. | Image Processing Method and Electronic Device for Supporting the Same |
WO2016154158A1 (en) * | 2015-03-25 | 2016-09-29 | Microsoft Technology Licensing, Llc | Machine learning to recognize key moments in audio and video calls |
US9473803B2 (en) * | 2014-08-08 | 2016-10-18 | TCL Research America Inc. | Personalized channel recommendation method and system |
WO2016172379A1 (en) * | 2015-04-21 | 2016-10-27 | Stinkdigital, Ltd | Video delivery platform |
US20160360079A1 (en) * | 2014-11-18 | 2016-12-08 | Sony Corporation | Generation apparatus and method for evaluation information, electronic device and server |
US9620173B1 (en) * | 2016-04-15 | 2017-04-11 | Newblue Inc. | Automated intelligent visualization of data through text and graphics |
US9646652B2 (en) | 2014-08-20 | 2017-05-09 | Gopro, Inc. | Scene and activity identification in video summary generation based on motion detected in a video |
US20170132498A1 (en) * | 2015-11-11 | 2017-05-11 | Adobe Systems Incorporated | Structured Knowledge Modeling, Extraction and Localization from Images |
CN106682060A (en) * | 2015-11-11 | 2017-05-17 | 奥多比公司 | Structured Knowledge Modeling, Extraction and Localization from Images |
US9659218B1 (en) * | 2015-04-29 | 2017-05-23 | Google Inc. | Predicting video start times for maximizing user engagement |
US9679605B2 (en) | 2015-01-29 | 2017-06-13 | Gopro, Inc. | Variable playback speed template for video editing application |
US9721611B2 (en) | 2015-10-20 | 2017-08-01 | Gopro, Inc. | System and method of generating video from video clips based on moments of interest within the video clips |
US9734870B2 (en) | 2015-01-05 | 2017-08-15 | Gopro, Inc. | Media identifier generation for camera-captured media |
US20170243065A1 (en) * | 2016-02-19 | 2017-08-24 | Samsung Electronics Co., Ltd. | Electronic device and video recording method thereof |
US9754159B2 (en) | 2014-03-04 | 2017-09-05 | Gopro, Inc. | Automatic generation of video from spherical content using location-based metadata |
US9761278B1 (en) | 2016-01-04 | 2017-09-12 | Gopro, Inc. | Systems and methods for generating recommendations of post-capture users to edit digital media content |
CN107239801A (en) * | 2017-06-28 | 2017-10-10 | 安徽大学 | Video attribute represents that learning method and video text describe automatic generation method |
US9794632B1 (en) | 2016-04-07 | 2017-10-17 | Gopro, Inc. | Systems and methods for synchronization based on audio track changes in video editing |
US9812175B2 (en) | 2016-02-04 | 2017-11-07 | Gopro, Inc. | Systems and methods for annotating a video |
US9838731B1 (en) | 2016-04-07 | 2017-12-05 | Gopro, Inc. | Systems and methods for audio track selection in video editing with audio mixing option |
US9836853B1 (en) | 2016-09-06 | 2017-12-05 | Gopro, Inc. | Three-dimensional convolutional neural networks for video highlight detection |
US9858340B1 (en) | 2016-04-11 | 2018-01-02 | Digital Reasoning Systems, Inc. | Systems and methods for queryable graph representations of videos |
US9894393B2 (en) | 2015-08-31 | 2018-02-13 | Gopro, Inc. | Video encoding for reduced streaming latency |
US9922682B1 (en) | 2016-06-15 | 2018-03-20 | Gopro, Inc. | Systems and methods for organizing video files |
US9972066B1 (en) | 2016-03-16 | 2018-05-15 | Gopro, Inc. | Systems and methods for providing variable image projection for spherical visual content |
US9998769B1 (en) | 2016-06-15 | 2018-06-12 | Gopro, Inc. | Systems and methods for transcoding media files |
US10002641B1 (en) | 2016-10-17 | 2018-06-19 | Gopro, Inc. | Systems and methods for determining highlight segment sets |
US20180173319A1 (en) * | 2013-12-31 | 2018-06-21 | Google Llc | Systems and methods for gaze-based media selection and editing |
CN108228705A (en) * | 2016-12-09 | 2018-06-29 | 波音公司 | Automatic object and activity tracking equipment, method and medium in live video feedback |
US10037767B1 (en) * | 2017-02-01 | 2018-07-31 | Wipro Limited | Integrated system and a method of identifying and learning emotions in conversation utterances |
US10045120B2 (en) | 2016-06-20 | 2018-08-07 | Gopro, Inc. | Associating audio with three-dimensional objects in videos |
US10083718B1 (en) | 2017-03-24 | 2018-09-25 | Gopro, Inc. | Systems and methods for editing videos based on motion |
WO2018187167A1 (en) * | 2017-04-05 | 2018-10-11 | Ring Inc. | Triggering actions based on shared video footage from audio/video recording and communication devices |
US10102593B2 (en) | 2016-06-10 | 2018-10-16 | Understory, LLC | Data processing system for managing activities linked to multimedia content when the multimedia content is changed |
US10109319B2 (en) | 2016-01-08 | 2018-10-23 | Gopro, Inc. | Digital media editing |
US10127943B1 (en) | 2017-03-02 | 2018-11-13 | Gopro, Inc. | Systems and methods for modifying videos based on music |
US10152757B2 (en) | 2016-06-10 | 2018-12-11 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US10185895B1 (en) | 2017-03-23 | 2019-01-22 | Gopro, Inc. | Systems and methods for classifying activities captured within images |
US10187690B1 (en) | 2017-04-24 | 2019-01-22 | Gopro, Inc. | Systems and methods to detect and correlate user responses to media content |
US10185891B1 (en) | 2016-07-08 | 2019-01-22 | Gopro, Inc. | Systems and methods for compact convolutional neural networks |
US10186012B2 (en) | 2015-05-20 | 2019-01-22 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10204273B2 (en) | 2015-10-20 | 2019-02-12 | Gopro, Inc. | System and method of providing recommendations of moments of interest within video clips post capture |
US10250894B1 (en) | 2016-06-15 | 2019-04-02 | Gopro, Inc. | Systems and methods for providing transcoded portions of a video |
US10257058B1 (en) | 2018-04-27 | 2019-04-09 | Banjo, Inc. | Ingesting streaming signals |
US10255252B2 (en) | 2013-09-16 | 2019-04-09 | Arria Data2Text Limited | Method and apparatus for interactive reports |
US20190108419A1 (en) * | 2017-10-09 | 2019-04-11 | Box, Inc. | Combining outputs of data processing services in a cloud-based collaboration platform |
US10261846B1 (en) | 2018-02-09 | 2019-04-16 | Banjo, Inc. | Storing and verifying the integrity of event related data |
US10262639B1 (en) | 2016-11-08 | 2019-04-16 | Gopro, Inc. | Systems and methods for detecting musical features in audio content |
US10268898B1 (en) | 2016-09-21 | 2019-04-23 | Gopro, Inc. | Systems and methods for determining a sample frame order for analyzing a video via segments |
US10268897B2 (en) | 2017-03-24 | 2019-04-23 | International Business Machines Corporation | Determining most representative still image of a video for specific user |
US20190130192A1 (en) * | 2017-10-31 | 2019-05-02 | Google Llc | Systems and Methods for Generating a Summary Storyboard from a Plurality of Image Frames |
US10282422B2 (en) | 2013-09-16 | 2019-05-07 | Arria Data2Text Limited | Method, apparatus, and computer program product for user-directed reporting |
US10282632B1 (en) | 2016-09-21 | 2019-05-07 | Gopro, Inc. | Systems and methods for determining a sample frame order for analyzing a video |
US10284809B1 (en) | 2016-11-07 | 2019-05-07 | Gopro, Inc. | Systems and methods for intelligently synchronizing events in visual content with musical features in audio content |
US20190138812A1 (en) * | 2017-08-28 | 2019-05-09 | Nec Laboratories America, Inc. | Mobile device with activity recognition |
CN109791558A (en) * | 2016-09-23 | 2019-05-21 | 微软技术许可有限责任公司 | Fine motion figure automatically selects |
US10304232B2 (en) * | 2017-04-06 | 2019-05-28 | Microsoft Technology Licensing, Llc | Image animation in a presentation document |
US20190163436A1 (en) * | 2017-11-28 | 2019-05-30 | Lg Electronics Inc. | Electronic device and method for controlling the same |
US10313865B1 (en) | 2018-04-27 | 2019-06-04 | Banjo, Inc. | Validating and supplementing emergency call information |
US10311129B1 (en) | 2018-02-09 | 2019-06-04 | Banjo, Inc. | Detecting events from features derived from multiple ingested signals |
US10313413B2 (en) * | 2017-08-28 | 2019-06-04 | Banjo, Inc. | Detecting events from ingested communication signals |
US20190182565A1 (en) * | 2017-12-13 | 2019-06-13 | Playable Pty Ltd | System and Method for Algorithmic Editing of Video Content |
US10327116B1 (en) | 2018-04-27 | 2019-06-18 | Banjo, Inc. | Deriving signal location from signal content |
US10324948B1 (en) | 2018-04-27 | 2019-06-18 | Banjo, Inc. | Normalizing ingested signals |
US10324935B1 (en) | 2018-02-09 | 2019-06-18 | Banjo, Inc. | Presenting event intelligence and trends tailored per geographic area granularity |
US10341712B2 (en) | 2016-04-07 | 2019-07-02 | Gopro, Inc. | Systems and methods for audio track selection in video editing |
US10339443B1 (en) | 2017-02-24 | 2019-07-02 | Gopro, Inc. | Systems and methods for processing convolutional neural network operations using textures |
US20190207946A1 (en) * | 2016-12-20 | 2019-07-04 | Google Inc. | Conditional provision of access by interactive assistant modules |
US10353934B1 (en) | 2018-04-27 | 2019-07-16 | Banjo, Inc. | Detecting an event from signals in a listening area |
US10360942B1 (en) * | 2017-07-13 | 2019-07-23 | Gopro, Inc. | Systems and methods for changing storage of videos |
US10360945B2 (en) | 2011-08-09 | 2019-07-23 | Gopro, Inc. | User interface for editing digital media objects |
US10382372B1 (en) * | 2017-04-27 | 2019-08-13 | Snap Inc. | Processing media content based on original context |
US10395122B1 (en) | 2017-05-12 | 2019-08-27 | Gopro, Inc. | Systems and methods for identifying moments in videos |
US10395119B1 (en) | 2016-08-10 | 2019-08-27 | Gopro, Inc. | Systems and methods for determining activities performed during video capture |
US10402698B1 (en) | 2017-07-10 | 2019-09-03 | Gopro, Inc. | Systems and methods for identifying interesting moments within videos |
US10402938B1 (en) | 2016-03-31 | 2019-09-03 | Gopro, Inc. | Systems and methods for modifying image distortion (curvature) for viewing distance in post capture |
US10402656B1 (en) | 2017-07-13 | 2019-09-03 | Gopro, Inc. | Systems and methods for accelerating video analysis |
US10469909B1 (en) | 2016-07-14 | 2019-11-05 | Gopro, Inc. | Systems and methods for providing access to still images derived from a video |
US10467347B1 (en) | 2016-10-31 | 2019-11-05 | Arria Data2Text Limited | Method and apparatus for natural language document orchestrator |
US20190378351A1 (en) * | 2018-06-11 | 2019-12-12 | International Business Machines Corporation | Cognitive learning for vehicle sensor monitoring and problem detection |
US10534966B1 (en) | 2017-02-02 | 2020-01-14 | Gopro, Inc. | Systems and methods for identifying activities and/or events represented in a video |
US10582343B1 (en) | 2019-07-29 | 2020-03-03 | Banjo, Inc. | Validating and supplementing emergency call information |
US10581945B2 (en) | 2017-08-28 | 2020-03-03 | Banjo, Inc. | Detecting an event from signal data |
US10595098B2 (en) * | 2018-01-09 | 2020-03-17 | Nbcuniversal Media, Llc | Derivative media content systems and methods |
US10614114B1 (en) | 2017-07-10 | 2020-04-07 | Gopro, Inc. | Systems and methods for creating compilations based on hierarchical clustering |
CN111078902A (en) * | 2018-10-22 | 2020-04-28 | 三星电子株式会社 | Display device and operation method thereof |
US20200151585A1 (en) * | 2018-11-09 | 2020-05-14 | Fujitsu Limited | Information processing apparatus and rule generation method |
US10664558B2 (en) * | 2014-04-18 | 2020-05-26 | Arria Data2Text Limited | Method and apparatus for document planning |
US10671815B2 (en) | 2013-08-29 | 2020-06-02 | Arria Data2Text Limited | Text generation from correlated alerts |
US10679063B2 (en) * | 2012-04-23 | 2020-06-09 | Sri International | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics |
US20200186897A1 (en) * | 2018-12-05 | 2020-06-11 | Sony Interactive Entertainment Inc. | Method and system for generating a recording of video game gameplay |
US10685060B2 (en) | 2016-02-26 | 2020-06-16 | Amazon Technologies, Inc. | Searching shared video footage from audio/video recording and communication devices |
US10685187B2 (en) | 2017-05-15 | 2020-06-16 | Google Llc | Providing access to user-controlled resources by automated assistants |
US10691749B2 (en) | 2016-06-10 | 2020-06-23 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
CN111368142A (en) * | 2020-04-15 | 2020-07-03 | 华中科技大学 | Video intensive event description method based on generation countermeasure network |
US10748414B2 (en) | 2016-02-26 | 2020-08-18 | A9.Com, Inc. | Augmenting and sharing data from audio/video recording and communication devices |
US10762754B2 (en) | 2016-02-26 | 2020-09-01 | Amazon Technologies, Inc. | Sharing video footage from audio/video recording and communication devices for parcel theft deterrence |
US10762646B2 (en) | 2016-02-26 | 2020-09-01 | A9.Com, Inc. | Neighborhood alert mode for triggering multi-device recording, multi-camera locating, and multi-camera event stitching for audio/video recording and communication devices |
US10776561B2 (en) | 2013-01-15 | 2020-09-15 | Arria Data2Text Limited | Method and apparatus for generating a linguistic representation of raw input data |
US10841542B2 (en) | 2016-02-26 | 2020-11-17 | A9.Com, Inc. | Locating a person of interest using shared video footage from audio/video recording and communication devices |
ES2797825A1 (en) * | 2019-06-03 | 2020-12-03 | Blue Quality S L | SYSTEM AND METHOD OF GENERATING CUSTOMIZED VIDEOS FOR KARTING RACES (Machine-translation by Google Translate, not legally binding) |
US10904720B2 (en) | 2018-04-27 | 2021-01-26 | safeXai, Inc. | Deriving signal location information and removing private information from it |
US10917618B2 (en) | 2016-02-26 | 2021-02-09 | Amazon Technologies, Inc. | Providing status information for secondary devices with video footage from audio/video recording and communication devices |
US10917704B1 (en) * | 2019-11-12 | 2021-02-09 | Amazon Technologies, Inc. | Automated video preview generation |
US20210056173A1 (en) * | 2019-08-21 | 2021-02-25 | International Business Machines Corporation | Extracting meaning representation from text |
US20210093973A1 (en) * | 2019-10-01 | 2021-04-01 | Sony Interactive Entertainment Inc. | Apparatus and method for generating a recording |
US10970184B2 (en) | 2018-02-09 | 2021-04-06 | Banjo, Inc. | Event detection removing private information |
US10977097B2 (en) | 2018-04-13 | 2021-04-13 | Banjo, Inc. | Notifying entities of relevant events |
WO2021069989A1 (en) * | 2019-10-06 | 2021-04-15 | International Business Machines Corporation | Filtering group messages |
AU2016225820B2 (en) * | 2015-11-11 | 2021-04-15 | Adobe Inc. | Structured knowledge modeling, extraction and localization from images |
US11004471B1 (en) * | 2020-03-24 | 2021-05-11 | Facebook, Inc. | Editing portions of videos in a series of video portions |
CN112818955A (en) * | 2021-03-19 | 2021-05-18 | 北京市商汤科技开发有限公司 | Image segmentation method and device, computer equipment and storage medium |
US11025693B2 (en) | 2017-08-28 | 2021-06-01 | Banjo, Inc. | Event detection from signal data removing private information |
CN112905829A (en) * | 2021-03-25 | 2021-06-04 | 王芳 | Cross-modal artificial intelligence information processing system and retrieval method |
CN112906649A (en) * | 2018-05-10 | 2021-06-04 | 北京影谱科技股份有限公司 | Video segmentation method, device, computer device and medium |
US11042274B2 (en) * | 2013-12-04 | 2021-06-22 | Autodesk, Inc. | Extracting demonstrations from in-situ video content |
US11048397B2 (en) * | 2015-06-14 | 2021-06-29 | Google Llc | Methods and systems for presenting alert event indicators |
CN113076286A (en) * | 2021-03-09 | 2021-07-06 | 北京梧桐车联科技有限责任公司 | Method, device and equipment for acquiring multimedia file and readable storage medium |
US11057457B2 (en) * | 2014-02-21 | 2021-07-06 | Twitter, Inc. | Television key phrase detection |
US11087023B2 (en) | 2018-08-07 | 2021-08-10 | Google Llc | Threshold-based assembly of automated assistant responses |
US11107503B2 (en) | 2019-10-08 | 2021-08-31 | WeMovie Technologies | Pre-production systems for making movies, TV shows and multimedia contents |
US11120835B2 (en) * | 2016-06-24 | 2021-09-14 | Google Llc | Collage of interesting moments in a video |
US11157524B2 (en) | 2018-05-18 | 2021-10-26 | At&T Intellectual Property I, L.P. | Automated learning of anomalies in media streams with external feed labels |
US11158353B2 (en) * | 2015-08-03 | 2021-10-26 | Sony Corporation | Information processing system, information processing method, and recording medium |
US11166086B1 (en) * | 2020-10-28 | 2021-11-02 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
US11205103B2 (en) | 2016-12-09 | 2021-12-21 | The Research Foundation for the State University | Semisupervised autoencoder for sentiment analysis |
US11257171B2 (en) | 2016-06-10 | 2022-02-22 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US11270067B1 (en) * | 2018-12-26 | 2022-03-08 | Snap Inc. | Structured activity templates for social media content |
US11281943B2 (en) * | 2017-07-25 | 2022-03-22 | Cloudminds Robotics Co., Ltd. | Method for generating training data, image semantic segmentation method and electronic device |
US11295455B2 (en) * | 2017-11-16 | 2022-04-05 | Sony Corporation | Information processing apparatus, information processing method, and program |
US11315602B2 (en) | 2020-05-08 | 2022-04-26 | WeMovie Technologies | Fully automated post-production editing for movies, TV shows and multimedia contents |
US20220132223A1 (en) * | 2020-10-28 | 2022-04-28 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
US11321639B1 (en) | 2021-12-13 | 2022-05-03 | WeMovie Technologies | Automated evaluation of acting performance using cloud services |
US11330154B1 (en) | 2021-07-23 | 2022-05-10 | WeMovie Technologies | Automated coordination in multimedia content production |
US11328510B2 (en) * | 2019-03-19 | 2022-05-10 | The Boeing Company | Intelligent video analysis |
US11334752B2 (en) * | 2019-11-19 | 2022-05-17 | Netflix, Inc. | Techniques for automatically extracting compelling portions of a media content item |
US11341185B1 (en) * | 2018-06-19 | 2022-05-24 | Amazon Technologies, Inc. | Systems and methods for content-based indexing of videos at web-scale |
US11354901B2 (en) * | 2017-03-10 | 2022-06-07 | Turing Video | Activity recognition method and system |
US11386700B2 (en) * | 2020-02-28 | 2022-07-12 | Panasonic I-Pro Sensing Solutions Co., Ltd. | Face detection system |
US11393108B1 (en) | 2016-02-26 | 2022-07-19 | Amazon Technologies, Inc. | Neighborhood alert mode for triggering multi-device recording, multi-camera locating, and multi-camera event stitching for audio/video recording and communication devices |
US11438639B2 (en) * | 2020-03-03 | 2022-09-06 | Microsoft Technology Licensing, Llc | Partial-video near-duplicate detection |
US11436417B2 (en) | 2017-05-15 | 2022-09-06 | Google Llc | Providing access to user-controlled resources by automated assistants |
US11443513B2 (en) * | 2020-01-29 | 2022-09-13 | Prashanth Iyengar | Systems and methods for resource analysis, optimization, or visualization |
US11445273B1 (en) | 2021-05-11 | 2022-09-13 | CLIPr Co. | System and method for creating a video summary based on video relevancy |
US20220293133A1 (en) * | 2021-03-12 | 2022-09-15 | Snap Inc. | Automated video editing |
US20220301307A1 (en) * | 2021-03-19 | 2022-09-22 | Alibaba (China) Co., Ltd. | Video Generation Method and Apparatus, and Promotional Video Generation Method and Apparatus |
US20220319549A1 (en) * | 2021-04-06 | 2022-10-06 | Smsystems Co., Ltd | Video automatic editing method and system based on machine learning |
WO2022240409A1 (en) * | 2021-05-11 | 2022-11-17 | CLIPr Co. | System and method for crowdsourcing a video summary for creating an enhanced video summary |
US11514244B2 (en) * | 2015-11-11 | 2022-11-29 | Adobe Inc. | Structured knowledge modeling and extraction from images |
US20220382811A1 (en) * | 2021-06-01 | 2022-12-01 | Apple Inc. | Inclusive Holidays |
US11564014B2 (en) | 2020-08-27 | 2023-01-24 | WeMovie Technologies | Content structure aware multimedia streaming service for movies, TV shows and multimedia contents |
US11570525B2 (en) | 2019-08-07 | 2023-01-31 | WeMovie Technologies | Adaptive marketing in cloud-based content production |
US11589086B2 (en) * | 2019-08-21 | 2023-02-21 | Dish Network L.L.C. | Systems and methods for targeted advertisement insertion into a program content stream |
WO2023050295A1 (en) * | 2021-09-30 | 2023-04-06 | 中远海运科技股份有限公司 | Multimodal heterogeneous feature fusion-based compact video event description method |
US11641500B2 (en) * | 2018-01-11 | 2023-05-02 | Editorj1 Technologies Private Limited | Method and system for customized content |
US20230169770A1 (en) * | 2021-11-30 | 2023-06-01 | Kwai Inc. | Methods and device for video data analysis |
US11669743B2 (en) * | 2019-05-15 | 2023-06-06 | Huawei Technologies Co., Ltd. | Adaptive action recognizer for video |
WO2023121840A1 (en) * | 2021-12-21 | 2023-06-29 | Sri International | Video processsor capable of in-pixel processing |
US11698927B2 (en) * | 2018-05-16 | 2023-07-11 | Sony Interactive Entertainment LLC | Contextual digital media processing systems and methods |
US11736654B2 (en) | 2019-06-11 | 2023-08-22 | WeMovie Technologies | Systems and methods for producing digital multimedia contents including movies and tv shows |
WO2023160241A1 (en) * | 2022-02-28 | 2023-08-31 | 荣耀终端有限公司 | Video processing method and related device |
US11800201B2 (en) * | 2018-08-31 | 2023-10-24 | Beijing Bytedance Network Technology Co., Ltd. | Method and apparatus for outputting information |
Families Citing this family (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2510424A (en) * | 2013-02-05 | 2014-08-06 | British Broadcasting Corp | Processing audio-video (AV) metadata relating to general and individual user parameters |
WO2017070427A1 (en) * | 2015-10-23 | 2017-04-27 | Spotify Ab | Automatic prediction of acoustic attributes from an audio signal |
CN109691124B (en) | 2016-06-20 | 2021-07-27 | 皮克索洛特公司 | Method and system for automatically generating video highlights |
US11250947B2 (en) | 2017-02-24 | 2022-02-15 | General Electric Company | Providing auxiliary information regarding healthcare procedure and system performance using augmented reality |
US20190028766A1 (en) * | 2017-07-18 | 2019-01-24 | Audible Magic Corporation | Media classification for media identification and licensing |
US20190205450A1 (en) * | 2018-01-03 | 2019-07-04 | Getac Technology Corporation | Method of configuring information capturing device |
CN110475129B (en) * | 2018-03-05 | 2021-05-28 | 腾讯科技(深圳)有限公司 | Video processing method, medium, and server |
US10747500B2 (en) | 2018-04-03 | 2020-08-18 | International Business Machines Corporation | Aural delivery of environmental visual information |
US10825481B2 (en) | 2018-05-16 | 2020-11-03 | At&T Intellectual Property I, L.P. | Video curation service for personal streaming |
US10963697B2 (en) * | 2018-06-05 | 2021-03-30 | Philip Martin Meier | Systems and methods for generating composite media using distributed networks |
CN108846343B (en) * | 2018-06-05 | 2022-05-13 | 北京邮电大学 | Multi-task collaborative analysis method based on three-dimensional video |
US10679626B2 (en) * | 2018-07-24 | 2020-06-09 | Pegah AARABI | Generating interactive audio-visual representations of individuals |
JP2020035086A (en) * | 2018-08-28 | 2020-03-05 | 富士ゼロックス株式会社 | Information processing system, information processing apparatus and program |
US11282259B2 (en) | 2018-11-26 | 2022-03-22 | International Business Machines Corporation | Non-visual environment mapping |
CN109743624B (en) * | 2018-12-14 | 2021-08-17 | 深圳壹账通智能科技有限公司 | Video cutting method and device, computer equipment and storage medium |
US10860860B1 (en) * | 2019-01-03 | 2020-12-08 | Amazon Technologies, Inc. | Matching videos to titles using artificial intelligence |
CN111464881B (en) * | 2019-01-18 | 2021-08-13 | 复旦大学 | Full-convolution video description generation method based on self-optimization mechanism |
CN113748468B (en) | 2019-02-21 | 2023-03-10 | 剧院公司 | System and method for filling out postoperative report of surgical operation, computer readable medium |
US11426255B2 (en) | 2019-02-21 | 2022-08-30 | Theator inc. | Complexity analysis and cataloging of surgical footage |
US11157542B2 (en) | 2019-06-12 | 2021-10-26 | Spotify Ab | Systems, methods and computer program products for associating media content having different modalities |
US11341186B2 (en) * | 2019-06-19 | 2022-05-24 | International Business Machines Corporation | Cognitive video and audio search aggregation |
US11170271B2 (en) * | 2019-06-26 | 2021-11-09 | Dallas Limetree, LLC | Method and system for classifying content using scoring for identifying psychological factors employed by consumers to take action |
US11636117B2 (en) | 2019-06-26 | 2023-04-25 | Dallas Limetree, LLC | Content selection using psychological factor vectors |
CN110602554B (en) * | 2019-08-16 | 2021-01-29 | 华为技术有限公司 | Cover image determining method, device and equipment |
CN113132753A (en) | 2019-12-30 | 2021-07-16 | 阿里巴巴集团控股有限公司 | Data processing method and device and video cover generation method and device |
CN111259215B (en) * | 2020-02-14 | 2023-06-27 | 北京百度网讯科技有限公司 | Multi-mode-based topic classification method, device, equipment and storage medium |
US11417330B2 (en) * | 2020-02-21 | 2022-08-16 | BetterUp, Inc. | Determining conversation analysis indicators for a multiparty conversation |
US20210312949A1 (en) | 2020-04-05 | 2021-10-07 | Theator inc. | Systems and methods for intraoperative video review |
CN112101182B (en) * | 2020-09-10 | 2021-05-07 | 哈尔滨市科佳通用机电股份有限公司 | Railway wagon floor damage fault identification method based on improved SLIC method |
WO2022177894A1 (en) * | 2021-02-16 | 2022-08-25 | Tree Goat Media, Inc. | Systems and methods for transforming digital audio content |
WO2021184026A1 (en) * | 2021-04-08 | 2021-09-16 | Innopeak Technology, Inc. | Audio-visual fusion with cross-modal attention for video action recognition |
US20230298615A1 (en) * | 2022-03-18 | 2023-09-21 | Capital One Services, Llc | System and method for extracting hidden cues in interactive communications |
US11910073B1 (en) * | 2022-08-15 | 2024-02-20 | Amazon Technologies, Inc. | Automated preview generation for video entertainment content |
CN116091984B (en) * | 2023-04-12 | 2023-07-18 | 中国科学院深圳先进技术研究院 | Video object segmentation method, device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070101266A1 (en) * | 1999-10-11 | 2007-05-03 | Electronics And Telecommunications Research Institute | Video summary description scheme and method and system of video summary description data generation for efficient overview and browsing |
US20140161354A1 (en) * | 2012-12-06 | 2014-06-12 | Nokia Corporation | Method and apparatus for semantic extraction and video remix creation |
US20150302894A1 (en) * | 2010-03-08 | 2015-10-22 | Sightera Technologies Ltd. | System and method for semi-automatic video editing |
Family Cites Families (51)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6564263B1 (en) * | 1998-12-04 | 2003-05-13 | International Business Machines Corporation | Multimedia content description framework |
US7194687B2 (en) * | 1999-09-16 | 2007-03-20 | Sharp Laboratories Of America, Inc. | Audiovisual information management system with user identification |
US8051446B1 (en) * | 1999-12-06 | 2011-11-01 | Sharp Laboratories Of America, Inc. | Method of creating a semantic video summary using information from secondary sources |
AU2003265318A1 (en) * | 2002-08-02 | 2004-02-23 | University Of Rochester | Automatic soccer video analysis and summarization |
US7421455B2 (en) * | 2006-02-27 | 2008-09-02 | Microsoft Corporation | Video search and services |
EP1999753B1 (en) * | 2006-03-30 | 2018-01-10 | British Telecommunications public limited company | Video abstraction |
US7869658B2 (en) * | 2006-10-06 | 2011-01-11 | Eastman Kodak Company | Representative image selection based on hierarchical clustering |
US20080306995A1 (en) * | 2007-06-05 | 2008-12-11 | Newell Catherine D | Automatic story creation using semantic classifiers for images and associated meta data |
US8934717B2 (en) * | 2007-06-05 | 2015-01-13 | Intellectual Ventures Fund 83 Llc | Automatic story creation using semantic classifiers for digital assets and associated metadata |
US8411935B2 (en) * | 2007-07-11 | 2013-04-02 | Behavioral Recognition Systems, Inc. | Semantic representation module of a machine-learning engine in a video analysis system |
US9111146B2 (en) * | 2008-02-15 | 2015-08-18 | Tivo Inc. | Systems and methods for semantically classifying and normalizing shots in video |
US10867133B2 (en) * | 2008-05-01 | 2020-12-15 | Primal Fusion Inc. | System and method for using a knowledge representation to provide information based on environmental inputs |
US8339456B2 (en) * | 2008-05-15 | 2012-12-25 | Sri International | Apparatus for intelligent and autonomous video content generation and streaming |
US8634638B2 (en) * | 2008-06-20 | 2014-01-21 | Sri International | Real-time action detection and classification |
US8364660B2 (en) * | 2008-07-11 | 2013-01-29 | Videosurf, Inc. | Apparatus and software system for and method of performing a visual-relevance-rank subsequent search |
WO2010006334A1 (en) * | 2008-07-11 | 2010-01-14 | Videosurf, Inc. | Apparatus and software system for and method of performing a visual-relevance-rank subsequent search |
US8611677B2 (en) * | 2008-11-19 | 2013-12-17 | Intellectual Ventures Fund 83 Llc | Method for event-based semantic classification |
US8218859B2 (en) * | 2008-12-05 | 2012-07-10 | Microsoft Corporation | Transductive multi-label learning for video concept detection |
US20100272187A1 (en) * | 2009-04-24 | 2010-10-28 | Delta Vidyo, Inc. | Efficient video skimmer |
US8068677B2 (en) * | 2009-08-25 | 2011-11-29 | Satyam Computer Services Limited | System and method for hierarchical image processing |
US8285060B2 (en) * | 2009-08-31 | 2012-10-09 | Behavioral Recognition Systems, Inc. | Detecting anomalous trajectories in a video surveillance system |
US9111287B2 (en) * | 2009-09-30 | 2015-08-18 | Microsoft Technology Licensing, Llc | Video content-aware advertisement placement |
US8121618B2 (en) * | 2009-10-28 | 2012-02-21 | Digimarc Corporation | Intuitive computing methods and systems |
US9179102B2 (en) * | 2009-12-29 | 2015-11-03 | Kodak Alaris Inc. | Group display system |
US8874584B1 (en) * | 2010-02-24 | 2014-10-28 | Hrl Laboratories, Llc | Hierarchical video search and recognition system |
JP2011217197A (en) * | 2010-03-31 | 2011-10-27 | Sony Corp | Electronic apparatus, reproduction control system, reproduction control method, and program thereof |
JP2011217209A (en) * | 2010-03-31 | 2011-10-27 | Sony Corp | Electronic apparatus, content recommendation method, and program |
US9710760B2 (en) * | 2010-06-29 | 2017-07-18 | International Business Machines Corporation | Multi-facet classification scheme for cataloging of information artifacts |
US8532390B2 (en) * | 2010-07-28 | 2013-09-10 | International Business Machines Corporation | Semantic parsing of objects in video |
US9171578B2 (en) * | 2010-08-06 | 2015-10-27 | Futurewei Technologies, Inc. | Video skimming methods and systems |
US9317598B2 (en) * | 2010-09-08 | 2016-04-19 | Nokia Technologies Oy | Method and apparatus for generating a compilation of media items |
US8874538B2 (en) * | 2010-09-08 | 2014-10-28 | Nokia Corporation | Method and apparatus for video synthesis |
WO2012064976A1 (en) * | 2010-11-11 | 2012-05-18 | Google Inc. | Learning tags for video annotation using latent subtags |
EP2641401B1 (en) * | 2010-11-15 | 2017-04-05 | Huawei Technologies Co., Ltd. | Method and system for video summarization |
US8930959B2 (en) * | 2011-05-13 | 2015-01-06 | Orions Digital Systems, Inc. | Generating event definitions based on spatial and relational relationships |
US9298816B2 (en) * | 2011-07-22 | 2016-03-29 | Open Text S.A. | Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation |
US10068024B2 (en) * | 2012-02-01 | 2018-09-04 | Sri International | Method and apparatus for correlating and viewing disparate data |
US9129158B1 (en) * | 2012-03-05 | 2015-09-08 | Hrl Laboratories, Llc | Method and system for embedding visual intelligence |
US20130335635A1 (en) * | 2012-03-22 | 2013-12-19 | Bernard Ghanem | Video Analysis Based on Sparse Registration and Multiple Domain Tracking |
US9406020B2 (en) * | 2012-04-02 | 2016-08-02 | Taiger Spain Sl | System and method for natural language querying |
US9244924B2 (en) * | 2012-04-23 | 2016-01-26 | Sri International | Classification, search, and retrieval of complex video events |
US20140328570A1 (en) * | 2013-01-09 | 2014-11-06 | Sri International | Identifying, describing, and sharing salient events in images and videos |
US10691743B2 (en) * | 2014-08-05 | 2020-06-23 | Sri International | Multi-dimensional realization of visual content of an image collection |
EP2763077B1 (en) * | 2013-01-30 | 2023-11-15 | Nokia Technologies Oy | Method and apparatus for sensor aided extraction of spatio-temporal features |
US9141866B2 (en) * | 2013-01-30 | 2015-09-22 | International Business Machines Corporation | Summarizing salient events in unmanned aerial videos |
US10642891B2 (en) * | 2013-04-12 | 2020-05-05 | Avigilon Fortress Corporation | Graph matching by sub-graph grouping and indexing |
CA2924065C (en) * | 2013-09-13 | 2018-05-15 | Arris Enterprises, Inc. | Content based video content segmentation |
US9384400B2 (en) * | 2014-07-08 | 2016-07-05 | Nokia Technologies Oy | Method and apparatus for identifying salient events by analyzing salient video segments identified by sensor information |
US9685194B2 (en) * | 2014-07-23 | 2017-06-20 | Gopro, Inc. | Voice-based video tagging |
US9646227B2 (en) * | 2014-07-29 | 2017-05-09 | Microsoft Technology Licensing, Llc | Computerized machine learning of interesting video sections |
US9740963B2 (en) * | 2014-08-05 | 2017-08-22 | Sri International | Multi-dimensional realization of visual content of an image collection |
-
2014
- 2014-07-15 US US14/332,071 patent/US20140328570A1/en not_active Abandoned
-
2015
- 2015-09-04 US US14/846,318 patent/US10679063B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070101266A1 (en) * | 1999-10-11 | 2007-05-03 | Electronics And Telecommunications Research Institute | Video summary description scheme and method and system of video summary description data generation for efficient overview and browsing |
US20150302894A1 (en) * | 2010-03-08 | 2015-10-22 | Sightera Technologies Ltd. | System and method for semi-automatic video editing |
US20140161354A1 (en) * | 2012-12-06 | 2014-06-12 | Nokia Corporation | Method and apparatus for semantic extraction and video remix creation |
Cited By (297)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10360945B2 (en) | 2011-08-09 | 2019-07-23 | Gopro, Inc. | User interface for editing digital media objects |
US10679063B2 (en) * | 2012-04-23 | 2020-06-09 | Sri International | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics |
US10776561B2 (en) | 2013-01-15 | 2020-09-15 | Arria Data2Text Limited | Method and apparatus for generating a linguistic representation of raw input data |
US20140245153A1 (en) * | 2013-02-28 | 2014-08-28 | Nk Works Co., Ltd. | Image processing apparatus, computer-readable medium storing an image processing program, and image processing method |
US10671815B2 (en) | 2013-08-29 | 2020-06-02 | Arria Data2Text Limited | Text generation from correlated alerts |
US10860812B2 (en) | 2013-09-16 | 2020-12-08 | Arria Data2Text Limited | Method, apparatus, and computer program product for user-directed reporting |
US10255252B2 (en) | 2013-09-16 | 2019-04-09 | Arria Data2Text Limited | Method and apparatus for interactive reports |
US10282422B2 (en) | 2013-09-16 | 2019-05-07 | Arria Data2Text Limited | Method, apparatus, and computer program product for user-directed reporting |
US11144709B2 (en) * | 2013-09-16 | 2021-10-12 | Arria Data2Text Limited | Method and apparatus for interactive reports |
US11042274B2 (en) * | 2013-12-04 | 2021-06-22 | Autodesk, Inc. | Extracting demonstrations from in-situ video content |
US20180173319A1 (en) * | 2013-12-31 | 2018-06-21 | Google Llc | Systems and methods for gaze-based media selection and editing |
US10915180B2 (en) | 2013-12-31 | 2021-02-09 | Google Llc | Systems and methods for monitoring a user's eye |
US20150205771A1 (en) * | 2014-01-22 | 2015-07-23 | Panasonic Intellectual Property Corporation Of America | Terminal apparatus, server apparatus, method for supporting posting of information, and non-transitory recording medium storing computer program |
US9781190B2 (en) * | 2014-01-22 | 2017-10-03 | Panasonic Intellectual Property Corporation Of America | Apparatus and method for supporting selection of an image to be posted on a website |
US20160199742A1 (en) * | 2014-01-31 | 2016-07-14 | Google Inc. | Automatic generation of a game replay video |
US11057457B2 (en) * | 2014-02-21 | 2021-07-06 | Twitter, Inc. | Television key phrase detection |
US9760768B2 (en) | 2014-03-04 | 2017-09-12 | Gopro, Inc. | Generation of video from spherical content using edit maps |
US9754159B2 (en) | 2014-03-04 | 2017-09-05 | Gopro, Inc. | Automatic generation of video from spherical content using location-based metadata |
US10084961B2 (en) | 2014-03-04 | 2018-09-25 | Gopro, Inc. | Automatic generation of video from spherical content using audio/visual analysis |
US10664558B2 (en) * | 2014-04-18 | 2020-05-26 | Arria Data2Text Limited | Method and apparatus for document planning |
US20150373281A1 (en) * | 2014-06-19 | 2015-12-24 | BrightSky Labs, Inc. | Systems and methods for identifying media portions of interest |
US9626103B2 (en) * | 2014-06-19 | 2017-04-18 | BrightSky Labs, Inc. | Systems and methods for identifying media portions of interest |
US20160012597A1 (en) * | 2014-07-09 | 2016-01-14 | Nant Holdings Ip, Llc | Feature trackability ranking, systems and methods |
US10217227B2 (en) * | 2014-07-09 | 2019-02-26 | Nant Holdings Ip, Llc | Feature trackability ranking, systems and methods |
US9984473B2 (en) * | 2014-07-09 | 2018-05-29 | Nant Holdings Ip, Llc | Feature trackability ranking, systems and methods |
US10540772B2 (en) | 2014-07-09 | 2020-01-21 | Nant Holdings Ip, Llc | Feature trackability ranking, systems and methods |
US11776579B2 (en) | 2014-07-23 | 2023-10-03 | Gopro, Inc. | Scene and activity identification in video summary generation |
US9685194B2 (en) | 2014-07-23 | 2017-06-20 | Gopro, Inc. | Voice-based video tagging |
US11069380B2 (en) | 2014-07-23 | 2021-07-20 | Gopro, Inc. | Scene and activity identification in video summary generation |
US9792502B2 (en) * | 2014-07-23 | 2017-10-17 | Gopro, Inc. | Generating video summaries for a video using video summary templates |
US10339975B2 (en) | 2014-07-23 | 2019-07-02 | Gopro, Inc. | Voice-based video tagging |
US9984293B2 (en) | 2014-07-23 | 2018-05-29 | Gopro, Inc. | Video scene classification by activity |
US20160029105A1 (en) * | 2014-07-23 | 2016-01-28 | Gopro, Inc. | Generating video summaries for a video using video summary templates |
US10074013B2 (en) | 2014-07-23 | 2018-09-11 | Gopro, Inc. | Scene and activity identification in video summary generation |
US10776629B2 (en) | 2014-07-23 | 2020-09-15 | Gopro, Inc. | Scene and activity identification in video summary generation |
US20160026874A1 (en) * | 2014-07-23 | 2016-01-28 | Gopro, Inc. | Activity identification in video |
US9473803B2 (en) * | 2014-08-08 | 2016-10-18 | TCL Research America Inc. | Personalized channel recommendation method and system |
US9646652B2 (en) | 2014-08-20 | 2017-05-09 | Gopro, Inc. | Scene and activity identification in video summary generation based on motion detected in a video |
US10643663B2 (en) | 2014-08-20 | 2020-05-05 | Gopro, Inc. | Scene and activity identification in video summary generation based on motion detected in a video |
US10192585B1 (en) | 2014-08-20 | 2019-01-29 | Gopro, Inc. | Scene and activity identification in video summary generation based on motion detected in a video |
US20160360079A1 (en) * | 2014-11-18 | 2016-12-08 | Sony Corporation | Generation apparatus and method for evaluation information, electronic device and server |
US9888161B2 (en) * | 2014-11-18 | 2018-02-06 | Sony Mobile Communications Inc. | Generation apparatus and method for evaluation information, electronic device and server |
US10096341B2 (en) | 2015-01-05 | 2018-10-09 | Gopro, Inc. | Media identifier generation for camera-captured media |
US10559324B2 (en) | 2015-01-05 | 2020-02-11 | Gopro, Inc. | Media identifier generation for camera-captured media |
US9734870B2 (en) | 2015-01-05 | 2017-08-15 | Gopro, Inc. | Media identifier generation for camera-captured media |
EP3051463A1 (en) * | 2015-01-27 | 2016-08-03 | Samsung Electronics Co., Ltd. | Image processing method and electronic device for supporting the same |
US9886454B2 (en) * | 2015-01-27 | 2018-02-06 | Samsung Electronics Co., Ltd. | Image processing, method and electronic device for generating a highlight content |
US20160217348A1 (en) * | 2015-01-27 | 2016-07-28 | Samsung Electronics Co., Ltd. | Image Processing Method and Electronic Device for Supporting the Same |
US9966108B1 (en) | 2015-01-29 | 2018-05-08 | Gopro, Inc. | Variable playback speed template for video editing application |
US9679605B2 (en) | 2015-01-29 | 2017-06-13 | Gopro, Inc. | Variable playback speed template for video editing application |
WO2016154158A1 (en) * | 2015-03-25 | 2016-09-29 | Microsoft Technology Licensing, Llc | Machine learning to recognize key moments in audio and video calls |
WO2016172379A1 (en) * | 2015-04-21 | 2016-10-27 | Stinkdigital, Ltd | Video delivery platform |
US9659218B1 (en) * | 2015-04-29 | 2017-05-23 | Google Inc. | Predicting video start times for maximizing user engagement |
US10390067B1 (en) | 2015-04-29 | 2019-08-20 | Google Llc | Predicting video start times for maximizing user engagement |
US10535115B2 (en) | 2015-05-20 | 2020-01-14 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10817977B2 (en) | 2015-05-20 | 2020-10-27 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10529051B2 (en) | 2015-05-20 | 2020-01-07 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US11688034B2 (en) | 2015-05-20 | 2023-06-27 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US11164282B2 (en) | 2015-05-20 | 2021-11-02 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10529052B2 (en) | 2015-05-20 | 2020-01-07 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10679323B2 (en) | 2015-05-20 | 2020-06-09 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10395338B2 (en) | 2015-05-20 | 2019-08-27 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US10186012B2 (en) | 2015-05-20 | 2019-01-22 | Gopro, Inc. | Virtual lens simulation for video and photo cropping |
US11599259B2 (en) | 2015-06-14 | 2023-03-07 | Google Llc | Methods and systems for presenting alert event indicators |
US11048397B2 (en) * | 2015-06-14 | 2021-06-29 | Google Llc | Methods and systems for presenting alert event indicators |
US11158353B2 (en) * | 2015-08-03 | 2021-10-26 | Sony Corporation | Information processing system, information processing method, and recording medium |
US9894393B2 (en) | 2015-08-31 | 2018-02-13 | Gopro, Inc. | Video encoding for reduced streaming latency |
US10186298B1 (en) | 2015-10-20 | 2019-01-22 | Gopro, Inc. | System and method of generating video from video clips based on moments of interest within the video clips |
US9721611B2 (en) | 2015-10-20 | 2017-08-01 | Gopro, Inc. | System and method of generating video from video clips based on moments of interest within the video clips |
US10748577B2 (en) | 2015-10-20 | 2020-08-18 | Gopro, Inc. | System and method of generating video from video clips based on moments of interest within the video clips |
US10789478B2 (en) | 2015-10-20 | 2020-09-29 | Gopro, Inc. | System and method of providing recommendations of moments of interest within video clips post capture |
US11468914B2 (en) | 2015-10-20 | 2022-10-11 | Gopro, Inc. | System and method of generating video from video clips based on moments of interest within the video clips |
US10204273B2 (en) | 2015-10-20 | 2019-02-12 | Gopro, Inc. | System and method of providing recommendations of moments of interest within video clips post capture |
US10460033B2 (en) * | 2015-11-11 | 2019-10-29 | Adobe Inc. | Structured knowledge modeling, extraction and localization from images |
US11514244B2 (en) * | 2015-11-11 | 2022-11-29 | Adobe Inc. | Structured knowledge modeling and extraction from images |
CN106682060A (en) * | 2015-11-11 | 2017-05-17 | 奥多比公司 | Structured Knowledge Modeling, Extraction and Localization from Images |
GB2544379A (en) * | 2015-11-11 | 2017-05-17 | Adobe Systems Inc | Structured knowledge modeling, extraction and localization from images |
US20170132498A1 (en) * | 2015-11-11 | 2017-05-11 | Adobe Systems Incorporated | Structured Knowledge Modeling, Extraction and Localization from Images |
GB2544379B (en) * | 2015-11-11 | 2019-09-11 | Adobe Inc | Structured knowledge modeling, extraction and localization from images |
AU2016225820B2 (en) * | 2015-11-11 | 2021-04-15 | Adobe Inc. | Structured knowledge modeling, extraction and localization from images |
US11238520B2 (en) | 2016-01-04 | 2022-02-01 | Gopro, Inc. | Systems and methods for generating recommendations of post-capture users to edit digital media content |
US10423941B1 (en) | 2016-01-04 | 2019-09-24 | Gopro, Inc. | Systems and methods for generating recommendations of post-capture users to edit digital media content |
US10095696B1 (en) | 2016-01-04 | 2018-10-09 | Gopro, Inc. | Systems and methods for generating recommendations of post-capture users to edit digital media content field |
US9761278B1 (en) | 2016-01-04 | 2017-09-12 | Gopro, Inc. | Systems and methods for generating recommendations of post-capture users to edit digital media content |
US10109319B2 (en) | 2016-01-08 | 2018-10-23 | Gopro, Inc. | Digital media editing |
US10607651B2 (en) | 2016-01-08 | 2020-03-31 | Gopro, Inc. | Digital media editing |
US11049522B2 (en) | 2016-01-08 | 2021-06-29 | Gopro, Inc. | Digital media editing |
US10424102B2 (en) | 2016-02-04 | 2019-09-24 | Gopro, Inc. | Digital media editing |
US10565769B2 (en) | 2016-02-04 | 2020-02-18 | Gopro, Inc. | Systems and methods for adding visual elements to video content |
US10769834B2 (en) | 2016-02-04 | 2020-09-08 | Gopro, Inc. | Digital media editing |
US11238635B2 (en) | 2016-02-04 | 2022-02-01 | Gopro, Inc. | Digital media editing |
US10083537B1 (en) | 2016-02-04 | 2018-09-25 | Gopro, Inc. | Systems and methods for adding a moving visual element to a video |
US9812175B2 (en) | 2016-02-04 | 2017-11-07 | Gopro, Inc. | Systems and methods for annotating a video |
US20170243065A1 (en) * | 2016-02-19 | 2017-08-24 | Samsung Electronics Co., Ltd. | Electronic device and video recording method thereof |
US10748414B2 (en) | 2016-02-26 | 2020-08-18 | A9.Com, Inc. | Augmenting and sharing data from audio/video recording and communication devices |
US10979636B2 (en) | 2016-02-26 | 2021-04-13 | Amazon Technologies, Inc. | Triggering actions based on shared video footage from audio/video recording and communication devices |
US10917618B2 (en) | 2016-02-26 | 2021-02-09 | Amazon Technologies, Inc. | Providing status information for secondary devices with video footage from audio/video recording and communication devices |
US11393108B1 (en) | 2016-02-26 | 2022-07-19 | Amazon Technologies, Inc. | Neighborhood alert mode for triggering multi-device recording, multi-camera locating, and multi-camera event stitching for audio/video recording and communication devices |
US11335172B1 (en) | 2016-02-26 | 2022-05-17 | Amazon Technologies, Inc. | Sharing video footage from audio/video recording and communication devices for parcel theft deterrence |
US10796440B2 (en) | 2016-02-26 | 2020-10-06 | Amazon Technologies, Inc. | Sharing video footage from audio/video recording and communication devices |
US11240431B1 (en) | 2016-02-26 | 2022-02-01 | Amazon Technologies, Inc. | Sharing video footage from audio/video recording and communication devices |
US10841542B2 (en) | 2016-02-26 | 2020-11-17 | A9.Com, Inc. | Locating a person of interest using shared video footage from audio/video recording and communication devices |
US10685060B2 (en) | 2016-02-26 | 2020-06-16 | Amazon Technologies, Inc. | Searching shared video footage from audio/video recording and communication devices |
US10762754B2 (en) | 2016-02-26 | 2020-09-01 | Amazon Technologies, Inc. | Sharing video footage from audio/video recording and communication devices for parcel theft deterrence |
US10762646B2 (en) | 2016-02-26 | 2020-09-01 | A9.Com, Inc. | Neighborhood alert mode for triggering multi-device recording, multi-camera locating, and multi-camera event stitching for audio/video recording and communication devices |
US11158067B1 (en) | 2016-02-26 | 2021-10-26 | Amazon Technologies, Inc. | Neighborhood alert mode for triggering multi-device recording, multi-camera locating, and multi-camera event stitching for audio/video recording and communication devices |
US11399157B2 (en) | 2016-02-26 | 2022-07-26 | Amazon Technologies, Inc. | Augmenting and sharing data from audio/video recording and communication devices |
US9972066B1 (en) | 2016-03-16 | 2018-05-15 | Gopro, Inc. | Systems and methods for providing variable image projection for spherical visual content |
US10740869B2 (en) | 2016-03-16 | 2020-08-11 | Gopro, Inc. | Systems and methods for providing variable image projection for spherical visual content |
US10817976B2 (en) | 2016-03-31 | 2020-10-27 | Gopro, Inc. | Systems and methods for modifying image distortion (curvature) for viewing distance in post capture |
US10402938B1 (en) | 2016-03-31 | 2019-09-03 | Gopro, Inc. | Systems and methods for modifying image distortion (curvature) for viewing distance in post capture |
US11398008B2 (en) | 2016-03-31 | 2022-07-26 | Gopro, Inc. | Systems and methods for modifying image distortion (curvature) for viewing distance in post capture |
US9794632B1 (en) | 2016-04-07 | 2017-10-17 | Gopro, Inc. | Systems and methods for synchronization based on audio track changes in video editing |
US10341712B2 (en) | 2016-04-07 | 2019-07-02 | Gopro, Inc. | Systems and methods for audio track selection in video editing |
US9838731B1 (en) | 2016-04-07 | 2017-12-05 | Gopro, Inc. | Systems and methods for audio track selection in video editing with audio mixing option |
US9858340B1 (en) | 2016-04-11 | 2018-01-02 | Digital Reasoning Systems, Inc. | Systems and methods for queryable graph representations of videos |
US10108709B1 (en) | 2016-04-11 | 2018-10-23 | Digital Reasoning Systems, Inc. | Systems and methods for queryable graph representations of videos |
US9620173B1 (en) * | 2016-04-15 | 2017-04-11 | Newblue Inc. | Automated intelligent visualization of data through text and graphics |
US9830948B1 (en) * | 2016-04-15 | 2017-11-28 | Newblue Inc. | Automated intelligent visualization of data through text and graphics |
US11257171B2 (en) | 2016-06-10 | 2022-02-22 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US10152758B2 (en) | 2016-06-10 | 2018-12-11 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US10691749B2 (en) | 2016-06-10 | 2020-06-23 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US10102593B2 (en) | 2016-06-10 | 2018-10-16 | Understory, LLC | Data processing system for managing activities linked to multimedia content when the multimedia content is changed |
US10402918B2 (en) | 2016-06-10 | 2019-09-03 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US10157431B2 (en) | 2016-06-10 | 2018-12-18 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US10152757B2 (en) | 2016-06-10 | 2018-12-11 | Understory, LLC | Data processing system for managing activities linked to multimedia content |
US11645725B2 (en) | 2016-06-10 | 2023-05-09 | Rali Solutions, Llc | Data processing system for managing activities linked to multimedia content |
US11470335B2 (en) | 2016-06-15 | 2022-10-11 | Gopro, Inc. | Systems and methods for providing transcoded portions of a video |
US10250894B1 (en) | 2016-06-15 | 2019-04-02 | Gopro, Inc. | Systems and methods for providing transcoded portions of a video |
US9998769B1 (en) | 2016-06-15 | 2018-06-12 | Gopro, Inc. | Systems and methods for transcoding media files |
US9922682B1 (en) | 2016-06-15 | 2018-03-20 | Gopro, Inc. | Systems and methods for organizing video files |
US10645407B2 (en) | 2016-06-15 | 2020-05-05 | Gopro, Inc. | Systems and methods for providing transcoded portions of a video |
US10045120B2 (en) | 2016-06-20 | 2018-08-07 | Gopro, Inc. | Associating audio with three-dimensional objects in videos |
US11120835B2 (en) * | 2016-06-24 | 2021-09-14 | Google Llc | Collage of interesting moments in a video |
US10185891B1 (en) | 2016-07-08 | 2019-01-22 | Gopro, Inc. | Systems and methods for compact convolutional neural networks |
US10469909B1 (en) | 2016-07-14 | 2019-11-05 | Gopro, Inc. | Systems and methods for providing access to still images derived from a video |
US10812861B2 (en) | 2016-07-14 | 2020-10-20 | Gopro, Inc. | Systems and methods for providing access to still images derived from a video |
US11057681B2 (en) | 2016-07-14 | 2021-07-06 | Gopro, Inc. | Systems and methods for providing access to still images derived from a video |
US10395119B1 (en) | 2016-08-10 | 2019-08-27 | Gopro, Inc. | Systems and methods for determining activities performed during video capture |
US9836853B1 (en) | 2016-09-06 | 2017-12-05 | Gopro, Inc. | Three-dimensional convolutional neural networks for video highlight detection |
US10282632B1 (en) | 2016-09-21 | 2019-05-07 | Gopro, Inc. | Systems and methods for determining a sample frame order for analyzing a video |
US10268898B1 (en) | 2016-09-21 | 2019-04-23 | Gopro, Inc. | Systems and methods for determining a sample frame order for analyzing a video via segments |
CN109791558A (en) * | 2016-09-23 | 2019-05-21 | 微软技术许可有限责任公司 | Fine motion figure automatically selects |
US10002641B1 (en) | 2016-10-17 | 2018-06-19 | Gopro, Inc. | Systems and methods for determining highlight segment sets |
US10643661B2 (en) | 2016-10-17 | 2020-05-05 | Gopro, Inc. | Systems and methods for determining highlight segment sets |
US10923154B2 (en) | 2016-10-17 | 2021-02-16 | Gopro, Inc. | Systems and methods for determining highlight segment sets |
US11727222B2 (en) | 2016-10-31 | 2023-08-15 | Arria Data2Text Limited | Method and apparatus for natural language document orchestrator |
US10467347B1 (en) | 2016-10-31 | 2019-11-05 | Arria Data2Text Limited | Method and apparatus for natural language document orchestrator |
US10963650B2 (en) | 2016-10-31 | 2021-03-30 | Arria Data2Text Limited | Method and apparatus for natural language document orchestrator |
US10284809B1 (en) | 2016-11-07 | 2019-05-07 | Gopro, Inc. | Systems and methods for intelligently synchronizing events in visual content with musical features in audio content |
US10560657B2 (en) | 2016-11-07 | 2020-02-11 | Gopro, Inc. | Systems and methods for intelligently synchronizing events in visual content with musical features in audio content |
US10262639B1 (en) | 2016-11-08 | 2019-04-16 | Gopro, Inc. | Systems and methods for detecting musical features in audio content |
US10546566B2 (en) | 2016-11-08 | 2020-01-28 | Gopro, Inc. | Systems and methods for detecting musical features in audio content |
US11205103B2 (en) | 2016-12-09 | 2021-12-21 | The Research Foundation for the State University | Semisupervised autoencoder for sentiment analysis |
CN108228705A (en) * | 2016-12-09 | 2018-06-29 | 波音公司 | Automatic object and activity tracking equipment, method and medium in live video feedback |
US20190207946A1 (en) * | 2016-12-20 | 2019-07-04 | Google Inc. | Conditional provision of access by interactive assistant modules |
US20180218750A1 (en) * | 2017-02-01 | 2018-08-02 | Wipro Limited | Integrated system and a method of identifying and learning emotions in conversation utterances |
US10037767B1 (en) * | 2017-02-01 | 2018-07-31 | Wipro Limited | Integrated system and a method of identifying and learning emotions in conversation utterances |
US10534966B1 (en) | 2017-02-02 | 2020-01-14 | Gopro, Inc. | Systems and methods for identifying activities and/or events represented in a video |
US10339443B1 (en) | 2017-02-24 | 2019-07-02 | Gopro, Inc. | Systems and methods for processing convolutional neural network operations using textures |
US10776689B2 (en) | 2017-02-24 | 2020-09-15 | Gopro, Inc. | Systems and methods for processing convolutional neural network operations using textures |
US10679670B2 (en) | 2017-03-02 | 2020-06-09 | Gopro, Inc. | Systems and methods for modifying videos based on music |
US10127943B1 (en) | 2017-03-02 | 2018-11-13 | Gopro, Inc. | Systems and methods for modifying videos based on music |
US10991396B2 (en) * | 2017-03-02 | 2021-04-27 | Gopro, Inc. | Systems and methods for modifying videos based on music |
US11443771B2 (en) | 2017-03-02 | 2022-09-13 | Gopro, Inc. | Systems and methods for modifying videos based on music |
US11354901B2 (en) * | 2017-03-10 | 2022-06-07 | Turing Video | Activity recognition method and system |
US10185895B1 (en) | 2017-03-23 | 2019-01-22 | Gopro, Inc. | Systems and methods for classifying activities captured within images |
US10268897B2 (en) | 2017-03-24 | 2019-04-23 | International Business Machines Corporation | Determining most representative still image of a video for specific user |
US10789985B2 (en) | 2017-03-24 | 2020-09-29 | Gopro, Inc. | Systems and methods for editing videos based on motion |
US11282544B2 (en) | 2017-03-24 | 2022-03-22 | Gopro, Inc. | Systems and methods for editing videos based on motion |
US10083718B1 (en) | 2017-03-24 | 2018-09-25 | Gopro, Inc. | Systems and methods for editing videos based on motion |
WO2018187167A1 (en) * | 2017-04-05 | 2018-10-11 | Ring Inc. | Triggering actions based on shared video footage from audio/video recording and communication devices |
US10304232B2 (en) * | 2017-04-06 | 2019-05-28 | Microsoft Technology Licensing, Llc | Image animation in a presentation document |
US10187690B1 (en) | 2017-04-24 | 2019-01-22 | Gopro, Inc. | Systems and methods to detect and correlate user responses to media content |
US10382372B1 (en) * | 2017-04-27 | 2019-08-13 | Snap Inc. | Processing media content based on original context |
US11108715B1 (en) | 2017-04-27 | 2021-08-31 | Snap Inc. | Processing media content based on original context |
US10614315B2 (en) | 2017-05-12 | 2020-04-07 | Gopro, Inc. | Systems and methods for identifying moments in videos |
US10395122B1 (en) | 2017-05-12 | 2019-08-27 | Gopro, Inc. | Systems and methods for identifying moments in videos |
US10817726B2 (en) | 2017-05-12 | 2020-10-27 | Gopro, Inc. | Systems and methods for identifying moments in videos |
US11436417B2 (en) | 2017-05-15 | 2022-09-06 | Google Llc | Providing access to user-controlled resources by automated assistants |
US10685187B2 (en) | 2017-05-15 | 2020-06-16 | Google Llc | Providing access to user-controlled resources by automated assistants |
CN107239801A (en) * | 2017-06-28 | 2017-10-10 | 安徽大学 | Video attribute represents that learning method and video text describe automatic generation method |
US10402698B1 (en) | 2017-07-10 | 2019-09-03 | Gopro, Inc. | Systems and methods for identifying interesting moments within videos |
US10614114B1 (en) | 2017-07-10 | 2020-04-07 | Gopro, Inc. | Systems and methods for creating compilations based on hierarchical clustering |
US10402656B1 (en) | 2017-07-13 | 2019-09-03 | Gopro, Inc. | Systems and methods for accelerating video analysis |
US10360942B1 (en) * | 2017-07-13 | 2019-07-23 | Gopro, Inc. | Systems and methods for changing storage of videos |
US11281943B2 (en) * | 2017-07-25 | 2022-03-22 | Cloudminds Robotics Co., Ltd. | Method for generating training data, image semantic segmentation method and electronic device |
US10853655B2 (en) * | 2017-08-28 | 2020-12-01 | Nec Corporation | Mobile device with activity recognition |
US10581945B2 (en) | 2017-08-28 | 2020-03-03 | Banjo, Inc. | Detecting an event from signal data |
US10853656B2 (en) * | 2017-08-28 | 2020-12-01 | Nec Corporation | Surveillance system with activity recognition |
US20190138812A1 (en) * | 2017-08-28 | 2019-05-09 | Nec Laboratories America, Inc. | Mobile device with activity recognition |
US20190138855A1 (en) * | 2017-08-28 | 2019-05-09 | Nec Laboratories America, Inc. | Video representation of first-person videos for activity recognition without labels |
US11122100B2 (en) | 2017-08-28 | 2021-09-14 | Banjo, Inc. | Detecting events from ingested data |
US11025693B2 (en) | 2017-08-28 | 2021-06-01 | Banjo, Inc. | Event detection from signal data removing private information |
US10313413B2 (en) * | 2017-08-28 | 2019-06-04 | Banjo, Inc. | Detecting events from ingested communication signals |
US10867209B2 (en) * | 2017-10-09 | 2020-12-15 | Box, Inc. | Combining outputs of data processing services in a cloud-based collaboration platform |
US11379686B2 (en) | 2017-10-09 | 2022-07-05 | Box, Inc. | Deploying data processing service plug-ins into a cloud-based collaboration platform |
US20190108419A1 (en) * | 2017-10-09 | 2019-04-11 | Box, Inc. | Combining outputs of data processing services in a cloud-based collaboration platform |
US11074475B2 (en) * | 2017-10-09 | 2021-07-27 | Box, Inc. | Integrating external data processing technologies with a cloud-based collaboration platform |
US20190130192A1 (en) * | 2017-10-31 | 2019-05-02 | Google Llc | Systems and Methods for Generating a Summary Storyboard from a Plurality of Image Frames |
US10452920B2 (en) * | 2017-10-31 | 2019-10-22 | Google Llc | Systems and methods for generating a summary storyboard from a plurality of image frames |
US11295455B2 (en) * | 2017-11-16 | 2022-04-05 | Sony Corporation | Information processing apparatus, information processing method, and program |
US20190163436A1 (en) * | 2017-11-28 | 2019-05-30 | Lg Electronics Inc. | Electronic device and method for controlling the same |
US20190182565A1 (en) * | 2017-12-13 | 2019-06-13 | Playable Pty Ltd | System and Method for Algorithmic Editing of Video Content |
US11729478B2 (en) * | 2017-12-13 | 2023-08-15 | Playable Pty Ltd | System and method for algorithmic editing of video content |
US10595098B2 (en) * | 2018-01-09 | 2020-03-17 | Nbcuniversal Media, Llc | Derivative media content systems and methods |
US11641500B2 (en) * | 2018-01-11 | 2023-05-02 | Editorj1 Technologies Private Limited | Method and system for customized content |
US10467067B2 (en) | 2018-02-09 | 2019-11-05 | Banjo, Inc. | Storing and verifying the integrity of event related data |
US10261846B1 (en) | 2018-02-09 | 2019-04-16 | Banjo, Inc. | Storing and verifying the integrity of event related data |
US10311129B1 (en) | 2018-02-09 | 2019-06-04 | Banjo, Inc. | Detecting events from features derived from multiple ingested signals |
US10970184B2 (en) | 2018-02-09 | 2021-04-06 | Banjo, Inc. | Event detection removing private information |
US10324935B1 (en) | 2018-02-09 | 2019-06-18 | Banjo, Inc. | Presenting event intelligence and trends tailored per geographic area granularity |
US10977097B2 (en) | 2018-04-13 | 2021-04-13 | Banjo, Inc. | Notifying entities of relevant events |
US10904720B2 (en) | 2018-04-27 | 2021-01-26 | safeXai, Inc. | Deriving signal location information and removing private information from it |
US10327116B1 (en) | 2018-04-27 | 2019-06-18 | Banjo, Inc. | Deriving signal location from signal content |
US10623937B2 (en) | 2018-04-27 | 2020-04-14 | Banjo, Inc. | Validating and supplementing emergency call information |
US10353934B1 (en) | 2018-04-27 | 2019-07-16 | Banjo, Inc. | Detecting an event from signals in a listening area |
US10324948B1 (en) | 2018-04-27 | 2019-06-18 | Banjo, Inc. | Normalizing ingested signals |
US10313865B1 (en) | 2018-04-27 | 2019-06-04 | Banjo, Inc. | Validating and supplementing emergency call information |
US10257058B1 (en) | 2018-04-27 | 2019-04-09 | Banjo, Inc. | Ingesting streaming signals |
CN112906649A (en) * | 2018-05-10 | 2021-06-04 | 北京影谱科技股份有限公司 | Video segmentation method, device, computer device and medium |
US11698927B2 (en) * | 2018-05-16 | 2023-07-11 | Sony Interactive Entertainment LLC | Contextual digital media processing systems and methods |
US11157524B2 (en) | 2018-05-18 | 2021-10-26 | At&T Intellectual Property I, L.P. | Automated learning of anomalies in media streams with external feed labels |
US20190378351A1 (en) * | 2018-06-11 | 2019-12-12 | International Business Machines Corporation | Cognitive learning for vehicle sensor monitoring and problem detection |
US10977874B2 (en) * | 2018-06-11 | 2021-04-13 | International Business Machines Corporation | Cognitive learning for vehicle sensor monitoring and problem detection |
US11341185B1 (en) * | 2018-06-19 | 2022-05-24 | Amazon Technologies, Inc. | Systems and methods for content-based indexing of videos at web-scale |
US11455418B2 (en) | 2018-08-07 | 2022-09-27 | Google Llc | Assembling and evaluating automated assistant responses for privacy concerns |
US11790114B2 (en) | 2018-08-07 | 2023-10-17 | Google Llc | Threshold-based assembly of automated assistant responses |
US11087023B2 (en) | 2018-08-07 | 2021-08-10 | Google Llc | Threshold-based assembly of automated assistant responses |
US11966494B2 (en) | 2018-08-07 | 2024-04-23 | Google Llc | Threshold-based assembly of remote automated assistant responses |
US11822695B2 (en) | 2018-08-07 | 2023-11-21 | Google Llc | Assembling and evaluating automated assistant responses for privacy concerns |
US11314890B2 (en) | 2018-08-07 | 2022-04-26 | Google Llc | Threshold-based assembly of remote automated assistant responses |
US20220083687A1 (en) | 2018-08-07 | 2022-03-17 | Google Llc | Threshold-based assembly of remote automated assistant responses |
US11800201B2 (en) * | 2018-08-31 | 2023-10-24 | Beijing Bytedance Network Technology Co., Ltd. | Method and apparatus for outputting information |
CN111078902A (en) * | 2018-10-22 | 2020-04-28 | 三星电子株式会社 | Display device and operation method thereof |
US20200151585A1 (en) * | 2018-11-09 | 2020-05-14 | Fujitsu Limited | Information processing apparatus and rule generation method |
US11663502B2 (en) * | 2018-11-09 | 2023-05-30 | Fujitsu Limited | Information processing apparatus and rule generation method |
US20200186897A1 (en) * | 2018-12-05 | 2020-06-11 | Sony Interactive Entertainment Inc. | Method and system for generating a recording of video game gameplay |
US11640497B2 (en) | 2018-12-26 | 2023-05-02 | Snap Inc. | Structured activity templates for social media content |
US11270067B1 (en) * | 2018-12-26 | 2022-03-08 | Snap Inc. | Structured activity templates for social media content |
US11328510B2 (en) * | 2019-03-19 | 2022-05-10 | The Boeing Company | Intelligent video analysis |
US11669743B2 (en) * | 2019-05-15 | 2023-06-06 | Huawei Technologies Co., Ltd. | Adaptive action recognizer for video |
ES2797825A1 (en) * | 2019-06-03 | 2020-12-03 | Blue Quality S L | SYSTEM AND METHOD OF GENERATING CUSTOMIZED VIDEOS FOR KARTING RACES (Machine-translation by Google Translate, not legally binding) |
US11736654B2 (en) | 2019-06-11 | 2023-08-22 | WeMovie Technologies | Systems and methods for producing digital multimedia contents including movies and tv shows |
US10582343B1 (en) | 2019-07-29 | 2020-03-03 | Banjo, Inc. | Validating and supplementing emergency call information |
US11570525B2 (en) | 2019-08-07 | 2023-01-31 | WeMovie Technologies | Adaptive marketing in cloud-based content production |
US11910036B2 (en) | 2019-08-21 | 2024-02-20 | Dish Network L.L.C. | Systems and methods for targeted advertisement insertion into a program content stream |
US11138383B2 (en) * | 2019-08-21 | 2021-10-05 | International Business Machines Corporation | Extracting meaning representation from text |
US11589086B2 (en) * | 2019-08-21 | 2023-02-21 | Dish Network L.L.C. | Systems and methods for targeted advertisement insertion into a program content stream |
US20210056173A1 (en) * | 2019-08-21 | 2021-02-25 | International Business Machines Corporation | Extracting meaning representation from text |
US20210093973A1 (en) * | 2019-10-01 | 2021-04-01 | Sony Interactive Entertainment Inc. | Apparatus and method for generating a recording |
US11783007B2 (en) * | 2019-10-01 | 2023-10-10 | Sony Interactive Entertainment Inc. | Apparatus and method for generating a recording |
US11843569B2 (en) | 2019-10-06 | 2023-12-12 | International Business Machines Corporation | Filtering group messages |
WO2021069989A1 (en) * | 2019-10-06 | 2021-04-15 | International Business Machines Corporation | Filtering group messages |
GB2604772A (en) * | 2019-10-06 | 2022-09-14 | Ibm | Filtering group messages |
US11552914B2 (en) | 2019-10-06 | 2023-01-10 | International Business Machines Corporation | Filtering group messages |
US11783860B2 (en) | 2019-10-08 | 2023-10-10 | WeMovie Technologies | Pre-production systems for making movies, tv shows and multimedia contents |
US11107503B2 (en) | 2019-10-08 | 2021-08-31 | WeMovie Technologies | Pre-production systems for making movies, TV shows and multimedia contents |
US10917704B1 (en) * | 2019-11-12 | 2021-02-09 | Amazon Technologies, Inc. | Automated video preview generation |
US11336972B1 (en) * | 2019-11-12 | 2022-05-17 | Amazon Technologies, Inc. | Automated video preview generation |
AU2020388552B2 (en) * | 2019-11-19 | 2023-08-03 | Netflix, Inc. | Techniques for automatically extracting compelling portions of a media content item |
US11334752B2 (en) * | 2019-11-19 | 2022-05-17 | Netflix, Inc. | Techniques for automatically extracting compelling portions of a media content item |
US20220277564A1 (en) * | 2019-11-19 | 2022-09-01 | Netflix, Inc. | Techniques for automatically extracting compelling portions of a media content item |
US11443513B2 (en) * | 2020-01-29 | 2022-09-13 | Prashanth Iyengar | Systems and methods for resource analysis, optimization, or visualization |
US11386700B2 (en) * | 2020-02-28 | 2022-07-12 | Panasonic I-Pro Sensing Solutions Co., Ltd. | Face detection system |
US11438639B2 (en) * | 2020-03-03 | 2022-09-06 | Microsoft Technology Licensing, Llc | Partial-video near-duplicate detection |
US11004471B1 (en) * | 2020-03-24 | 2021-05-11 | Facebook, Inc. | Editing portions of videos in a series of video portions |
CN111368142A (en) * | 2020-04-15 | 2020-07-03 | 华中科技大学 | Video intensive event description method based on generation countermeasure network |
US11315602B2 (en) | 2020-05-08 | 2022-04-26 | WeMovie Technologies | Fully automated post-production editing for movies, TV shows and multimedia contents |
US11564014B2 (en) | 2020-08-27 | 2023-01-24 | WeMovie Technologies | Content structure aware multimedia streaming service for movies, TV shows and multimedia contents |
US11943512B2 (en) | 2020-08-27 | 2024-03-26 | WeMovie Technologies | Content structure aware multimedia streaming service for movies, TV shows and multimedia contents |
US11166086B1 (en) * | 2020-10-28 | 2021-11-02 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
US11812121B2 (en) * | 2020-10-28 | 2023-11-07 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
US20220132223A1 (en) * | 2020-10-28 | 2022-04-28 | WeMovie Technologies | Automated post-production editing for user-generated multimedia contents |
CN113076286A (en) * | 2021-03-09 | 2021-07-06 | 北京梧桐车联科技有限责任公司 | Method, device and equipment for acquiring multimedia file and readable storage medium |
US11967343B2 (en) | 2021-03-12 | 2024-04-23 | Snap Inc. | Automated video editing |
US20220293133A1 (en) * | 2021-03-12 | 2022-09-15 | Snap Inc. | Automated video editing |
US11581019B2 (en) * | 2021-03-12 | 2023-02-14 | Snap Inc. | Automated video editing |
US20220301307A1 (en) * | 2021-03-19 | 2022-09-22 | Alibaba (China) Co., Ltd. | Video Generation Method and Apparatus, and Promotional Video Generation Method and Apparatus |
CN112818955A (en) * | 2021-03-19 | 2021-05-18 | 北京市商汤科技开发有限公司 | Image segmentation method and device, computer equipment and storage medium |
CN112905829A (en) * | 2021-03-25 | 2021-06-04 | 王芳 | Cross-modal artificial intelligence information processing system and retrieval method |
US11615814B2 (en) * | 2021-04-06 | 2023-03-28 | Smsystems Co., Ltd | Video automatic editing method and system based on machine learning |
US20220319549A1 (en) * | 2021-04-06 | 2022-10-06 | Smsystems Co., Ltd | Video automatic editing method and system based on machine learning |
WO2022240409A1 (en) * | 2021-05-11 | 2022-11-17 | CLIPr Co. | System and method for crowdsourcing a video summary for creating an enhanced video summary |
US11610402B2 (en) | 2021-05-11 | 2023-03-21 | CLIPr Co. | System and method for crowdsourcing a video summary for creating an enhanced video summary |
US11445273B1 (en) | 2021-05-11 | 2022-09-13 | CLIPr Co. | System and method for creating a video summary based on video relevancy |
WO2022240408A1 (en) * | 2021-05-11 | 2022-11-17 | CLIPr Co. | System and method for creating a video summary based on video relevancy |
US20220382811A1 (en) * | 2021-06-01 | 2022-12-01 | Apple Inc. | Inclusive Holidays |
US11330154B1 (en) | 2021-07-23 | 2022-05-10 | WeMovie Technologies | Automated coordination in multimedia content production |
US11924574B2 (en) | 2021-07-23 | 2024-03-05 | WeMovie Technologies | Automated coordination in multimedia content production |
WO2023050295A1 (en) * | 2021-09-30 | 2023-04-06 | 中远海运科技股份有限公司 | Multimodal heterogeneous feature fusion-based compact video event description method |
US11682210B1 (en) * | 2021-11-30 | 2023-06-20 | Kwai Inc. | Methods and device for video data analysis |
US20230169770A1 (en) * | 2021-11-30 | 2023-06-01 | Kwai Inc. | Methods and device for video data analysis |
US11790271B2 (en) | 2021-12-13 | 2023-10-17 | WeMovie Technologies | Automated evaluation of acting performance using cloud services |
US11321639B1 (en) | 2021-12-13 | 2022-05-03 | WeMovie Technologies | Automated evaluation of acting performance using cloud services |
WO2023121840A1 (en) * | 2021-12-21 | 2023-06-29 | Sri International | Video processsor capable of in-pixel processing |
WO2023160241A1 (en) * | 2022-02-28 | 2023-08-31 | 荣耀终端有限公司 | Video processing method and related device |
Also Published As
Publication number | Publication date |
---|---|
US10679063B2 (en) | 2020-06-09 |
US20160004911A1 (en) | 2016-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10679063B2 (en) | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics | |
US11328013B2 (en) | Generating theme-based videos | |
US9870798B2 (en) | Interactive real-time video editor and recorder | |
US10198509B2 (en) | Classification, search and retrieval of complex video events | |
US9570107B2 (en) | System and method for semi-automatic video editing | |
US8717367B2 (en) | Automatically generating audiovisual works | |
US8948515B2 (en) | Method and system for classifying one or more images | |
KR102290419B1 (en) | Method and Appratus For Creating Photo Story based on Visual Context Analysis of Digital Contents | |
CN113709561B (en) | Video editing method, device, equipment and storage medium | |
US9554111B2 (en) | System and method for semi-automatic video editing | |
CN113569088B (en) | Music recommendation method and device and readable storage medium | |
JP2011215963A (en) | Electronic apparatus, image processing method, and program | |
US20210117471A1 (en) | Method and system for automatically generating a video from an online product representation | |
JP2019185738A (en) | System and method for associating textual summary with content media, program, and computer device | |
US20230140369A1 (en) | Customizable framework to extract moments of interest | |
Wu et al. | Monet: A system for reliving your memories by theme-based photo storytelling | |
US9667886B2 (en) | Apparatus and method for editing video data according to common video content attributes | |
US20230274481A1 (en) | Digital image annotation and retrieval systems and methods | |
TWI780333B (en) | Method for dynamically processing and playing multimedia files and multimedia play apparatus | |
Podlesnyy | Automatic Video Editing | |
Maybury | Multimedia information extraction: History and state of the art | |
Berka et al. | Flexible Approach to Documenting and Presenting Multimedia Performances Using Motion Capture Data | |
CN115619901A (en) | Material editing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SRI INTERNATIONAL, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHENG, HUI;SAWHNEY, HARPREET SINGH;LIU, JINGEN;AND OTHERS;REEL/FRAME:033321/0615 Effective date: 20140714 |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
STCV | Information on status: appeal procedure |
Free format text: APPEAL BRIEF (OR SUPPLEMENTAL BRIEF) ENTERED AND FORWARDED TO EXAMINER |
|
STCV | Information on status: appeal procedure |
Free format text: EXAMINER'S ANSWER TO APPEAL BRIEF MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: APPEAL READY FOR REVIEW |
|
STCV | Information on status: appeal procedure |
Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS |
|
STCV | Information on status: appeal procedure |
Free format text: BOARD OF APPEALS DECISION RENDERED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |