WO2009113102A2

WO2009113102A2 - Content based visual information retrieval systems

Info

Publication number: WO2009113102A2
Application number: PCT/IN2009/000120
Authority: WO
Inventors: Chattopadhyay Tanushyam; Pal Arpan; Chaki Ayan; Bhowmick Brojeshwar
Original assignee: Tata Consultancy Services Limited
Priority date: 2008-02-27
Filing date: 2009-02-24
Publication date: 2009-09-17
Also published as: WO2009113102A3

Abstract

A system and a method for content based retrieval and indexing of video sequences based on the image of a pre-determined object. The system performs operations typically in two stages: Video annotation and video retrieval using image of a pre-determined object. Annotated data is stored in a XML database.

Description

CONTENT BASED VISUAL INFORMATION RETRIEVAL SYSTEMS

The present invention relates to the field of content based visual information retrieval systems.

DEFINITIONS OF TERMS USED IN THE SPECIFICATION

[n this specification, the following terms have the following definitions as given alongside. These are additions to the usual definitions expressed in the art.

• Bhattacharya distance: Bhattacharya distance is used for measuring similarity between two unidentified random variables.

• Circularity parameter: A compactness measure, called the circularity ratio, is the ratio of the area of the shape to the area of a circle having the same perimeter.

• Color histogram: A color histogram is a representation of the distribution of colors in an image, derived by counting the number of pixels of each of given set of color ranges in a typically two-dimensional (2D) or three-dimensional (3D) color space.

• Connected Component Labeling: Connected component labeling is used to detect connected regions in binary digital images. Connected components labeling scans an image and groups its pixels into components based on pixel connectivity, i.e. all pixels in a connected component share similar pixel intensity values and are in some way connected with each other.

• Entropy: Entropy is the measure of information content in an image.

• Gaussian Mask: Gaussian mask, is a pre-processing step to edge detection for to reduce high frequency noise.

• Hough transform: Hough transform is a technique which is be used to isolate features of a particular shape within an image.

• Images: Images include both still and moving images.

• Input query: Input query is the example image which is to be searched and retrieved from the image/video database.

• Key Frames: Key frames are frames within shots which contain the relevant information. • K-means: The k-means clusters objects based on attributes into k partitions so that objects from the same cluster are more similar to each other, than objects from different clusters.

• Laplacian of Gaussian filter: Laplacian of Gaussian filter is typically used fro edge detection in image processing.

• Normalization: Normalization, is a process that changes the range of pixel intensity values to a desired value.

• Relevance feedback: Feedback provided by users for refining the search results.

• Shots: Moving images contain multiple frames. Shots are a collection of similar frames i.e. frames that are acquired through a continuous camera recording. They form a smaller and manageable part of a video.

• Sobel operator: Sobel operator, an operator used in image processing, particularly within edge detection algorithms. The Sobel operator is named after Dr. Irwin Sobel. The operator calculates the gradient of the image intensity at each point, giving the direction of the largest possible increase of intensity from light to dark and the rate of change in that direction.

• Threshold: Threshold is the pixel value of an image corresponding to a non-zero frequency.

• Thumbnail: Thumbnails are reduced-size versions of the pictures used to help in recognizing and organizing the pictures.

• Video: Video is a sequence of still images representing scenes in motion.

BACKGROUND AND PRIOR ART

The emergence of multimedia content within electronic publications and World Wide Web raises the critical issue of searching and retrieving objects of interest from a huge chunk of image and video database. Browsing a digital video library for searching for an object of interest can be very tedious especially with an ever expanding collection of multimedia material. Moreover, organizing the video and locating the required information effectively and efficiently presents an even greater challenge. The huge emergence of image processing and pattern recognition applications with the rapidly expanding image and video collections on the World Wide Web has attracted significant research efforts in effective retrieval and management of visual information. There is an old saying, "a picture is worth a thousand words". If each video sequence is considered a set of still images then the number of images in such a set typically will be in hundreds or thousands or even more, making it very difficult to find certain information in video documents. The sheer volume of video data available nowadays presents a good challenge in front of researchers on how effectively they can organize the data and make use of all these video documents efficiently. There is a need to locate meaningful information and extract knowledge from video documents. The better the representation scheme of video contents the faster and accurate is the retrieval of data. One of the good representations is to index the video data and finding informative images from the indexed data. Apparently, it is totally impractical to index video documents manually due to the fact that it is too time consuming. However using various image processing techniques computer science researchers have come up with the idea of shot detection which are used to group frames into shots. Thus, a shot designates a contiguous sequence of video frames recorded by an uninterrupted camera operation.

In the last a few years a huge progress has been noticed in the field of multimedia CODEC to enable users to store more data in a less amount of space. Exploiting this advancement, medical practitioners are recording the video sequences of critical operations. These sequences are highly interesting for the medical instrument manufacturers. They are usually keen to see how the doctors use the instrument in real life operation theaters. But they face difficulties in locating the portion of video where the actual instrument was used.

Most of the existing solutions available in the market for image retrieval are typically based on annotating each image with a set of keywords. A major disadvantage in these applications is manual classification which is time-consuming and potentially error-prone. Moreover, manual classification becomes impractical for huge databases and increases the chances of missing the image during search as the image can be described using a synonym of the search term. Additionally, the retrieval in these cases is done mainly from a still image database and not from videos. Therefore, there is a need of content based video retrieval.

Content based image retrieval systems are systems that search for objects of interest based on actual contents of the image. The term content refers to features including color, shape, texture and the like that can be extracted from an image.

Content based retrieval systems are typically implemented by using Query techniques. 'Query by example' is a query technique that involves providing the content based image retrieval systems with an example image that can be used as a base for searching for similar images. This query technique eliminates the shortcomings of annotating each image with keywords. Using this technique one can provide as a query a digital image and ask for all electronic documents that contain similar pictures.

Content-based access of videos benefits large domains including teaching, research, training, surgical instruments manufacturing industry and the like.

In the medical field, images and especially digital images, are produced in ever increasing quantities and used for diagnostics and therapy. Potential online applications of endoscopic video understanding involve indexing, retrieval and analysis of surgery. Despite widespread interest in content-based visual information retrieval, there is a dearth of tools for retrieval of video specialized to the medical domain. In particular, there are limited tools available to automatically annotate laparoscopic surgery videos.

Some of the approaches used are color based methods and shape models to track the instruments. In addition, some research groups use instruments marked with special color for tracking. Zhang and Payandeh proposed a design marker on the instrument tips to recover robot controlled parameters. In most of these applications image recognition and analysis is done based on color thresholding and shape modeling. These methods are not successful due to reflections from the instruments and varying illumination. Krupa et al. solved a related problem by using a light emitting diode on the instrument tip to project laser dots on the surface whose pattern is analyzed to guide the endoscope. Instead of marking on the instruments, Climent and Mares proposed a method using Hough Transform based on the assumption that most of the instruments are structured objects. However most of the attempts have been made for automatic guidance of robot so that it can perform as an assistant for laparoscopic surgeries.

Thus, existing solutions for surgical instrument tracking involves modification of the instruments by some visual markers. There is a need to focus on detection of particular surgical instruments from a large chunk of video data, without any modification of the desired instrument, so that the use of the instrument can be studied and necessary upgrades can be done.

• US Patent 6584221 discloses a method for representing an image for the purpose of indexing, searching and retrieving images in a multi-media database. The invention allows users to specify "content" to be searched along with "regions-of-interest". The image retrieval and representation is based on two key image components. The first component is a set of image features and the second component is a similarity metric used to compare the image features. However, the invention retrieves the images from a still image database and not from videos. The output also in provided in the form of static images. The functioning requires storing of joint distributions of features along with the image thus increases the size of the database and in turn increasing complexity of the content-based analysis. The invention also does not have the provision for user feedback.

Therefore, there is a need for a system for:

• Retrieval of video sequences from video files.

• Reducing the space complexity involved in storing blocks of the extracted features of the raw image files.

• Providing a cross platform system that can work with a variety of image and video file formats.

• Reducing the time taken for retrieving the images from the databases • Providing and playing the sequences of video files which contain the object of interest.

• Providing a system which can particularly be used for detection of surgical instruments from a large chunk of video data, without any modification of the desired instrument.

• Providing a system for accepting user feedback. By taking into account a user's feedback, it is possible to make the system more precise in the search of relevant images.

OBJECT OF THE INVENTION

It is an object of the present invention to provide a user friendly system for content based retrieval of video sequences.

It is another object of the present invention to provide a fast and an accurate system for retrieval of video sequences.

It is still another object of the present invention to provide a system for content based retrieval of video sequences which can be applied to both compressed and uncompressed image data.

It is still another object of the present invention to provide a platform independent system which can be implemented on different operating systems.

It is yet another object of the present invention to provide a user-centric system which enables human intervention in the form of 'relevance feedback'.

SUMMARY OF THE INVENTION

In accordance with the present invention, content based video indexing and/or retrieval system comprises:

• at least a first database for uploading still pictures and/or videos, • indexing means for indexing the still pictures and/or videos in said first database;

• segmentation means for segmentation of indexed videos into frames;

• first extraction means for extracting features of pre-determined objects from the frames;

• annotating means for annotating the extracted features with respect to the frames;

• storing means for storing information with respect to extracted features and the annotated frames in a second database;

• a user interface adapted to receive an image of at least one object from the predetermined objects;

• a processor for processing the received images, the processor comprising: a) a second extraction means for extracting features of the received image; and b) a matching engine adapted to match the extracted features of the received image with the stored extracted features in the second database to identify annotated frames containing an object having the extracted features;

• locating means for locating at least one video of the matched annotated frame in the first database and further adapted to retrieve relevant matched frames from the. video and convert the retrieved frames to a compatible format for display purposes; and

• displaying means for displaying the video sequence of the matched frames.

Typically, the second database is adapted to store the annotated frames of pre-determined objects in the XML format.

Preferably, the first database comprises updating means for:

1) adding new still pictures and/or videos in the first database; and

2) on update to the first database means adapted to extract, annotate and store the pre-determined objects in the videos in the second database.

Typically, a third database will used for storing still pictures of pre-determined objects and also for transferring selected images of pre-determined objects to the user interface. Specifically, the segmentation means is adapted to split still pictures and videos into contiguous frames and group them into one or more shots and further identify one or more key frames in the shots.

Preferably, the segmentation means includes a shot detection means to index video data intelligently and automatically.

Typically, the segmentation means include a shot transition detection means for sensing the transition from one shot to the subsequent one and further improving the shots by reducing the number of places erroneously declared as shot breaks.

Typically, the shot transition detection means is adapted to detect and improve hard and soft cuts in videos. Particularly, shot transition detection is adapted to detect and improve soft cuts including fade-in, fade-out, dissolves and wipes.

Specifically, the annotating means is adapted to extract at least one feature of the image of an object selected from a group of features consisting of color, linear shape, edge gradient and compactness.

Specifically, the first database is adapted to store the still pictures in JPEG/BMP/TIFF formats and videos in AVI/WMV/MPEG/H.264 formats.

Typically, the displaying means consists of a media player for playing the matched frames of the marked objects.

According to an embodiment of the present invention, the system includes feedback means adapted to receive relevancy feedback from users.

Typically, system comprises means for generating thumbnails of the annotated key frames.

According the present invention, there is provided a method for a content based retrieval of video sequences comprising the following steps:

• uploading still pictures and/or videos in a first database; • indexing the still pictures and/or videos in the first database;

• segmentation of said indexed videos into frames;

• extracting features of pre-determined objects from the key frames;

• annotating the extracted features with respect to the key frames;

• storing information with respect to extracted features and the annotated frames in a second database;

• receiving an image of at least one object from the pre-determined objects from the users;

• extracting the features of the received image;

• matching the extracted features of the received image with the stored extracted features in the second database to identify annotated frames containing an object having the extracted features;

• locating at least one video of the matched annotated frame in the first database; and

• retrieving relevant matched frames from the video and converting the retrieved frames to a compatible format for display.

Typically, segmentation includes detection of shots in the videos, sensing the transition from one shot to another, sensing the boundaries between the shots and improving the shots by reducing the number of places erroneously declared as shot breaks using block color histogram.

Particularly, detecting of shots comprises: a) detection of the object within the frames based on the linear shape of the image of pre-determined objects using Hough transform; b) constructing an 8x8 block of every frame, each block having pixels in the RGB color space; c) converting the pixels from RGB to HSV color space for every block and constructing a color histogram; d) quantizing the color histogram for each of the blocks. In accordance with still another embodiment of the present invention, constructing color histogram further includes: a) computing the distance for every block of the frame with respect to the previous frame using Bhattacharya distance; b) finding all similar elements in blocks of frames to determine isolated blocks not containing connected elements using connected component labeling; c) removing the isolated blocks in all frames using thresholding; d) filtering the frames to remove noise using Gaussian mask over LOG (laplacian of Gaussian) filter; e) normalizing the filtered frames; f) measuring the differences between the filtered frames; and g) detecting shots based on the number of differences calculated between frames.

According to yet another embodiment of the present invention, the feature based matching comprises the following steps:

• receiving the extracted features of an image of at least one object input by the users;

• retrieving subsequent annotated frames from the second database;

• matching the color based features of the received image with the retrieved frames and providing the matched frames;

• determining the edge points of the matched frames using Sobel operator and further removing weak edges from the matched frames using K-means;

• determining the pixel density of the matched frames and removing objects based on the pixel density from the matched frames;

• determining the compactness of the matched frames and removing undesirable objects based on the compactness ratio from the matched frames using circularity parameter; and

• further retrieving and tracking pre-determined objects in the matched frames satisfying the shape feature of the received images of objects. According to still another embodiment of the invention, thumbnails of the video sequence for the matched frames are displayed to the users. Typically, users on seeing the thumbnails can provide relevance feedback concerning the selected video sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

Other aspects of the invention will become apparent by consideration of the accompanying drawings and their descriptions stated below, which is merely illustrative of a preferred embodiment of the invention and does not limit in any way the nature and scope of the invention in which, Figure 1 is the first level block diagram describing the overview of the system;

Figure 2 is a schematic of the content based video sequencing system;

Figure 3 is a schematic diagram of the segmentation means;

Figure 4 shows detection of lower and upper threshold from the histogram of query image;

Figure 5 shows a homogeneity histogram difference plot;

Figure 6 shows a homogeneity histogram difference plot for a very smooth dissolving effect;

Figure 7 is a flowchart of typical steps for the video annotation process; and

Figure 8 is a flowchart of typical steps for the video retrieval process.

DETAILED DESCRIPTION

In accordance with this invention, there is provided a system for retrieving and indexing video sequences based on the image of an object provided by users.

Particularly, this invention will be useful for tracking surgical instruments in surgical video sequences and further for retrieving the frames where the particular instrument is being used.

Typically, medical operations are usually recorded using three cameras. One of the cameras is placed on top of the operation table, one is held by a person by hand used to record the laparoscopic view displayed in a screen and the third one is placed on a tripod. From this set of images of the medical instruments and the available surgical videos, we can make the following observations:

• AU the instruments have a linear pattern;

• All the instruments have two distinct linear edges;

• More than 20% of the instrument is visible during operation; and

• Instruments visible in laparoscopic view are monochromatic in nature. This leads to a single peak in the histogram of y, u and v components of the query image

In accordance with the above observations, the present invention will be useful in effectively and optimally retrieving and tracking medical instruments including scalpel, surgical short stem and long stem scissors, stitching needles, cautery and the like from video recordings of operations. Typically, the image of the instrument to be tracked will be provided by the users and the system will find and display all the video sequences from the video recordings where the instrument was seen.

The retrieval of the video sequences is typically carried out in at least two phases. In the first phase, typically image of the instrument to be tracked is taken and its features are extracted. In the second phase, the video is typically split into segments to find out the region of interest (ROI) for reducing the region of search. Finally, the color and shape based features of the video and the object are matched to track and retrieve the required instrument. Typically, the system recognizes the instrument in 91% cases but does not give any false alarm.

Still further, this invention envisages a method by which both automatic and semi automatic approach for query image based video retrieval is made available. Particularly, automatic retrieval of video images based on the image of the object (query image) can be performed using manual intervention by taking user's feedback into account.

Also, within a shot, thumbnails (representative frames) are generated for users for making search of the desired instrument faster. This invention also provides a tool which can split the video sequence into smaller units that are taken in a continuous camera recording and contain similar information. These smaller and manageable units are called shots. Typically, users can provide relevancy feedback by seeing the thumbnails of key frames of the shots for the retrieved videos and accordingly select the relevant video.

The invention can specifically be used for retrieval of video sequences which can be a part of compressed or un-compressed videos.

Referring to the figures, the system of video sequencing can be broadly classified into two parts as seen in Figure 1 :

1. Video annotation as represented by block 100 in Figure 1 ; and

2. Video retrieval means based on query image as represented by block 102 in Figure 1

Typically, users will upload the videos of all the surgeries and operations and these will be stored in a first database as represented by 104 in Figure 2. The first database can store videos in AVI/WMV/MPEG/H.264 formats. The first database 104 acts as a raw database for storing all the video recording. These recording are used later for retrieving the video sequences matching the image of the desired objects typically medical instruments. Users will also be given the provision for adding new videos in the first database 104.

Typically, the still pictures uploaded by users will be pictures of the objects like surgical instruments that users want to track and retrieve. Still pictures in JPEG/BMP/TIFF formats will be accepted by the system. These images will enable users to select the image of the object they wish to track in the videos.

These still pictures are stored in a third database represented by block 116 in Figure 2. Thus, the third database 116 acts a database containing still pictures of the instruments and objects. In accordance with the invention, third database will typically contain images of surgical instruments including scalpel, cautery, surgical tongs, long stem scissors, short stem scissors, stitching needle, gauge, swab and cotton, gloves and the like. These images of pre-determined objects stored in the third database 116 are provided to the user interface represented by block 118 in Figure 2 and also to the first extraction means represented by block 110 in Figure 2. Provision has also been given to update the third database to add new images of objects.

The still pictures and the videos stored in the first and the third databases will be indexed using an indexing means as represented by block 106 in Figure 2. These indexed videos and still pictures are further passed for performing video annotation 100.

Video Annotation: During video annotation 100, still pictures and/or videos are received as input. Still pictures and/or videos in both compressed and uncompressed form are accepted for video annotation 100. Typically, the still pictures and/or videos can be uploaded, provided real-time through a camera feed and the like. The uploaded still pictures are typically used for extracting images of objects that will be stored as template. These templates can be used for selecting the objects to be searched.

According to an embodiment of the present invention, the process of video annotation 100 can typically be performed offline at a pre-processing stage, real-time or in parallel with the video retrieval process.

The first step in video annotation includes segmentation wherein segmentation means represented by block 108 in Figure 2, receives the indexed videos and splits them into contiguous frames grouped into one or more shots. Splitting videos in shots enables dividing the whole video into groups of similar frames or video sequences from one camera recording.

The segmentation means 108 includes: i) Shot transition detection means as represented by block 132 in Figure 3; and ii) Shot detection means as represented by block 134 in Figure 3.

i) Shot Transition detection: Shot transition detection means 132 senses the transition from one shot to the subsequent one and improves the quality of shot by reducing the number of places erroneously declared as shot breaks. In accordance with an embodiment of the present invention, shot transition detection performs the below functions:

• receives the indexed still pictures and/or videos from the first database 104, and further organises the still pictures and/or videos into a hierarchical structure;

• splits the video sequences into shots;

• senses the transition from one shot to the subsequent one. A transition occurs if there is a significant difference between two frames.

• The present invention improves the shot transition by reducing the number of places erroneously declared as shot breaks using block colour histogram.

Typically, three types of shot boundaries are recognized:

• Cut: Cut or hard cuts is a sharp boundary between shots. This generally implies a peak in the difference between color or motion histograms corresponding to the two frames surrounding the cut.

• Dissolve: Dissolve is a boundary between shots when the content of last images of the first shots is continuously mixed with that of the first images of the second shot. Fade-in and fade-out effects are special cases of dissolve transitions where the first or the second scene, respectively is a dark frame.

• Wipe: Wipe is a boundary between shots and implies that the images of the second shot continuously cover or push out of the display (coming from a given direction) that of the first shot.

• Typically, the invention uses multi-factorial approach to determine the shot boundary which makes a decision to identify the shot boundary from multiple parameters. The above mentioned parameter list contains:

• Formation of one dimensional quantized block feature in HSV domain;

• Creating a color histogram;

• Computing similarity between blocks using Bhattacharya's distance or any other probabilistic distance;

• Edge detection and filteration using Laplacian of Gaussian or Gaussian;

• Normalization of the edge strength;

• Computation of the homogeneity factor in terms of edge;

• Normalization of the histogram;

• Calculating the runtime mean and variance of histogram difference; and

• Declare the shot based on number of block different or depending on runtime mean and standard deviation using di ≥ m + p * ⁽J where d, is the histogram distance.

The invention further involves identifying the key frames in the shots by performing the below steps:

• Calculating the Shannon entropy of every frame within shot; and

• Determining the peak of the set of entropy values within shot to identify some representative frames those containing the most information within a shot.

After the shots have been improved and shot boundaries, key frames accurately identified, the shots within the videos are indexed using a shot detection system 134.

ii) Shot detection means: Shot detection system 134, typically is used to index video data intelligently and automatically.

The segmented videos are further passed to a first extraction means. The first extraction means 110 receives the segmented videos from the segmentation means and the images of pre-determined objects from the third database 116 and extracts the features of predetermined objects from the key frames. Features including color, edge gradient and compactness of the pre-determined objects are extracted from the key frames and passed to the annotating means as represented by block 112 in Figure 2.

The annotating means 112 receives the extracted features from the first extraction means 110 and annotates the extracted features with respect to the key frames. Features of the annotated frames and converted to the XML format and stored using a storing means represented by block 130 in Figure 2 in a second database as represented by block 114 in Figure 2.

The second database 114 is an XML database which stores the extracted features of the key frames with respect to the pre-determined objects. The second database is the working database of the system.

The contents of the second database 114 and the third database 116 are provided for the process of video retrieval based on query image.

Video retrieval means based on query image: Video retrieval means based on query image (VRBQI) as represented by block 102 in Figure 1, according to an embodiment of the present invention, receives the query image provided by users, extracts the image features, compares the extracted features with the stored extracted features in the second database 114 to retrieve frames matching the features of the query image.

Typically, the tracking and searching of an object is performed using Fuzzy C Means or Probabilistic tracking techniques like Bayesian Classifier, condensation algorithms of video sequences from a video. VRBQI 102 presents the output in the form of ranked output based on fuzzy matching score.

The first step in the video retrieval process typically includes receiving the users input in the form of a query image. Query image is the image of an object which the user wishes to track in the videos. The user input is typically provided by a user interface 116.

In accordance with an embodiment of the present invention user interface 116 receives the input query image from users. Users can provide the query image to the user interface 116 by drag and drop, browsing through the images of pre-determined objects in the third database 116, upload and like. User interface 116 provides the query image to a second extraction means, represented by block 120 in Figure 2. Typically, second extraction means extracts at least one feature of the features including color and compactness of the image received from the user interface 118 and passes the extracted features to a matching engine represented by block 122 in Figure 2.

Matching engine, 122, matches the extracted features of the query image with the features of the key frames stored in the XML files. The matched key frames are further passed to a locating means represented by block 124 in Figure 2.

Locating means 124 typically receives the matched frames from the matching engine 122 and locates the video for the frames in the first database 104. It then retrieves the matched frames from the video and converts the retrieved frames to a compatible format for display. The matched frames are also sent to a thumbnail generator represented by block 126 in Figure 2. The thumbnail generator 126, in accordance with an embodiment of the present invention typically receives the matched frames from the matching engine 122 and generates thumbnails for the respective frames (representative frames) and passes them to a displaying means represented by block 128 in Figure 2.

Displaying means 128 displays thumbnails of the key frame in the video sequences. These thumbnails help users to identify whether the retrieved frames are relevant and thus he/she can play only the video sequences which are found relevant. Displaying means typically consists of a media player for playing the video sequences selected by users.

Typically, during laparoscopic surgery there are cameras recording the movement of the instruments inside the abdomen of the patient. These live camera recordings are fed to the present invention. If the surgeon wishes to track the position of the scissors, in accordance with this invention he will select the image of the scissors from the pre-determined objects and provide the image of the scissors as an input to the system. The system will then match the features of the image of the scissors provided by the surgeon with the stored annotated frames and show the thumbnail of the image of the retrieved frames on the display. On clicking on the relevant thumbnail surgeon can play the video sequence where the scissors were last seen and accordingly proceed with the surgery.

Similarly, the invention will also prove useful to medical instrument manufacturers. Typically, on providing the image of a particular surgical instrument to the present invention they can track and retrieve all the video sequences where the medical instruments for example 'scalpel' was used. This will enable manufactures to see the usage of the instrument and further carry out research for improving the instruments.

The various techniques used for shot detection and feature extraction are explained below: I. Color based feature extraction from query image: In this step typically, three separate histogram for y, u and v components are plotted. Based on the plotted histogram, points with non-zero frequency in the neighborhood of the peak are located. We denote them as lower threshold (It) and upper threshold (ut) as shown in Figure 4.

II. Determination of compactness threshold from query image:

For compactness measurement the circularity parameter is selected as a feature as it is invariant in nature under scaling and translation. Extract the boundary of the query image (b) and count the number of pixels containing the information about the actual instrument (c) to search. As the video may contain the instrument in any orientation, the compactness parameter is used as a feature in this application. The c threshold value (t_c ) is determined for compactness as t_c = 0.5 * b * b III. Segmentation of input video Based on the It and ut: The input video stream comes in compressed H.264 format. Initially that compressed video is decoded into YUV

4:2:0 format (^'^J ). Then each pixel of the frame of the video is labeled as 0 or 1 depending upon the following criteria and thus obtain a binary video frame P ^IJ

= 0 Otherwise where i and j represent the row and column position respectively.

IV. Edge gradient based Noise Removal using K-means algorithm: In the next phase the edges are obtained from the intensity video frame (pi)using Sobel operator. Sobel gives the edge gradients stored in an array(p_e). Now among these points, some lead to sharp edges and some represent weak edges. In order to reduce the edges coming due to noise in the input video or poor lighting condition during recording of the video, we discard the weak edges. K-means algorithm is used to obtain the threshold. Pseudo code of this module is as follows:

Apply K-means algorithm with k = 2 on (p_e) which will in turn classify each pixel (p_e). into either of the two clusters Cl and C2 Re assign the values of (p_e). as: p'.j = 1 If _{$ e} & MAX(Cn_x, Cn₂) = 0 Ifp_e e MIN(cn,,cn₂) where ^c"i ^{and cn}^aie the centroids of Cl and C2 respectively.

V. Removal of undesirable components using Component Labeling:

In the next step the following operations are performed on P to reduce the search complexity. Apply connected component analysis to label each pixel of the binary

/ // _c frame " which gives " . For each component (comp) find the pixel count ( ^comp ).

Compute the bounding box ( ^comp ) for each component. Find the component density ( ^comp ) which gives the number of pixels per unit area inside the bounding box for that particular component. j _ ^Ccomp

^comp . camp

From observation it is found that the object has a high compactness thus it gives a high component density.

Now we filter the labeled " array based on ^comp and the '*. p" _t,_} = \lf d _{pll ^} > t_c = 0 Otherwise etection of Object of Interest based on linear shape feature:

The last step involves detection of the object of interest based on the linear shape of query image as described in observation section. To get the degree of linearity of object boundary, Hough transform has been applied in the boundary of each component.

Let Ni be the number of point in the boundary of i^Λ component and Hi be the number of boundary points belonging to the line produced by the Hough transform. We compute

Where n is the total number of components in p"

As all the instruments are typically linearly shaped, the R₁ should be ideally 1. Some tolerance is imposed for noise. A threshold T is taken so that if,

= 0 Otherwise

In this experiment T = 0.5, is taken i.e. if the H₁ is less than 50% of N₁ then the component is noise.

Shot can be categorized in two ways.

• Hard cut which is a drastic change between two frames and

• Soft cut which includes fade-in, fade-out, dissolves etc. In this approach a hierarchical system is designed to detect the type of cut and also locate the position of cut in one pass.

An 8x8 block is taken of the every color frame. For every block we convert pixels from RGB to HSV color space to have separate information of color, intensity and impurities. We construct a histogram for each block. As the color, intensity and impurities ranges are very big we use the quantization scheme. The quantization scheme can be summarized as following:

Let h, s, and v be the HSV domain color value, with s and v is normalized between [0,1], for a particular R, G, and B value and index is the quantized bin index. Now, a pure black area can be found in v _e [0,0.2] index = 0 Gray area can be found in

v e [0.2,0.8] index e [((y 0.2)x l0)J+l a white area can be found in ^ [0,0.2]^

V 6 [0.8,1.0] index = 7

The color area is found in

v e [0.2,LO]_{5 for different h values}

The quantization of h ,s and v value for different s and v is done as following

L_et ⁿmdex ₅ mdex ^(J ^y mdex ^Q _me quantized index for different h, s and v value.

H_mdex

Finally the histogram bin index is given by the following equation index = 4 * " index + 2 * ^ index + * index + g-

In the above way we find the quantized color histogram of each 8x8 blocks. In videos there will be motion in consecutive frames and so we cannot compare a current frame block with the previous frame block in same position. For that, we search the correspondence of a present frame block with the previous frame block in the block 8- neighbor as shown in figure below

The distance between two probability distributions typically can be computed using Bhattacharya's distance function. The distance metric we have used is,

where P and Q are probability distribution functions. The proof of being metric of equation (1) can be found in ()

If the distance between the two histogram is less than T then we say that the blocks are identical, otherwise they are different.

The distance for every block of the image with the previous frame is computed. For a hard cut, the block difference image has many blocks different from the previous frame and thus should construct some object of considerable area. The connected component algorithm is used to find all connected component in the difference image.

Let Ai be the area of the i component of the difference image. We apply area thresholding to remove the isolated different block. The object having area less then one macro block area can be removed.

Ai = 0 if Ai < 64, i.e. only a single block is different not a single neighbour is different. We declare a hard cut if

∑Ai > T

i.e. if a considerable amount of connected macro-block differs in two consecutive frames we declare a hard cut.

For soft cut the above methods falls difficult as there is no major difference between blocks in previous and current frame. By soft cut we mainly concentrate in fade-in, fade-out and dissolve.

During investigation we notice that during soft-cut two important characteristic of image differs in previous and current frame.

• Edge strength

• Gray value

We combine these two features to capture the soft cut.

To find the edge strength image should be de-noised as noise can incorporate erroneous result. Typically, LOG (laplacian of Guassian) filter can be applied to detect edges in image. Equation describing a LOG filter:

An alternative way to use the LOG filter is to apply a Gaussian mask over the image before applying the edge detection operator. This pre-processing step reduces the high frequency noise components prior to the differentiation step.

The above procedure gives us the edge image I of the gray frame F.

Now we normalize the I image to I' with the maximum edge value. So the normalize edge strength I' is

I' .-L

Where

F = Max (I(i,j) ), i = 0,l,2...M J = 0,1,2...N

We compute the homogeneity factor at every I'(iJ). Let H be the homogeneity image of I', then

H(ij) = 1 - I'(ij) i = 0,1,2...M (2)

J = 0,1,2...N

One thing to be notice in equation 2 is if I'(ij) is large i.e. roughness is large, homogeneity is low. As our concentration lies in roughness of the image we neglect the homogenous region, i.e.

H(ij) = 0 if H(ij) < T

In our experiment we have taken T = 0.8, i.e. roughness less than 80% is removed. The associativity of normalize edge strength and grayness of frame F is defined by a histogram Hist as follows:

Where ^δ&?> = 1 if i=j = 0 if≠j

Then the histogram is normalized in [0, 1] using equation 2. The normalize histogram Nhist(i) is

NHist(i) =

These normalize histograms of previous frame and current frames are calculated and Bhattacharya's distance function as in equation 1 is used to measure the difference between the two frames.

As soft-cut occurs very slowly we need to follow-up the difference in every frame. In every frame the running average m and the standard deviation ^σ are calculated.

m ._ =Σ*

Where di is the distance in ith frame. A soft cut occurs if

di ≥ m + p * σ

In our experiment P = 2.5

Once the soft cut occurs we re-initialize the system and put m = 0 and ^σ = 0 and stores the difference value at softcut in mdiff, i,e, mdiff= di Again the ^m and ^σ calculation starts freshly after getting soft cut but again calculation restarted once the mean value ^{m < mdl}JJ . ψ_e put this constraint, because after soft-cut the system usually remains unstable, and there is a fluctuation in difference values. To avoid that when system is going to stable i.e. the difference between frames is coming down, then we compute again the parameters. Figure 5 and

Figure 6 describe the difference (^') values in different frames.

Typically, the method of video annotation comprises the following steps as seen in Figure 7:

• Uploading videos in a first database and uploading still pictures in a third database, represented by block 1000;

• indexing the still pictures and/or videos in said first and third databases, represented by block 1002;

• segmentation of videos into frames, represented by block 1004;

• extracting features of pre-determined objects from the frames, represented by block 1006;

• annotating the extracted features with respect to the frames, represented by block 1008;

• storing information with respect to extracted features and the annotated frames in a second database, represented by block 1010;

Typically, the method of video retrieval comprises the following steps as seen in Figure 8:

• Receiving the image of at least one object from said pre-determined objects from users, represented by block 1012;

• extracting the features of the received image, represented by block 1014;

• matching the extracted features of the received image with the stored extracted features in said second database to identify annotated frames containing an object having said extracted features, represented by block 1016;

• locating at least one video of the matched annotated frame in the first database, represented by block 1018;

• retrieving relevant matched frames from the video and converting the retrieved frames to a compatible format, represented by block 1020;

• displaying the thumbnail of the video sequence for the matched frames, represented by block 1022; and • receiving users feedback on relevance of the displayed thumbnail of the video sequence, represented by block 1024.

The technical advancements of the present invention include:

• Providing a process for content based retrieval of video sequences from an image and video database.

• Providing fast and accurate retrieval of video sequences.

• Providing a process for content based image retrieval which can be applied to both compressed and uncompressed images alike.

• Providing a means for user interaction by allowing users to filter the response of the system.

While considerable emphasis has been placed herein on the components and component parts of the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the invention. These and other changes in the preferred embodiment as well as other embodiments of the invention will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the invention and not as a limitation.

Claims

Claims:

1. A system for content based retrieval of video sequences, said system comprising:

• at least a first database for uploading still pictures and/or videos;

• a user interface adapted to receive an image of at least one object from the predetermined objects; and

• a processor for processing said received image, still pictures and/or videos, said processor comprising: a) indexing means for indexing the still pictures and videos in said first database; b) segmentation means for segmentation of said indexed videos into frames; c) first extraction means for extracting features of pre-determined objects from the frames; d) annotating means for annotating the extracted features with respect to said frames; e) storing means for storing information with respect to extracted features and said annotated frames in a second database; c) a second extraction means for extracting features of said received image; and d) a matching engine adapted to match the extracted features of said received image with the stored extracted features in said second database to identify annotated frames containing an object having said extracted features; and e) locating means for locating at least one video of said matched annotated frame in said first database and further adapted to retrieve relevant matched frames from said video and convert the retrieved frames to a compatible format for display purposes;

• displaying means for displaying the video sequence of the matched frames.

2. A system as claimed in claim 1 wherein said second database is adapted to store the annotated frames of pre-determined objects in XML format.

3. A system as claimed in claim 1 wherein said first database comprises updating means for:

1) adding new still pictures and videos in the first database; and

2) on update to the first database means adapted to extract, annotate and store the pre-determined objects in said videos in said second database.

4. A system as claimed in claim 1 wherein said system comprises a third database for storing still pictures of pre-determined objects and further adapted to transfer the images of pre-determined objects to the user interface.

5. A system as claimed in claim 1 wherein said segmentation means is adapted to split still pictures and videos into contiguous frames grouped into one or more shots and is further adapted to identify one or more key frames in the shots.

6. A system as claimed in claim 1 wherein said segmentation means includes a shot detection means to index video data intelligently and automatically.

7. A system as claimed in claim 1 wherein said segmentation means has a shot transition detection means for sensing the transition from one shot to the subsequent one and is adapted to improve the shots by reducing the number of places erroneously declared as shot breaks.

8. A system as claimed in claim 1 wherein said segmentation means has a shot transition detection means further adapted to detect and improve hard and soft cuts in said videos.

9. A system as claimed in claim 1 wherein said segmentation means has a shot transition detection means still further adapted to detect and improve soft cuts including fade-in, fade-out, dissolves and wipes in said videos.

10. A system as claimed in claim 1 wherein the first extraction means is adapted to extract at least one feature from the image of an pre-determined object selected from a group of features consisting of color, linear shape, edge gradient and compactness.

11. A system as claimed in claim 1 wherein the first database is adapted to store said still pictures in JPEG/BMP/TIFF formats.

12. A system as claimed in claim 1 wherein the first database is adapted to store said videos in AVI/WMV/MPEG/H.264 formats.

13. A system as claimed in claim 1 wherein said displaying means consists of a media player for playing the matched frames of the marked objects.

14. A system as claimed in claim 1 wherein said system includes feedback means adapted to receive relevancy feedback from users.

15. A system as claimed in claim 1 wherein said system comprises means for generating thumbnails of the annotated key frames.

16. A method for content based retrieval of video sequences comprising the following steps:

• uploading still pictures and/or videos in a first database;

• indexing the still pictures and/or videos in said first database;

• segmentation of said indexed videos into frames;

• extracting features of pre-determined objects from said frames;

• annotating the extracted features with respect to said frames;

• storing information with respect to extracted features and said annotated frames in a second database;

• receiving an image of at least one object from said pre-determined objects from the users; • extracting the features of the received image;

• matching the extracted features of the received image with the stored extracted features in said second database to identify annotated frames containing an object having said extracted features;

• locating at least one video of the matched annotated frame in the first database;

• retrieving relevant matched frames from the video and converting the retrieved frames to a compatible format for display;

17. A method as claimed in claim 16 wherein the step of segmentation comprises the following steps: a) detection of shots in said videos; b) sensing the transition from one shot to another; c) sensing the boundaries between the shots; and d) improving the shots by reducing the number of places erroneously declared as shot breaks.

18. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots which comprises the following steps: a) detection of the object within the frames based on the linear shape of the image of pre-determined objects; b) constructing an 8x8 block of every frame, each block having pixels in the RGB color space; c) converting said pixels from RGB to HSV color space for every block and constructing a color histogram; and d) quantizing said color histogram for each of the blocks.

19. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots wherein the step of constructing color histogram further comprises the following steps: a) computing the distance for every block of said frame with respect to the previous frame; b) finding all similar elements in blocks of frames to determine isolated blocks not containing connected elements; c) removing the isolated blocks in all frames; d) filtering the frames to remove noise; e) normalizing the filtered frames; f) measuring the differences between the filtered frames; and g) detecting shots based on the number of differences calculated between frames.

20. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots wherein the step of detection of the object is performed using Hough transform.

21. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots wherein step of computing distance between blocks is calculated using Bhattacharya distance.

22. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots wherein step of finding all similar elements is performed using connected component labeling.

23. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots wherein step of removing isolated blocks is performed using thresholding.

24. A method as claimed in claim 16 wherein the step of segmentation includes the step of detection of shots wherein step of filtering is performed using Gaussian mask over LOG (laplacian of Gaussian) filter.

25. A method as claimed in 16 wherein the step of segmentation includes the step of detection of shots wherein the step of improving shots by reducing the number of places erroneously declared as shots breaks is performed using block color histogram.

26. A method as claimed in claim 16 wherein the step of matching comprises the following steps:

• retrieving subsequent annotated frames from said second database

• matching the color based features of the received image with said retrieved frames and providing the matched frames;

• determining the edge points of said matched frames and further removing weak edges from said matched frames;

• determining the pixel density of the matched frames and removing objects based on said pixel density from the matched frames;

• determining the compactness of said matched frames and removing undesirable objects based on the compactness ratio from the matched frames; and

• further retrieving and tracking pre-determined objects in the matched frames satisfying the shape feature of said received images of objects.

27. A method as claimed in claim 16 wherein the step of matching includes the step of determining the edge points of the matched frames by using Sobel operator.

28. A method as claimed in claim 16 wherein the step of matching includes the step of removing weak edges by using K-means.

29. A method as claimed in claim 16 wherein the step of matching includes the step of determining the compactness of said matched frames by using circularity parameter.

30. A method as claimed in claim 16 wherein the step of matching includes the step of removing undesirable objects by using connected component labeling.

31. A method as claimed in claim 16, wherein the step of display includes the step of showing the thumbnail of the video sequence for the matched frames.

32. A method as claimed in claim 31, wherein the steps of displaying thumbnails includes step of accepting relevancy feedback concerning the displayed thumbnail of the video sequence.