WO2012090033A1

WO2012090033A1 - A system and a method for visually aided telephone calls

Info

Publication number: WO2012090033A1
Application number: PCT/IB2010/056151
Authority: WO
Inventors: Oguz Demirci
Original assignee: Turkcell Teknoloji Arastirma Ve Gelistirme Anonim Sirketi
Priority date: 2010-12-31
Filing date: 2010-12-31
Publication date: 2012-07-05

Abstract

This invention relates to a system and a method for visually aided telephone calls using OCR. The system (1) comprising at least one mobile phone (2) which includes at least one camera (21) and specific button (22) and which initiates a video call, at least one mobile phone (3) which takes a call coming from the other mobile phone (2), at least one carrier network (4) for providing wireless communication between the mobile phones (2 and 3), at least one server (5) which takes a video call coming from the mobile phone (2), employs an optical character recognition (OCR) algorithm on the transferred video frames using principal component analysis to discriminate characters and to extract, non-limiting, digits in the phone number format and directs the video call as an audio call to another mobile phone (3) according to the recognized phone number of the video call content.

Description

DESCRIPTION

A SYSTEM AND A METHOD FOR VISUALLY AIDED TELEPHONE

CALLS

Field of the invention

This invention relates to a system and a method for visually aided telephone calls using OCR (Optical character recognition).

Prior art According to some specific methods, pattern recognition is the assignment of some sort of output value to a given input value. Pattern recognition attempts to assign each input value to one of a given set of classes. This is an example of classification by using pattern recognition, (for example, determine whether a given email is "spam" or "non-spam")

However, pattern recognition is one of the most famous problems that encompasses other types of output as well. Other example is regression, which sets a real-valued output to each input. Another example sets a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence) which is sequence labeling. Another example is parsing, which assigns a parse tree to an input sentence, defining the syntactic structure of the sentence.

Pattern recognition methods generally aim to provide a reasonable answer for all possible inputs to do "fuzzy" matching of inputs. Pattern matching methods look for correct matches in the input with pre-existing patterns as opposed to recognition methods. A general example of a pattern-matching method is regular expression matching, which looks for patterns of a given sort in textual data and text editors and word processors. The difference between pattern recognition and pattern matching is that pattern matching is generally not considered a type of machine learning, although pattern-matching methods can sometimes succeed in providing similar-quality output to the sort provided by pattern-recognition methods.

Many fields including psychology, psychiatry, cognitive science and computer science have studied various recognition methods.

The example describes pattern recognition and matching methods;

Optical character recognition (OCR) is the mechanical or electronic translation which can translate of scanned images of handwritten or printed text. OCR is usually used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition.

Microsoft Tag, 2D barcode readers, GetFugu, Kooba, Upcode, Layar, Junaio and Google Goggle are applications that try to recognize barcodes, icons/images usually in specific formats and help users access to digital content effortlessly. Though, none of these products processes video transferred over 3G networks and they do the processing in the mobile platform.

US2006182311A1, US2006182311A1, US2003012410A1, US2010008265A1, US6700990B1, WO2009114039A1, US7227893B1, US2006233423A1 are some of the patents examined. Some of these patent documents have been either written to solve problems like watermarking, surveillance, pose estimation or explained pattern recognition systems without giving details in the used algorithms and the specific purposes they were designed for. None of these patent documents mentions a telecommunication application connecting a caller that initiated a video call and is directed to the other party based on the recognized telephone number or pattern in the scene.

The United States patent number US2008233999 discloses a mobile station device includes a camera and a calling module that uses information regarding images captured by the camera for automatically placing a call from the mobile station device. An exemplary mobile station device includes a camera that is configured to capture an image of something in a field of vision of the camera. A calling module determines a number to call based on the captured image and automatically calls the number. An exemplary method of communicating using a mobile station device equipped with a camera includes capturing an image with the camera. The captured image is used to determine a number and to automatically call the number.

The United States patent number US6594503 discloses a communication device, such as a cellular mobile phone or a cordless phone, has a dial unit for communication with a base station. An input device for dialing a phone number which forwards a coded dial signal to the dial unit is implemented as an optical character recognition (OCR) scanner that reads phone numbers from a printed or hand-written original. Optionally, especially for recognizing hand-writing, the scanner sends graphical representations of input data to the base station that obtains the code dial signals by an external processor.

The Japan patent number JP6311220 discloses a device which can recognize an image showing the shape of the lips of a user and makes dialing possible. An image pickup part, feature extraction part, memory and shape recognition part are controlled by a CPU, features are extracted from the image showing the shape of the lips by the image pickup part, feature extraction part, memory and dictionary and recognizes as character data corresponding to extracted data, a data base is retrieved from the character data by a telephone number retrieval part, a telephone number corresponding to the character data is read out, and the telephone number is sent out by a sending part. When the image can not be recognized, it is instructed to the user by a recognition disable output part and when the telephone number can not be retrieved from the character data, it is instructed to the user by a retrieval disable output part. Summary of the invention

The object of the invention is to provide a system and a method that uses OCR ability to recognize phone numbers of the calling subscriber from video call content.

Further object of the invention is to provide a system and a method that directs phone calls to another subscriber according to the recognized phone number inthe video call content. Detailed description of the invention

"A system and a method for visually aided telephone calls" designed to fulfill the objects of the present invention is illustrated in the attached figures, where: Figure 1 - is the schematic view of the system.

Figure 2 - is the flow diagram of the method.

Figure 3 - is the flow diagram of the "extracting phone number to be called from video frames of the video call by using OCR algorithm" step of the method. The parts in the figure are each given a reference numeral where the numerals refer to the following:

1. System

2. Mobile phone

21. Camera

22. Button

3. Mobile phone

4. Carrier network

5. Server

1000. Method

Ul . Caller subscriber

U2. Calling subscriber A system (1) for visually aided telephone calls comprises;

- at least one mobile phone (2) which includes at least one camera (21) and specific button (22) and which initiates a video call,

- at least one mobile phone (3) which takes a call coming from the other mobile phone (2),

- at least one carrier network (4) for providing wireless communication between the mobile phones (2 and 3),

- at least one server (5) which takes a video call coming from the mobile phone (2), employs an optical character recognition (OCR) algorithm on the transferred video frames using principal component analysis to discriminate characters and to extract, non-limiting, digits in the phone number format and directs the video call as an audio call to another mobile phone (3) according to the recognized phone number of the video call content.

The system (1) applies image processing tools and then connects calling parties non-limiting, for audio calls to endpoints defined with printed phone numbers or patterns. This is achieved by processing video call content using image processing algorithms.

The server (5) employs an optical character recognition (OCR) algorithm on the transferred video frames using principal component analysis to discriminate characters and to extract, non-limiting, digits in the phone number format.

Mobile phone (2) has a specific button (22) for initiating a video call towards server (5).

Therefore, the system (1) enables a mobile phone (2) user (Ul) to use (a specific button (22)) a certain service number to initiate a video phone call and the call is directed to a certain mobile phone (3) based on the findings of pattern recognition and optical character recognition operations.

Using the system (1) and a specific button (22), even the blind users (Ul) can initiate phone calls without having to specifically read and dial the telephone numbers or perceive the figures printed in posters. The user (Ul) can easily initiate a video call using the previously defined button (22) and direct his camera (21) to the scene either with a figure or a printed telephone number. In the first scenario, the figure is recognized using the pattern recognition algorithm and an audio call is set up between the calling subscriber (U2) and the number associated with the identified figure. In the second scenario, the telephone number is recognized using the optical character recognition algorithm that uses principal component analysis and audio call is set up between the two ends.

In the preferred embodiment of the invention, carrier network (4) is a UMTS (Universal Mobile Telecommunications System) network. Mobile phone (2) users (Ul) dial various set of numbers to connect their calls to different endpoints. Many companies try to obtain numbers that are easy to remember and be accessed so that customers can reach them more easily. As an alternative method, without having to deal with the number to dial, a certain video call number can be used to initiate a video call and the video content transferred form the subscriber to the service provider's servers (5) can be processed to establish the connection. The same video call number can be used for multiple companies. Some merchants may not want to declare their telephone number but just use figures to be called and certain patterns can be directed to different numbers at different times with modifications only on the service provider side.

As in another scenario, the same video call number can be used with a scene including a visually noticeable telephone number. The video content provided by the user (Ul) is transferred to service provider's servers (5) and processed with the optical character recognition (OCR). The telephone number to be called is extracted. The audio call between the user (Ul) and the extracted number is established afterwards. The system (1) provides effortless connection between the users (Ul and U2) and even visually impaired people can initiate phone calls without having to deal with the telephone number. Besides, a dedicated button (22) can be defined on mobile phones (2) to be used with the video call service.

One of the optimized and preferred ways of dialing numbers is clicking on them on touch screen phones when they are available in websites, emails or text messages. The designed system (1) is an extension of the service that extends the abilities to printed telephone numbers in the environment.

Video call services built on 3G networks can also be used unilaterally for a mobile phone (2) user (Ul) to communicate with the telephone service provider and its servers (5). After the user (Ul) initiates the video call, the video of the scene is transferred over the network (4) to service provider's servers (5). In the first of this invention's scenarios where the pattern in the scene is matched one of the figures in the server's (5) library, the system (1) uses the pattern recognition algorithm. After it has been decided that one of the figures in the library exists in the scene, audio call between the caller and the number associated with the pattern is established. Here, in this invention the system (1) explains the optical character recognition algorithm to detect the phrases or the telephone numbers in the scene. A method (1000) for visually aided telephone calls comprises the steps of;

- initiating a video call towards a server (5) (1100),

- directing a mobile phone's (2) camera (21) to a scene (1200),

- establishing a connection (1300),

- directing the video call to the server (5) (1400),

- extracting a phone number to be called from video frames of the video call by using OCR algorithm (1500),

- establishing an audio call between mobile phones (2 and 3) (1600) (Figure 2)· The system (1) for visually aided telephone calls comprising;

- a mobile phone (2) which is adapted to perform for initiating a video call towards a server,

- carrier network (4) which is adapted to perform for establishing a connection,

- carrier network (4) which is adapted to perform for directing the video call to the server (5),

- a server (5) which is adapted to perform for extracting a phone number to be called from video frames of the video call by using OCR algorithm,

- a server (5) establishing an audio call between mobile phones (2 and 3). In the method (1000) firstly mobile phone (2) user (Ul) initiates a video call towards a server (5). After initiating a video call, user (Ul) directs his mobile phone's (2) camera (21) to a scene (1200). This scene can be everything in the environment; like newspaper, picture or guide etc. For initiating a video call, user (Ul) only presses the specific button (22) at the mobile phone (2) which can start video call towards the server (5).

After this, carrier network (4) establishes a connection (1300) between mobile phone (2) and server (5) and directs video call to the server (5) (1400). After video call reaches to the server (5), server (5) starts to detect a telephone number in the video frames of the video call. For achieving this; server (5) compare the video frames with the images stored in its (5) library.

With comparing the video frames with the images stored in the library, server (5) extracts a phone number to be called from video frames of the video call by using OCR algorithm (1500).

Extracting a phone number to be called from video frames of the video call by using OCR algorithm (1500) comprises the sub-steps of;

- assigning a zero value to the "N"; where "N" is an achieved match number

(1501),

- obtaining video frame from the video call (1502),

- controlling recognizing of a text (1503),

- if the text is recognized, keeping that video frame (1504),

- obtaining the phone number to be called (1505),

- increasing the value of "N" (1506),

- controlling the value of "N" is bigger than the value of "k"; where "k" is an predetermined repetition match value for correct matching (1507),

- if "N" is not bigger than "k", obtaining a new video frame from the video call (1502) (in other words, going to the step 1502), - if "N" is bigger than "k", finishing the operation (1508),

- if the text is not recognized, controlling the matching of a pattern (1509),

- if the pattern is matched, keeping that video frame (1504) (in other words, going to the step 1504),

- if the pattern is not matched, deleting that frame (1510),

- obtaining a new video frame from the video call (1502) (in other words, going to the step 1502).

The system (1) for visually aided telephone calls comprising a server (5) which is adapted to perform for;

- assigning a zero value to the "N"; where "N" is an achieved match number,

- obtaining video frame from the video call,

- controlling recognizing of a text,

- if the text is recognized, keeping that video frame,

- obtaining the phone number to be called,

- increasing the value of "N",

- controlling the value of "N" is bigger than the value of "k"; where "k" is an predetermined repetition match value for correct matching,

- if "N" is not bigger than "k", obtaining a new video frame from the video call,

- if "N" is bigger than "k", finishing the operation,

- if the text is not recognized, controlling the matching of a pattern,

- if the pattern is matched, keeping that video frame,

- if the pattern is not matched, deleting that frame,

- obtaining a new video frame from the video call.

For extracting a phone number to be called from video frames of the video call, firstly the server (5) assigns a zero value to the "N"; where "N" is an achieved match number (1501). After that, server (5) obtains a first video frame from the video call (1502). After obtaining the video frame (1502), server (5) controls that is there any recognized text or not (1503). If the text is recognized, server keeps that video frame (1504).

After recognizing of the text, server (5) obtains the phone number to be called (1505). Server (5) increases the value of "N" (1506) and controls the value of "N" is bigger than the value of "k" or not (1507). If "N" is not bigger than "k", server (5) obtains a new video frame from the video call (1502).

If "N" is bigger than "k", server (5) finishes the operation (1508). If the text is not recognized, server controls that is there any matched pattern or not (1509). If the pattern is matched, server (5) keeps that video frame (1504). After matching of the pattern, server (5) obtains the phone number to be called (1505). Server (5) increases the value of "N" (1506) and controls the value of "N" is bigger than the value of "k" or not (1507). If "N" is not bigger than "k", server (5) obtains a new video frame from the video call (1502). If "N" is bigger than "k", server (5) finishes the operation (1508).

If the pattern is not matched, server deletes that frame (1510) and obtains a new video frame from the video call (1502).

The text recognizing operation consists of two stages, Training and Execution. Training stage is carried out offline and the principal component space is generated to be used later, during the call. During the training stage, server (5) is adapted to perform for; - applying median filtering,

- applying normalization,

- detecting lines,

- extracting characters and generating character images,

- applying Radon transform,

- generating observation matrix, X, using the character image and its Radon transform,

- applying eigen-decomposition on the covariance matrix,

- evaluating the first p eigenvalues that include at least 95% of the variance, - generating the principal component space using the first p eigenvectors.

During the line detection stage of the training, server (5) searches for the threshold that minimizes the intra-class variance, defined as a weighted sum of variances of the two classes (black and white pixels in the binary image) and applies Otsu thresholding. Then, a skew correction is applied if necessary. On the corrected image, server (5) obtains histogram of the black pixels using horizontal projection of the image, and uses the end of the histograms to detect top and bottom of the lines as shown in Picture 1. A similar algorithm is applied to extract characters in the image at each line.

Picture 1 - An example of detecting top and bottom of the lines plary embodiments of the

rging customers, based on_Top

w exemplary system and -4

isers for use of the system^Bottc

a, such as integration of

service (paid indirectly by Server (5) then generates a sample set of character images each with 16x16 pixels. These character images are used in Radon transform to generate corresponding Radon transform images and transformed images are also interpolated to images of 16x16. It can be assumed that total number of character samples is n and there are k different characters to be differentiated from each other (0, 1, 2, 9, a, b, c,..., z), each provides ¾ samples in the total set of n character images.

£=i

¾: number of images that correspond to 0.

n₂: number of images that correspond to 1.

n₃: number of images that correspond to 2.

nii: number of images that correspond to a.

ni₂: number of images that correspond to b, etc.

Server (5) then vectorizes each of these n sample images and their Radon transforms to generate vectors of 256x1x2 (m=512). Average of the n character

Server (5) can combine the n vectorized character images to generate the observation matrix, X, which will be used to generate the covariance matrix.

Mean removed observation matrix is evaluated as follows,

li i]

This matrix, which is composed of individual vectorized and mean removed character images and their Radon transforms, are used to generate the covariance matrix, C. The covariance matrix is then used in the ei gen-decomposition to obtain eigenevalues and eigenvectors.

Q includes the eigenvectors (combination of regular and Radon transformed images) in its columns, and Λ includes the eigenvalues in the diagonal. The first p columns of Q are used to generate the eigenspace, where p is determined based on the first p eigenvalues (diagonal of Λ) that has at least 95% of the variance. These p directions, eigenvectors, indeed are the directions that maximize the variance of the distribution.

Each of the n vectors (character images in the training set) is then projected onto p eigenvectors with dot product operations and character images are represented as points in the principal component (PC) space. These n points belong to k different groups, each representing a different character (a or b or 5, etc.). Centroids of these k groups can be evaluated to represent the average of the character in the principal component space as shown in Graphic 1.

During the execution stage, server (5) repeats the line detection and character extraction steps as in the training stage. After the transfer of the video content including the text for analysis, the lines and the characters are extracted. The character images are put in vectors of 256x1 format. A particular character image is then projected on the selected p eigenvectors. This enables to represent the character to be determined in the principal component space.

Graphic 1 - Representation of n character images in the PC space.

Server (5) can then measure the distance of this point to the centroids of each group (candidate character) and determine the group it belongs to picking the closest distance. The new character image is the character with the closest centroid.

The same operation is repeated this for every extracted character and the decision is based on the projection of the character image and where it corresponds to in the PC space. The overall processing stages of the method (1000) are summarized in Figure 3. As it is indicated in this diagram, the same operation can be repeated on the neighboring frames until the same decision is repeated k times (3, e.g.) and we are certain that the character is predicted correctly.

Based on the extracted text phrase or the telephone number, the audio call connection between the caller and the number is established using the Parlay X system. After extracting a phone number to be called from video frames of the video call by using OCR algorithm (1500), server (5) establishes an audio call between mobile phones (2 and 3) (1600).

Within the scope of this basic concept, it is possible to develop various embodiments of the inventive a system (1) and a method (1000) for visually aided telephone calls. The invention cannot be limited to the examples described herein; it is essentially according to the claims.

Claims

1. The method (1000) for visually aided telephone calls characterized by the steps of;

- initiating a video call towards a server (5) (1100),

- directing a mobile phone's (2) camera (21) to a scene (1200),

- establishing a connection (1300),

- directing the video call to the server (5) (1400),

- establishing an audio call between mobile phones (2 and 3) (1600).

2. The method (1000) according to claim 1, characterized by the step of extracting a phone number to be called from video frames of the video call by using OCR algorithm (1500) comprising the sub-steps of;

- assigning a zero value to the "N"; where "N" is an achieved match number (1501),

- obtaining video frame from the video call (1502),

- controlling recognizing of a text (1503),

- if the text is recognized, keeping that video frame (1504),

- obtaining the phone number to be called (1505),

- increasing the value of "N" (1506),

- if "N" is not bigger than "k", obtaining a new video frame from the video call (1502) (in other words, going to the step 1502),

- if "N" is bigger than "k", finishing the operation (1508),

- if the text is not recognized, controlling the matching of a pattern (1509),

- if the pattern is matched, keeping that video frame (1504) (in other words, going to the step 1504), - if the pattern is not matched, deleting that frame (1510),

3. The method (1000) according to claim 1 or claim 2, characterized by text recognizing operation consists of two stages which are called Training and Execution. Training stage is carried out offline and the principal component space is generated to be used later, during the call.

4. The method (1000) according to claim 3, characterized by training stage which comprises the steps of;

- applying median filtering,

- applying normalization,

- detecting lines,

- extracting characters and generating character images,

- applying Radon transform,

- applying eigen-decomposition on the covariance matrix,

- evaluating the first p eigenvalues that include at least 95% of the variance,

- generating the principal component space using the first p eigenvectors.

5. The method (1000) according to claim 4, characterized by execution stage which repeats the line detection and character extraction steps as in the training stage.

6. A system (1) for visually aided telephone calls comprising;

- at least one mobile phone (2) which includes at least one camera (21) and which initiates a video call, - at least one mobile phone (3) which takes a call coming from the other mobile phone (2),

- at least one carrier network (4) for providing wireless communication between the mobile phones (2 and 3) and characterized by

7. The system (1) according to claim 6, characterized by the mobile phone (2) which has a specific button (22) for initiating a video call towards server (5).

8. The system (1) according to claim 6 or claim 7, characterized by the carrier network (4) which is a UMTS (Universal Mobile Telecommunications System) network.

9. The system (1) according to claim 6 to claim 8, characterized by

- the mobile phone (2) which is adapted to perform for initiating a video call towards a server,

- the carrier network (4) which is adapted to perform for establishing a connection and directing the video call to the server (5),

- the server (5) which is adapted to perform for extracting a phone number to be called from video frames of the video call by using OCR algorithm and establishing an audio call between mobile phones (2 and 3).

10. The system (1) according to claim 6 to claim 9, characterized by the server (5) which is adapted to perform for;

- assigning a zero value to the "N"; where "N" is an achieved match number, - obtaining video frame from the video call,

- controlling recognizing of a text,

- if the text is recognized, keeping that video frame,

- obtaining the phone number to be called,

- increasing the value of "N",

- if "N" is bigger than "k", finishing the operation,

- if the text is not recognized, controlling the matching of a pattern,

- if the pattern is matched, keeping that video frame,

- if the pattern is not matched, deleting that frame,

- obtaining a new video frame from the video call.

11. The system (1) according to claim 6 to claim 10, characterized by the server (5) which is adapted to perform for;

- detecting lines,

- extracting characters,

- generating observation matrix, X,

- applying eigen-decomposition on the covariance matrix,

- evaluating the first p eigenvalues that include at least 95% of the variance,

- generating the principal component space using the first p eigenvectors.