WO2004012144A1 - System and method for displaying digital images linked together to enable navigation through views - Google Patents

System and method for displaying digital images linked together to enable navigation through views Download PDF

Info

Publication number
WO2004012144A1
WO2004012144A1 PCT/SE2003/001237 SE0301237W WO2004012144A1 WO 2004012144 A1 WO2004012144 A1 WO 2004012144A1 SE 0301237 W SE0301237 W SE 0301237W WO 2004012144 A1 WO2004012144 A1 WO 2004012144A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
transformation parameters
camera
scenery
images
Prior art date
Application number
PCT/SE2003/001237
Other languages
French (fr)
Inventor
Sami Niemi
Mikael Persson
Karl-Anders Johansson
Original Assignee
Scalado Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Scalado Ab filed Critical Scalado Ab
Priority to AU2003247316A priority Critical patent/AU2003247316A1/en
Publication of WO2004012144A1 publication Critical patent/WO2004012144A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/66Remote control of cameras or camera parts, e.g. by remote control devices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/698Control of cameras or camera modules for achieving an enlarged field of view, e.g. panoramic image capture
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2624Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects for obtaining an image which is composed of whole input images, e.g. splitscreen

Definitions

  • This invention relates to a system and method for displaying digital images linked together and for enabling a user to navigate through the linked digital images.
  • this invention relates to a system and method for generating a novel view from digital images or video sequences obtained from one or more fixed camera views during a user's navigation through said digital images or video sequences.
  • a digital image of a particular object such as an object presented in a digital image on a computer, may be established through linking a series of digital pictures together so as to achieve a wide or panoramic representation of the object.
  • the digital representation comprises vertices, each representing a digital image, and edges binding together a first vertex and a second vertex.
  • An edge represents information on the transition between a first digital image and a second digital image.
  • the first and second digital image comprise a first and second area, respectively, wherein the depictions in the first area substantially correspond to the depictions in the second area.
  • the transition information defines how at least one of the digital images is to be manipulated in order to provide a smooth boundary between the two digital images.
  • the disclosed image-based representation and method requires a user to manually assist in identifying the substantially corresponding ar- eas prior to linking the series of digital images. Hence in representations including a plurality of digital images it is considerably time consuming to link each of the digital images together.
  • American patent US 6,337,688 discloses a method and system enabling a user lacking specialized programming skills and training to produce a realistic simulation of a real environment.
  • the simulation is constructed from a series of recorded frames that each contain an image of a real environment.
  • each frame comprises data specifying the associated position and orientation within the real environment.
  • the associated positions are recorded in- a camera utilising a position and orientation sensor implemented with inertial sensors that permit the camera to sense positional changes without directly receiving externally generated position information.
  • Canadian patent application CA 2,323,462 discloses a method and system for processing images into a mosaic.
  • the method and system convert both the input image and the mosaic into Lapla- cian image pyramids and a real-time alignment process is applied to the levels within the respective pyramids.
  • the method and system uses a coarse to fine image alignment ap- proach.
  • the result of the alignment process is alignment information that defines the required transformations to achieve alignment.
  • the method and system disclosed in the Canadian patent application however, applies two-dimensional displacements image relation, which are inaccurate for images captured from a fixed camera position. This becomes especially noticeable when working with live video image streams.
  • International patent application 098/54674 describes a method, instruction set and apparatus for combining related source images, each represented by a set of digital data, by determining three-dimensional relationships between data sets representing related source images and creating a data set representing an output image by combining the data sets represent- ing the source images in accordance with the determined three- dimensional relationships.
  • a motorized device controlling the pan, tilt and zoom of the camera is a straightforward way to solve the navigation problem.
  • the camera is mounted on a rotation device of which the pan, tilt can be remote controlled. This enables, if a suitable user interface exists, remote control navigation of the camera.
  • FIGS. 2a and 2b show typical extensive user interfaces for creating navigation enabled panoramic images with high quality.
  • Some tools for creating panoramic images exists, which do not apply relatively complicated mathematical operations. This results in an image of lower quality, however this seriously reduces the need for a complicated user interface.
  • These tools are often seen bundled with digital cameras, as shown in figures 3a and 3b.
  • FIG. 4a and 4b shows the result of the reversion process needed to convert the image captured by a camera equipped with a fisheye lens into a panoramic image. Since only one camera is used, the entire field of view, which is often more than 180 degrees, is compressed onto the image plane of the camera. This implies that an enormous resolution is required.
  • Images captured by a fish eye lens are compressed in a spherical manner, i.e. the image is thus not compressed equally over the image plane but compressed more at the edges.
  • the above referenced prior art technologies perform a single estimation of a displacement between corresponding areas of two images and assume that the single estimation is the correct one. Hence the statistical hit rate of the prior art technology is low.
  • the object of the present invention to clearly display the physical relation of the digital images acquired by cameras in relation to each other. This gives the user a good overview of the scenery to be captured.
  • temporal synchronization errors are identified and eliminated so as to avoid correlation of image points in two images not representing a projection of the same scenery point.
  • a particular feature of the present invention is the provision of a self-optimization procedure enabling an automatic and continuous recalibration of any number of cameras utilised for providing digital images.
  • a particular feature of the present invention is the provision of clustering technique performing a number of correlations originating from different initial displacements.
  • a first aspect of the present invention obtained by a method for generating a view of at least part of a scenery from a plural- ity of images showing said scenery, comprising:
  • navigation is in this context to be construed as a tool for moving through a series of linked images. During navigation it appears as if a camera is moving although all cameras are actually static, which implies that no moving parts are re- quired.
  • image stream is in this context to be construed as a representation of a continuous flow of images, for instance from a network camera. At each instant a single image is avail- able for retrieval.
  • projective transformation is in this context to be construed as the process of projecting points in three-dimensional space onto a plane or as mostly used in our case pro- jecting points from one plane onto another. It is further described in "Multiple view geometry in computer vision” by Richard Hartley and Andrew Zisserman, Cambridge university press 2000.
  • panoramic images is in this context to be construed as referring to images covering a very wide field of view, usually more than 180 degrees. If only parts of the image are shown at one time navigation such as pan, tilt, rotation and zoom are made possible without actually moving the cameras capturing the images.
  • image stitching is in this context to be construed as referring to the process of creating a panoramic image out of many images captured with narrow field of view. All images are captured from the same point of view and only differs by viewing direction, this implies that the images are related by a projective transform or a so called homography, further de- scribed in "Multiple view geometry in computer vision” by Richard Hartley and Andrew Zisserman, Cambridge university press 2000. If this projective transform can be found the images can be stitched and blended together into one large mosaic of images making a panoramic image.
  • transition data is in this context to be construed as referring to the computed parameters, in the case of stitching panoramic images the projective transform relating the images, needed to enable navigation.
  • a or “an” is in this context to be construed as “one”, “one or more”, “at least one”.
  • a plurality of digital images are linked together according to a geometrical interrelationship between the digital images such as a series of still digital photographs of an object or objects from various angles, video sequences of an object or objects from various angles, or a combination thereof .
  • the method according to the first aspect of the present invention provides means for generating transformation parame- ters linking a plurality of images forming a scenery and means for enabling a user to view any particular part of the scenery.
  • a particular advantage of the method according to the first aspect of the present invention is the fact that the images do not need to be relocated from the cameras to the processor unit in order for the processor unit to generate transformation parameters.
  • a user may through a viewer directly connected to the processor unit or connected through a co - puter network communicating with said viewer generate a specific view in a scenery captured by the cameras without having to communicate entire images from the cameras but only the particular views.
  • This provides a method which significantly increases applicability of the present invention since the transmission data is reduced.
  • a second aspect of the present invention obtained by a method for generating one or more transformation parameters interrelating a plurality of images each showing at least a part of a scenery and for displaying said images in accordance with the transformation parameters, and said method comprising:
  • the method according to the second aspect of the present invention enables a user to auto-configure how a plurality of images are to be linked together to form a scenery.
  • the method enables a user to re-compute the automatically generated proposal, and if the user finds that any of the camera's field of view needs to be adjusted.
  • the method according to the second aspect of the present invention is particularly advantageous since the configuration of the images is performed automatically. Hence the production time is significantly reduced and the operation needed to produced the appropriate links simplified.
  • a system for generating one or more transformation parameters interrelating a plurality of images each showing at least a part of a scen- 15 ery comprising: a) a first camera for capturing a first image of a first part of said scenery; b) a second camera for capturing a second image of a second part of said scenery, said first part and said
  • the first and second camera according to the third aspect of the present invention may comprise a digital still camera, - 30 a network camera, cell phone, mobile phone, a digital video camera, any other device capable of generating said views of said scenery, or any combination thereof.
  • the variety of digital imaging devices have multiplied in recent years, the sys- tern according to the third aspect of the present invention may comprise any type known to the person skilled in the art.
  • the communication lines according to the third aspect of 10 the present invention may comprise a wire or wireless dedicated line, computer network, television network, telecommunications network or any combinations thereof.
  • the communication lines may in fact enable communication not only between the cameras and the processor unit but also between the cameras, 15 the processor unit and further peripherals connecting to the cameras or processor unit.
  • the processor unit may comprise a plurality of processor devices . ' 20 communicating with one another through a communications network.
  • the processor unit is in this context to be construed as any number of processors inter-connected so as to communicate with one another.
  • the cameras may include processors for preliminary handling of the images, and the system
  • ⁇ .5 further may comprise processors for performing mathematical operation on the images and processors for communicating with a plurality of various clients.
  • the processor unit according to the third aspect of the 310 present invention may comprise a viewer adapted to enable a first user to navigate through the scenery and a camera configuration display adapted to calculate the transformation pa- rameters and enable a second user to store the transformation parameters in a storage device.
  • the processor unit may comprise a server communicating with the first and second camera through the communication lines and adapted to establish a database for the first and second image in a storage device.
  • the server may be adapted to calculate said transformation parameters and/or communicate with a camera configuration display for calculat- ing said transformation parameters and to enable a second user to store the transformation parameters in the storage device.
  • the server may further be adapted to communicate with a viewer for determining a view of at least part of the scenery in accordance with user interaction navigating in the scenery.
  • the processor unit may comprise a processor device in at least one of the first and second cameras, and a viewer communicating through the communication lines with the first and second camera, the viewer determining a view of at least part of the scenery in accordance with user interaction navigating in the scenery.
  • the above listed three alternatives provide for solution fulfilling requirements of various systems .
  • the first alterna- tive presents a processor unit performing both the viewing operation and the calculation operation needed in order to enable a user to navigate in a scenery consisting of a plurality of images.
  • the processor unit may be implemented on a server communicating with further processor devices in the system and enabling clients (e.g. other processor devices) connecting to the server to utilise the information generated by the server.
  • the second alternative is particularly advantageous for implementation on a computer network such as the Internet.
  • the server may include a capability to connect to mobile processor devices such as mobile or cell phones having multimedia displaying means.
  • the third alternative provides a processor unit integrated as processor devices in the cameras and thus eliminating the use for a server thus enabling a system which is mobile.
  • navigating software may be embedded in a network camera.
  • the system according to the third aspect of the present invention may further comprise a display for displaying the first and second image in accordance with the transformation parameters.
  • the display may be implemented as a multimedia display of a mobile or cell phone, or may be implemented as any monitor communicating with the processor unit.
  • the storage device may be adapted to store the transformation parameters and/or the first image and/or the second image and may be established according to any techniques known to a person skilled in the art.
  • the system according to the third aspect of the present invention may further incorporate any features of the method according to a first aspect and any features of the method according to a second aspect of the present invention.
  • a fourth and fifth aspect of the present invention obtained by a computer program comprising code adapted to perform the method according to the first aspect of the present invention and a computer program comprising code adapted to perform the method according to the second aspect of the present invention.
  • the computer program according to the fourth and fifth aspect of the present invention may incorporate features of the method according to the first aspect of the present invention, features of the method according to the second aspect of the present invention, and features of the system according to the third aspect of the present invention.
  • the system and methods according to the first, second and third aspect of the present invention may comprise a user interface used in conjunction with an automatic computation of transformation parameters requiring very little or no user interaction and may easily be embedded in processor devices having limited graphical and memory storage capabilities.
  • the present invention remedies this by enabling the use of multiple inexpensive consumer targeted cameras to be used to allow navigation such as pan, tilt, rotation and zoom. The requirement of an extensive user interface is eliminated as described in the following section.
  • figure 1 shows a photograph of a prior art pan/tilt camera toolkit.
  • the camera is mounted on a device enabling motorized remote control of pan, tilt and zoom;
  • figures 2a and 2b show two screen shots of prior art user interface for a panoramic image creation software;
  • FIGS. 3a and 3b show a prior art user interface of a digital camera and a merge preview window
  • FIGS. 4a and 4b show an image captured by a camera equipped with a fisheye lens and a panoramic image created by reversing the fisheye effect
  • figure 5 shows a camera configuration display according to a first embodiment of the present invention
  • figure 6 shows a camera configuration display according to a second embodiment of the present invention.
  • figure 7 shows a camera configuration display according to a third embodiment of the present invention.
  • figure 8 shows a first graphical user interface of the camera configuration display according to the first, second and third embodiments of the present invention
  • figure 9 shows a second graphical user interface of the camera configuration display illustrating image streams presented in three dimensions
  • figure 10 shows a flow chart of the method for linking image streams according to a fourth embodiment of the present in- vention
  • figure 11 shows flow and components of an automatic computation of transformation parameters
  • figure 12 shows flow and components of a phase correlation method
  • figure 13 shows an example of displacement of peaks as applied in the phase correlation method
  • FIGS. 14a and 14b show image “A” and image “B”, being slightly displaced relative to one another;
  • figure 15 shows a correlation surface having an arrow indicating one candidate displacement vector
  • figure 16 shows a correlation surface as a periodic sig- nal
  • FIGS. 17a, 17b, 17c, and 17d show first displacement vectors
  • FIGS. 18a, 18b, 18c, and 18d show second displacement vectors
  • figure 19 shows a graph of sites of initial displacement vectors for phase correlation.
  • Figure 5 shows a system according to a first embodiment of the present invention designated in entirety by reference nu- eral 10 and comprising a number of vital components.
  • a plurality of cameras 12 being any type of device capable of delivering images, provide the system 10 with image streams. There are no special requirements on the cameras
  • any ordinary off-the-shelf consumer targeted camera can be used.
  • the system 10 handles ordinary still images as well as live video streams. This requires the system 10 to operate well in real-time, since the content of video image streams can continuously change.
  • a database 14 stores parameters and other information required to enable navigation through a mosaic of im- ages.
  • the database 14 may be incorporated in one of the cameras 12, if the cameras 12 are equipped with a memory storage unit.
  • a camera configuration display 16 is the component responsible for computing the parameters required for the navigation and responsible for showing the current camera configuration.
  • the parameters are computed from the camera image streams and user input, the result is stored in the database 14.
  • a viewer 18 the component performing the actual navigation, can use the pre-computed parameters and camera image streams to enable navigation.
  • the viewer 18 presents the result on for instance a computer screen and enables navigation through images with for instance an attached computer mouse or other pointing device.
  • the system 10 enables a user to open any number of viewers simultaneously and the user can view different parts of the scenery independent of each other.
  • the system 10 operates by having the plurality of cameras 12 provide one or more streams of images to the camera configuration display 16 and the viewer 18.
  • the camera configuration display 16 computes transformation parameters and stores these in the database 14.
  • the viewer 18 receives the stream of images from the plurality of cameras 12, displays them for a user, and enables the user to navigate through the stream of images.
  • the system 10 uses the database 14 as a storing device for the computed parameters .
  • the database 14 is resident on one of the cameras 12, eliminating the use of a server component.
  • the system 10 may be configured in a number of different set-ups.
  • Figure 6 shows a system according to a second embodiment of the present invention, which system is designated in entirety by reference numeral 30.
  • the system 30 comprises a plurality of cameras 32 and an actual server 34 containing a database.
  • the server 34 has the capability to store and manipulate the camera image streams re- ceived from the plurality of cameras 32.
  • the camera image streams are uploaded to the server 34.
  • the server component 34 accesses the images and computes the required transformation parameters.
  • a camera configuration display 36 downloads the images and transformation parameters and displays the camera set-up.
  • the viewer 38 downloads the image streams and in addition the transformation parameters.
  • the server 34 is intelligent so as to enable automatic calibration as well as minimiz- ing the amount of downloaded data.
  • the server can automatically detect and mask out motion in the streams of images. This implies that parts of an image that includes motion are detected and masked out and excluded from the computation of the transformation parameters. If pixels containing movement in the images are used it will result in errors in the transformation parameters. This implies that the transformation parameters can be re-calibrated even though the image streams contain motion. This is simply implemented by comparing a number of temporal neighbouring images in the image stream, the areas that are constant are included in the mask and used when computing transformation parameters. The parts that differ are excluded from the mask.
  • a server component can apply additional image processing such as correction for lens distortions, colour adjustment and additional compression.
  • FIG. 6 shows a system according to a third embodiment of the present invention, which system is designated in entirety by reference numeral 50.
  • the system 50 is the "black box"-case where no user interfaces are required.
  • a plurality of cameras 52 transmit image streams to a viewer 54 that computes all relevant parameters in real-time and no storing capacity is needed. This is ideal for off-the-shelf consumer targeted cameras, where no processing or storing capacity exists.
  • the camera images are fetched by the viewer 54, which initially performs the automatic computation of transformation parameters.
  • the parameters are only stored locally and used for enabling navigation. The parameters can be continuously recomputed, to ensure that the link between the images is correct if the cameras 52 would for instance be knocked out of place.
  • the camera configuration display user interface is designed to be as effective as possible regarding, ease of use, speed and the memory requirements of the platform hosting the interface.
  • the platform may be hosted on a digital camera (video and/or still) , a cell phone, a personal computer, and/or a server accessible through a computer network or a dedicated communication line.
  • a central part of the camera configuration display user interface is the use of a method for automatic linking of images, described below. This implies that all steps required to compute the transition data for the case where the involved cam- eras approximately share focal point location are performed automatically.
  • the software automatically estimates the region where two images substantially correspond.
  • substantially corresponding regions should in this context be construed as the two areas depicting or showing the same or mainly the same part of the same or at least almost identical objects or sceneries.
  • a first area depicting a front view of a painting substantially corresponds to a second area depicting the painting from a side angle.
  • the second area will show more of the frame of the painting, but the information in the first and the second area will still be corresponding.
  • the purpose of the camera configuration display user interface is to show the placements of the cameras and give the user the opportunity to, while the image streams are continuously updated, interactively move the cameras capturing the images and thus updating the transformation parameters i.e. the link between images .
  • the interface also works as a backup if an automatic computation of transformation parameters fails.
  • the camera configuration display user interface is designed to effectively and fast create navigation able content from live video image streams.
  • FIG 8 shows a camera configuration display user interface designated in entirety by reference numeral 60.
  • the camera configuration display user interface 60 is shown working with only a first image stream 62 and a second image stream 64 for illustrative purposes only.
  • the camera configuration display user interface 60 may include any number of image streams. If the user is not content with the result he may order the camera configuration display through the camera configuration display user interface 60 to re-link by activating button 66. When the user is satisfied with the result he may order the camera configuration display through the camera configuration display user interface 60 to store the transformation parameters for image streams by activating a button 68.
  • An alternative set-up is to present the image relations in three dimensions as the cameras are actually physically placed. This implies that a three-dimensional presentation is required. The presentation can be rotated, translated and moved to view different parts of the camera set-up. This form of interface is required for future extensions, if the cameras were to be allowed to be placed arbitrarily, i.e. not required to share focal point.
  • the workflow of the three-dimensional user interface differs from the two-dimensional. The difference is the possi- bility to view the camera set-ups from different angles and positions.
  • the 3D user interface also requires that the complete computation process is executed before the image streams can be related in three dimensions.
  • the set-up process is completed in two steps, initial two-dimensional placement and final optimization. This cannot be done in the three-dimensional case.
  • the three-dimensional user interface is used only to show the relative location and orientation of the cameras .
  • Figure 9 shows a 3D camera configuration display user interface designated in entirety by reference numeral 70.
  • a first live video stream 72 and a second live video stream 74 are projected according to camera position and orientation.
  • the camera configuration display user interface 70 is shown working with only two video streams for illustrative purposes only.
  • the camera configuration display user interface 70 may include any number of image streams.
  • Marker 76 symbolises the camera centre and punctured lines 78 illustrate the projection of the first live video stream 72 and punctured lines 80 illustrate the projection of the second video stream 74.
  • the user may order the camera configuration display through the camera configuration display user interface 70 to re-compute by activating button 82.
  • the user may order the camera configuration display through the camera configuration display user interface 70 to store the transition data linking the image streams by activating a button 84.
  • Figure 10 shows a flowchart of the method for linking image streams according to a fourth embodiment of the present invention.
  • the method is designated in entirety by reference numeral 90.
  • the object of the camera configuration display and thus the method 90 is to define the relative two-dimensional placement among image streams. This placement is required to perform a final step 98 of the method 90.
  • the final step 98 can be quite time consuming and is performed only when the user is satisfied with the camera set-up and the transformation parameters are stored.
  • the relative placement of the image streams is achieved automatically through the first step 92, the initial placement, using a phase correlation step of the automatic computation of transformation parameters. This process is only performed at start-up or when requested by the user and can therefore be computational expensive, this increases the reliability of the method.
  • This placement consists only of relative two-dimensional displacements.
  • the user may in a second step 94 physically adjust the camera or cameras and view the result.
  • the user may in a third step 96 click drag and drop the camera images relative to one another. When the user is satisfied with the camera set-up, the user initiates the final step 98, performing the non-linear optimization and stores the transformation parameters .
  • a stripped down automatic placement method computes new placements relative to the previously computed displacements. This enables the user to adjust the cameras and view the result at interactive frame rates. If the user is satisfied with the camera set-up, the user can initiate the nonlinear optimization and store the parameters. If for some reason the user is unsatisfied and wants to start over, this can be achieved through ordering a re-computation.
  • the first step 92 of the method 90 fails to perform the initial placement a texture snap function is available.
  • the user can click-drag the image of a video stream and approximately place it at the correct position and the method 90 will use the texture content of the two video streams to align them correctly. In other words, the user supplies a guess displacement defining the overlapping area that substantially corresponds between the images.
  • the method 90 then computes the best match in texture originating from the user guess. This greatly reduces the time spend on defining the transition data even if the automatic procedure fails. If the user is satisfied with the camera set-up, the user can initiate the non-linear optimization and store the parameters. However, if for some reason the user is unsatisfied and wants to start over, this can be achieved by ordering a re-com- putation.
  • the quality of the result can vary, and the correctness of the placement is not always immediately obvious to the user.
  • the borders of the video stream images are indicated, using a colouring system, to visualise quality of the resulting video stream mosaic e.g. correctness of the placement.
  • the quality is computed as the inverse of error. If the match error is below a certain threshold the indication system indicates a match, e.g. green border. If the error is above a certain upper threshold the indication system indicates a mismatch, e.g. a red border. This can occur if the images have no overlapping areas and the cameras need to be adjusted. Anything in-between is indicated by a gradient between for instance two colours or shades of one colour.
  • This function is not required in the 3D user interface case since the camera set-up can be viewed from any direction.
  • a re-computation function is available through the button 66 or 82 in the user interfaces 60 and 70, respectively.
  • a usual case where this is required is if the images have no overlap when the first initial placement was performed and no computation could be performed.
  • the method 90 performs an optimization that uses the two-dimensional displacements to compute the homographies relating the image, i.e. enabling the generation of the transfor- mation parameters needed to enable navigation.
  • the parameters needed by the viewer to enable navigation are stored in the database, or used directly to view the result.
  • One of the main features of the present invention is the use of a method for automatic linking images according to a fifth embodiment of the present invention.
  • the fact that a fully automatic method is available enables a number of interesting cases .
  • the method for automatic linking images works completely without requiring user intervention. This enables integration of the method in units where no graphical user interface is possible, for instance in digital cameras and mobile units.
  • the method simply requires a number of image streams and computes the image relations enabling linking of images . Obviously the above described navigation through linked images may be similarly integrated where no user interface is available.
  • the method for automatic linking images requires a number of image streams and fully automatically computes the transformation parameters needed to perform the navigation that is the nature of the invention. This could for instance be used to publish navigation-enabled content directly from the digital camera or other mobile unit responsible for acquiring the source image streams.
  • the method for automatic linking of images can be extended to work with images related by different degrees of zoom instead of images related by different orientations. This can be used to create zoomable images with no graphical user interface required.
  • the method requires a number of images related by different degree of zoom and produces a zoomable image.
  • the method for automatically linking images works without user intervention it can be used to automatically and continuously recalibrate the cameras by continuously re-comput- ing the transformation parameters. This prevents the resulting link to become distorted if the cameras were to be physically affected, for instance knocked out of place.
  • the transformation parameters may be pre-computed and stored as a part of the image storing structure and may comprise transition data.
  • the transition data are continuously recomputed as the content of the images change (i.e. live video streams) .
  • a coarse version of the transition data are stored prior to display and used to obtain, in real-time, an optimized version of the data suitable for use when displaying the transitions, i.e. self-optimization.
  • the database/server can continuously update the parameters previously computed. This will ensure that the parameters are up to date although the camera positions or orientations are altered.
  • the cameras capturing the images should, at least in theory, share focal point. This is of course almost never possible in practise and thus some par- allax distortion will occur as a result of the focal point approximation. Using different projective transformations for different parts of the image streams these distortions can be reduced.
  • the method for automatic linking of images comprises automatic calibration on segments of the image individually in order to accomplish this.
  • the purpose of the method for automatic linking of images is, given two images of a scenery captured by two cameras that, often only approximately, share focal point but differ in ori- entation, to automatically compute the projective transformation (homography) relating the two images.
  • the projective transform is used to relate two image points in two images representing a projection of the same scenery point.
  • a three by three matrix called a homography can describe this relation.
  • An image point in one image ( u, v) and the homography matrix (mo. .7) can be used to find the corresponding image point in the second image ( u f , v r ) according to equation 1:
  • the homography matrix or other similar representations relating image pairs, such as two-dimensional displacements, is part of the transformation parameters used to enable navigation, i.e. the novel view generation.
  • the projection of a scenery point in the first image is related to the projection of the same scenery point in the second image by a projective transformation (homography) that can be described by a three by three matrix. If this projective transformation can be found, navigation in the form of change of view direction from the view direction used when capturing the first image to the view direction used when capturing the second image can be performed.
  • a projective transformation homoography
  • phase correlation 102 is used to find the estimated two-dimensional motion from a first image 104 to a second image 106.
  • the phase correlation 102 is performed multiple times originating from different displacements of the input images 104 and 106.
  • the second step 108 clusters the displacement estimates obtained in the phase correlation to find the statistically most likely displacement .
  • the two-dimensional displacement found in the two steps works as a starting estimate for the final non-linear optimization process 110.
  • the following sections describe each of the three steps in the method 100.
  • phase correlation 102 is to find the dis- placement of one signal in relation to another, most commonly a displaced version of the first signal. Below the one-dimensional case is described but the process is identical for the two-dimensional case.
  • the signals represent two-dimensional intensity images.
  • a signal is composed by the sum of its frequency compo- nents, each frequency has an amplitude and a phase displacement.
  • the standardized tool for extracting frequency and phase information from a signal is the Fourier transform.
  • Figure 12 shows basic components and workflow of the phase correlation 102.
  • the frequency and phase information are extracted from a first 112 and second signal 114 using the Fourier transform 116.
  • the Fourier transform 116 Preferably the Fast Fourier Transform, FFT .
  • phase differences from the first 112 and second signal 114 are computed during step 118 and fed as phase information to the phase difference step 120 for the output correlation signal. Then, the amplitudes from the first 112 and second signal 114 are processed in a normalization step 122.
  • the correlation signal 124 is obtained by inverse Fourier transform 126.
  • the result is a set of frequency components that all have normalized amplitude, but have phases corresponding to the difference between the phases of the input signals.
  • Figure 13 shows a first signal 130 consisting of three frequency components 132, 134, 136 and a second signal 138 consisting of three frequency components 140, 142 and 144.
  • the phase differences are constantly zero.
  • the frequency components will be aligned such that the position of their global maximum indicates the sought displacement.
  • the peak is central which indicates no dis- placement, that is along axis 148. In other words, if the two signals are not displaced the normalized frequency components are added with zero phase and produce a single peak in the centre of the inverse transform.
  • the correlation signal will contain phase information since it is obtained from the difference 150 of the phase of the first 130 and second signal 138.
  • the normalized signals 148 of the correlation signal will now reach their local maximum simultaneously at a peak shifted to the left, as illustrated in the bottom right of figure 13. The peak is shifted by the difference 150 moved from the first 130 to the second signal 138.
  • the two-dimensional case is analogous with the one-dimensional case.
  • the two-dimensional case is used when working with intensity images .
  • the correlation signal is represented by an intensity image, where a peak represents a detected motion.
  • a condition such as a thresholding operation, is used to find the motion estimate peak representing the largest object, preferably a displacement of the entire image. The local maximum of the selected peak can be found with sub-pixel accuracy, thus ensuring sub-pixel displacement accuracy as well.
  • the phase correlation is performed from a specific origin, i.e. a predefined displacement of the images.
  • the area of over- lap obtained when displacing the images is used when performing the phase correlation.
  • the two image blocks from the overlapping area are modulated with a window function that ensures there is no high frequency activity near the edges of the image blocks when performing the Fourier transform. This is done to prevent frequency peaks that are not related to the source images, but rather to the forced discontinuity of the signal as the Fourier transform treats the signal as periodic.
  • the window function chosen is shown below as equation 2, where "x" and "y” define a position within a window having height "H” and width " W":
  • the window function preserves the amplitude of relevant features near the edges of the image blocks but still removes the periodicity artefacts.
  • the phase correlation is performed as described earlier and a two-dimensional correlation surface is obtained.
  • the initial displacement is decided to be the zero vector since the two images are not very displaced. In practice, many different initial displacement are tested and combined. By applying phase correlation to the two images above, the correlation surface as seen in figure 15 is obtained.
  • the correlation surface is the result of two forward Fourier transforms and one inverse transform, the correlation surface must also be interpreted as having a periodic nature. Thus, for every peak found as many as three other displacements must be considered, as shown in figure 16. Only four of the infinite numbers of repeated peaks are valid since all other peaks represent displacements that would lead to no image overlap .
  • any vector that has a too large magnitude can be discarded.
  • the number of phase correlation steps performed and the placement of the initial displacements determine the deviation threshold. A large number of phase correlations should lead to a low tolerance.
  • a candidate vector is assigned to each pixel in the corre- lation surface having a relative value in a top level defined relative to a maximum in the ranges 5 to 50% or 10 to 40% level or such as ranges top 5 to 14% level, 15 to 24% level, 25 to 34% level, or 35 to 50% level.
  • the correlation surface is scanned for the maximum level, and then it is scanned a second time, creating displacement vectors for each point having a level above the peak level multiplied by 0.7. This makes the method exceptionally stable for level variations. If a displacement vector would result in an overlap area less then 10% of the original image area, or if its magnitude is too large, the vector is immediately discarded.
  • the next step is to measure the error when using a certain displacement vector.
  • the error metric used is a weighted mean square error where each pixel error is multiplied with a weight map before the summation.
  • the weight maps can be obtained by first applying a Sobel filter to the images and then filter them using a box filter. The actual weight used for each pixel difference is the largest weight from the two weight maps. By doing this, edges and areas with high contrast are required to be in approximately the same places in the two images, but large uniform areas do not affect the error.
  • Figures 17a-d and 18a-d show the visualization of two candidate vectors and the weight maps used to calculate the error.
  • the vector with the least error is chosen as the best match.
  • the single candidate obtained by one phase correlation step is not sufficient for a precise approximation of the final displacement by itself. Instead a number of phase correlation can- didates are collected from a number of origins, called sites, and sorted into clusters. The cluster with most candidates is statistically the correct displacement.
  • the sites' initial displacement vectors for the phase cor- relation step can be seen in figure 19. Each one of these sites will provide their candidate for the final displacement.
  • Each site's candidate vector is compared to every other site's vector and merged if they are supporting the same dis- placement within a certain tolerance.
  • their scores are added.
  • the score is the inverse of the vector's error.
  • the score is again converted to an error by taking its inverse.
  • the vector that now has the least error is determined to be the best over- all displacement vector for the image pair and is used to create an initial homography matrix for the next step of the method.
  • m 07 represents the elements of the homography matrix
  • ( x, , y, ) is the image point coordinates in image /'
  • (x. -j/, ) is the image point coordinates of image /
  • ⁇ and ⁇ r are weighting functions weighting edges and discontinuities in the respective im- ages so that regions of similar intensities does not affect the error function as much as the edges.
  • / ' and I ⁇ represents image /' and / modified with the respective weighting factor as previously described.
  • the partial derivatives, the Hessian matrix A and the weighted gradient vector b are continuously updated for each pixel of overlap between the two images.
  • the motion vector for the homography matrix is computed (equation 7) :
  • the homography is updated from iteration j to the next iteration j +1 .
  • the factor ⁇ is a time-varying stabilization parameter.
  • the factor ⁇ is a constant stabilization parameter used to slow the descent of the error minimization and reduce the influence of pixel noise. If the global error has decreased everything is fine, m is updated and another iteration is be- gun. If not ⁇ is increased by a factor 10 and Am is recomputed.
  • the complete non-linear optimization consists of the following steps:

Abstract

This invention relates to a system and method for displaying digital images linked together and for enabling a user to navigate through the linked digital images. In particular, this invention relates to a system and method for generating a novel view from digital images or video sequences obtained from one or more fixed camera views during a user's navigation through said digital images or video sequences.

Description

SYSTEM AND METHOD FOR DISPLAYING DIGITAL IMAGES LINKED TOGETHER TO ENABLE NAVIGATION THROUGH VIEWS
Field of Invention
This invention relates to a system and method for displaying digital images linked together and for enabling a user to navigate through the linked digital images. In particular, this invention relates to a system and method for generating a novel view from digital images or video sequences obtained from one or more fixed camera views during a user's navigation through said digital images or video sequences.
Background of Invention
A digital image of a particular object, such as an object presented in a digital image on a computer, may be established through linking a series of digital pictures together so as to achieve a wide or panoramic representation of the object.
International patent application PCT/SE02/00097 discloses an image-based digital representation of a scenery and an image information storing structure for use in presentation of or navigation in a scenery represented by digital images. This application further discloses a method for improving presentation of information, while providing short downloading times and at the same time presenting a substantial amount of information. The digital representation comprises vertices, each representing a digital image, and edges binding together a first vertex and a second vertex. An edge represents information on the transition between a first digital image and a second digital image. The first and second digital image comprise a first and second area, respectively, wherein the depictions in the first area substantially correspond to the depictions in the second area. The transition information defines how at least one of the digital images is to be manipulated in order to provide a smooth boundary between the two digital images. The disclosed image-based representation and method requires a user to manually assist in identifying the substantially corresponding ar- eas prior to linking the series of digital images. Hence in representations including a plurality of digital images it is considerably time consuming to link each of the digital images together.
American patent US 6,337,688 discloses a method and system enabling a user lacking specialized programming skills and training to produce a realistic simulation of a real environment. The simulation is constructed from a series of recorded frames that each contain an image of a real environment. In ad- dition to the series of recorded frames, each frame comprises data specifying the associated position and orientation within the real environment. The associated positions are recorded in- a camera utilising a position and orientation sensor implemented with inertial sensors that permit the camera to sense positional changes without directly receiving externally generated position information.
Canadian patent application CA 2,323,462 discloses a method and system for processing images into a mosaic. The method and system convert both the input image and the mosaic into Lapla- cian image pyramids and a real-time alignment process is applied to the levels within the respective pyramids. Hence the method and system uses a coarse to fine image alignment ap- proach. The result of the alignment process is alignment information that defines the required transformations to achieve alignment. The method and system disclosed in the Canadian patent application, however, applies two-dimensional displacements image relation, which are inaccurate for images captured from a fixed camera position. This becomes especially noticeable when working with live video image streams.
International patent application 098/54674 describes a method, instruction set and apparatus for combining related source images, each represented by a set of digital data, by determining three-dimensional relationships between data sets representing related source images and creating a data set representing an output image by combining the data sets represent- ing the source images in accordance with the determined three- dimensional relationships.
Both CA 2 323 462 and W098/54674 describe a Gaussian pyramid system of images, where the image at one resolution level is a Gaussian filtered and scaled version of the image at the resolution level below. This implies that a rough displacement can be obtained by correlating the highest level, and that the estimate can be improved by moving to the level below. This process will further improve the accuracy of the estimated dis- placement, but a good error estimate is still required to verify that the estimated solution is the correct one. In the Canadian and International patent application a simple square pixel intensity error is used as a matrix, although it is widely known within the art that this error matrix provides an insufficient correction.
In addition to the above mentioned patents and patent application disclosing prior art techniques, a motorized device controlling the pan, tilt and zoom of the camera, such as shown in figure 1, is a straightforward way to solve the navigation problem. The camera is mounted on a rotation device of which the pan, tilt can be remote controlled. This enables, if a suitable user interface exists, remote control navigation of the camera.
There are a number of downsides to this mechanical solution, the most serious being the instability of a motorized so- lution. Moving parts results in mechanical instability, requiring substantial maintenance. Further, often mechanical systems are noisy which is a disadvantage in security applications. The motorized movement is, in addition slower and more limited in movement than what is achieved by the digital version. Further- more, the navigation is restricted to one user at a time.
Most panorama editors today, such as PTGui and Canon PhotoStitch, are focused on producing one large panoramic image mainly using one of two methods. Either stitching a number of still images captured from the same point of view but with different viewing directions into one large image or using large fisheye photographs.
Most tools using the first approach require a complex user interface making them cumbersome to embed in limited mobile units such as network cameras. These user interfaces require the user to input an often quite large number of parameters to be able to perform the stitching. The result using this approach is often one of high quality but at the price of a cum- bersome interface. Figures 2a and 2b, show typical extensive user interfaces for creating navigation enabled panoramic images with high quality. Some tools for creating panoramic images exists, which do not apply relatively complicated mathematical operations. This results in an image of lower quality, however this seriously reduces the need for a complicated user interface. These tools are often seen bundled with digital cameras, as shown in figures 3a and 3b. This is, however, not feasible when working with live video cameras since lack in quality will be very noticeable when looking at a dynamic scenery consisting of many moving objects. In these tools the placement of the images is also often assumed to be known in advance, which is not always possible when working with live video network cameras. These lacks are solved by the present invention.
Substantial prior art exist when using fisheye lens systems enabling navigation such as pan, tilt, rotation and zoom using just one camera equipped with a fisheye lens. Since only one camera is required it becomes a very attractive system for the professional market, but the requirements of the camera in question is very high. The resolution has to be extremely high even with today's measurements and the lenses used are often very expensive and hard to manufacture. The fisheye lens system becomes too cumbersome and expensive to target the consumer market. Figures 4a and 4b shows the result of the reversion process needed to convert the image captured by a camera equipped with a fisheye lens into a panoramic image. Since only one camera is used, the entire field of view, which is often more than 180 degrees, is compressed onto the image plane of the camera. This implies that an enormous resolution is required. Images captured by a fish eye lens are compressed in a spherical manner, i.e. the image is thus not compressed equally over the image plane but compressed more at the edges. Generally, the above referenced prior art technologies perform a single estimation of a displacement between corresponding areas of two images and assume that the single estimation is the correct one. Hence the statistical hit rate of the prior art technology is low.
Summary of the Invention
It is therefore an overall object of the present invention to provide a method and system compensating shortcomings and solving problems induced by the prior art technology.
In particular, it is an object of the present invention to provide a method and system for enabling a user to navigate through one or more digital images without having to change configuration of any associated digital camera.
Further, it is the object of the present invention to provide a method and system for automatic identification of corre- sponding areas of digital images on the basis of the digital images themselves so as to configure a digital representation of any particular scenery.
Further, it is the object of the present invention to pro- vide a method and system for eliminating pixels representing objects in motion thus reducing the effect of the moving objects on estimation of the relations between images.
Further, it is the object of the present invention to clearly display the physical relation of the digital images acquired by cameras in relation to each other. This gives the user a good overview of the scenery to be captured. In addition, it is the object of the present invention to provide a method and system simplifying the creation of transition data between linked digital images.
It is a particular advantage of the present invention that by automating the creation of transition data the complex interlinking handling of a plurality of digital images is significantly improved since the user interaction, which is time consuming, has been reduced.
Further, it is a particular advantage of the present invention that temporal synchronization errors are identified and eliminated so as to avoid correlation of image points in two images not representing a projection of the same scenery point.
A particular feature of the present invention is the provision of a self-optimization procedure enabling an automatic and continuous recalibration of any number of cameras utilised for providing digital images.
Further, a particular feature of the present invention is the provision of clustering technique performing a number of correlations originating from different initial displacements.
The above objects, advantages and features together with numerous other objects, advantages and features, which become evident from below detailed description, is according to a first aspect of the present invention obtained by a method for generating a view of at least part of a scenery from a plural- ity of images showing said scenery, comprising:
(a) capturing a first image of a first part of said scenery by means of a first camera; (b) capturing a second image of a second part of said scenery by means of a second camera, said first part and said second part at least partly overlapping thereby defining a common area; (c) estimating said common area comprising image data and computing from said image data transformation parameters interrelating said first and second image by means of a processor unit;
(d) performing an evaluation process of said transforma- tion parameters using an error metric by means of said processor unit; and
(e) generating said view by means of said processor unit utilising said transformation parameters, said first image and/or said second image.
The term navigation is in this context to be construed as a tool for moving through a series of linked images. During navigation it appears as if a camera is moving although all cameras are actually static, which implies that no moving parts are re- quired.
The term image stream is in this context to be construed as a representation of a continuous flow of images, for instance from a network camera. At each instant a single image is avail- able for retrieval.
The term projective transformation is in this context to be construed as the process of projecting points in three-dimensional space onto a plane or as mostly used in our case pro- jecting points from one plane onto another. It is further described in "Multiple view geometry in computer vision" by Richard Hartley and Andrew Zisserman, Cambridge university press 2000. The term panoramic images is in this context to be construed as referring to images covering a very wide field of view, usually more than 180 degrees. If only parts of the image are shown at one time navigation such as pan, tilt, rotation and zoom are made possible without actually moving the cameras capturing the images.
The term image stitching is in this context to be construed as referring to the process of creating a panoramic image out of many images captured with narrow field of view. All images are captured from the same point of view and only differs by viewing direction, this implies that the images are related by a projective transform or a so called homography, further de- scribed in "Multiple view geometry in computer vision" by Richard Hartley and Andrew Zisserman, Cambridge university press 2000. If this projective transform can be found the images can be stitched and blended together into one large mosaic of images making a panoramic image.
The term automatic computation of transformation parameters is in this context to be construed as referring to a method able to compute the parameters required to perform image linking without requiring any user interaction.
The term transition data is in this context to be construed as referring to the computed parameters, in the case of stitching panoramic images the projective transform relating the images, needed to enable navigation.
The term "a" or "an" is in this context to be construed as "one", "one or more", "at least one". In particular, a plurality of digital images are linked together according to a geometrical interrelationship between the digital images such as a series of still digital photographs of an object or objects from various angles, video sequences of an object or objects from various angles, or a combination thereof .
The method according to the first aspect of the present invention provides means for generating transformation parame- ters linking a plurality of images forming a scenery and means for enabling a user to view any particular part of the scenery.
A particular advantage of the method according to the first aspect of the present invention is the fact that the images do not need to be relocated from the cameras to the processor unit in order for the processor unit to generate transformation parameters. Hence a user may through a viewer directly connected to the processor unit or connected through a co - puter network communicating with said viewer generate a specific view in a scenery captured by the cameras without having to communicate entire images from the cameras but only the particular views. This provides a method which significantly increases applicability of the present invention since the transmission data is reduced.
Further embodiments of the method according to first aspect of the present invention may be obtained by features according to any of dependent claims 1 through 26.
The above objects, advantages and features together with numerous other objects, advantages and features, which become evident from below detailed description, is according to a second aspect of the present invention obtained by a method for generating one or more transformation parameters interrelating a plurality of images each showing at least a part of a scenery and for displaying said images in accordance with the transformation parameters, and said method comprising:
(a) capturing a first image of a first part of said scenery by means of a first camera;
(b) capturing a second image of a second part of said scenery by means of a second camera, said first part and said second part at least partly overlapping thereby defining a common area;
(c) estimating said common area and computing transformation parameters interrelating said first and second image by means of a processor unit; (d) performing an evaluation process of said transformation parameters using an error metric by means of said processor unit;
(e) displaying said first and second image in accordance with said transformation parameters by means of a camera configuration display communicating with said processor unit; and
(f) storing said transformation parameters in a memory communicating with said processor unit.
The method according to the second aspect of the present invention enables a user to auto-configure how a plurality of images are to be linked together to form a scenery. The method enables a user to re-compute the automatically generated proposal, and if the user finds that any of the camera's field of view needs to be adjusted.
The method according to the second aspect of the present invention is particularly advantageous since the configuration of the images is performed automatically. Hence the production time is significantly reduced and the operation needed to produced the appropriate links simplified.
5 Further embodiments of the method according to second aspect of the present invention may be obtained by features according to any of dependent claims 28 through 34.
The above objects, advantages and features together with 10 numerous other objects, advantages and features, which become evident from below detailed description, is according to a third aspect of the present invention obtained by a system for generating one or more transformation parameters interrelating a plurality of images each showing at least a part of a scen- 15 ery comprising: a) a first camera for capturing a first image of a first part of said scenery; b) a second camera for capturing a second image of a second part of said scenery, said first part and said
20 second part at least partly overlapping thereby defining a common area; and c) a processor unit for estimating' said common area and computing transformation parameters interrelating said first and second image, said processor being ■ ' 25 adapted to perform an evaluation process of said transformation parameters by using 'an error metric.
The first and second camera according to the third aspect of the present invention may comprise a digital still camera, - 30 a network camera, cell phone, mobile phone, a digital video camera, any other device capable of generating said views of said scenery, or any combination thereof. The variety of digital imaging devices have multiplied in recent years, the sys- tern according to the third aspect of the present invention may comprise any type known to the person skilled in the art.
It is particularly advantageous that all cameras are used 5 and although they individually perhaps have low resolution when the cameras are linked a scenery having a large resolution may be obtained.
The communication lines according to the third aspect of 10 the present invention may comprise a wire or wireless dedicated line, computer network, television network, telecommunications network or any combinations thereof. The communication lines may in fact enable communication not only between the cameras and the processor unit but also between the cameras, 15 the processor unit and further peripherals connecting to the cameras or processor unit.
The processor unit according to a third aspect of the present invention may comprise a plurality of processor devices . ' 20 communicating with one another through a communications network. The processor unit is in this context to be construed as any number of processors inter-connected so as to communicate with one another. For example, the cameras may include processors for preliminary handling of the images, and the system
^ .5 further may comprise processors for performing mathematical operation on the images and processors for communicating with a plurality of various clients.
The processor unit according to the third aspect of the 310 present invention may comprise a viewer adapted to enable a first user to navigate through the scenery and a camera configuration display adapted to calculate the transformation pa- rameters and enable a second user to store the transformation parameters in a storage device.
Alternatively, the processor unit may comprise a server communicating with the first and second camera through the communication lines and adapted to establish a database for the first and second image in a storage device. The server may be adapted to calculate said transformation parameters and/or communicate with a camera configuration display for calculat- ing said transformation parameters and to enable a second user to store the transformation parameters in the storage device. The server may further be adapted to communicate with a viewer for determining a view of at least part of the scenery in accordance with user interaction navigating in the scenery.
Further alternatively, the processor unit may comprise a processor device in at least one of the first and second cameras, and a viewer communicating through the communication lines with the first and second camera, the viewer determining a view of at least part of the scenery in accordance with user interaction navigating in the scenery.
The above listed three alternatives provide for solution fulfilling requirements of various systems . The first alterna- tive presents a processor unit performing both the viewing operation and the calculation operation needed in order to enable a user to navigate in a scenery consisting of a plurality of images. However, as the second alternative describes, the processor unit may be implemented on a server communicating with further processor devices in the system and enabling clients (e.g. other processor devices) connecting to the server to utilise the information generated by the server. The second alternative is particularly advantageous for implementation on a computer network such as the Internet. In addition, the server may include a capability to connect to mobile processor devices such as mobile or cell phones having multimedia displaying means. The third alternative provides a processor unit integrated as processor devices in the cameras and thus eliminating the use for a server thus enabling a system which is mobile. Thus, according to this alternative, navigating software may be embedded in a network camera.
The system according to the third aspect of the present invention may further comprise a display for displaying the first and second image in accordance with the transformation parameters. As mentioned above the display may be implemented as a multimedia display of a mobile or cell phone, or may be implemented as any monitor communicating with the processor unit. The storage device may be adapted to store the transformation parameters and/or the first image and/or the second image and may be established according to any techniques known to a person skilled in the art.
The system according to the third aspect of the present invention may further incorporate any features of the method according to a first aspect and any features of the method according to a second aspect of the present invention.
The above objects, advantages and features together with numerous other objects, advantages and features, which become evident from below detailed description, is according to a fourth and fifth aspect of the present invention obtained by a computer program comprising code adapted to perform the method according to the first aspect of the present invention and a computer program comprising code adapted to perform the method according to the second aspect of the present invention. The computer program according to the fourth and fifth aspect of the present invention may incorporate features of the method according to the first aspect of the present invention, features of the method according to the second aspect of the present invention, and features of the system according to the third aspect of the present invention.
The system and methods according to the first, second and third aspect of the present invention may comprise a user interface used in conjunction with an automatic computation of transformation parameters requiring very little or no user interaction and may easily be embedded in processor devices having limited graphical and memory storage capabilities. The present invention remedies this by enabling the use of multiple inexpensive consumer targeted cameras to be used to allow navigation such as pan, tilt, rotation and zoom. The requirement of an extensive user interface is eliminated as described in the following section.
Brief Description of the Drawings
The above, as well as additional features and advantages of the present invention, will be better understood through the following illustrative and non-limiting detailed description of preferred embodiments of the present invention, with reference to the appended drawings, wherein:
figure 1, shows a photograph of a prior art pan/tilt camera toolkit. The camera is mounted on a device enabling motorized remote control of pan, tilt and zoom; figures 2a and 2b, show two screen shots of prior art user interface for a panoramic image creation software;
figures 3a and 3b, show a prior art user interface of a digital camera and a merge preview window;
figures 4a and 4b, show an image captured by a camera equipped with a fisheye lens and a panoramic image created by reversing the fisheye effect;
figure 5, shows a camera configuration display according to a first embodiment of the present invention;
figure 6, shows a camera configuration display according to a second embodiment of the present invention;
figure 7, shows a camera configuration display according to a third embodiment of the present invention;
figure 8, shows a first graphical user interface of the camera configuration display according to the first, second and third embodiments of the present invention;
figure 9, shows a second graphical user interface of the camera configuration display illustrating image streams presented in three dimensions;
figure 10, shows a flow chart of the method for linking image streams according to a fourth embodiment of the present in- vention;
figure 11, shows flow and components of an automatic computation of transformation parameters; If
figure 12, shows flow and components of a phase correlation method;
figure 13, shows an example of displacement of peaks as applied in the phase correlation method;
figures 14a and 14b, show image "A" and image "B", being slightly displaced relative to one another;
figure 15, shows a correlation surface having an arrow indicating one candidate displacement vector;
figure 16, shows a correlation surface as a periodic sig- nal;
figures 17a, 17b, 17c, and 17d, show first displacement vectors;
figures 18a, 18b, 18c, and 18d, show second displacement vectors; and
figure 19, shows a graph of sites of initial displacement vectors for phase correlation.
Detailed Description of Preferred Embodiments
Figure 5, shows a system according to a first embodiment of the present invention designated in entirety by reference nu- eral 10 and comprising a number of vital components.
Firstly, a plurality of cameras 12, being any type of device capable of delivering images, provide the system 10 with image streams. There are no special requirements on the cameras
12, any ordinary off-the-shelf consumer targeted camera can be used. The system 10 handles ordinary still images as well as live video streams. This requires the system 10 to operate well in real-time, since the content of video image streams can continuously change.
Secondly, a database 14 stores parameters and other information required to enable navigation through a mosaic of im- ages. The database 14 may be incorporated in one of the cameras 12, if the cameras 12 are equipped with a memory storage unit.
Thirdly, a camera configuration display 16 (CCD)is the component responsible for computing the parameters required for the navigation and responsible for showing the current camera configuration. The parameters are computed from the camera image streams and user input, the result is stored in the database 14.
Finally, a viewer 18, the component performing the actual navigation, can use the pre-computed parameters and camera image streams to enable navigation. The viewer 18 presents the result on for instance a computer screen and enables navigation through images with for instance an attached computer mouse or other pointing device. The system 10 enables a user to open any number of viewers simultaneously and the user can view different parts of the scenery independent of each other.
The system 10 operates by having the plurality of cameras 12 provide one or more streams of images to the camera configuration display 16 and the viewer 18. The camera configuration display 16 computes transformation parameters and stores these in the database 14. The viewer 18 receives the stream of images from the plurality of cameras 12, displays them for a user, and enables the user to navigate through the stream of images.
Arrows 20 in figure 5 illustrate the information paths of the system 10.
The system 10 uses the database 14 as a storing device for the computed parameters . In an alternative embodiment the database 14 is resident on one of the cameras 12, eliminating the use of a server component.
The system 10 may be configured in a number of different set-ups. Figure 6, shows a system according to a second embodiment of the present invention, which system is designated in entirety by reference numeral 30.
The system 30 comprises a plurality of cameras 32 and an actual server 34 containing a database. The server 34 has the capability to store and manipulate the camera image streams re- ceived from the plurality of cameras 32. The camera image streams are uploaded to the server 34.
The server component 34 accesses the images and computes the required transformation parameters. A camera configuration display 36 downloads the images and transformation parameters and displays the camera set-up. Similarly, the viewer 38 downloads the image streams and in addition the transformation parameters. In a specific embodiment the server 34 is intelligent so as to enable automatic calibration as well as minimiz- ing the amount of downloaded data.
When working with video it is important to consider temporal synchronization. This implies that the two images of views used to generate the novel view are often separated in time resulting in ghosting artefacts. If a server component is used it can bundle the image pairs as to decrease the effect of temporal displacements. This will ensure that the artefacts due to temporal asynchronization are minimized.
The server can automatically detect and mask out motion in the streams of images. This implies that parts of an image that includes motion are detected and masked out and excluded from the computation of the transformation parameters. If pixels containing movement in the images are used it will result in errors in the transformation parameters. This implies that the transformation parameters can be re-calibrated even though the image streams contain motion. This is simply implemented by comparing a number of temporal neighbouring images in the image stream, the areas that are constant are included in the mask and used when computing transformation parameters. The parts that differ are excluded from the mask.
If a server component is used to channel the images from the cameras to the viewer, it can apply additional image processing such as correction for lens distortions, colour adjustment and additional compression.
Arrows 40 in figure 6 illustrate the information paths of the system 30. Note that an arrow 41 is punctured thereby illustrating this information path is optional. Figure 7, shows a system according to a third embodiment of the present invention, which system is designated in entirety by reference numeral 50.
The system 50 is the "black box"-case where no user interfaces are required. A plurality of cameras 52 transmit image streams to a viewer 54 that computes all relevant parameters in real-time and no storing capacity is needed. This is ideal for off-the-shelf consumer targeted cameras, where no processing or storing capacity exists. The camera images are fetched by the viewer 54, which initially performs the automatic computation of transformation parameters. The parameters are only stored locally and used for enabling navigation. The parameters can be continuously recomputed, to ensure that the link between the images is correct if the cameras 52 would for instance be knocked out of place.
Previously a certain amount of interaction was required by the user to create panoramic images or a transition consisting only of camera rotation or zoom. Great efforts have been made to limit the interactivity required by automating the procedure to create the required transition data. This also renders a self-optimization procedure possible.
The camera configuration display user interface is designed to be as effective as possible regarding, ease of use, speed and the memory requirements of the platform hosting the interface. The platform may be hosted on a digital camera (video and/or still) , a cell phone, a personal computer, and/or a server accessible through a computer network or a dedicated communication line. A central part of the camera configuration display user interface is the use of a method for automatic linking of images, described below. This implies that all steps required to compute the transition data for the case where the involved cam- eras approximately share focal point location are performed automatically. The software automatically estimates the region where two images substantially correspond. The term substantially corresponding regions should in this context be construed as the two areas depicting or showing the same or mainly the same part of the same or at least almost identical objects or sceneries. For example, a first area depicting a front view of a painting substantially corresponds to a second area depicting the painting from a side angle. The second area will show more of the frame of the painting, but the information in the first and the second area will still be corresponding.
The purpose of the camera configuration display user interface is to show the placements of the cameras and give the user the opportunity to, while the image streams are continuously updated, interactively move the cameras capturing the images and thus updating the transformation parameters i.e. the link between images . The interface also works as a backup if an automatic computation of transformation parameters fails.
The camera configuration display user interface is designed to effectively and fast create navigation able content from live video image streams. There are a number of different setups of the interface. In the basic set-up the images are presented as flat two-dimensional thumbnails two-dimensionally displaced in relation to each other.
Figure 8, shows a camera configuration display user interface designated in entirety by reference numeral 60. The camera configuration display user interface 60 is shown working with only a first image stream 62 and a second image stream 64 for illustrative purposes only. The camera configuration display user interface 60 may include any number of image streams. If the user is not content with the result he may order the camera configuration display through the camera configuration display user interface 60 to re-link by activating button 66. When the user is satisfied with the result he may order the camera configuration display through the camera configuration display user interface 60 to store the transformation parameters for image streams by activating a button 68.
An alternative set-up is to present the image relations in three dimensions as the cameras are actually physically placed. This implies that a three-dimensional presentation is required. The presentation can be rotated, translated and moved to view different parts of the camera set-up. This form of interface is required for future extensions, if the cameras were to be allowed to be placed arbitrarily, i.e. not required to share focal point. The workflow of the three-dimensional user interface differs from the two-dimensional. The difference is the possi- bility to view the camera set-ups from different angles and positions. The 3D user interface also requires that the complete computation process is executed before the image streams can be related in three dimensions. In the two-dimensional user interface, described with reference to figure 8, the set-up process is completed in two steps, initial two-dimensional placement and final optimization. This cannot be done in the three-dimensional case. The three-dimensional user interface is used only to show the relative location and orientation of the cameras .
Figure 9, shows a 3D camera configuration display user interface designated in entirety by reference numeral 70. A first live video stream 72 and a second live video stream 74 are projected according to camera position and orientation. The camera configuration display user interface 70 is shown working with only two video streams for illustrative purposes only. The camera configuration display user interface 70 may include any number of image streams. Marker 76 symbolises the camera centre and punctured lines 78 illustrate the projection of the first live video stream 72 and punctured lines 80 illustrate the projection of the second video stream 74.
If the user is not content with the automatically computed result he may order the camera configuration display through the camera configuration display user interface 70 to re-compute by activating button 82. When the user is satisfied with the automatically computed result he may order the camera configuration display through the camera configuration display user interface 70 to store the transition data linking the image streams by activating a button 84.
Figure 10, shows a flowchart of the method for linking image streams according to a fourth embodiment of the present invention. The method is designated in entirety by reference numeral 90. The object of the camera configuration display and thus the method 90 is to define the relative two-dimensional placement among image streams. This placement is required to perform a final step 98 of the method 90. The final step 98 can be quite time consuming and is performed only when the user is satisfied with the camera set-up and the transformation parameters are stored.
The relative placement of the image streams is achieved automatically through the first step 92, the initial placement, using a phase correlation step of the automatic computation of transformation parameters. This process is only performed at start-up or when requested by the user and can therefore be computational expensive, this increases the reliability of the method. This placement consists only of relative two-dimensional displacements. The user may in a second step 94 physically adjust the camera or cameras and view the result. The user may in a third step 96 click drag and drop the camera images relative to one another. When the user is satisfied with the camera set-up, the user initiates the final step 98, performing the non-linear optimization and stores the transformation parameters .
Since the content of the live video streams are continuously changing a stripped down automatic placement method computes new placements relative to the previously computed displacements. This enables the user to adjust the cameras and view the result at interactive frame rates. If the user is satisfied with the camera set-up, the user can initiate the nonlinear optimization and store the parameters. If for some reason the user is unsatisfied and wants to start over, this can be achieved through ordering a re-computation.
If the first step 92 of the method 90 fails to perform the initial placement a texture snap function is available. The user can click-drag the image of a video stream and approximately place it at the correct position and the method 90 will use the texture content of the two video streams to align them correctly. In other words, the user supplies a guess displacement defining the overlapping area that substantially corresponds between the images. The method 90 then computes the best match in texture originating from the user guess. This greatly reduces the time spend on defining the transition data even if the automatic procedure fails. If the user is satisfied with the camera set-up, the user can initiate the non-linear optimization and store the parameters. However, if for some reason the user is unsatisfied and wants to start over, this can be achieved by ordering a re-com- putation.
When the initial placements are computed automatically the quality of the result can vary, and the correctness of the placement is not always immediately obvious to the user. The borders of the video stream images are indicated, using a colouring system, to visualise quality of the resulting video stream mosaic e.g. correctness of the placement. The quality is computed as the inverse of error. If the match error is below a certain threshold the indication system indicates a match, e.g. green border. If the error is above a certain upper threshold the indication system indicates a mismatch, e.g. a red border. This can occur if the images have no overlapping areas and the cameras need to be adjusted. Anything in-between is indicated by a gradient between for instance two colours or shades of one colour.
Originally a specific image is centred in a camera configuration display area of the user interface. By double-clicking on an image on the camera configuration display area the image becomes centred and more images may be added. This can of course be achieved in a number of different ways known to a person skilled in the art.
This function is not required in the 3D user interface case since the camera set-up can be viewed from any direction.
If for some reason the user requires a new initial placement a re-computation function is available through the button 66 or 82 in the user interfaces 60 and 70, respectively. A usual case where this is required is if the images have no overlap when the first initial placement was performed and no computation could be performed.
When the user is satisfied with the placements of the image streams the method 90 performs an optimization that uses the two-dimensional displacements to compute the homographies relating the image, i.e. enabling the generation of the transfor- mation parameters needed to enable navigation. When the final optimizations are done, the parameters needed by the viewer to enable navigation are stored in the database, or used directly to view the result.
One of the main features of the present invention is the use of a method for automatic linking images according to a fifth embodiment of the present invention. The fact that a fully automatic method is available enables a number of interesting cases .
The method for automatic linking images works completely without requiring user intervention. This enables integration of the method in units where no graphical user interface is possible, for instance in digital cameras and mobile units. The method simply requires a number of image streams and computes the image relations enabling linking of images . Obviously the above described navigation through linked images may be similarly integrated where no user interface is available.
The method for automatic linking images requires a number of image streams and fully automatically computes the transformation parameters needed to perform the navigation that is the nature of the invention. This could for instance be used to publish navigation-enabled content directly from the digital camera or other mobile unit responsible for acquiring the source image streams.
The method for automatic linking of images can be extended to work with images related by different degrees of zoom instead of images related by different orientations. This can be used to create zoomable images with no graphical user interface required. The method requires a number of images related by different degree of zoom and produces a zoomable image.
Since the method for automatically linking images works without user intervention it can be used to automatically and continuously recalibrate the cameras by continuously re-comput- ing the transformation parameters. This prevents the resulting link to become distorted if the cameras were to be physically affected, for instance knocked out of place.
The transformation parameters may be pre-computed and stored as a part of the image storing structure and may comprise transition data. In the case of live video streams the transition data are continuously recomputed as the content of the images change (i.e. live video streams) . Often a coarse version of the transition data are stored prior to display and used to obtain, in real-time, an optimized version of the data suitable for use when displaying the transitions, i.e. self-optimization.
In the set-up containing database/server 34, as shown in figure 6, the database/server can continuously update the parameters previously computed. This will ensure that the parameters are up to date although the camera positions or orientations are altered. In the pan/tilt navigation case, the cameras capturing the images should, at least in theory, share focal point. This is of course almost never possible in practise and thus some par- allax distortion will occur as a result of the focal point approximation. Using different projective transformations for different parts of the image streams these distortions can be reduced. The method for automatic linking of images comprises automatic calibration on segments of the image individually in order to accomplish this.
The purpose of the method for automatic linking of images is, given two images of a scenery captured by two cameras that, often only approximately, share focal point but differ in ori- entation, to automatically compute the projective transformation (homography) relating the two images.
The projective transform is used to relate two image points in two images representing a projection of the same scenery point. A three by three matrix called a homography can describe this relation. An image point in one image ( u, v) and the homography matrix (mo. .7) can be used to find the corresponding image point in the second image ( u f , vr ) according to equation 1:
m0 mλ m2
[wu' wV w] = [u v 1 m3 m4 m5 m6 mη 1
Equation 1
The homography matrix or other similar representations relating image pairs, such as two-dimensional displacements, is part of the transformation parameters used to enable navigation, i.e. the novel view generation. The projection of a scenery point in the first image is related to the projection of the same scenery point in the second image by a projective transformation (homography) that can be described by a three by three matrix. If this projective transformation can be found, navigation in the form of change of view direction from the view direction used when capturing the first image to the view direction used when capturing the second image can be performed. For further description of the pro- jective transformation see "Multiple view geometry in computer vision" .
The projective transformation is, as shown in figure 11, derived by a method 100 for automatic computation of transfor- mation parameters comprising three steps. First, phase correlation 102 is used to find the estimated two-dimensional motion from a first image 104 to a second image 106. The phase correlation 102 is performed multiple times originating from different displacements of the input images 104 and 106. The second step 108 clusters the displacement estimates obtained in the phase correlation to find the statistically most likely displacement .
The two-dimensional displacement found in the two steps works as a starting estimate for the final non-linear optimization process 110. The following sections describe each of the three steps in the method 100.
The purpose of phase correlation 102 is to find the dis- placement of one signal in relation to another, most commonly a displaced version of the first signal. Below the one-dimensional case is described but the process is identical for the two-dimensional case. In the present invention the signals represent two-dimensional intensity images.
A signal is composed by the sum of its frequency compo- nents, each frequency has an amplitude and a phase displacement. The standardized tool for extracting frequency and phase information from a signal is the Fourier transform.
Figure 12 shows basic components and workflow of the phase correlation 102.
Firstly, the frequency and phase information are extracted from a first 112 and second signal 114 using the Fourier transform 116. Preferably the Fast Fourier Transform, FFT .
Secondly, the phase differences from the first 112 and second signal 114 are computed during step 118 and fed as phase information to the phase difference step 120 for the output correlation signal. Then, the amplitudes from the first 112 and second signal 114 are processed in a normalization step 122.
Every frequency containing information, which has an amplitude greater than a pre-defined threshold, will be normalized. This implies that frequency information of the final correlation signal will consist only of frequency amplitudes of one and zero. The phase information of the final correlation signal will consist of the phase difference of the two input signals. Finally the correlation signal 124 is obtained by inverse Fourier transform 126. In other words, the result is a set of frequency components that all have normalized amplitude, but have phases corresponding to the difference between the phases of the input signals. These coefficients are fed to the inverse Fourier transform 126 and the correlation signal 124 is obtained. The nature of the correlation signal 124 is such that a peak indicates the amount of displacement between the two input signals. Figure 13 shows a first signal 130 consisting of three frequency components 132, 134, 136 and a second signal 138 consisting of three frequency components 140, 142 and 144. In the first case no displacement exists, therefore the phase differences are constantly zero. When all frequency components are normalized, shown in figure 13 as normalized signals 146, and affected by the phase differences (no affect in this particular example since the first 130 and second signal 138 are in phase) the frequency components will be aligned such that the position of their global maximum indicates the sought displacement. In this case, the peak is central which indicates no dis- placement, that is along axis 148. In other words, if the two signals are not displaced the normalized frequency components are added with zero phase and produce a single peak in the centre of the inverse transform.
If on the other hand the first 130 and the second signal
138 are displaced as presented on the right hand side of figure 13, the correlation signal will contain phase information since it is obtained from the difference 150 of the phase of the first 130 and second signal 138. The normalized signals 148 of the correlation signal will now reach their local maximum simultaneously at a peak shifted to the left, as illustrated in the bottom right of figure 13. The peak is shifted by the difference 150 moved from the first 130 to the second signal 138.
If the entire signals are displaced in relation to each other only one peak will arise, as illustrated by the synthetic example in figure 13. If, however, the signal contains different parts with individual displacements, as for instance the result of a parallax effect, there will be one peak in the correlation signal for each displacement. The size of the peak is proportional to the size of the displaced object. This implies that finding the largest peak would give the most likely over- all displacement of the signal.
The two-dimensional case is analogous with the one-dimensional case. The two-dimensional case is used when working with intensity images . In the two-dimensional case the correlation signal is represented by an intensity image, where a peak represents a detected motion. A condition, such as a thresholding operation, is used to find the motion estimate peak representing the largest object, preferably a displacement of the entire image. The local maximum of the selected peak can be found with sub-pixel accuracy, thus ensuring sub-pixel displacement accuracy as well.
The phase correlation is performed from a specific origin, i.e. a predefined displacement of the images. The area of over- lap obtained when displacing the images is used when performing the phase correlation. The two image blocks from the overlapping area are modulated with a window function that ensures there is no high frequency activity near the edges of the image blocks when performing the Fourier transform. This is done to prevent frequency peaks that are not related to the source images, but rather to the forced discontinuity of the signal as the Fourier transform treats the signal as periodic. The window function chosen is shown below as equation 2, where "x" and "y" define a position within a window having height "H" and width " W":
Figure imgf000037_0001
The window function preserves the amplitude of relevant features near the edges of the image blocks but still removes the periodicity artefacts. The phase correlation is performed as described earlier and a two-dimensional correlation surface is obtained.
The following example describes the process. The images used are shown in figure 14.
The initial displacement is decided to be the zero vector since the two images are not very displaced. In practice, many different initial displacement are tested and combined. By applying phase correlation to the two images above, the correlation surface as seen in figure 15 is obtained.
Since the correlation surface is the result of two forward Fourier transforms and one inverse transform, the correlation surface must also be interpreted as having a periodic nature. Thus, for every peak found as many as three other displacements must be considered, as shown in figure 16. Only four of the infinite numbers of repeated peaks are valid since all other peaks represent displacements that would lead to no image overlap .
Since the method for automatic linking of images has been fed with a rough estimate of the displacement, any vector that has a too large magnitude can be discarded. The number of phase correlation steps performed and the placement of the initial displacements determine the deviation threshold. A large number of phase correlations should lead to a low tolerance.
A candidate vector is assigned to each pixel in the corre- lation surface having a relative value in a top level defined relative to a maximum in the ranges 5 to 50% or 10 to 40% level or such as ranges top 5 to 14% level, 15 to 24% level, 25 to 34% level, or 35 to 50% level. First the correlation surface is scanned for the maximum level, and then it is scanned a second time, creating displacement vectors for each point having a level above the peak level multiplied by 0.7. This makes the method exceptionally stable for level variations. If a displacement vector would result in an overlap area less then 10% of the original image area, or if its magnitude is too large, the vector is immediately discarded.
The next step is to measure the error when using a certain displacement vector. The error metric used is a weighted mean square error where each pixel error is multiplied with a weight map before the summation. The weight maps can be obtained by first applying a Sobel filter to the images and then filter them using a box filter. The actual weight used for each pixel difference is the largest weight from the two weight maps. By doing this, edges and areas with high contrast are required to be in approximately the same places in the two images, but large uniform areas do not affect the error.
Figures 17a-d and 18a-d, show the visualization of two candidate vectors and the weight maps used to calculate the error.
When the errors have been calculated for all displacement candidates, the vector with the least error is chosen as the best match. The single candidate obtained by one phase correlation step is not sufficient for a precise approximation of the final displacement by itself. Instead a number of phase correlation can- didates are collected from a number of origins, called sites, and sorted into clusters. The cluster with most candidates is statistically the correct displacement.
The sites' initial displacement vectors for the phase cor- relation step can be seen in figure 19. Each one of these sites will provide their candidate for the final displacement.
Each site's candidate vector is compared to every other site's vector and merged if they are supporting the same dis- placement within a certain tolerance. When merging two vectors, their scores are added. The score is the inverse of the vector's error. When all sites have been compared, the score is again converted to an error by taking its inverse. The vector that now has the least error is determined to be the best over- all displacement vector for the image pair and is used to create an initial homography matrix for the next step of the method.
Once the initial two-dimensional displacement is found it can be used to find the projective transform relating the two images. A modified Levenberg-Marquardt non-linear optimization is used with the initial two-dimensional displacement as a starting value. Any gradient descent method may be used to accomplish the optimization, however, for a more detailed de- scription of the Levenberg-Marquardt optimization see "Image
Mosaicing for Tele-Reality Applications", by Richard Szeliski, Cambridge Research Laboratory. Two image points resembling the same scenery point are related by a homography (projective transform) as shown in equation 3 below, see "Multiple view geometry in computer vision" for further detailed description:
m xi + mlyj + m2 m7 i + m4yj + m5
X, = yχ= m6xt + m1yi + 1 m6x( + m1yi + 1
Equation 3
The purpose of the optimisation is to minimize in accordance with equation 4:
E = ef
Figure imgf000040_0001
Equation 4
i.e. the square intensity error over all pixels i. Above m07 represents the elements of the homography matrix, ( x, , y, ) is the image point coordinates in image /' , and (x. -j/, )is the image point coordinates of image /, and μ and μ r are weighting functions weighting edges and discontinuities in the respective im- ages so that regions of similar intensities does not affect the error function as much as the edges. / ' and Iμ represents image /' and / modified with the respective weighting factor as previously described. To perform the minimization the partial derivatives of e.with respect to the homography elements mQ 7 have to be computed. These can be solved algebraically by equation 5:
Figure imgf000041_0001
del x. dI,X de, . y, siy de, 1 dl/ dm 3 D, ay dm , D, dy dm , D, dy' ' del XI Xλ
X , 9 + y dm 6 D, dx' dy
Figure imgf000041_0002
Equation 5
where D, = m6x, + mηy, is the image intensity gradi-
Figure imgf000041_0003
ents at I'(x,,y,) . These partial derivatives are used to compute the approximate Hessian matrix A and the weighted gradient vector b (equation 6) :
^ de, de, ^ de
Aki = ∑ , dam'k dml » ^="2 e, dm.
Equation 6
The partial derivatives, the Hessian matrix A and the weighted gradient vector b are continuously updated for each pixel of overlap between the two images. When each pixel of overlap has contributed to the Hessian matrix and the weighted gradient vector, the motion vector for the homography matrix is computed (equation 7) :
Am = μ(A+ λϊ) λ b, mJ+l = m] + Am Equation 7
The homography is updated from iteration j to the next iteration j +1 . The factor λ is a time-varying stabilization parameter. The factor μ is a constant stabilization parameter used to slow the descent of the error minimization and reduce the influence of pixel noise. If the global error has decreased everything is fine, m is updated and another iteration is be- gun. If not Λis increased by a factor 10 and Am is recomputed. The complete non-linear optimization consists of the following steps:
1. For each pixel i at location (x; ,y, ) , • Compute its corresponding position in the other image
(x ,y,) using the above described projective transform relation.
• Compute the error in intensity between the corresponding pixels, the implementation uses bilinear interpolation to avoid jittering due to noise in the images:
E
Figure imgf000042_0002
μ(xl ,y, )I(x, ,y, )f
Figure imgf000042_0001
, , )f = ∑e
Compute the intensity gradients at pixel i us-
Figure imgf000042_0003
ing for instance the Sobel operator.
• Compute the partial derivatives and add the pixels contribution to the Hessian matrix and the weighted gradi- ent vector.
2. Compute the motion vector, Am = A + λl)~ b , and update the homography matrix. 3. Re-compute the pixel error with the new homography matrix, if the error has decreased continue iterating. If the error has increased, increase A by a factor 10 and recompute the pixel error.
4. Continue iterating a fixed number of iterations or until the pixel error is below a certain threshold.

Claims

Claims
1. A method for generating a view of at least part of a scenery from a plurality of images showing said scenery, compris- ing:
(a) capturing a first image of a first part of said scenery by means of a first camera;
(b) capturing a second image of a second part of said scenery by means of a second camera, said first part and said second part at least partly overlapping thereby defining a common area;
(c) estimating said common area comprising image data and computing from said image data transformation parameters interrelating said first and second image by means of a processor unit;
(d) performing an evaluation process of said transformation parameters using an error metric by means of said processor unit; and
(e) generating said view by means of said processor unit utilising said transformation parameters, said first image and/or said second image.
2. A method according to claim 1 further comprises storing said transformation parameters by means of a memory.
3. A method according to any of claims 1 or 2 further comprises displaying said view by means of a viewer using said transformation parameters.
4. A method according to any of claims 1 to 3 further comprising iterating steps (c) and (d) to find the best set of transformation parameters .
5. A method according to any of claims 1 to 4, wherein said first and second camera have in relation to each other fixed positions and/or changeable fields of view and/or changeable viewing directions.
6. A method according to any of claims 1 to 5, wherein said first and second image comprises a first stream of images and a second stream of images, respectively.
7. A method according to any of claims 1 to 6, wherein said capturing of said first and second image comprises digitisation of said first and second image so as to generate a first digital image of said first image and a second digital image of said second image.
8. A method according to any of claims 1 to 7, wherein said first and second camera comprises a digital still camera, a network camera, cell phone, mobile phone, a digital video camera, any other device capable of generating said views of said scenery, or any combination thereof.
9. A method according to any of claims 1 to 8 further comprises communication of said first and second image to said processor unit by utilising a communication line such as a wire or wireless dedicated line, computer network, television network, telecommunications network or any combinations thereof.
10. A method according to claim 9, wherein said communication comprises data compressing and data packaging of said first image, said second image and/or said transformation parameters .
11. A method according to any of claims 1 to 10, wherein said processor unit comprises a plurality of processor devices communicating with one another.
12. A method according to claim 11, wherein said first camera, said second camera and/or said viewer each comprise one or more of said plurality of processor devices.
13. A method according to any of claims 1 to 12, wherein said estimating comprises performing a correlation operation for estimating displacements between said first and second image.
14. A method according to claim 13, wherein said estimating said common area and computing transformation parameters com- prises an estimation clustering operation for clustering said displacements for calculating a statistical displacement.
15. A method according to claim 14, wherein said estimating said common area and computing transformation parameters com- prises an optimisation operation utilising said statistical displacement for computing said transformation parameters.
16. A method according to any of claims 1 to 15 further comprising determining said view by means of user interaction navigating in said scenery and registering said user interaction by means of said processor unit.
17. A method according to any of claims 1 to 16, wherein said using said error metric comprises measuring an error value for each pair of overlapping pixels in said common area, weighting each of said error values for each pair of overlapping pixels in said common area in accordance with a weight map of said common area, and performing a summation of said measured and weighted error values to said total error value.
18. A method according to claim 17, wherein said weight map comprises applying an edge detection filter to said first and second images.
19. A method according to any of claims 1 to 18 further comprises updating said transformation parameters so as to ensure said transformation parameters between said first and second image are updated when said first part and/or said second part changes .
20. A method according to claim 19, wherein said updating is provided in accordance with a schedule, detection of changes in said first part and/or said second part, request of a user or combinations thereof.
21. A method according to any of claims 1 to 20, wherein a server incorporates said processor unit, memory and viewer.
22. A method according to claim 21, wherein said server communicates with an agent and enables said agent to execute said viewer, said processor unit, and to read and write to said memory.
23. A system according to claim 22, wherein said agent comprises a user, a client, a software, a hardware, or any combinations thereof.
24. A method according to any of claims 6 to 23 further comprising detecting objects moving in said common area of said first and second stream of images and masking out said objects from said image data prior to computing said transformation parameters .
25. A method according to any of claims 1 to 24, wherein said generating said view comprises correcting lens distortion and/or adjusting colours of said first and second image and/or compressing said first and second image.
26. A method according to any of claims 1 to 25 further com- prising compensating parallax distortion by means of said processor unit generating separate transformation parameters for each of a plurality of pre-defined areas of said first and/or second image.
27. A method for generating one or more transformation parameters interrelating a plurality of images each showing at least a part of a scenery and for displaying said images in accordance with the transformation parameters, and said method comprising: (a) capturing a first image of a first part of said scenery by means of a first camera;
(b) capturing a second image of a second part of said scenery by means of a second camera, said first part and said second part at least partly overlapping thereby defining a common area;
(c) estimating said common area and computing transformation parameters interrelating said first and second image by means of a processor unit;
(d) performing an evaluation process of said transforma- tion parameters using an error metric by means of said processor unit;
(e) displaying said first and second image in accordance with said transformation parameters by means of a camera configuration display communicating with said processor unit; and
(f) storing said transformation parameters in a memory communicating with said processor unit.
28. A method according to claim 27, wherein said displaying comprises indicating borders of said first and second image by means of said processor unit providing a specific colour to the border.
29. A method according to claim 28, wherein said border is provided in a first colour when said error metric is below a first pre-defined threshold and in a second colour when said error metric is above a second pre-defined threshold so as to indicate when said transformation parameters provide an acceptable link between said first and second images and when said transformation parameters provide an unacceptable link between said first and second images.
30. A method according to claim 29, wherein said border is provided in a third colour when said error metric is between said first pre-defined threshold and said second pre-defined threshold.
31. A method according to any of claims 27 to 30 further comprising updating of said trans ormation parameters so as to ensure said transformation parameters between said first and second image are updated when said first part and/or said second part changes.
32. A method according to claim 31, wherein said updating is provided in accordance with a schedule, detection of changes in said first part and/or said second part, request of a user or combinations thereof.
33. A method according to any of claims 27 to 32, wherein said estimating said common area and computing transformation parameters further comprises enabling a user to drag and drop said first and second image relative to one another.
34. A method according to any of claims 27 to 33, incorporat- ing features of said method according to any of claims 1 to
26.
35. A system for generating one or more transformation parameters interrelating a plurality of images each showing at least a part of a scenery comprising: a) a first camera for capturing a first image of a first part of said scenery; b) a second camera for capturing a second image of a second part of said scenery, said first part and said second part at least partly overlapping thereby defining a common area; and c) a processor unit for estimating said common area and computing transformation parameters interrelating said first and second image, said processor being adapted to perform an evaluation process of said trans ormation parameters by using an error metric.
36. A system according to claim 35, wherein said first and second camera comprises a digital still camera, a network cam- era, cell phone, mobile phone, a digital video camera, any other device capable of generating said views of said scenery, or any combination thereof.
37. A system according to any of claims 35 or 36 further comprising communication lines for enabling communication between said first and second camera and said processor unit, said communication lines comprising a wire or wireless dedicated line, computer network, television network, telecommunications network or any combinations thereof.
38. A system according to any of claims 35 to 37, wherein said processor unit comprises a plurality of processor devices com- municating with one another through a communications network.
39. A system according to any of claims 35 to 38, wherein said processor unit being adapted to register user interaction navigating in said scenery determining said view of at least part of said scenery.
40. A system according to any of claims 35 to 39, wherein said processor unit comprises a viewer adapted to enable a first agent to navigate through said scenery, a camera configuration display adapted to calculate said transformation parameters, and a storage device adapted to enable a second agent to store said transformation parameters.
41. A system according to claim 40, wherein said first and second agent comprises a user, a client, a software, a hardware, or any combinations thereof.
42. A system according to any of claims 35 to 41, wherein said processor unit comprises a server communicating with said first and second camera through said communication lines and adapted to establish a database for said first and second image in a storage device.
43. A system according to claim 42, wherein said server is adapted to calculate said transformation parameters and to enable a user to retrieve said transformation parameters from said storage device.
44. A system according to any of claims 42 or 43, wherein said server is adapted to communicate with a viewer for determining a view of at least part of said scenery in accordance with user interaction navigating in said scenery.
45. A system according to any of claims 35 to 44, wherein said processor unit comprises a processor device in at least one of said first and second cameras, and a viewer communicating through said communication lines with said processor device in said at least one of said first and second cameras, said viewer determining a view of at least part of said scenery in accordance with user interaction navigating in said scenery.
46. A system according to any of claims 35 to 45 further comprising a display for displaying said first and second image in accordance with said transformation parameters, said display communicating with said processor unit.
47. A system according to any of claims 35 to 46, wherein said storage device is adapted to store said transformation parameters and/or said first image and/or said second image.
48. A system according to any of claims 35 to 47, wherein said system incorporates features of said method according to any of claims 27 to 34, and incorporates features of said method according to any of claims 1 to 26.
49. A computer program comprising code adapted to perform said method according to claim 1.
50. A computer program according to claim 49, wherein said computer program incorporates features of said method according to any of claims 2 to 26, features of said method according to any of claims 27 to 34, and features of said system according to any of claims 35 to 48.
51. A computer program comprising code adapted to perform said method according to claim 27.
52. A computer program according to claim 51, wherein said computer program incorporates features of said method accord- ing to any of claims 1 to 26, features of said method according to any of claims 28 to 34, and features of said system according to any of claims 35 to 48.
PCT/SE2003/001237 2002-07-31 2003-07-24 System and method for displaying digital images linked together to enable navigation through views WO2004012144A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003247316A AU2003247316A1 (en) 2002-07-31 2003-07-24 System and method for displaying digital images linked together to enable navigation through views

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE0202342-2 2002-07-31
SE0202342A SE0202342D0 (en) 2002-07-31 2002-07-31 System and method for displaying digital images linked together to enable navigation through views

Publications (1)

Publication Number Publication Date
WO2004012144A1 true WO2004012144A1 (en) 2004-02-05

Family

ID=20288657

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2003/001237 WO2004012144A1 (en) 2002-07-31 2003-07-24 System and method for displaying digital images linked together to enable navigation through views

Country Status (3)

Country Link
AU (1) AU2003247316A1 (en)
SE (1) SE0202342D0 (en)
WO (1) WO2004012144A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7262736B2 (en) 2004-05-25 2007-08-28 Nec Corporation Mobile communication terminal
CN102103457A (en) * 2009-12-18 2011-06-22 深圳富泰宏精密工业有限公司 Briefing operating system and method
US8203597B2 (en) * 2007-10-26 2012-06-19 Hon Hai Precision Industry Co., Ltd. Panoramic camera

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5650814A (en) * 1993-10-20 1997-07-22 U.S. Philips Corporation Image processing system comprising fixed cameras and a system simulating a mobile camera
US6075905A (en) * 1996-07-17 2000-06-13 Sarnoff Corporation Method and apparatus for mosaic image construction
US6173087B1 (en) * 1996-11-13 2001-01-09 Sarnoff Corporation Multi-view image registration with application to mosaicing and lens distortion correction
US6304284B1 (en) * 1998-03-31 2001-10-16 Intel Corporation Method of and apparatus for creating panoramic or surround images using a motion sensor equipped camera

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5650814A (en) * 1993-10-20 1997-07-22 U.S. Philips Corporation Image processing system comprising fixed cameras and a system simulating a mobile camera
US6075905A (en) * 1996-07-17 2000-06-13 Sarnoff Corporation Method and apparatus for mosaic image construction
US6173087B1 (en) * 1996-11-13 2001-01-09 Sarnoff Corporation Multi-view image registration with application to mosaicing and lens distortion correction
US6304284B1 (en) * 1998-03-31 2001-10-16 Intel Corporation Method of and apparatus for creating panoramic or surround images using a motion sensor equipped camera

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7262736B2 (en) 2004-05-25 2007-08-28 Nec Corporation Mobile communication terminal
US8203597B2 (en) * 2007-10-26 2012-06-19 Hon Hai Precision Industry Co., Ltd. Panoramic camera
CN102103457A (en) * 2009-12-18 2011-06-22 深圳富泰宏精密工业有限公司 Briefing operating system and method
CN102103457B (en) * 2009-12-18 2013-11-20 深圳富泰宏精密工业有限公司 Briefing operating system and method

Also Published As

Publication number Publication date
AU2003247316A1 (en) 2004-02-16
SE0202342D0 (en) 2002-07-31

Similar Documents

Publication Publication Date Title
US11721067B2 (en) System and method for virtual modeling of indoor scenes from imagery
CN105191287B (en) Replace the method and computer program of the object in video flowing
KR101923845B1 (en) Image processing method and apparatus
US9369694B2 (en) Adjusting stereo images
US8477246B2 (en) Systems, methods and devices for augmenting video content
US9014507B2 (en) Automatic tracking matte system
US20080253685A1 (en) Image and video stitching and viewing method and system
US20130016097A1 (en) Virtual Camera System
CN107155341B (en) Three-dimensional scanning system and frame
Majumder et al. Immersive teleconferencing: a new algorithm to generate seamless panoramic video imagery
CN107646126A (en) Camera Attitude estimation for mobile device
EP1074943A2 (en) Image processing method and apparatus
KR20000023784A (en) Method and apparatus for mosaic image construction
WO2019164498A1 (en) Methods, devices and computer program products for global bundle adjustment of 3d images
EP1556736A1 (en) Image capture and and display and method for generating a synthesized image
US8531505B2 (en) Imaging parameter acquisition apparatus, imaging parameter acquisition method and storage medium
WO2014121108A1 (en) Methods for converting two-dimensional images into three-dimensional images
JP2003179800A (en) Device for generating multi-viewpoint image, image processor, method and computer program
Bleyer et al. Temporally consistent disparity maps from uncalibrated stereo videos
US11328436B2 (en) Using camera effect in the generation of custom synthetic data for use in training an artificial intelligence model to produce an image depth map
JP2004326179A (en) Image processing device, image processing method, image processing program, and recording medium storing it
KR101841750B1 (en) Apparatus and Method for correcting 3D contents by using matching information among images
JP2005063041A (en) Three-dimensional modeling apparatus, method, and program
JP2018116421A (en) Image processing device and image processing method
WO2004012144A1 (en) System and method for displaying digital images linked together to enable navigation through views

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP