CN101075351A

CN101075351A - Method for restoring human-body videothree-dimensional movement based on sided shadow and end node

Info

Publication number: CN101075351A
Application number: CNA2006100534057A
Authority: CN
Inventors: 庄越挺; 肖俊; 陈成; 吴飞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2006-09-14
Filing date: 2006-09-14
Publication date: 2007-11-21
Anticipated expiration: 2026-09-14
Also published as: CN100541540C

Abstract

A method for restoring video human body 3-D movement based on side shadow and tail end node includes picking up side shadow from video, making tail end node detection on side shadow, seeking optimum attitude enabling to explain positions of side shadow and tail end node on the image through analog annealing algorithm and connecting attitude of each frame to form continuous 3-D movement sequence after post-treatment.

Description

Video human three-dimensional motion restoration method based on silhouette and endpoint node

Technical field

The present invention relates to multimedia human body three-dimensional animation field, relate in particular to a kind of video human three-dimensional motion restoration method based on silhouette and endpoint node.

Background technology

Based on the human motion 3 d reconstruction of the video field that has a wide range of applications, can play the effect of optics 3 dimension human motion captures in a lot of occasions, also can under the situation that a lot of optics human body capture systems can't be used, play a role.

Current, mainly contain two general orientation based on the human body attitude/motion 3 d reconstruction method of video/image sequences: based on the direction of study.

One is based on the direction of study.The study input of the method in this direction is the 2 dimension images (in general being some feature that extracts from 2 dimension images) of a series of orientation vector and their correspondences.Obtain the respective function of 2 dimension images and orientation vector by study.Afterwards, new not 2 dimension images within training set offer system, and the knowledge that system obtains according to study generates one and this visual corresponding orientation vector.In this way, system can all generate an orientation vector to every frame of video, then orientation vector is connected into 3 dimension motion sequences.Just belong to this method such as the article of on IEEE Transactions onPattern Analysis and Machine Intelligence, delivering in 2006 " Recovering 3D Human Pose fromMonocular Images ", in this article, from training silhouette sample, extract in shape hereinafter, train with the associated vector machine then, simulate one from silhouette in shape hereinafter to a complicated function of orientation vector, predict the new pairing orientation vector of silhouette sample with this function then, thereby realize the 3-d recovery of attitude.A common difficult point based on these class methods of learning is, because 2 dimension images are very complicated to the function of orientation vector, method based on study supposes all that generally orientation vector belongs to a certain type, that is to say that these class methods are general only to walking, and run, a certain specific motion such as play ball is effective, simultaneously, it generally needs a prior data bank that meets this special exercise, and often needs a training process that calculated amount is huge.

Make a direction be based on the direction of manikin.The feature of the method in this direction is, internal system has one 3 dimension manikin, and the orientation vector of the best of system looks makes the 2 dimension images that this orientation vector can best explanation provides.Under a lot of situations, the method for this class can be summed up becomes an optimization problem.Just belong to this method such as the article of on ComputerGraphics International, delivering in 2005 " Markerless Monocular Motion Capture Using ImageFeatures and Physical Constraints ", in this article, to preset human body 3 dimension models plays up under various different attitudes, then the result that will play up and and the information that from frame of video, extracts compare, by designing the matching degree of the information that result that an appropriate fitness function describes to play up and frame of video extract, so also just be summed up as an optimization problem.Often there are two difficult problems in method based on manikin.The one, these class methods are returned a complete human body patch model projection 2 dimensional planes mostly and are compared with 2 dimension images, certainly will cause the objective function calculated amount huge like this, make that the process efficiency of three-dimensional reconstruction is very low; The 2nd, since very complicated from the function of orientation vector to 2 dimension image, cause optimization procedure to be confined in local minimum easily.

In addition, most of human motion 3-d recovery methods are difficult to solve the problem of error accumulation and transmission.If the reconstructed results of a certain frame is made mistakes, the reconstructed results of so ensuing frame all will be made mistakes, and be difficult to recover automatically from mistake.

Summary of the invention

The object of the present invention is to provide a kind of video human three-dimensional motion restoration method based on silhouette and endpoint node.

May further comprise the steps:

(1) based on background model and colouring information, the color model that separates with brightness with chromaticity in the RGB color space is that every frame of video extracts its silhouette;

(2) on the silhouette of every frame, detect and mark terminal articulation point, comprise the position of head, both hands and both feet according to curvature information;

(3) according to the colouring information of detected both hands position, the both hands skin color is carried out dynamic modeling, the color arbiter that utilizes dynamic construction to go out then removes to strengthen detecting the both hands of being omitted by step (2) mean curvature detecting device;

(4) to every frame, utilize detected endpoint node position on its silhouette and the silhouette, constructing an independent variable is orientation vector, value is that the objective function of matching degree, objective function are that the matching degree of orientation vector and silhouette and endpoint node applies nucleus punishment, the punishment of terminal articulation point and three punishment of flatness punishment;

(5) quantity according to detected endpoint node is segmented into reliable zone and unreliable zone to video.Then, zones of different is taked diverse ways, utilize simulated annealing to carry out 3-d recovery, obtain the 3 d pose vector of every frame.

Described based on background model and colouring information, the color model that separates with brightness with chromaticity in the RGB color space is that every frame of video extracts its silhouette: using the color model that separates with brightness based on chromaticity is that every frame of video extracts silhouette; The basic ideas of extracting silhouette are to compare the color distortion of present frame and background model by pixel, and determine that with this each pixel belongs to prospect or background; In comparison, use the color model that chromaticity is separated with brightness in the RGB color space, each color is a vector from initial point in rgb space, and its chromaticity characterizes with the direction of this vector, and its brightness characterizes with the length of this vector; So, when comparing the difference of two colors, not directly to compare their distances in RGB three dimensions, but relatively their chromaticity and brightness respectively, the calculating formula of chromaticity and luminance difference is:

D_{1} = \cos^{- 1} (\frac{| | C_{1} \cdot C_{2} | |}{| | C_{1} | | \cdot | | C_{2} | |}) - - - 1

D_{2} = \frac{| | | | C_{1} | | - | | C_{2} | | | |}{\max (| | C_{1} | |, | | C_{2} | |)} - - - 2

Wherein, C ₁And C ₂Be two kinds of colors to be compared in the RBG space vectors from initial point; D ₁Be the chromaticity difference of two kinds of colors, D ₂It is the luminance difference of two kinds of colors.When definite certain pixel is prospect or background, if the D of respective pixel color on this color of pixel and the background model ₁Surpass certain threshold value, think that then this pixel is a foreground pixel, otherwise, think that then this pixel is a background, if D at this moment just ₂Also surpass certain threshold value, illustrate that this pixel is the shade or the high light part of background.

Describedly on the silhouette of every frame, detect and mark terminal articulation point, comprise the position of head, both hands and both feet: after extracting silhouette, utilize curvature information on silhouette, to detect the position of terminal articulation point according to curvature information; At first calculate the curvature of silhouette edge every bit, this curvature determines that with slide along the silhouette edge angle of two somes compositions of each n point of this point and its both sides the curvature computing formula is:

c = (\frac{(P_{1} - P) \cdot (P_{2} - P)}{| | P_{1} - P | | \cdot | | P_{2} - P | |} + 1) / 2 - - - 3

Wherein, P is the coordinate of the point of current calculating curvature, P ₁And P ₂The coordinate of two points that to be P obtain after each n point slided in both sides along the silhouette edge; Here the curvature c that calculates between 0 to 1, on the occasion of, for the convex-concave that reflects that silhouette is ordered at P, also need to judge P ₁And P ₂On the silhouette prospect, if not, then need the c reversion is that negative is as final curvature to the line midpoint; After the curvature of each point all calculates on the silhouette edge, the point of all curvature greater than a certain threshold value marked, then the point that marks is carried out cluster, abandon after the very few class of the point that comprises, the barycenter of remaining class is exactly the position of terminal articulation point; The mask method of endpoint node takes the user to mark first frame, and the method for utilizing the prediction of the flatness of sequential and Kalman filtering to combine is carried out follow-up automatic mark.

The colouring information of the detected both hands of said basis position, the both hands skin color is carried out dynamic modeling, the color arbiter that utilizes dynamic construction to go out then removes strengthen to detect the both hands of being omitted by step (2) mean curvature detecting device: utilize endpoint node based on curvature to detect detected accurate but incomplete both hands position, set up both hands skin color Gauss model dynamically, utilize this color model in the secondary detection of the enterprising hand-manipulating of needle of primitive frame image then, detect the both hands of omitting in the detection based on curvature both hands.

Said to every frame, utilize detected endpoint node position on its silhouette and the silhouette, constructing an independent variable is orientation vector, value is the objective function of matching degree, objective function is that the matching degree of orientation vector and silhouette and endpoint node applies three punishment: nucleus punishment, terminal articulation point punishment and flatness punishment: for independent variable of every frame structure is orientation vector p, value is the objective function E (p) of the matching degree E of the silhouette of orientation vector and present frame and endpoint node position, this function comprises 3 punishment, investigates following 3 matching degrees respectively by 3 penalty terms: the one, and nucleus punishment E _Core-area, promptly the joint skeleton model under the orientation vector state two-dimensional projection should drop in the nucleus of this frame silhouette, wherein nucleus is emphasized by the Euclidean distance conversion of silhouette; The 2nd, coverage punishment E _Corerage, promptly the joint skeleton model should be near detected associated end node on the silhouette in the endpoint node position of the two-dimensional projection under the orientation vector state; The 3rd, flatness punishment E _Smoothness, promptly current orientation vector should keep level and smooth with the orientation vector of previous frame, avoids sudden change, more than the calculating formula of three penalty terms be respectively:

E_{core - area} = - \underset{m}{Σ} s' (t) (point (m) \cdot T (t, p)) / M - - - 4

Nucleus punishment E _Core-areaComputation process in, it in attitude a series of point of uniform sampling on the human skeleton of p, in the following formula, s ' is the range conversion of the silhouette of present frame (t), and point (m) is a m point of uniform sampling on skeleton, and p is an orientation vector, T (t, p) be the transformation matrix relevant with p with t, the coordinate of the point that it will be sampled is transformed into the silhouette plane from three-dimensional local coordinate system, and M is a total number of sample points;

E _Coverage(p, i) ‖/I is for each terminal articulation point 5 that is detected for=∑ ‖ ps (i)-pc

E in the following formula _CoverageBe coverage punishment, the i that ps (i) is detected on silhouette for present frame terminal articulation point position, Pc (p, i) for human skeleton projects to the position of i terminal articulation point behind the silhouette plane according to attitude p, I is the number of all terminal articulation points that are detected of present frame;

E _smoothness＝‖p(t-1)-2p+p(t-2)‖ 6

E in the following formula _SmoothnessBe flatness punishment, p (t-1) and p (t-2) be respectively present frame before a frame and the orientation vector of two frames;

With top three punishment additions, promptly obtain objective function:

E(p)＝αE _core-area+βEc _overage+γE _smoothness 7

Said quantity according to detected endpoint node is segmented into reliable zone and unreliable zone to video, then, zones of different is taked diverse ways, utilize simulated annealing to carry out 3-d recovery, obtain the 3 d pose vector of every frame: according to the quantity of detected endpoint node to every frame classification, promptly all detected frame of all 5 endpoint nodes is as reliable frame, all the other are non-reliable frame, reliable frame and unreliable frame trend towards assembling, so whole video is split into reliable zone and the unreliable zone that intermeshes, and as follows the orientation vector of every frame is carried out three-dimensional reconstruction: from each reliable zone, select a frame as reference frame; For reference frame, adopt complete simulated annealing that objective function E (p) is carried out optimization to seek the orientation vector p of optimum matching; Then from reference frame, the frame to front and back adopts incomplete simulation annealing to carry out optimization to seek orientation vector successively, is limited to the reliable zone at reference frame place and its front and back two unreliable zones in mountain range mutually from the frame that each reference frame recovered; For the frame in the reliable zone, above-mentioned three-dimensional reconstruction obtains the attitude three-dimensional reconstruction result of an orientation vector as this frame, for the frame in the unreliable zone, every frame obtains two orientation vectors, is that weights mix attitude three-dimensional reconstruction result as this frame according to the place frame pitch from the inverse of the distance of reliable zone boundary, front and back with these two orientation vectors; The result of every frame attitude 3 d reconstruction is connected, carry out post-processed after, form the 3 dimension motion sequences that link up.

The useful effect that the present invention has is: the silhouette that the color model that separates with brightness by chromaticity carries out extracts, effectively avoided because the caused extraction of illumination variation and shadow problem is slipped up, the shadow and highlight part of background is not by correct understanding and by the prospect for the treatment of as of mistake.Make both hands detect by secondary detection and reached very high precision ratio and recall ratio based on dynamic color model.Because objective function E (p) calculated amount is less, the solution of calculated amount problem makes further that again the simulated annealing of relative calculation of complex is used, from and solved the problem that locks into local minimum.Algorithm selection reference frame on each reliable zone restarts automatically, has broken away from error accumulation and the wrong problem of transmitting.Experiment effect shows, under a spot of user interactions, this method can robust and recovered long human motion video accurately, has reached gratifying result.

This method is not made the hypothesis of any apriority to the human motion type except anatomy and physical restriction, do not need the support of prior data bank.This method belongs to the direction based on manikin described in the technical background, but passes through the solution of two problems of and stranded local minimum big to the objective function calculated amount, avoids the problem with class methods, reaches outstanding result.

Description of drawings

Fig. 1 is based on the process flow diagram of the video human three-dimensional motion restoration method of silhouette and endpoint node;

Fig. 2 is the model synoptic diagram that chromaticity is separated with brightness in the color rgb space of the present invention;

Fig. 3 is the human synovial skeleton pattern synoptic diagram that the present invention uses;

Fig. 4 is a video segmentation synoptic diagram of the present invention;

Fig. 5 (a) is the nucleus penalty term silhouette synoptic diagram of objective function of the present invention;

Fig. 5 (b) is the nucleus penalty term synoptic diagram of objective function of the present invention;

Fig. 6 is a fencing match video recovery embodiment synoptic diagram of the present invention.

Embodiment

The video human three-dimensional motion restoration method based on silhouette and endpoint node that the present invention proposes before concrete scheme implementation, be determined skeleton pattern.We adopt skeleton pattern as shown in Figure 2, and it comprises 16 rotating articulation points, and except that root node (articulation point 0 among the figure), each articulation point can be rotated with maximum 3 degree of freedom, and root node then comprises 3 rotational freedoms and 3 displacement degree of freedom.In recovering the attitude process, we do not consider the displacement degree of freedom of root node, because its description is the displacement of whole model at world coordinate system, rather than the attitude of model itself.We carry out when the displacement degree of freedom of root node is postponed till post-processed.Like this, each orientation vector p is one 48 n dimensional vector n.

Introduce concrete technical scheme of the present invention and implementation step below:

1. the extraction of silhouette

For each frame of video extracts its silhouette.Extraction scheme is based on the comparison by pixel.This method supposition camera is fixed and background color model known (generally can obtain by taking the pure background image of some frames), two field picture for silhouette to be extracted, by the color on its color and background model respective pixel of pixel comparison, if both differ by more than a certain threshold value, think that then this pixel of this frame is a prospect.Attention is when comparing two colors, and we adopt the model that in rgb space chromaticity is separated with brightness.Formally, compare two color C ₁And C ₂, the computing formula of chromaticity and luminance difference is:

D_{1} = \cos^{- 1} (\frac{| | C_{1} \cdot C_{2} | |}{| | C_{1} | | \cdot | | C_{2} | |})

D_{2} = \frac{| | | | C_{1} | | - | | C_{2} | | | |}{\max (| | C_{1} | |, | | C_{2} | |)}

Wherein, C ₁And C ₂Be two kinds of colors to be compared in the RBG space vectors from initial point; D ₁Be the chromaticity difference of two kinds of colors, D ₂It is the luminance difference of two kinds of colors.

For instance, as shown in Figure 3, compare in the rgb space two color Ei and Ii, connect OEi, make vertical line from Ii to OEi, then Ii has characterized the difference of Ei and Ii chromaticity to the ratio of the distance C Di of intersection point and OIi length, and intersection point has characterized Ei and the difference of Ii in brightness in the position on the OEi (ratio).Relatively the time, bigger if the chromaticity of dichromatism differs, in any case think all that then this pixel is a prospect, otherwise if the dichromatism chromaticity differs very little, but brightness has difference, then still this pixel is considered as background, thinks that just this is shade or highlighted result.This model can handle illumination variation preferably and shade can cause background to be mistakenly classified as horizon problem.

2. the endpoint node based on curvature detects and mark

On the outline line of every frame silhouette, calculate the curvature of each point, curvature is by this point and slide into n along the silhouette edge respectively to both sides and put 2 angles that constitute that obtain and be similar to, and computing formula is:

c = (\frac{(P_{1} - P) \cdot (P_{2} - P)}{| | P_{1} - P | | \cdot | | P_{2} - P | |} + 1) / 2

Wherein, P is the coordinate of the point of current calculating curvature, P ₁And P ₂The coordinate of two points that to be P obtain after each n point slided in both sides along the silhouette edge; Here the curvature c that calculates between 0 to 1, on the occasion of.Notice that c that following formula calculates is actually the absolute value of actual curvature because no matter silhouette be herein protruding be that the recessed c that calculates is positive, and in fact, if silhouette is recessed herein, curvature will be set to bear.The foundation of judging is exactly with P ₁And P ₂Point links to each other, and gets its mid point, if its mid point on the prospect of silhouette, then silhouette is protruding herein, curvature is being for just, otherwise, just need add negative sign to the c that calculates, become negative value.

We are provided with a threshold value then, on the profile all curvature for just and the point that is higher than this threshold value marked and carried out simple non-supervision cluster, note since the point that is in general marked clearly be gathered into several classes, very simple clustering method based on distance is just passable.After the cluster, the too small cluster of number that comprises a little is abandoned, and the barycenter of each remaining cluster is as the position of an endpoint node.

So far just find out the position of endpoint node, also they have not been marked, that is, distinguished hand and pin, distinguished a left side and right.We adopt the endpoint node mask method based on sequential flatness and Kalman filtering prediction.Promptly, marking each endpoint node that detects of first frame by hand by the user is any endpoint node, because the continuity and the flatness of human motion, system is automatically according to time sequence information, utilize the position of each endpoint node in the Kalman filtering prediction next frame, confirm with the actual endpoint node position that detects of next frame then, thus the endpoint node that detects on other the frame of progressive mark.Meet the situation of ambiguity, then hand over the user to make decision.

3. the both hands based on dynamic color model detect

Previous step detects fine for head and foot's effect based on the endpoint node of curvature, but for both hands, its precision ratio is very high but recall ratio is general, and reason is that sometimes both hands are positioned between health major part and the camera, then in silhouette both hands comprised by body part can't identification.We adopt the both hands based on dynamic color model to detect this problem that solves.Since the both hands detection accuracy rate based on curvature is very high, we are according to the result of curvature detection so, gather the color at both hands place, both hands skin color Gauss model of training when operation, and then remove to seek on the original two field picture both hands of omission with this color model.Notice that it is to carry out that current both hands based on dynamic color model detect on original two field picture, but the hunting zone is limited in the pairing prospect of silhouette, reduced the search volume so on the one hand, reduced on the one hand because the potential robustness problem that the color of background is brought.

4. objective function design

Objective function is an independent variable with orientation vector p, is worth to be the matching degree E between p and the silhouette, and E is the smaller the better.By to the minimizing of objective function, just can find out optimum orientation vector.Objective function is weighed coupling by 3 penalty terms.These three are respectively:

Nucleus punishment E _Core-area: the skeleton pattern that is positioned at attitude p projects to 2 dimensional planes, and behind Pan and Zoom, four limbs should be in the nucleus of silhouette.So-called nucleus is exactly the place of depending on the axis of nearly four limbs, sees Fig. 5.In order to weigh this point, we at first do the Euclidean distance conversion to silhouette, and the four limbs of skeleton should be in the bigger part of value after the range conversion as far as possible.For the purpose of quantizing, our some points of sampling out uniformly on skeleton pattern are investigated the range averaging value in these range conversions through being positioned at silhouette after the projective transformation.This punishment can be expressed as mathematics:

E_{core - area} = - \underset{m}{Σ} s' (t) (point (m) \cdot T (t, p)) / M

Wherein s ' is the range conversion of the silhouette of present frame (t), point (m) is a m point of uniform sampling on skeleton, and (t p) is the transformation matrix relevant with p with t to T, the coordinate of the point that it will be sampled is transformed into the silhouette plane from 3 dimension local coordinate systems, and M is a total number of sample points.

Endpoint node position punishment E _Coverage: the skeleton pattern of attitude p projects to 2 dimensional planes, and behind Pan and Zoom, the position of the head on the skeleton, hand, pin endpoint node should be approaching as far as possible with detected endpoint node position on the silhouette, promptly

E _Coverage=∑ ‖ ps (i)-pc (p, i) ‖/I, the i that is detected on silhouette for present frame for each terminal articulation point ps (i) that is detected a terminal articulation point position, Pc (p, i) for human skeleton projects to the position of i terminal articulation point behind the silhouette plane according to attitude p, I is the number of all terminal articulation points that are detected of present frame.

Flatness punishment E _Smoothness: the attitude p of present frame should keep the flatness of moving.We utilize the front cross frame that has recovered to quantize flatness, promptly

E _smoothness＝‖p(t-1)-2p+p(t-2)‖

Wherein p (t-1) and p (t-2) are respectively the frame before the present frame and the orientation vector of two frames.

With top three punishment additions, promptly obtain objective function:

E(p)＝αE _core-area+βEc _overage+γE _smoothness

5. the segmentation of video and segmentation 3-d recovery

According to the endpoint node quantity that detects frame is classified: the frame that all 5 endpoint nodes all are detected is called reliable frame, the unreliable frame of other appellation.The reliable frame sequence is formed reliable zone, and unreliable frame sequence is formed unreliable zone.Because the continuity of motion, reliable frame and unreliable frame are assembled, thereby video is divided into a series of staggered reliable zones and unreliable zone, as shown in Figure 4 (the figure middle and lower part has marked the frame number of a real video segmentation rear region boundary with numeral).Unreliable frame is owing to its information deficiency, and time sequence information occupies bigger proportion during reconstruction.And because the utilization of time sequence information, the frame of close zone boundary is owing to more close reliable frame in the unreliable zone, and the possibility of correctly being rebuild wants big away from the frame of zone boundary.We set a threshold value 1, if a certain unreliable zone length surpasses 1, then it becomes the hazardous location, the hazardous location means that the reconstruction fiduciary level of frame of its core is quite low, what therefore, need user's craft is in the manual position of specifying each articulation point of that frame of mid point for the hazardous location.At this moment, this frame by manual appointment upgrades to reliable frame, and original hazardous location is cut apart, and video carries out segmentation again, up to no longer dangerous zone.

Segmentation recovering step: at first, system chooses a frame as reference frame in each reliable zone, for each reference frame, utilize simulated annealing completely, do not utilize any time sequence information (yet being that the 3rd level and smooth punishment of objective function is made as 0), seek the orientation vector of optimum matching.Then, around each reference frame, the progressive 3 d reconstruction that carries out.From each reference frame, rebuild the reliable zone at its place and comprise frame in its former and later two unreliable zones.Like this, for the frame in each reliable zone, a reconstructed results is arranged; For the frame in each unreliable zone, two reconstructed results p1 and p2 are arranged, mix for p1 and p2, get this frame pitch from the inverse of the distance in former and later two reliable zones as weights, obtain final orientation vector p.

6. post-processed

So far, each frame of video all has the orientation vector of a reconstruction.Post-processed connects these orientation vectors.Make two post-processed then.The one, level and smooth, the orientation vector sequence is made Gauss's low-pass filtering reach smooth effect.The 2nd, the position of root node in definite every frame.Owing to all do not consider the displacement of root node in aforementioned each step,, make the pin that lands fix on the ground non-skidly so will determine the root node displacement in this step.If any moment can not be lifted and do not had slip in the known motion, can utilize the height of the both feet that recover to come out to determine in every frame that every pin lands or liftoff so, thereby definite root node displacement can automatically be carried out.If certain moment both feet all can be liftoff in the motion, perhaps there be slide (such as jump, skating), the displacement of root node is definite by hand by the user so.

Through after the post-processed, promptly generate 3 continuous, level and smooth, as to meet video dimension motion sequences.

Embodiment

To carry out 3-d recovery to one section fencing match video of Athens Olympic Games 2004, be illustrated in figure 6 as the frame in the video and the three-dimensional motion that recovers in the attitude of this frame.Below in conjunction with foregoing concrete technical scheme the step that this example is implemented is described, as follows:

(1) utilizes the previously described color model that separates with brightness based on chromaticity, extract the silhouette of each frame in the video.Specifically, suppose that camera is fixed and background color model known (generally can obtain by taking the pure background image of some frames), two field picture for silhouette to be extracted, by the color on its color and background model respective pixel of pixel comparison, if both differ by more than a certain threshold value, think that then this pixel of this frame is a prospect.Attention is when comparing two colors, and we adopt the model that in rgb space chromaticity is separated with brightness.Formally, compare two color C ₁And C ₂, the computing formula of chromaticity and luminance difference is:

D_{1} = \cos^{- 1} (\frac{| | C_{1} \cdot C_{2} | |}{| | C_{1} | | \cdot | | C_{2} | |})

D_{2} = \frac{| | | | C_{1} | | - | | C_{2} | | | |}{\max (| | C_{1} | |, | | C_{2} | |)}

Wherein, C ₁And C ₂Be two kinds of colors to be compared in the RBG space vectors from initial point; D ₁Be the chromaticity difference of two kinds of colors, D ₂It is the luminance difference of two kinds of colors.Relatively the time, if D ₁Differ bigger, in any case think all that then this pixel is a prospect, otherwise, if D ₁Differ very little, but D ₂Difference is arranged, then still this pixel is considered as background, think that just this is shade or highlighted result.

(2) on the outline line of every frame silhouette, calculate the curvature of each point, curvature is by this point and slide into n along the silhouette edge respectively to both sides and put 2 angles that constitute that obtain and be similar to, and computing formula is:

c = (\frac{(P_{1} - P) \cdot (P_{2} - P)}{| | P_{1} - P | | \cdot | | P_{2} - P | |} + 1) / 2

Wherein, P is the coordinate of the point of current calculating curvature, P ₁And P ₂The coordinate of two points that to be P obtain after each n point (the parameter n when calculating curvature in this example gets 20) slided in both sides along the silhouette edge; Here the curvature c that calculates between 0 to 1, on the occasion of.Notice that c that following formula calculates is actually the absolute value of actual curvature because no matter silhouette be herein protruding be that the recessed c that calculates is positive, and in fact, if silhouette is recessed herein, curvature will be set to bear.The foundation of judging is exactly with P ₁And P ₂Point links to each other, and gets its mid point, if its mid point on the prospect of silhouette, then silhouette is protruding herein, curvature is being for just, otherwise, just need add negative sign to the c that calculates, become negative value.

We are provided with a threshold value (this example adopts curvature threshold 0.8) then, on the profile all curvature for just and the point that is higher than this threshold value marked and carried out simple non-supervision cluster, note since the point that is in general marked clearly be gathered into several classes, very simple clustering method based on distance is just passable.After the cluster, the too small cluster of number that comprises a little is abandoned, and the barycenter of each remaining cluster is as the position of an endpoint node.

(3) previous step detects fine for head and foot's effect based on the endpoint node of curvature, but for both hands, its precision ratio is very high but recall ratio is general, reason is that sometimes both hands are positioned between health major part and the camera, then in silhouette both hands comprised by body part can't identification.We adopt the both hands based on dynamic color model to detect this problem that solves.Since the both hands detection accuracy rate based on curvature is very high, we are according to the result of curvature detection so, gather the color at both hands place, both hands skin color Gauss model of training when operation, and then remove to seek on the original two field picture both hands of omission with this color model.Notice that it is to carry out that current both hands based on dynamic color model detect on original two field picture, but the hunting zone is limited in the pairing prospect of silhouette, reduced the search volume so on the one hand, reduced on the one hand because the potential robustness problem that the color of background is brought.

(4) objective function is an independent variable with orientation vector p, is worth to be the matching degree E between p and the silhouette, and E is the smaller the better.By to the minimizing of objective function, just can find out optimum orientation vector.Objective function is weighed coupling by 3 penalty terms.These three are respectively:

E_{core - area} = - \underset{m}{Σ} s' (t) (point (m) \cdot T (t, p)) / M

E _smoothness＝‖p(t-1)-2p+p(t-2)‖

With top three punishment additions, promptly obtain objective function:

E(p)＝αE _core-area+βEc _overage+γE _smoothness

The parameter of the objective function E (p) of following formula is got α=0.4 in the present embodiment, β=0.4, γ=0.2;

(5) according to the description of technical scheme video is carried out segmentation, carry out the segmentation 3-d recovery simultaneously.Specifically: according to the endpoint node quantity that detects frame is classified: the frame that all 5 endpoint nodes all are detected is called reliable frame, the unreliable frame of other appellation.The reliable frame sequence is formed reliable zone, and unreliable frame sequence is formed unreliable zone.Because the continuity of motion, reliable frame and unreliable frame are assembled, thereby video is divided into a series of staggered reliable zones and unreliable zone, as shown in Figure 4 (the figure middle and lower part has marked the frame number of a real video segmentation rear region boundary with numeral).Unreliable frame is owing to its information deficiency, and time sequence information occupies bigger proportion during reconstruction.And because the utilization of time sequence information, the frame of close zone boundary is owing to more close reliable frame in the unreliable zone, and the possibility of correctly being rebuild wants big away from the frame of zone boundary.We set a threshold value 1, if a certain unreliable zone length surpasses 1, then it becomes the hazardous location, the hazardous location means that the reconstruction fiduciary level of frame of its core is quite low, what therefore, need user's craft is in the manual position of specifying each articulation point of that frame of mid point for the hazardous location.At this moment, this frame by manual appointment upgrades to reliable frame, and original hazardous location is cut apart, and video carries out segmentation again, up to no longer dangerous zone.

(6) in the post-processed process, because there is the situation of lifting in fencing, so need the position of root node in world coordinate system in user's manual setting campaign.

In the present embodiment,, obtained one section three-dimensional sequence of level and smooth continuous fencing by above step.Accompanying drawing 6 has shown the effect of a frame wherein.Figure top is the frame of original video, and the below is the 3 d poses of two sportsmen obtaining of this method at this frame.

Claims

1. video human three-dimensional motion restoration method based on silhouette and endpoint node is characterized in that may further comprise the steps:

(5) quantity according to detected endpoint node is segmented into reliable zone and unreliable zone to video, then, zones of different is taked diverse ways, utilizes simulated annealing to carry out 3-d recovery, obtains the 3 d pose vector of every frame.

2. a kind of video human three-dimensional motion restoration method according to claim 1 based on silhouette and endpoint node, it is characterized in that, described based on background model and colouring information, the color model that separates with brightness with chromaticity in the RGB color space is that every frame of video extracts its silhouette: using the color model that separates with brightness based on chromaticity is that every frame of video extracts silhouette; The basic ideas of extracting silhouette are to compare the color distortion of present frame and background model by pixel, and determine that with this each pixel belongs to prospect or background; In comparison, use the color model that chromaticity is separated with brightness in the RGB color space, each color is a vector from initial point in rgb space, and its chromaticity characterizes with the direction of this vector, and its brightness characterizes with the length of this vector; So, when comparing the difference of two colors, not directly to compare their distances in RGB three dimensions, but relatively their chromaticity and brightness respectively, the calculating formula of chromaticity and luminance difference is:

D_{1} = \cos^{- 1} (\frac{| | C_{1} \cdot C_{2} | |}{| | C_{1} | | \cdot | | C_{2} | |}) - - - 1

D_{2} = \frac{| | | | C_{1} | | - | | C_{2} | | | |}{\max (| | C_{1} | |, | | C_{2} | |)} - - - 2

3. a kind of video human three-dimensional motion restoration method according to claim 1 based on silhouette and endpoint node, it is characterized in that, describedly on the silhouette of every frame, detect and mark terminal articulation point according to curvature information, the position that comprises head, both hands and both feet: after extracting silhouette, utilize curvature information on silhouette, to detect the position of terminal articulation point; At first calculate the curvature of silhouette edge every bit, this curvature determines that with slide along the silhouette edge angle of two somes compositions of each n point of this point and its both sides the curvature computing formula is:

c = (\frac{(P_{1} - P) \cdot (P_{2} - P)}{| | P_{1} - P | | \cdot | | P_{2} - P | |} + 1) / 2 - - - 3

4. a kind of video human three-dimensional motion restoration method according to claim 1 based on silhouette and endpoint node, it is characterized in that, the colouring information of the detected both hands of described basis position, the both hands skin color is carried out dynamic modeling, the color arbiter that utilizes dynamic construction to go out then removes strengthen to detect the both hands of being omitted by step (2) mean curvature detecting device: utilize endpoint node based on curvature to detect detected accurate but incomplete both hands position, set up both hands skin color Gauss model dynamically, utilize this color model in the secondary detection of the enterprising hand-manipulating of needle of primitive frame image then, detect the both hands of omitting in the detection based on curvature both hands.

5. a kind of video human three-dimensional motion restoration method according to claim 1 based on silhouette and endpoint node, it is characterized in that, described to every frame, utilize detected endpoint node position on its silhouette and the silhouette, constructing an independent variable is orientation vector, value is the objective function of matching degree, objective function is that the matching degree of orientation vector and silhouette and endpoint node applies nucleus punishment, terminal articulation point punishment and three punishment of flatness punishment: for independent variable of every frame structure is orientation vector p, value is the objective function E (p) of the matching degree E of the silhouette of orientation vector and present frame and endpoint node position, this function comprises 3 punishment, investigates following 3 matching degrees respectively by 3 penalty terms: the one, and nucleus punishment E _Core-area, promptly the joint skeleton model under the orientation vector state two-dimensional projection should drop in the nucleus of this frame silhouette, wherein nucleus is emphasized by the Euclidean distance conversion of silhouette; The 2nd, coverage punishment E _Coverage, promptly the joint skeleton model should be near detected associated end node on the silhouette in the endpoint node position of the two-dimensional projection under the orientation vector state; The 3rd, flatness punishment E _Smoothness, promptly current orientation vector should keep level and smooth with the orientation vector of previous frame, avoids sudden change, more than the calculating formula of three penalty terms be respectively:

E_{core - area} = - \underset{m}{Σ} s^{'} (t) (point (m) \cdot T (t, p)) / M - - - 4

E _Coveroge(p, i) ‖/I is for each terminal articulation point 5 that is detected for=∑ ‖ ps (i)-pc

E _smoothness＝‖p(t-1)-2p+p(t-2)‖ 6

With top three punishment additions, promptly obtain objective function:

E(p)＝αE _core-area+βE _coverage+γE _smoothness 7

6. a kind of video human three-dimensional motion restoration method according to claim 1 based on silhouette and endpoint node, it is characterized in that, described quantity according to detected endpoint node is segmented into reliable zone and unreliable zone to video, then, zones of different is taked diverse ways, utilize simulated annealing to carry out 3-d recovery, obtain the 3 d pose vector of every frame: according to the quantity of detected endpoint node to every frame classification, promptly all detected frame of all 5 endpoint nodes is as reliable frame, all the other are non-reliable frame, reliable frame and unreliable frame trend towards assembling, so whole video is split into reliable zone and the unreliable zone that intermeshes, and as follows the orientation vector of every frame is carried out three-dimensional reconstruction: from each reliable zone, select a frame as reference frame; For reference frame, adopt complete simulated annealing that objective function E (p) is carried out optimization to seek the orientation vector p of optimum matching; Then from reference frame, the frame to front and back adopts incomplete simulation annealing to carry out optimization to seek orientation vector successively, is limited to the reliable zone at reference frame place and its front and back two unreliable zones in mountain range mutually from the frame that each reference frame recovered; For the frame in the reliable zone, above-mentioned three-dimensional reconstruction obtains the attitude three-dimensional reconstruction result of an orientation vector as this frame, for the frame in the unreliable zone, every frame obtains two orientation vectors, is that weights mix attitude three-dimensional reconstruction result as this frame according to the place frame pitch from the inverse of the distance of reliable zone boundary, front and back with these two orientation vectors; The result of every frame attitude 3 d reconstruction is connected, carry out post-processed after, form the 3 dimension motion sequences that link up.