CN102768659A - Method and system for identifying repeated account - Google Patents
Method and system for identifying repeated account Download PDFInfo
- Publication number
- CN102768659A CN102768659A CN2011101132521A CN201110113252A CN102768659A CN 102768659 A CN102768659 A CN 102768659A CN 2011101132521 A CN2011101132521 A CN 2011101132521A CN 201110113252 A CN201110113252 A CN 201110113252A CN 102768659 A CN102768659 A CN 102768659A
- Authority
- CN
- China
- Prior art keywords
- account
- characteristic
- information
- similarity
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention provides a method and a system for identifying a repeated account. The method includes: acquiring feature information of a first account and a second account saved by a website server; calculating similarity between the features in the feature information of the first account and the features in the feature information of the second account; using the obtained similarity as an input parameter of a preset identifying model, calculating similarity of the feature information of the first account and the feature information of the second account according to the preset identifying model, and judging whether the first account and the second account are the repeated account or not according to the obtained similarity. By the method and the system, the problem that the repeated account cannot be identified in the prior art is solved, the repeated account is identified accurately, and operating speed is improved.
Description
Technical field
The application relates to the internet information field, in particular to a kind of repetition number of the account automatic identifying method and system.
Background technology
In the process that present internet uses; Duplicate message is to influence user search experience most and increase the weight of one of problem of search engine server search burden; Wherein, With the e-commerce website is example, and the number of the account of repetition can cause the duplication of labour of buyer user in the time of the contact seller, and good seller's user profile can not get exposure also can to cause part; Simultaneously because a large amount of existence that repeats account number makes the user when carrying out information inquiry, increase the weight of the search burden of search engine, the search speed of having slowed down search engine.
In the prior art, generally adopt following steps to discern the repetition number of the account:
S1: server obtains number of the account to be identified;
S2: server with the title of the number of the account of scheduled volume in the title of number of the account to be identified and the database through following manner title relatively one by one:
The participle dictionary of the different parts of speech that utilization is preset carries out participle and confirms part of speech the title of number of the account to be identified and the number of the account title in the database;
To pass through participle and confirm trade name that the account number to be identified of part of speech is corresponding and the solid shop/brick and mortar store name in the database is inserted predetermined template respectively;
The whether identical scoring that obtains the comparison of number of the account title of speech through entity trade name corresponding part of speech in said template in the corresponding trade name of account number more to be identified and the database;
S3: server is through relatively scoring and preassigned assign to judge that said number of the account to be identified and the number of the account in the database relatively repeat;
S4: server will be judged as unduplicated said number of the account to be identified and be added into database.
Said method is through judging the whether identical repetition number of the account of discerning of number of the account title, yet, it will be understood by those skilled in the art that; In ecommerce, seller's number of the account generally comprises a plurality of characteristic informations, for example; The number of the account title, the Business Name that this number of the account is corresponding, company introduction; Contact method, visit behavior etc.Whether the number of the account title is identical and can't judge this number of the account exactly and repeat, and for example, the number of the account name of number of the account A is called Apple; Various apples such as red fuji apple are mainly sold by the said firm, and the number of the account title of number of the account B also is an Apple, and iphone mainly sells in the said firm; Electronic products such as ipad; It is thus clear that the characteristic information of number of the account A and number of the account B should be obviously different, but if only relatively whether the number of the account title identical; Can think that then number of the account A and number of the account B are the repetition number of the account, thereby cause the number of the account identification error.Because it is inaccurate to repeat the identification of account number, causes repeating in a large number the existence of account number, not can solve the problem of the search burden of search engine server; Therefore; Be badly in need of the recognition accuracy of a kind of raising number of the account, thereby alleviate search engine server search burden, accelerate the scheme of search speed.
Summary of the invention
The application aims to provide a kind of repetition number of the account automatic identifying method and system, can't correctly discern the repetition number of the account to solve in the prior art, thereby causes increasing the weight of the problem of search engine server search burden.
According to the application's a aspect, a kind of repetition number of the account automatic identifying method is provided, it comprises: obtain first number of the account that the server of website preserves and the characteristic information of second number of the account; Calculate the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and second number of the account of the characteristic in the characteristic information of first number of the account; According to pre-assigned weight parameter the similarity between each characteristic parameter is carried out match and obtain the similarity between each characteristic of first number of the account each characteristic corresponding with second number of the account; Judge according to the similarity between each characteristic of first number of the account each characteristic corresponding whether first number of the account and second number of the account are the repetition number of the account with second number of the account.
According to the application on the other hand; A kind of repetition number of the account automatic recognition system is provided; It comprises: acquiring unit; Be used to obtain first number of the account that the server of website preserves and the characteristic information of second number of the account, wherein, characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account; Computing unit; Be used for calculating the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and second number of the account of characteristic of characteristic information of first number of the account, and according to pre-assigned weight parameter the similarity between each characteristic parameter carried out match and obtain the similarity between each characteristic of first number of the account each characteristic corresponding with second number of the account; Judging unit is used for judging according to the similarity between each characteristic of first number of the account each characteristic corresponding with second number of the account whether first number of the account and second number of the account are the repetition number of the account.
Have following beneficial effect among the application:
1) the application judges through the similarity of a plurality of characteristics between two numbers of the account of match whether two numbers of the account are repetition; Can effectively avoid owing to judging that the inaccurate duplicate message with mistake that causes offers user's problem; Thereby reach the purpose of accurate identification repetition number of the account; Further alleviate the processing pressure of search engine server when the processes user queries request, improved search speed;
2) characteristic information among the application comprises a plurality of characteristics; For example; The product information characteristic of the essential information characteristic of number of the account, number of the account institute release product and the behavioural information characteristic of number of the account; Utilize above-mentioned characteristic information to carry out similarity from the multidimensional angle and calculate, the unicity of the dimension that has adopted when having avoided repeating number of the account calculating has improved the accuracy that repeats number of the account identification;
3) the application has saved the cycle index of calculating through model of cognition is trained, thereby when carrying out the identification of repetition number of the account, improves the arithmetic speed of system, has saved computing time.
Description of drawings
Accompanying drawing described herein is used to provide the further understanding to the application, constitutes the application's a part, and the application's illustrative examples and explanation thereof are used to explain the application, do not constitute the improper qualification to the application.In the accompanying drawings:
Fig. 1 is a kind of preferred structure synoptic diagram according to the repetition number of the account automatic recognition system of the application embodiment;
Fig. 2 is the another kind of preferred structure synoptic diagram according to the repetition number of the account automatic recognition system of the application embodiment;
Fig. 3 is a kind of preferred flow charts according to the repetition number of the account automatic identifying method of the application embodiment;
Fig. 4 is the another kind of preferred flow charts according to the repetition number of the account automatic identifying method of the application embodiment.
Embodiment
Hereinafter will and combine embodiment to specify the application with reference to accompanying drawing.Need to prove that under the situation of not conflicting, embodiment and the characteristic among the embodiment among the application can make up each other.
Before the further details of each embodiment that describes the application, a suitable counting system structure of the principle that can be used for realizing the application will be described with reference to figure 1.In the following description, except as otherwise noted, otherwise each embodiment of the application will be described with reference to the symbolic representation of action of carrying out by one or more computing machines and operation.Thus, be appreciated that and be called as processing unit that this type action that computing machine carries out and operation comprise computing machine sometimes represent the manipulation of the electric signal of data with structured form.This manipulation transforms safeguard it on data or the position in the accumulator system of computing machine, the operation of computing machine is reshuffled or changed to this mode of all understanding with those skilled in the art.The data structure of service data is the physical location of storer with defined particular community of form of data.Yet, although in above-mentioned context, describe the application, it and do not mean that restrictive, as the each side that skilled person understands that back civilian described action and operation also available hardware realize.
Turn to accompanying drawing, wherein identical reference number refers to identical element, and the application's principle is shown in the suitable computing environment and realizes.Below describe embodiment, and should not think to limit the application here about the alternative embodiment clearly do not described based on described the application.
Fig. 1 shows the synoptic diagram of an example computer architecture that can be used for these equipment.For purposes of illustration, the architecture of being painted is merely an example of proper environment, is not that usable range or function to the application proposes any limitation.Should this computing system be interpreted as yet arbitrary assembly shown in Figure 1 or its combination are had any dependence or demand.
The application's principle can use other general or dedicated computing or communication environment or configuration to operate.The example that is applicable to the application's well-known computing system, environment and configuration includes but not limited to; Personal computer, server, multicomputer system, system, minicomputer, mainframe computer and the DCE that comprises arbitrary said system or equipment based on little processing.
In its most basic configuration, the repetition number of the account automatic recognition system 100 among Fig. 1 generally includes at least one processing unit 102 and storer 104.Processing unit 102 can but be not limited to microprocessor MCU, PLD FPGA etc., storer 104 can be volatibility (like RAM), non-volatile (like ROM, flash memory etc.) or both a certain combinations.In this instructions and claims, " repeat number of the account automatic recognition system " is defined as can executive software, firmware or microcode are realized any nextport hardware component NextPort of function or the combination of nextport hardware component NextPort.Repeat number of the account automatic recognition system 100 even can be distributed, to realize distributed function.
Employed like the application, term " module ", " assembly " or " unit " can refer in the software object or the routine that repeat execution on the number of the account automatic recognition system 100.Different assembly described herein, module, unit, engine and service can be implemented as in the object or the process that repeat to carry out on the number of the account automatic recognition system 100 (for example, as independent thread).Although system and method described herein realizes with software that preferably the realization of the combination of hardware or software and hardware also maybe and be conceived.
Employed like the application, term " is cut speech " or " part-of-speech tagging " is the common method of natural language processing.Cut speech and be divided into significant speech to the Chinese text sequence exactly.Part-of-speech tagging to cutting the speech that obtains behind the speech, is assigned a suitable part of speech, such as verb, noun etc. exactly.In ecommerce, commonly used have product speech, model speech, a brand speech etc.In this application, carry out the operation of " cutting speech " or " part-of-speech tagging " by system.Certainly, the application also is not limited thereto, also can be through artificial mode, and perhaps, artificial mode with system in combination is carried out the operation of " cutting speech " or " part-of-speech tagging ".
Repeat number of the account automatic recognition system 100 and can also comprise the communication unit 106 of permission main frame as communicating through network 108 and other system and equipment.Communication unit 106 can be wire transmission equipment, like cable network communication interface and chip, perhaps is radio transmission apparatus, like RF, infrared, bluetooth equipment etc.
Embodiment 1
Fig. 2 is the another kind of preferred structure synoptic diagram according to the repetition number of the account automatic recognition system of the application embodiment, and is preferred, each assembly shown in Figure 2 can but be not limited to realize by the processing unit shown in Fig. 1 102.As shown in Figure 2; Repeating the number of the account automatic recognition system comprises: acquiring unit 202; Be used to obtain first number of the account that the server of website preserves and the characteristic information of second number of the account; Wherein, said characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account; Computing unit 204; Be used for calculating the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of characteristic of characteristic information of said first number of the account, and according to pre-assigned weight parameter the similarity between said each characteristic parameter carried out match and obtain the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account; Judging unit 206 is used for judging according to the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account whether said first number of the account and said second number of the account are the repetition number of the account.
In the application's preferred embodiment; Judge through the similarity of a plurality of characteristics between two numbers of the account of match whether two numbers of the account are repetition; Can effectively avoid owing to judging that the inaccurate duplicate message with mistake that causes offers user's problem; Thereby reach the purpose of accurate identification repetition number of the account; Further improved the Experience Degree of user when using web search business, ecommerce etc., the processing pressure when alleviating search engine server and handling query requests, raising inquiry velocity.In addition; Characteristic information among the application comprises a plurality of characteristics; For example, the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account are utilized above-mentioned characteristic information to carry out similarity from the multidimensional angle and are calculated; The unicity of the dimension that has adopted when having avoided repeating number of the account calculating has improved the accuracy that repeats number of the account identification.
Preferably, computing unit 204 comprises: first acquisition module 2041 that connects successively, second acquisition module 2042, select module 2043, first computing module 2044.In the application's preferred embodiment, first acquisition module 2041, second acquisition module 2042, select module 2043, first computing module 2044 to adopt the method for cosine angles to come the similarity between the calculated characteristics parameter, specifically describe as follows:
During similarity in each characteristic parameter of the characteristic in calculating the characteristic information of said first number of the account and the characteristic information of said second number of the account between each characteristic parameter of characteristic of correspondence, first acquisition module 2041 obtains by first characteristic parameter being cut first group of keyword A that speech obtains
1, A
2... A
MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech
A1, W
A2... W
AM, wherein, a characteristic parameter of the characteristic in the characteristic information that said first characteristic parameter is said first number of the account; Second acquisition module 2042 obtains by second characteristic parameter being cut speech and obtains second group of keyword B
1, B
2... B
NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech
B1, W
B2... W
BN, wherein, a characteristic parameter of the characteristic in the characteristic information that said second characteristic parameter is said second number of the account.
After getting access to above-mentioned parameter, select module 2043 to select identical keyword C between said first group of keyword and the said second group of keyword
1... C
H, H>=1 and corresponding weights W
C1... W
CHThen, first computing module 2044 calculates the similarity df between said first characteristic parameter and said second characteristic parameter through following formula:
Wherein, d1=W
C1* W
C1+ ... W
CH* W
CH
da=W
A1×W
A1+…W
AM×W
AM;
db=W
B1×W
B1+…W
BN×W
BN。
The method of above-mentioned cosine angle can utilize different weights to come the similarity between the calculated characteristics parameter, rather than single the similarity of carrying out calculates, thereby obtains two similarities between the characteristic parameter exactly.Certainly, the method for the cosine angle among the application is a kind of example, and the application is not limited only to this, can also carry out calculation of similarity degree through other similar methods.
As shown in Figure 2, computing unit 204 also comprises: second computing module 2045.Similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is being carried out in the process of match; Second computing module 2045 can adopt the mode of linear fit; That is, can carry out match through following formula:
d=c1×W
c1+c2×W
c2…+cq×W
cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account;
C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W
C1, W
C2W
CqBe pre-assigned weight.
Certainly, above-mentioned linear fit is a kind of mode, and the application is not limited only to this.
For instance, the essential information characteristic of first number of the account comprises parameter: CompanyAddress (A1), and company introduction (A2) and company's phone (A3), the essential information characteristic of second number of the account comprises characteristic parameter: CompanyAddress (B1), company introduction (B2) and company's phone (B3).In the process of the similarity of the essential information characteristic of the essential information characteristic of calculating first number of the account and second number of the account, first computing module 2041 at first calculates the similarity C3 between similarity C2, A3 and the B3 between similarity C1, A2 and the B2 between A1 and the B1; Fitting module 2042 obtains the similarity of essential information characteristic of essential information characteristic and second number of the account of first number of the account through C1, C2 and C3 being carried out linear fit then.In concrete realization; Can adopt the computing method of cosine angle to calculate the similarity between each parameter in the essential information characteristic of each parameter and second number of the account in the essential information characteristic of first number of the account, the computation process about table 1-table 4 of its detailed process in can reference implementation example 3.In addition, about above-mentioned concrete fit procedure, also can reference implementation computation process in the example 3 about table 1-table 4.
In above-mentioned preferred embodiment, obtain the similarity between a pair of characteristic information characteristic owing to carry out The Fitting Calculation to the similarity of each characteristic parameter, therefore, guaranteed the accuracy that the similarity between a pair of characteristic information characteristic is calculated.
Further, judging unit 206 comprises: the 3rd calculating module 2061 and judge module 2062 that connects successively.Judging according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account are in the process of repetition number of the account with said second number of the account; The 3rd calculates similarity between module 2061 each characteristic that each characteristic of said first number of the account is corresponding with said second number of the account as the input parameter of being scheduled to model of cognition, calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition; Judge module 2062 judges according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
Preferably, the 3rd calculating module 2061 comprises: the training submodule and the calculating sub module that connect successively.In the process of the similarity between the characteristic information of characteristic information that calculates said first number of the account through said predetermined model of cognition and said second number of the account; The training submodule is trained said predetermined model of cognition through the training parameter of predetermined quantity; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter; Then; Calculating sub module with the similarity between the characteristic of correspondence in the characteristic information of each characteristic in the characteristic information of said first number of the account and said second number of the account as input parameter, the similarity between the characteristic information of characteristic information and said second number of the account through obtaining said first number of the account through the said predetermined model of cognition after the training.The application has saved the cycle index of calculating through model of cognition is trained, thereby when carrying out the identification of repetition number of the account, improves the arithmetic speed of system, has saved computing time.In this preferred embodiment, for concrete training process, can reference implementation computation process in the example 3 about table 1-table 4.
In addition; Judge module 2062 comprises: judge submodule; Be used to judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold; And the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account is during greater than said predetermined threshold, judges said first number of the account and said second number of the account is the repetition number of the account.In the application's preferred embodiment, the passing threshold judgment mode can be judged the repetition number of the account effectively.Certainly, the judgment mode among the application is not limited only to this.
Preferably, acquiring unit 202 comprise following one of at least: first acquisition module 2021 is used to obtain the essential information of said first number of the account and said second number of the account; Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account; Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account; Second acquisition module 2022 is used to obtain the product information of said first number of the account and said second number of the account; The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product; The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product; Perhaps the 3rd acquisition module 2023; Employed identification information Cookie ID when being used to obtain said first number of the account and said second number of the account and logining said website; With the Cookie ID of said first number of the account that gets access to behavioural information characteristic, with the Cookie ID of said second number of the account that gets access to behavioural information characteristic as said second number of the account as said first number of the account.In the application's preferred embodiment, through above-mentioned steps, can obtain useful characteristic information, make that the judgement of similarity is more accurate.
Preferably; Above-mentioned repetition number of the account automatic recognition system also comprises: communication unit 208 is used for sending indication information judging after first number of the account and second number of the account be the repetition number of the account to the user; Wherein, to be used to indicate first number of the account and second number of the account be the repetition number of the account to indication information.In the application's preferred embodiment, through above-mentioned advice method, make the user to manage neatly to number of the account, improved user's Experience Degree.
Embodiment 2
Based on repetition number of the account automatic recognition system illustrated in figures 1 and 2, the application also provides a kind of repetition number of the account automatic identifying method, and is as shown in Figure 3, and the repetition number of the account automatic identifying method in the present embodiment comprises:
S302 obtains first number of the account that the server of website preserves and the characteristic information of second number of the account; Preferably, can but be not limited to carry out the step of S302 by the acquiring unit among the processing unit among Fig. 1 102 or Fig. 2 202;
S304 calculates the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account; Preferably, can but be not limited to carry out the step of S304 by the computing unit among the processing unit among Fig. 1 102 or Fig. 2 204;
S306 carries out match according to pre-assigned weight parameter to the similarity between said each characteristic parameter and obtains the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account; Preferably, can but be not limited to carry out the step of S306 by the computing unit among the processing unit among Fig. 1 102 or Fig. 2 204;
S308 judges according to the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account whether said first number of the account and said second number of the account are the repetition number of the account; Preferably, can but be not limited to carry out the step of S306 by the judging unit among the processing unit among Fig. 1 102 or Fig. 2 206.
In the application's preferred embodiment; Judge through the similarity of a plurality of characteristics between two numbers of the account of match whether two numbers of the account are repetition; Can effectively avoid owing to judging that the inaccurate duplicate message with mistake that causes offers user's problem; Thereby reach the purpose of accurate identification repetition number of the account, further improved the Experience Degree of user when using web search business, ecommerce etc.
Preferably, above-mentioned characteristic information comprise in the following characteristic one of at least: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product or the behavioural information characteristic of number of the account.Characteristic information among the application comprises a plurality of characteristics; For example; The product information characteristic of the essential information characteristic of number of the account, number of the account institute release product and the behavioural information characteristic of number of the account; Utilize above-mentioned characteristic information to carry out similarity from the multidimensional angle and calculate, the unicity of the dimension that has adopted when having avoided repeating number of the account calculating has improved the accuracy that repeats number of the account identification.
Preferably; First acquisition module 2041 among Fig. 2, second acquisition module 2042, select module 2043, first computing module 2044 to adopt the method for cosine angles to come the similarity between the calculated characteristics parameter; Just, calculate the similarity between second characteristic parameter of characteristic of correspondence in the characteristic information of first characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account through following steps:
S1 obtains by said first characteristic parameter being cut first group of keyword A that speech obtains
1, A
2... A
MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech
A1, W
A2... W
AM
S2 obtains by said second characteristic parameter is cut speech and obtains second group of keyword B
1, B
2... B
NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech
B1, W
B2... W
BN
S3 selects identical keyword C between said first group of keyword and the said second group of keyword
1... C
H, H>=1 and corresponding weights W
C1... W
CH
S4, calculate the similarity df between said first characteristic parameter and said second characteristic parameter through following formula:
Wherein, d1=W
C1* W
C1+ ... W
CH* W
CH
da=W
A1×W
A1+…W
AM×W
AM;
db=W
B1×W
B1+…W
BN×W
BN。
The method of above-mentioned cosine angle can utilize different weights to come the similarity between the calculated characteristics parameter, rather than single the similarity of carrying out calculates, thereby obtains two similarities between the characteristic parameter exactly.Certainly, the method for the cosine angle among the application is a kind of example, and the application is not limited only to this, can also carry out calculation of similarity degree through other similar methods.
Preferably, second computing module 2045 mode that can adopt linear fit comes the similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is carried out match through following steps:
d=c1×W
c1+c2×W
c2…+cq×W
cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account;
C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W
C1, W
C2W
CqBe pre-assigned weight.
Certainly, above-mentioned linear fit is a kind of mode, and the application is not limited only to this.
For instance; The essential information characteristic of first number of the account (first characteristic) comprises parameter: CompanyAddress (A1); Company introduction (A2) and company's phone (A3), the essential information characteristic of second number of the account (second characteristic) comprises parameter: CompanyAddress (B1), company introduction (B2) and company's phone (B3).In the process of the similarity of the essential information characteristic of the essential information characteristic of calculating first number of the account and second number of the account, first computing module 2041 at first calculates the similarity C3 between similarity C2, A3 and the B3 between similarity C1, A2 and the B2 between A1 and the B1; Fitting module 2042 obtains the similarity of essential information characteristic of essential information characteristic and second number of the account of first number of the account through C1, C2 and C3 being carried out match then.In concrete realization; Can adopt the computing method of cosine angle to calculate the similarity between each parameter in the essential information characteristic of each parameter and second number of the account in the essential information characteristic of first number of the account, the computation process about table 1-table 4 of its detailed process in can reference implementation example 3.In addition, about above-mentioned concrete fit procedure, also can reference implementation computation process in the example 3 about table 1-table 4.
In above-mentioned preferred embodiment, obtain the similarity between a pair of characteristic information characteristic owing to carry out The Fitting Calculation to the similarity of each parameter, therefore, guaranteed the accuracy that the similarity between a pair of characteristic information characteristic is calculated.
Preferably; Judge that according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises:, calculate the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition with the input parameter of the similarity between each characteristic of said first number of the account and each corresponding characteristic of said second number of the account as predetermined model of cognition with said second number of the account; Judge according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
Preferably; The step that calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition comprises: the training parameter through predetermined quantity is trained said predetermined model of cognition; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter; With the similarity between the characteristic of correspondence in the characteristic information of each characteristic in the characteristic information of said first number of the account and said second number of the account as input parameter, the similarity between the characteristic information of characteristic information and said second number of the account through obtaining said first number of the account through the said predetermined model of cognition after the training.The application has saved the cycle index of calculating through model of cognition is trained, thereby when carrying out the identification of repetition number of the account, improves the arithmetic speed of system, has saved computing time.In this preferred embodiment, for concrete training process, can reference implementation computation process in the example 3 about table 1-table 4.
Preferably, judge that according to resulting similarity whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises: judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold; If the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account greater than said predetermined threshold, is then judged said first number of the account and said second number of the account is the repetition number of the account.
Preferably, can by but be not limited to first acquisition module 2021 obtains first number of the account and second number of the account through following method essential information characteristic: the essential information of obtaining first number of the account and second number of the account; Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account; Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account.
Preferably, can but be not limited to obtain through following method the product information characteristic of first number of the account and second number of the account institute release product by second acquisition module 2022 among the processing unit among Fig. 1 102 or Fig. 2: the product information of obtaining first number of the account and second number of the account; The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product; The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product.In the application's preferred embodiment, through above-mentioned steps, can obtain useful characteristic information, make that the judgement of similarity is more accurate.
Preferably; Can but be not limited to obtain through following method the behavioural information characteristic of first number of the account and second number of the account by the 3rd acquisition module 2023 among the processing unit among Fig. 1 102 or Fig. 2: employed identification information (Cookie ID) when obtaining first number of the account and the second number of the account Website login; With the Cookie ID of first number of the account that gets access to behavioural information characteristic, with the Cookie ID of second number of the account that gets access to behavioural information characteristic as second number of the account as first number of the account.In the application's preferred embodiment, through above-mentioned steps, can obtain useful characteristic information, make that the judgement of similarity is more accurate.
Preferably; Judging after first number of the account and second number of the account be the repetition number of the account; Above-mentioned repetition number of the account automatic identifying method also comprises: can but be not limited to send indication information to the user by the communication unit among the communication unit among Fig. 1 106 or Fig. 2 208; Wherein, to be used to indicate first number of the account and second number of the account be the repetition number of the account to indication information.In the application's preferred embodiment, through above-mentioned advice method, make the user to manage neatly to number of the account, improved user's Experience Degree.
Embodiment 3
Based on repetition number of the account automatic recognition system illustrated in figures 1 and 2, the application also provides another kind of repetition number of the account automatic identifying method, and is as shown in Figure 4, and the repetition number of the account automatic identifying method in the present embodiment comprises:
S402-S406, obtain number of the account essential information, user's historical behavior information, product information etc. (can claim this stage be information collecting and the processing stage).Preferably, can but be not limited to carry out the step of S402-S406 by the acquiring unit among the processing unit among Fig. 1 102 or Fig. 2 202
Preferably, the essential information of number of the account comprises but is not limited to: essential informations such as Business Name, brief introduction, contact method, geographic position.
Preferably, the offer information of sending out through the extraction number of the account is obtained the corresponding product information of this number of the account.
Preferably, employed Cookie ID obtains user's historical behavior information of this number of the account when obtaining number of the account and land the website.
S408-S414; From the number of the account essential information, extract the essential information characteristic of this number of the account; From user's historical behavior information, extract the behavioural information characteristic of this number of the account, from product information, extract the product information characteristic (can claim that this stage is the characterisation stage of information) that this number of the account is issued.Preferably, can but be not limited to carry out S408-S414 by the computing unit among the processing unit among Fig. 1 102 or Fig. 2 204.
Preferably, after collecting above-mentioned essential information, through text handling method, cut speech and part-of-speech tagging then, form required essential information characteristic.
Preferably, said product information is cut speech and part-of-speech tagging, and the information behind the part-of-speech tagging is added up, obtain the product information characteristic.
Preferably, with the Cookie ID of the number of the account that gets access to behavioural information characteristic as this number of the account.Like this,, analyze the contact between the number of the account, thereby obtain the behavioural information characteristic of this number of the account through the historical behavior of analysis user.
S416, whether the way through machine learning is identified as automatically and repeats, and according to the result of machine learning, can the number of the account of all repetitions be identified.Preferably, can but be not limited to carry out S416 by computing unit among the processing unit among Fig. 1 102 or Fig. 2 204 and judging unit 206.
Preferably, in conjunction with the three aspect characteristics that characterization obtains, having described number of the account from a plurality of dimensions, is exactly the similarity of calculating between character pair below.Concrete grammar is distinguished as follows:
1) calculates the similarity between the essential information characteristic through the way of cosine angle, through these similar value of method match of machine learning, obtain the similarity between the final essential information characteristic then.
Particularly, after essential information is carried out characterization, can obtain one group of essential information characteristic sequence, it comprises: the weight that the id of characteristic and this id are corresponding, wherein, frequency that weight occurs according to id and the part of speech of id calculate.Then,, utilize the algorithm of cosine angle, can calculate a similarity of each final essential information characteristic for characteristic sequence.The similarity of each essential information characteristic of match just can obtain the similarity between the final essential information characteristic.Concrete operations can be with reference to the embodiment of follow-up table 1-4 description.
2) statistics two number of the account like products account for the accounting of product that this number of the account is sent out, and calculate the similarity that the like products portioned product distributes, and the product of product distribution similarity and product accounting obtains the similarity between the product information characteristic.
Preferably, the similarity between the product information characteristic also can utilize the algorithm of cosine angle to calculate.Particularly, at first obtain the id of every kind of product, quantity accounting that should product is represented the weight of this id, wherein, the quantity accounting obtains through the way of statistics.Use comprises that the information of product id and id weight forms the product information characteristic sequence, utilizes the algorithm of cosine angle to calculate similarity then.Concrete operations can be with reference to the embodiment of follow-up table 1-4 description.
3) utilize information such as historical behavior information and contact method, whether relatedly can obtain between a plurality of numbers of the account, obtain the similarity between the behavioural information characteristic between a plurality of numbers of the account.
The application adopts SVM (Support Vector Machines, SVMs) model of cognition to carry out the characteristic match after obtaining above-mentioned three similarities, obtains two similarities between the number of the account.For instance, at first extract the number of the account of a part, mark in twos, this part number of the account is extracted three aspect characteristics as above, and receive the markup information of user's input, learn out the SVM model of cognition of repetition number of the account.When classifying, three characteristics of two numbers of the account of input, the SVM model of cognition can provide a similar value, representes the repetition degree of these two numbers of the account, is higher than the repetition that is identified as of certain threshold values.Through the first vectorial clustering method of class, can do down classification to all numbers of the account, obtain final result, this result can use for each bar product line.Certainly, the application is not limited only to carry out feature identification with the SVM model of cognition, can also realize the application with other model of cognition.
The application's preferred embodiment makes things convenient for user and platform that a plurality of numbers of the account are managed through the repetition number of the account of the same company of identification or individual's registration.After identifying the repetition number of the account, website platform can be notified the user, clearly tells user's repetition number of the account, reminds the user to go to revise and management, accepts user's feedback simultaneously.Further, if the feedback indication merges above-mentioned repetition number of the account, but the indication that merges and incorrect, website platform can be revised this merging indication through preset program, so that carry out the indicated combine command of user better.
Repetition number of the account automatic identifying method and system based on above-mentioned each embodiment describes describe below concrete repetition number of the account and discern example automatically.
Suppose to have 4 companies, specifying information is respectively shown in following table 1-4:
Table 1
Table 2
Table 3
Table 4
To above-mentioned 4 numbers of the account, obtain essential information characteristic, behavioural information characteristic and the product information characteristic of 4 numbers of the account through said method, then,, calculate the similarity between the number of the account in twos through the SVM model of cognition according to the characteristic of above-mentioned three aspects.In said process, can receive the markup information of user's input, for example, and the similarity relation of the number of the account A of user's input, B, C, D, specific as follows, A B 1; A C 1; A D 0; B D 0; C D 1 (wherein, the non-repetition of 0 expression, 1 expression repetition).Before the SVM training, extract earlier the characteristic information of A, B, C, four numbers of the account of D respectively.
Below be example with account A, the process of essential information characterization is described.
1) for the essential information characteristic of number of the account, at first, the essential information of each number of the account is cut speech and part-of-speech tagging, and give weight.With the Business Name is example, and the result that the Business Name of number of the account A " Hangzhou Jia Hua Science and Technology Ltd. " is cut behind the speech is: Hangzhou, good China, science and technology, limited, company; Part-of-speech tagging is Hangzhou (zoning), good China (core institution name), science and technology (industry), limited (generic word), company (common).Then,, give each speech weight (this weight information can be imported in advance by the user and obtain), suppose that the result is: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2 according to factors such as parts of speech.Other dimensions that in like manner can the characterization essential information, for example, company introduction, contact method etc.In addition, for the product information characteristic of this number of the account institute release product, through text techniques as above, the product that can extract A is: mobile phone, MP3, digital camera etc., the accounting that comes out is respectively: 40%, 35%, 25%.Through above-mentioned statistics, obtain product information and be characterized as: mobile phone=0.4, MP3=0.35, digital camera=0.25.In addition, the behavioural information characteristic of this number of the account comprises: the userid of this number of the account, cookieid commonly used etc.
2) after characterization, calculate the similarity of the character pair between two numbers of the account.Following number of the account A and number of the account B (similarity relation is AB 1) are example, describe the similarity of the algorithm computation number of the account A and the Business Name between the B that utilize the cosine angle.Particularly, the Business Name of the A that obtains after the characterization is characterized as: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2; The Business Name of B is characterized as: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2, sales department=0.6.
Here, with the Business Name be the computing method that example is described the cosine angle.By on can know identical being characterized as in the Business Name of number of the account A and B: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2.Calculate the score of same characteristic features in the Business Name of number of the account A and B then, its formula that adopts be same characteristic features respective weights product with, dl=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2 just; Then; Calculate the score of A, B characteristic respectively; The formula that adopts is the weight sum of products of all characteristics, da=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2, db=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2+0.6*0.6.Final score is df=dl/ (sqrt (da) * sqrt (db)), and wherein, sqrt (da) refers to the evolution of da.
Through the algorithm of cosine angle, the similarity that can obtain the Business Name between above-mentioned A and the B is 0.96.In like manner, can calculate the similarity between other essential information characteristics between A and the B through identical method, wherein, other essential information characteristics comprise: company introduction, contact method etc.Finally, come similarity between each essential information characteristic between match number of the account A and the B to obtain the similarity between the essential information characteristic of final number of the account A and B through weight parameter, in the present embodiment; Fit method can adopt linear fit method; Particularly, the weight of supposing Business Name c1 is 0.55, and the weight of company introduction c2 is 0.35; The weight of contact method c3 is 0.1; The similarity d that calculates the essential information characteristic is: d=c1*0.55+c2*0.35+c3*0.1 for example, is 0.948.Further, if contact method is identical, then the repetition possibility of two numbers of the account is bigger, can above-mentioned similarity d further be handled, and for example, the similarity d of final essential information characteristic must be divided into: d=d*0.73+0.27.
In like manner, can utilize above-mentioned cosine angle computing method and above-mentioned fit procedure to calculate the similarity of number of the account A and other character pairs of number of the account B, comprise: the similarity between similarity between the product information characteristic and the behavioural information characteristic.Finally, can obtain the similarity of three characteristics, for example, the similarity of three characteristics of number of the account A and number of the account B is respectively 0.948,0.87,0.95.
After the similarity of having calculated between the characteristic of all marks, training SVM model.For example; Similarity relation AB 1 corresponding learning content is (0.948,0.87,0.95,1); That is, the input parameter when (0.948,0.87,0.95) is training SVM model, 1 is the desired output valve that obtains when training the SVM model; Adjust the inner parameter of SVM model through above-mentioned input parameter and output valve, arrive the purpose of training.In like manner, can come further training SVM model according to the learning content of similarity relation A C 1, A D 0, B D 0 and C D 1.The more parameters that training is adopted is many, and the inner parameter of SVM model can be accurate more by adjustment ground.
After having trained the SVM model, below just two numbers of the account are judged, for instance; Supposing needs to judge whether B and two numbers of the account of C repeat, and then can extract three characteristic informations of B C according to the method described above earlier, calculate B C characteristic of correspondence similarity then; Such as being (0.927,0.865,0.94).Give the SVM model with these three values, can obtain a rreturn value, as be 0.97, whether judge this rreturn value greater than the threshold values of setting, if greater than, then number of the account B and C then are judged to the repetition number of the account.
Top just example in the project of reality, can use a large amount of number of the account mark samples to learn.
Certainly, simply mate, perhaps through the information to the member, artificial mode also can realize the identification to a plurality of numbers of the account, but recognition efficiency is very low, and accuracy rate and recall rate are not high.
Needs to the technological challenge that faces at present, optimize allocation of resources and raising search experience; The application has developed the model of automatic identification repetition number of the account; Automatic identification technology through the high recall rate of high-accuracy; Identify a plurality of repetition numbers of the account, can the result of identification be applied to each bar product line with company or individual's registration.
Obviously; Each module or each step that it is apparent to those skilled in the art that above-mentioned the application can realize that they can concentrate on the single calculation element with the general calculation device; Perhaps be distributed on the network that a plurality of calculation element forms; Alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby can they be stored in the memory storage; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the application is not restricted to any specific hardware and software combination.
The preferred embodiment that the above is merely the application is not limited to the application, and for a person skilled in the art, the application can have various changes and variation.All within the application's spirit and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within the application's the protection domain.
Claims (17)
1. one kind is repeated the number of the account automatic identifying method, it is characterized in that, comprising:
Obtain first number of the account that the server of website preserves and the characteristic information of second number of the account;
Calculate the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account;
According to pre-assigned weight parameter the similarity between said each characteristic parameter is carried out match and obtain the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account;
Judge according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account are the repetition number of the account with said second number of the account.
2. method according to claim 1; It is characterized in that, calculate the similarity between second characteristic parameter of characteristic of correspondence in the characteristic information of first characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account through following steps:
Obtain by said first characteristic parameter being cut first group of keyword A that speech obtains
1, A
2... A
MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech
A1, W
A2... W
AM
Obtain by said second characteristic parameter is cut speech and obtain second group of keyword B
1, B
2... B
NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech
B1, W
B2... W
BN
Select identical keyword C between said first group of keyword and the said second group of keyword
1... C
H, H>=1 and corresponding weights W
C1... W
CH
Calculate the similarity df between said first characteristic parameter and said second characteristic parameter through following formula
Wherein, d1=W
C1* W
C1+ ... W
CH* W
CH
da=W
A1×W
A1+…W
AM×W
AM;
db=W
B1×W
B1+…W
BN×W
BN。
3. method according to claim 1 is characterized in that, comes the similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is carried out match through following steps:
d=c1×W
c1+c2×W
c2…+cq×W
cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account; C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W
C1, W
C2W
CqBe pre-assigned weight.
4. method according to claim 1; It is characterized in that, judge that according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises with said second number of the account:
Similarity between each characteristic that each characteristic of said first number of the account is corresponding with said second number of the account is as the input parameter of predetermined model of cognition, calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition;
Judge according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
5. method according to claim 4 is characterized in that, the step that calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition comprises:
Training parameter through predetermined quantity is trained said predetermined model of cognition; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter;
With the similarity between the characteristic of correspondence in the characteristic information of each characteristic in the characteristic information of said first number of the account and said second number of the account as input parameter, the similarity between the characteristic information of characteristic information and said second number of the account through obtaining said first number of the account through the said predetermined model of cognition after the training.
6. method according to claim 4 is characterized in that, judges that according to resulting similarity whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises:
Judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold;
If the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account greater than said predetermined threshold, is then judged said first number of the account and said second number of the account is the repetition number of the account.
7. according to each described method in the claim 1 to 6; It is characterized in that said characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account.
8. method according to claim 7 is characterized in that, obtains the essential information characteristic of said first number of the account and said second number of the account through following method:
Obtain the essential information of said first number of the account and said second number of the account;
Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account;
Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account.
9. method according to claim 7 is characterized in that, obtains the product information characteristic of said first number of the account and said second number of the account institute release product through following method:
Obtain the product information of said first number of the account and said second number of the account;
The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product;
The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product.
10. method according to claim 7 is characterized in that, obtains the behavioural information characteristic of said first number of the account and said second number of the account through following method:
Employed identification information Cookie ID when obtaining said first number of the account and said second number of the account and logining said website;
With the Cookie ID of said first number of the account that gets access to behavioural information characteristic, with the Cookie ID of said second number of the account that gets access to behavioural information characteristic as said second number of the account as said first number of the account.
11. one kind is repeated the number of the account automatic recognition system, it is characterized in that, comprising:
Acquiring unit; Be used to obtain first number of the account that the server of website preserves and the characteristic information of second number of the account; Wherein, said characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account;
Computing unit; Be used for calculating the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of characteristic of characteristic information of said first number of the account, and according to pre-assigned weight parameter the similarity between said each characteristic parameter carried out match and obtain the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account;
Judging unit is used for judging according to the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account whether said first number of the account and said second number of the account are the repetition number of the account.
12. system according to claim 11 is characterized in that, said computing unit comprises:
First acquisition module is used to obtain by first characteristic parameter being cut first group of keyword A that speech obtains
1, A
2... A
MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech
A1, W
A2... W
AM, wherein, a characteristic parameter of the characteristic in the characteristic information that said first characteristic parameter is said first number of the account;
Second acquisition module is used to obtain by second characteristic parameter being cut speech and obtains second group of keyword B
1, B
2... B
NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech
B1, W
B2... W
BN, wherein, a characteristic parameter of the characteristic in the characteristic information that said second characteristic parameter is said second number of the account;
Select module, be used to select identical keyword C between said first group of keyword and the said second group of keyword
1... C
H, H>=1 and corresponding weights W
C1... W
CH
First computing module is used for calculating the similarity df between said first characteristic parameter and said second characteristic parameter through following formula
Wherein, d1=W
C1* W
C1+ ... W
CH* W
CH
da=W
A1×W
A1+…W
AM×W
AM;
db=W
B1×W
B1+…W
BN×W
BN。
13. system according to claim 11; It is characterized in that; Said computing unit also comprises: second computing module is used for coming the similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is carried out match through following steps:
d=c1×W
c1+c2×W
c2…+cq×W
cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account;
C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W
C1, W
C2W
CqBe pre-assigned weight.
14. system according to claim 11 is characterized in that, said judging unit comprises:
The 3rd calculates module; Be used for similarity between each characteristic that each characteristic of said first number of the account is corresponding with said second number of the account as the input parameter of predetermined model of cognition, calculate the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition;
Judge module is used for judging according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
15. system according to claim 14 is characterized in that, the said the 3rd calculates module comprises:
The training submodule; Be used for said predetermined model of cognition being trained through the training parameter of predetermined quantity; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter;
Calculating sub module; Be used for the similarity between the characteristic of correspondence in the characteristic information of each characteristic of the characteristic information of said first number of the account and said second number of the account as input parameter the similarity between the characteristic information through obtaining said first number of the account through the said predetermined model of cognition after the training and the characteristic information of said second number of the account.
16. system according to claim 14 is characterized in that, said judge module comprises:
Judge submodule; Be used to judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold; And the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account is during greater than said predetermined threshold, judges said first number of the account and said second number of the account is the repetition number of the account.
17. according to each described system in the claim 11 to 16, it is characterized in that, said acquiring unit comprise following one of at least:
First acquisition module is used to obtain the essential information of said first number of the account and said second number of the account; Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account; Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account;
Second acquisition module is used to obtain the product information of said first number of the account and said second number of the account; The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product; The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product; Perhaps
The 3rd acquisition module; Employed identification information Cookie ID when being used to obtain said first number of the account and said second number of the account and logining said website; With the Cookie ID of said first number of the account that gets access to behavioural information characteristic, with the Cookie ID of said second number of the account that gets access to behavioural information characteristic as said second number of the account as said first number of the account.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110113252.1A CN102768659B (en) | 2011-05-03 | 2011-05-03 | Method and system for identifying repeated account |
HK12113367.4A HK1172706A1 (en) | 2011-05-03 | 2012-12-25 | Method and system for automatically identifying repeated account |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110113252.1A CN102768659B (en) | 2011-05-03 | 2011-05-03 | Method and system for identifying repeated account |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102768659A true CN102768659A (en) | 2012-11-07 |
CN102768659B CN102768659B (en) | 2015-06-24 |
Family
ID=47096063
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110113252.1A Active CN102768659B (en) | 2011-05-03 | 2011-05-03 | Method and system for identifying repeated account |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102768659B (en) |
HK (1) | HK1172706A1 (en) |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077366A (en) * | 2014-06-13 | 2014-10-01 | 北京百度网讯科技有限公司 | Method and device used for determining characteristic information in network device |
CN104239490A (en) * | 2014-09-05 | 2014-12-24 | 电子科技大学 | Multi-account detection method and device for UGC (user generated content) website platform |
CN104348871A (en) * | 2013-08-05 | 2015-02-11 | 深圳市腾讯计算机系统有限公司 | Similar account expanding method and device |
CN104537118A (en) * | 2015-01-26 | 2015-04-22 | 苏州大学 | Microblog data processing method, device and system |
CN104573076A (en) * | 2015-01-27 | 2015-04-29 | 南京烽火星空通信发展有限公司 | Social networking site user Chinese remark name system recommendation method |
CN105095306A (en) * | 2014-05-20 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Operating method and device based on associated objects |
CN105207996A (en) * | 2015-08-18 | 2015-12-30 | 小米科技有限责任公司 | Account merging method and apparatus |
CN105335390A (en) * | 2014-07-09 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Object classification method, business pushing method and server |
CN105491444A (en) * | 2015-11-25 | 2016-04-13 | 珠海多玩信息技术有限公司 | Data identification processing method and device |
CN105516282A (en) * | 2015-12-01 | 2016-04-20 | 深圳还是威健康科技有限公司 | Data synchronous processing method and wearable device |
CN105897726A (en) * | 2016-05-09 | 2016-08-24 | 深圳市永兴元科技有限公司 | Associated account data sharing method and device |
CN105991621A (en) * | 2015-03-04 | 2016-10-05 | 深圳市腾讯计算机系统有限公司 | Safety detection method and server |
CN106034149A (en) * | 2015-03-13 | 2016-10-19 | 阿里巴巴集团控股有限公司 | Account identification method and device |
CN106126654A (en) * | 2016-06-27 | 2016-11-16 | 中国科学院信息工程研究所 | A kind of inter-network station based on user name similarity user-association method |
WO2016188283A1 (en) * | 2015-05-26 | 2016-12-01 | 阿里巴巴集团控股有限公司 | Repeated data identification method and device |
WO2016188051A1 (en) * | 2015-05-27 | 2016-12-01 | 深圳市华傲数据技术有限公司 | Information entropy-based object name matching method |
CN106372977A (en) * | 2015-07-23 | 2017-02-01 | 阿里巴巴集团控股有限公司 | Method and device for processing virtual account |
CN107066616A (en) * | 2017-05-09 | 2017-08-18 | 北京京东金融科技控股有限公司 | Method, device and electronic equipment for account processing |
CN107404408A (en) * | 2017-08-30 | 2017-11-28 | 北京邮电大学 | A kind of virtual identity association recognition methods and device |
CN107730364A (en) * | 2017-10-31 | 2018-02-23 | 北京麒麟合盛网络技术有限公司 | user identification method and device |
EP3285179A4 (en) * | 2015-04-14 | 2018-10-24 | Alibaba Group Holding Limited | Data transfer method and device |
WO2018227931A1 (en) * | 2017-06-12 | 2018-12-20 | 北京小度信息科技有限公司 | Information determining method and apparatus |
CN111046894A (en) * | 2018-10-15 | 2020-04-21 | 北京京东尚科信息技术有限公司 | Method and device for identifying vest account |
CN111104795A (en) * | 2019-11-19 | 2020-05-05 | 平安金融管理学院(中国·深圳) | Company name matching method and device, computer equipment and storage medium |
CN111881304A (en) * | 2020-07-21 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | Author identification method, device, equipment and storage medium |
CN113536252A (en) * | 2021-07-21 | 2021-10-22 | 北京房江湖科技有限公司 | Account identification method and computer-readable storage medium |
WO2022152018A1 (en) * | 2021-01-14 | 2022-07-21 | 北京沃东天骏信息技术有限公司 | Method and device for identifying multiple accounts belonging to the same person |
CN111881304B (en) * | 2020-07-21 | 2024-04-26 | 百度在线网络技术(北京)有限公司 | Author identification method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101316262A (en) * | 2007-05-31 | 2008-12-03 | 中兴通讯股份有限公司 | Method for controlling repeated registration of the same account terminal |
US7725421B1 (en) * | 2006-07-26 | 2010-05-25 | Google Inc. | Duplicate account identification and scoring |
CN101727487A (en) * | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | Network criticism oriented viewpoint subject identifying method and system |
KR101022373B1 (en) * | 2004-01-29 | 2011-03-22 | 주식회사 케이티 | Log-in system allowing duplicated user account and method for registering of user account and method for authentication of user |
-
2011
- 2011-05-03 CN CN201110113252.1A patent/CN102768659B/en active Active
-
2012
- 2012-12-25 HK HK12113367.4A patent/HK1172706A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101022373B1 (en) * | 2004-01-29 | 2011-03-22 | 주식회사 케이티 | Log-in system allowing duplicated user account and method for registering of user account and method for authentication of user |
US7725421B1 (en) * | 2006-07-26 | 2010-05-25 | Google Inc. | Duplicate account identification and scoring |
CN101316262A (en) * | 2007-05-31 | 2008-12-03 | 中兴通讯股份有限公司 | Method for controlling repeated registration of the same account terminal |
CN101727487A (en) * | 2009-12-04 | 2010-06-09 | 中国人民解放军信息工程大学 | Network criticism oriented viewpoint subject identifying method and system |
Cited By (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104348871B (en) * | 2013-08-05 | 2019-01-11 | 深圳市腾讯计算机系统有限公司 | A kind of similar account extended method and device |
CN104348871A (en) * | 2013-08-05 | 2015-02-11 | 深圳市腾讯计算机系统有限公司 | Similar account expanding method and device |
CN105095306A (en) * | 2014-05-20 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Operating method and device based on associated objects |
CN105095306B (en) * | 2014-05-20 | 2019-04-09 | 阿里巴巴集团控股有限公司 | The method and device operated based on affiliated partner |
CN104077366B (en) * | 2014-06-13 | 2018-03-23 | 北京百度网讯科技有限公司 | A kind of method and apparatus for being used to determine characteristic information in the network device |
CN104077366A (en) * | 2014-06-13 | 2014-10-01 | 北京百度网讯科技有限公司 | Method and device used for determining characteristic information in network device |
CN105335390A (en) * | 2014-07-09 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Object classification method, business pushing method and server |
CN104239490A (en) * | 2014-09-05 | 2014-12-24 | 电子科技大学 | Multi-account detection method and device for UGC (user generated content) website platform |
CN104239490B (en) * | 2014-09-05 | 2017-05-10 | 电子科技大学 | Multi-account detection method and device for UGC (user generated content) website platform |
CN104537118A (en) * | 2015-01-26 | 2015-04-22 | 苏州大学 | Microblog data processing method, device and system |
CN104537118B (en) * | 2015-01-26 | 2017-12-26 | 苏州大学 | A kind of microblog data processing method, apparatus and system |
CN104573076A (en) * | 2015-01-27 | 2015-04-29 | 南京烽火星空通信发展有限公司 | Social networking site user Chinese remark name system recommendation method |
CN104573076B (en) * | 2015-01-27 | 2017-11-03 | 南京烽火星空通信发展有限公司 | A kind of Chinese remark names system recommendation method of social network sites user |
CN105991621A (en) * | 2015-03-04 | 2016-10-05 | 深圳市腾讯计算机系统有限公司 | Safety detection method and server |
CN105991621B (en) * | 2015-03-04 | 2019-12-13 | 深圳市腾讯计算机系统有限公司 | Security detection method and server |
CN106034149B (en) * | 2015-03-13 | 2019-06-18 | 阿里巴巴集团控股有限公司 | A kind of account recognition methods and device |
CN106034149A (en) * | 2015-03-13 | 2016-10-19 | 阿里巴巴集团控股有限公司 | Account identification method and device |
US10484342B2 (en) | 2015-04-14 | 2019-11-19 | Alibaba Group Holding Limited | Accuracy and security of data transfer to an online user account |
EP3285179A4 (en) * | 2015-04-14 | 2018-10-24 | Alibaba Group Holding Limited | Data transfer method and device |
CN106294429A (en) * | 2015-05-26 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Repeat data identification method and device |
WO2016188283A1 (en) * | 2015-05-26 | 2016-12-01 | 阿里巴巴集团控股有限公司 | Repeated data identification method and device |
WO2016188051A1 (en) * | 2015-05-27 | 2016-12-01 | 深圳市华傲数据技术有限公司 | Information entropy-based object name matching method |
CN106372977A (en) * | 2015-07-23 | 2017-02-01 | 阿里巴巴集团控股有限公司 | Method and device for processing virtual account |
CN106372977B (en) * | 2015-07-23 | 2019-06-07 | 阿里巴巴集团控股有限公司 | A kind of processing method and equipment of virtual account |
CN105207996A (en) * | 2015-08-18 | 2015-12-30 | 小米科技有限责任公司 | Account merging method and apparatus |
CN105207996B (en) * | 2015-08-18 | 2018-11-23 | 小米科技有限责任公司 | Account merging method and device |
CN105491444A (en) * | 2015-11-25 | 2016-04-13 | 珠海多玩信息技术有限公司 | Data identification processing method and device |
CN105491444B (en) * | 2015-11-25 | 2018-11-06 | 珠海多玩信息技术有限公司 | A kind of data identifying processing method and device |
CN105516282A (en) * | 2015-12-01 | 2016-04-20 | 深圳还是威健康科技有限公司 | Data synchronous processing method and wearable device |
CN105516282B (en) * | 2015-12-01 | 2019-06-11 | 深圳市元征科技股份有限公司 | A kind of method and wearable device of data synchronization processing |
CN105897726A (en) * | 2016-05-09 | 2016-08-24 | 深圳市永兴元科技有限公司 | Associated account data sharing method and device |
CN106126654B (en) * | 2016-06-27 | 2019-10-18 | 中国科学院信息工程研究所 | A kind of inter-network station user-association method based on user name similarity |
CN106126654A (en) * | 2016-06-27 | 2016-11-16 | 中国科学院信息工程研究所 | A kind of inter-network station based on user name similarity user-association method |
CN107066616A (en) * | 2017-05-09 | 2017-08-18 | 北京京东金融科技控股有限公司 | Method, device and electronic equipment for account processing |
WO2018227931A1 (en) * | 2017-06-12 | 2018-12-20 | 北京小度信息科技有限公司 | Information determining method and apparatus |
CN107404408A (en) * | 2017-08-30 | 2017-11-28 | 北京邮电大学 | A kind of virtual identity association recognition methods and device |
CN107404408B (en) * | 2017-08-30 | 2020-05-22 | 北京邮电大学 | Virtual identity association identification method and device |
CN107730364A (en) * | 2017-10-31 | 2018-02-23 | 北京麒麟合盛网络技术有限公司 | user identification method and device |
CN111046894A (en) * | 2018-10-15 | 2020-04-21 | 北京京东尚科信息技术有限公司 | Method and device for identifying vest account |
CN111104795A (en) * | 2019-11-19 | 2020-05-05 | 平安金融管理学院(中国·深圳) | Company name matching method and device, computer equipment and storage medium |
CN111881304A (en) * | 2020-07-21 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | Author identification method, device, equipment and storage medium |
CN111881304B (en) * | 2020-07-21 | 2024-04-26 | 百度在线网络技术(北京)有限公司 | Author identification method, device, equipment and storage medium |
WO2022152018A1 (en) * | 2021-01-14 | 2022-07-21 | 北京沃东天骏信息技术有限公司 | Method and device for identifying multiple accounts belonging to the same person |
CN113536252A (en) * | 2021-07-21 | 2021-10-22 | 北京房江湖科技有限公司 | Account identification method and computer-readable storage medium |
CN113536252B (en) * | 2021-07-21 | 2022-08-09 | 贝壳找房(北京)科技有限公司 | Account identification method and computer-readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN102768659B (en) | 2015-06-24 |
HK1172706A1 (en) | 2013-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102768659B (en) | Method and system for identifying repeated account | |
CN103679462B (en) | A kind of comment data treating method and apparatus, a kind of searching method and system | |
CN103870507B (en) | Method and device of searching based on category | |
CN111506721B (en) | Question-answering system and construction method for domain knowledge graph | |
CN105893533A (en) | Text matching method and device | |
CN102810117A (en) | Method and equipment for supplying search result | |
CN105975453A (en) | Method and device for comment label extraction | |
CN103473317A (en) | Method and equipment for extracting keywords | |
CN112257419A (en) | Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof | |
CN106776901A (en) | Data extraction method, apparatus and system | |
CN105989001A (en) | Image searching method and device, and image searching system | |
CN104462554A (en) | Method and device for recommending question and answer page related questions | |
CN111737494A (en) | Knowledge graph generation method of intelligent learning system | |
CN106919588A (en) | A kind of application program search system and method | |
CN104715063A (en) | Search ranking method and search ranking device | |
CN108182182A (en) | Document matching process, device and computer readable storage medium in translation database | |
CN109902157A (en) | A kind of training sample validation checking method and device | |
CN107679186A (en) | The method and device of entity search is carried out based on entity storehouse | |
CN113792084A (en) | Data heat analysis method, device, equipment and storage medium | |
CN108959289B (en) | Website category acquisition method and device | |
CN111523798A (en) | Automatic modeling method, device and system and electronic equipment thereof | |
CN104462556A (en) | Method and device for recommending question and answer page related questions | |
CN116628162A (en) | Semantic question-answering method, device, equipment and storage medium | |
CN103279549A (en) | Method and device for acquiring target data of target objects | |
CN105095385A (en) | Method and device for outputting retrieval result |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: GR Ref document number: 1172706 Country of ref document: HK |