CN102768659A - Method and system for identifying repeated account - Google Patents

Method and system for identifying repeated account Download PDF

Info

Publication number
CN102768659A
CN102768659A CN2011101132521A CN201110113252A CN102768659A CN 102768659 A CN102768659 A CN 102768659A CN 2011101132521 A CN2011101132521 A CN 2011101132521A CN 201110113252 A CN201110113252 A CN 201110113252A CN 102768659 A CN102768659 A CN 102768659A
Authority
CN
China
Prior art keywords
account
characteristic
information
similarity
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101132521A
Other languages
Chinese (zh)
Other versions
CN102768659B (en
Inventor
冯景华
陈超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201110113252.1A priority Critical patent/CN102768659B/en
Publication of CN102768659A publication Critical patent/CN102768659A/en
Priority to HK12113367.4A priority patent/HK1172706A1/en
Application granted granted Critical
Publication of CN102768659B publication Critical patent/CN102768659B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and a system for identifying a repeated account. The method includes: acquiring feature information of a first account and a second account saved by a website server; calculating similarity between the features in the feature information of the first account and the features in the feature information of the second account; using the obtained similarity as an input parameter of a preset identifying model, calculating similarity of the feature information of the first account and the feature information of the second account according to the preset identifying model, and judging whether the first account and the second account are the repeated account or not according to the obtained similarity. By the method and the system, the problem that the repeated account cannot be identified in the prior art is solved, the repeated account is identified accurately, and operating speed is improved.

Description

Repeat number of the account automatic identifying method and system
Technical field
The application relates to the internet information field, in particular to a kind of repetition number of the account automatic identifying method and system.
Background technology
In the process that present internet uses; Duplicate message is to influence user search experience most and increase the weight of one of problem of search engine server search burden; Wherein, With the e-commerce website is example, and the number of the account of repetition can cause the duplication of labour of buyer user in the time of the contact seller, and good seller's user profile can not get exposure also can to cause part; Simultaneously because a large amount of existence that repeats account number makes the user when carrying out information inquiry, increase the weight of the search burden of search engine, the search speed of having slowed down search engine.
In the prior art, generally adopt following steps to discern the repetition number of the account:
S1: server obtains number of the account to be identified;
S2: server with the title of the number of the account of scheduled volume in the title of number of the account to be identified and the database through following manner title relatively one by one:
The participle dictionary of the different parts of speech that utilization is preset carries out participle and confirms part of speech the title of number of the account to be identified and the number of the account title in the database;
To pass through participle and confirm trade name that the account number to be identified of part of speech is corresponding and the solid shop/brick and mortar store name in the database is inserted predetermined template respectively;
The whether identical scoring that obtains the comparison of number of the account title of speech through entity trade name corresponding part of speech in said template in the corresponding trade name of account number more to be identified and the database;
S3: server is through relatively scoring and preassigned assign to judge that said number of the account to be identified and the number of the account in the database relatively repeat;
S4: server will be judged as unduplicated said number of the account to be identified and be added into database.
Said method is through judging the whether identical repetition number of the account of discerning of number of the account title, yet, it will be understood by those skilled in the art that; In ecommerce, seller's number of the account generally comprises a plurality of characteristic informations, for example; The number of the account title, the Business Name that this number of the account is corresponding, company introduction; Contact method, visit behavior etc.Whether the number of the account title is identical and can't judge this number of the account exactly and repeat, and for example, the number of the account name of number of the account A is called Apple; Various apples such as red fuji apple are mainly sold by the said firm, and the number of the account title of number of the account B also is an Apple, and iphone mainly sells in the said firm; Electronic products such as ipad; It is thus clear that the characteristic information of number of the account A and number of the account B should be obviously different, but if only relatively whether the number of the account title identical; Can think that then number of the account A and number of the account B are the repetition number of the account, thereby cause the number of the account identification error.Because it is inaccurate to repeat the identification of account number, causes repeating in a large number the existence of account number, not can solve the problem of the search burden of search engine server; Therefore; Be badly in need of the recognition accuracy of a kind of raising number of the account, thereby alleviate search engine server search burden, accelerate the scheme of search speed.
Summary of the invention
The application aims to provide a kind of repetition number of the account automatic identifying method and system, can't correctly discern the repetition number of the account to solve in the prior art, thereby causes increasing the weight of the problem of search engine server search burden.
According to the application's a aspect, a kind of repetition number of the account automatic identifying method is provided, it comprises: obtain first number of the account that the server of website preserves and the characteristic information of second number of the account; Calculate the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and second number of the account of the characteristic in the characteristic information of first number of the account; According to pre-assigned weight parameter the similarity between each characteristic parameter is carried out match and obtain the similarity between each characteristic of first number of the account each characteristic corresponding with second number of the account; Judge according to the similarity between each characteristic of first number of the account each characteristic corresponding whether first number of the account and second number of the account are the repetition number of the account with second number of the account.
According to the application on the other hand; A kind of repetition number of the account automatic recognition system is provided; It comprises: acquiring unit; Be used to obtain first number of the account that the server of website preserves and the characteristic information of second number of the account, wherein, characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account; Computing unit; Be used for calculating the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and second number of the account of characteristic of characteristic information of first number of the account, and according to pre-assigned weight parameter the similarity between each characteristic parameter carried out match and obtain the similarity between each characteristic of first number of the account each characteristic corresponding with second number of the account; Judging unit is used for judging according to the similarity between each characteristic of first number of the account each characteristic corresponding with second number of the account whether first number of the account and second number of the account are the repetition number of the account.
Have following beneficial effect among the application:
1) the application judges through the similarity of a plurality of characteristics between two numbers of the account of match whether two numbers of the account are repetition; Can effectively avoid owing to judging that the inaccurate duplicate message with mistake that causes offers user's problem; Thereby reach the purpose of accurate identification repetition number of the account; Further alleviate the processing pressure of search engine server when the processes user queries request, improved search speed;
2) characteristic information among the application comprises a plurality of characteristics; For example; The product information characteristic of the essential information characteristic of number of the account, number of the account institute release product and the behavioural information characteristic of number of the account; Utilize above-mentioned characteristic information to carry out similarity from the multidimensional angle and calculate, the unicity of the dimension that has adopted when having avoided repeating number of the account calculating has improved the accuracy that repeats number of the account identification;
3) the application has saved the cycle index of calculating through model of cognition is trained, thereby when carrying out the identification of repetition number of the account, improves the arithmetic speed of system, has saved computing time.
Description of drawings
Accompanying drawing described herein is used to provide the further understanding to the application, constitutes the application's a part, and the application's illustrative examples and explanation thereof are used to explain the application, do not constitute the improper qualification to the application.In the accompanying drawings:
Fig. 1 is a kind of preferred structure synoptic diagram according to the repetition number of the account automatic recognition system of the application embodiment;
Fig. 2 is the another kind of preferred structure synoptic diagram according to the repetition number of the account automatic recognition system of the application embodiment;
Fig. 3 is a kind of preferred flow charts according to the repetition number of the account automatic identifying method of the application embodiment;
Fig. 4 is the another kind of preferred flow charts according to the repetition number of the account automatic identifying method of the application embodiment.
Embodiment
Hereinafter will and combine embodiment to specify the application with reference to accompanying drawing.Need to prove that under the situation of not conflicting, embodiment and the characteristic among the embodiment among the application can make up each other.
Before the further details of each embodiment that describes the application, a suitable counting system structure of the principle that can be used for realizing the application will be described with reference to figure 1.In the following description, except as otherwise noted, otherwise each embodiment of the application will be described with reference to the symbolic representation of action of carrying out by one or more computing machines and operation.Thus, be appreciated that and be called as processing unit that this type action that computing machine carries out and operation comprise computing machine sometimes represent the manipulation of the electric signal of data with structured form.This manipulation transforms safeguard it on data or the position in the accumulator system of computing machine, the operation of computing machine is reshuffled or changed to this mode of all understanding with those skilled in the art.The data structure of service data is the physical location of storer with defined particular community of form of data.Yet, although in above-mentioned context, describe the application, it and do not mean that restrictive, as the each side that skilled person understands that back civilian described action and operation also available hardware realize.
Turn to accompanying drawing, wherein identical reference number refers to identical element, and the application's principle is shown in the suitable computing environment and realizes.Below describe embodiment, and should not think to limit the application here about the alternative embodiment clearly do not described based on described the application.
Fig. 1 shows the synoptic diagram of an example computer architecture that can be used for these equipment.For purposes of illustration, the architecture of being painted is merely an example of proper environment, is not that usable range or function to the application proposes any limitation.Should this computing system be interpreted as yet arbitrary assembly shown in Figure 1 or its combination are had any dependence or demand.
The application's principle can use other general or dedicated computing or communication environment or configuration to operate.The example that is applicable to the application's well-known computing system, environment and configuration includes but not limited to; Personal computer, server, multicomputer system, system, minicomputer, mainframe computer and the DCE that comprises arbitrary said system or equipment based on little processing.
In its most basic configuration, the repetition number of the account automatic recognition system 100 among Fig. 1 generally includes at least one processing unit 102 and storer 104.Processing unit 102 can but be not limited to microprocessor MCU, PLD FPGA etc., storer 104 can be volatibility (like RAM), non-volatile (like ROM, flash memory etc.) or both a certain combinations.In this instructions and claims, " repeat number of the account automatic recognition system " is defined as can executive software, firmware or microcode are realized any nextport hardware component NextPort of function or the combination of nextport hardware component NextPort.Repeat number of the account automatic recognition system 100 even can be distributed, to realize distributed function.
Employed like the application, term " module ", " assembly " or " unit " can refer in the software object or the routine that repeat execution on the number of the account automatic recognition system 100.Different assembly described herein, module, unit, engine and service can be implemented as in the object or the process that repeat to carry out on the number of the account automatic recognition system 100 (for example, as independent thread).Although system and method described herein realizes with software that preferably the realization of the combination of hardware or software and hardware also maybe and be conceived.
Employed like the application, term " is cut speech " or " part-of-speech tagging " is the common method of natural language processing.Cut speech and be divided into significant speech to the Chinese text sequence exactly.Part-of-speech tagging to cutting the speech that obtains behind the speech, is assigned a suitable part of speech, such as verb, noun etc. exactly.In ecommerce, commonly used have product speech, model speech, a brand speech etc.In this application, carry out the operation of " cutting speech " or " part-of-speech tagging " by system.Certainly, the application also is not limited thereto, also can be through artificial mode, and perhaps, artificial mode with system in combination is carried out the operation of " cutting speech " or " part-of-speech tagging ".
Repeat number of the account automatic recognition system 100 and can also comprise the communication unit 106 of permission main frame as communicating through network 108 and other system and equipment.Communication unit 106 can be wire transmission equipment, like cable network communication interface and chip, perhaps is radio transmission apparatus, like RF, infrared, bluetooth equipment etc.
Embodiment 1
Fig. 2 is the another kind of preferred structure synoptic diagram according to the repetition number of the account automatic recognition system of the application embodiment, and is preferred, each assembly shown in Figure 2 can but be not limited to realize by the processing unit shown in Fig. 1 102.As shown in Figure 2; Repeating the number of the account automatic recognition system comprises: acquiring unit 202; Be used to obtain first number of the account that the server of website preserves and the characteristic information of second number of the account; Wherein, said characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account; Computing unit 204; Be used for calculating the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of characteristic of characteristic information of said first number of the account, and according to pre-assigned weight parameter the similarity between said each characteristic parameter carried out match and obtain the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account; Judging unit 206 is used for judging according to the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account whether said first number of the account and said second number of the account are the repetition number of the account.
In the application's preferred embodiment; Judge through the similarity of a plurality of characteristics between two numbers of the account of match whether two numbers of the account are repetition; Can effectively avoid owing to judging that the inaccurate duplicate message with mistake that causes offers user's problem; Thereby reach the purpose of accurate identification repetition number of the account; Further improved the Experience Degree of user when using web search business, ecommerce etc., the processing pressure when alleviating search engine server and handling query requests, raising inquiry velocity.In addition; Characteristic information among the application comprises a plurality of characteristics; For example, the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account are utilized above-mentioned characteristic information to carry out similarity from the multidimensional angle and are calculated; The unicity of the dimension that has adopted when having avoided repeating number of the account calculating has improved the accuracy that repeats number of the account identification.
Preferably, computing unit 204 comprises: first acquisition module 2041 that connects successively, second acquisition module 2042, select module 2043, first computing module 2044.In the application's preferred embodiment, first acquisition module 2041, second acquisition module 2042, select module 2043, first computing module 2044 to adopt the method for cosine angles to come the similarity between the calculated characteristics parameter, specifically describe as follows:
During similarity in each characteristic parameter of the characteristic in calculating the characteristic information of said first number of the account and the characteristic information of said second number of the account between each characteristic parameter of characteristic of correspondence, first acquisition module 2041 obtains by first characteristic parameter being cut first group of keyword A that speech obtains 1, A 2... A MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech A1, W A2... W AM, wherein, a characteristic parameter of the characteristic in the characteristic information that said first characteristic parameter is said first number of the account; Second acquisition module 2042 obtains by second characteristic parameter being cut speech and obtains second group of keyword B 1, B 2... B NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech B1, W B2... W BN, wherein, a characteristic parameter of the characteristic in the characteristic information that said second characteristic parameter is said second number of the account.
After getting access to above-mentioned parameter, select module 2043 to select identical keyword C between said first group of keyword and the said second group of keyword 1... C H, H>=1 and corresponding weights W C1... W CHThen, first computing module 2044 calculates the similarity df between said first characteristic parameter and said second characteristic parameter through following formula:
df = d 1 ( da × db )
Wherein, d1=W C1* W C1+ ... W CH* W CH
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
The method of above-mentioned cosine angle can utilize different weights to come the similarity between the calculated characteristics parameter, rather than single the similarity of carrying out calculates, thereby obtains two similarities between the characteristic parameter exactly.Certainly, the method for the cosine angle among the application is a kind of example, and the application is not limited only to this, can also carry out calculation of similarity degree through other similar methods.
As shown in Figure 2, computing unit 204 also comprises: second computing module 2045.Similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is being carried out in the process of match; Second computing module 2045 can adopt the mode of linear fit; That is, can carry out match through following formula:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account;
C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W C1, W C2W CqBe pre-assigned weight.
Certainly, above-mentioned linear fit is a kind of mode, and the application is not limited only to this.
For instance, the essential information characteristic of first number of the account comprises parameter: CompanyAddress (A1), and company introduction (A2) and company's phone (A3), the essential information characteristic of second number of the account comprises characteristic parameter: CompanyAddress (B1), company introduction (B2) and company's phone (B3).In the process of the similarity of the essential information characteristic of the essential information characteristic of calculating first number of the account and second number of the account, first computing module 2041 at first calculates the similarity C3 between similarity C2, A3 and the B3 between similarity C1, A2 and the B2 between A1 and the B1; Fitting module 2042 obtains the similarity of essential information characteristic of essential information characteristic and second number of the account of first number of the account through C1, C2 and C3 being carried out linear fit then.In concrete realization; Can adopt the computing method of cosine angle to calculate the similarity between each parameter in the essential information characteristic of each parameter and second number of the account in the essential information characteristic of first number of the account, the computation process about table 1-table 4 of its detailed process in can reference implementation example 3.In addition, about above-mentioned concrete fit procedure, also can reference implementation computation process in the example 3 about table 1-table 4.
In above-mentioned preferred embodiment, obtain the similarity between a pair of characteristic information characteristic owing to carry out The Fitting Calculation to the similarity of each characteristic parameter, therefore, guaranteed the accuracy that the similarity between a pair of characteristic information characteristic is calculated.
Further, judging unit 206 comprises: the 3rd calculating module 2061 and judge module 2062 that connects successively.Judging according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account are in the process of repetition number of the account with said second number of the account; The 3rd calculates similarity between module 2061 each characteristic that each characteristic of said first number of the account is corresponding with said second number of the account as the input parameter of being scheduled to model of cognition, calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition; Judge module 2062 judges according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
Preferably, the 3rd calculating module 2061 comprises: the training submodule and the calculating sub module that connect successively.In the process of the similarity between the characteristic information of characteristic information that calculates said first number of the account through said predetermined model of cognition and said second number of the account; The training submodule is trained said predetermined model of cognition through the training parameter of predetermined quantity; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter; Then; Calculating sub module with the similarity between the characteristic of correspondence in the characteristic information of each characteristic in the characteristic information of said first number of the account and said second number of the account as input parameter, the similarity between the characteristic information of characteristic information and said second number of the account through obtaining said first number of the account through the said predetermined model of cognition after the training.The application has saved the cycle index of calculating through model of cognition is trained, thereby when carrying out the identification of repetition number of the account, improves the arithmetic speed of system, has saved computing time.In this preferred embodiment, for concrete training process, can reference implementation computation process in the example 3 about table 1-table 4.
In addition; Judge module 2062 comprises: judge submodule; Be used to judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold; And the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account is during greater than said predetermined threshold, judges said first number of the account and said second number of the account is the repetition number of the account.In the application's preferred embodiment, the passing threshold judgment mode can be judged the repetition number of the account effectively.Certainly, the judgment mode among the application is not limited only to this.
Preferably, acquiring unit 202 comprise following one of at least: first acquisition module 2021 is used to obtain the essential information of said first number of the account and said second number of the account; Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account; Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account; Second acquisition module 2022 is used to obtain the product information of said first number of the account and said second number of the account; The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product; The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product; Perhaps the 3rd acquisition module 2023; Employed identification information Cookie ID when being used to obtain said first number of the account and said second number of the account and logining said website; With the Cookie ID of said first number of the account that gets access to behavioural information characteristic, with the Cookie ID of said second number of the account that gets access to behavioural information characteristic as said second number of the account as said first number of the account.In the application's preferred embodiment, through above-mentioned steps, can obtain useful characteristic information, make that the judgement of similarity is more accurate.
Preferably; Above-mentioned repetition number of the account automatic recognition system also comprises: communication unit 208 is used for sending indication information judging after first number of the account and second number of the account be the repetition number of the account to the user; Wherein, to be used to indicate first number of the account and second number of the account be the repetition number of the account to indication information.In the application's preferred embodiment, through above-mentioned advice method, make the user to manage neatly to number of the account, improved user's Experience Degree.
Embodiment 2
Based on repetition number of the account automatic recognition system illustrated in figures 1 and 2, the application also provides a kind of repetition number of the account automatic identifying method, and is as shown in Figure 3, and the repetition number of the account automatic identifying method in the present embodiment comprises:
S302 obtains first number of the account that the server of website preserves and the characteristic information of second number of the account; Preferably, can but be not limited to carry out the step of S302 by the acquiring unit among the processing unit among Fig. 1 102 or Fig. 2 202;
S304 calculates the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account; Preferably, can but be not limited to carry out the step of S304 by the computing unit among the processing unit among Fig. 1 102 or Fig. 2 204;
S306 carries out match according to pre-assigned weight parameter to the similarity between said each characteristic parameter and obtains the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account; Preferably, can but be not limited to carry out the step of S306 by the computing unit among the processing unit among Fig. 1 102 or Fig. 2 204;
S308 judges according to the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account whether said first number of the account and said second number of the account are the repetition number of the account; Preferably, can but be not limited to carry out the step of S306 by the judging unit among the processing unit among Fig. 1 102 or Fig. 2 206.
In the application's preferred embodiment; Judge through the similarity of a plurality of characteristics between two numbers of the account of match whether two numbers of the account are repetition; Can effectively avoid owing to judging that the inaccurate duplicate message with mistake that causes offers user's problem; Thereby reach the purpose of accurate identification repetition number of the account, further improved the Experience Degree of user when using web search business, ecommerce etc.
Preferably, above-mentioned characteristic information comprise in the following characteristic one of at least: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product or the behavioural information characteristic of number of the account.Characteristic information among the application comprises a plurality of characteristics; For example; The product information characteristic of the essential information characteristic of number of the account, number of the account institute release product and the behavioural information characteristic of number of the account; Utilize above-mentioned characteristic information to carry out similarity from the multidimensional angle and calculate, the unicity of the dimension that has adopted when having avoided repeating number of the account calculating has improved the accuracy that repeats number of the account identification.
Preferably; First acquisition module 2041 among Fig. 2, second acquisition module 2042, select module 2043, first computing module 2044 to adopt the method for cosine angles to come the similarity between the calculated characteristics parameter; Just, calculate the similarity between second characteristic parameter of characteristic of correspondence in the characteristic information of first characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account through following steps:
S1 obtains by said first characteristic parameter being cut first group of keyword A that speech obtains 1, A 2... A MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech A1, W A2... W AM
S2 obtains by said second characteristic parameter is cut speech and obtains second group of keyword B 1, B 2... B NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech B1, W B2... W BN
S3 selects identical keyword C between said first group of keyword and the said second group of keyword 1... C H, H>=1 and corresponding weights W C1... W CH
S4, calculate the similarity df between said first characteristic parameter and said second characteristic parameter through following formula:
df = d 1 ( da × db )
Wherein, d1=W C1* W C1+ ... W CH* W CH
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
The method of above-mentioned cosine angle can utilize different weights to come the similarity between the calculated characteristics parameter, rather than single the similarity of carrying out calculates, thereby obtains two similarities between the characteristic parameter exactly.Certainly, the method for the cosine angle among the application is a kind of example, and the application is not limited only to this, can also carry out calculation of similarity degree through other similar methods.
Preferably, second computing module 2045 mode that can adopt linear fit comes the similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is carried out match through following steps:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account;
C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W C1, W C2W CqBe pre-assigned weight.
Certainly, above-mentioned linear fit is a kind of mode, and the application is not limited only to this.
For instance; The essential information characteristic of first number of the account (first characteristic) comprises parameter: CompanyAddress (A1); Company introduction (A2) and company's phone (A3), the essential information characteristic of second number of the account (second characteristic) comprises parameter: CompanyAddress (B1), company introduction (B2) and company's phone (B3).In the process of the similarity of the essential information characteristic of the essential information characteristic of calculating first number of the account and second number of the account, first computing module 2041 at first calculates the similarity C3 between similarity C2, A3 and the B3 between similarity C1, A2 and the B2 between A1 and the B1; Fitting module 2042 obtains the similarity of essential information characteristic of essential information characteristic and second number of the account of first number of the account through C1, C2 and C3 being carried out match then.In concrete realization; Can adopt the computing method of cosine angle to calculate the similarity between each parameter in the essential information characteristic of each parameter and second number of the account in the essential information characteristic of first number of the account, the computation process about table 1-table 4 of its detailed process in can reference implementation example 3.In addition, about above-mentioned concrete fit procedure, also can reference implementation computation process in the example 3 about table 1-table 4.
In above-mentioned preferred embodiment, obtain the similarity between a pair of characteristic information characteristic owing to carry out The Fitting Calculation to the similarity of each parameter, therefore, guaranteed the accuracy that the similarity between a pair of characteristic information characteristic is calculated.
Preferably; Judge that according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises:, calculate the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition with the input parameter of the similarity between each characteristic of said first number of the account and each corresponding characteristic of said second number of the account as predetermined model of cognition with said second number of the account; Judge according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
Preferably; The step that calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition comprises: the training parameter through predetermined quantity is trained said predetermined model of cognition; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter; With the similarity between the characteristic of correspondence in the characteristic information of each characteristic in the characteristic information of said first number of the account and said second number of the account as input parameter, the similarity between the characteristic information of characteristic information and said second number of the account through obtaining said first number of the account through the said predetermined model of cognition after the training.The application has saved the cycle index of calculating through model of cognition is trained, thereby when carrying out the identification of repetition number of the account, improves the arithmetic speed of system, has saved computing time.In this preferred embodiment, for concrete training process, can reference implementation computation process in the example 3 about table 1-table 4.
Preferably, judge that according to resulting similarity whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises: judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold; If the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account greater than said predetermined threshold, is then judged said first number of the account and said second number of the account is the repetition number of the account.
Preferably, can by but be not limited to first acquisition module 2021 obtains first number of the account and second number of the account through following method essential information characteristic: the essential information of obtaining first number of the account and second number of the account; Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account; Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account.
Preferably, can but be not limited to obtain through following method the product information characteristic of first number of the account and second number of the account institute release product by second acquisition module 2022 among the processing unit among Fig. 1 102 or Fig. 2: the product information of obtaining first number of the account and second number of the account; The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product; The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product.In the application's preferred embodiment, through above-mentioned steps, can obtain useful characteristic information, make that the judgement of similarity is more accurate.
Preferably; Can but be not limited to obtain through following method the behavioural information characteristic of first number of the account and second number of the account by the 3rd acquisition module 2023 among the processing unit among Fig. 1 102 or Fig. 2: employed identification information (Cookie ID) when obtaining first number of the account and the second number of the account Website login; With the Cookie ID of first number of the account that gets access to behavioural information characteristic, with the Cookie ID of second number of the account that gets access to behavioural information characteristic as second number of the account as first number of the account.In the application's preferred embodiment, through above-mentioned steps, can obtain useful characteristic information, make that the judgement of similarity is more accurate.
Preferably; Judging after first number of the account and second number of the account be the repetition number of the account; Above-mentioned repetition number of the account automatic identifying method also comprises: can but be not limited to send indication information to the user by the communication unit among the communication unit among Fig. 1 106 or Fig. 2 208; Wherein, to be used to indicate first number of the account and second number of the account be the repetition number of the account to indication information.In the application's preferred embodiment, through above-mentioned advice method, make the user to manage neatly to number of the account, improved user's Experience Degree.
Embodiment 3
Based on repetition number of the account automatic recognition system illustrated in figures 1 and 2, the application also provides another kind of repetition number of the account automatic identifying method, and is as shown in Figure 4, and the repetition number of the account automatic identifying method in the present embodiment comprises:
S402-S406, obtain number of the account essential information, user's historical behavior information, product information etc. (can claim this stage be information collecting and the processing stage).Preferably, can but be not limited to carry out the step of S402-S406 by the acquiring unit among the processing unit among Fig. 1 102 or Fig. 2 202
Preferably, the essential information of number of the account comprises but is not limited to: essential informations such as Business Name, brief introduction, contact method, geographic position.
Preferably, the offer information of sending out through the extraction number of the account is obtained the corresponding product information of this number of the account.
Preferably, employed Cookie ID obtains user's historical behavior information of this number of the account when obtaining number of the account and land the website.
S408-S414; From the number of the account essential information, extract the essential information characteristic of this number of the account; From user's historical behavior information, extract the behavioural information characteristic of this number of the account, from product information, extract the product information characteristic (can claim that this stage is the characterisation stage of information) that this number of the account is issued.Preferably, can but be not limited to carry out S408-S414 by the computing unit among the processing unit among Fig. 1 102 or Fig. 2 204.
Preferably, after collecting above-mentioned essential information, through text handling method, cut speech and part-of-speech tagging then, form required essential information characteristic.
Preferably, said product information is cut speech and part-of-speech tagging, and the information behind the part-of-speech tagging is added up, obtain the product information characteristic.
Preferably, with the Cookie ID of the number of the account that gets access to behavioural information characteristic as this number of the account.Like this,, analyze the contact between the number of the account, thereby obtain the behavioural information characteristic of this number of the account through the historical behavior of analysis user.
S416, whether the way through machine learning is identified as automatically and repeats, and according to the result of machine learning, can the number of the account of all repetitions be identified.Preferably, can but be not limited to carry out S416 by computing unit among the processing unit among Fig. 1 102 or Fig. 2 204 and judging unit 206.
Preferably, in conjunction with the three aspect characteristics that characterization obtains, having described number of the account from a plurality of dimensions, is exactly the similarity of calculating between character pair below.Concrete grammar is distinguished as follows:
1) calculates the similarity between the essential information characteristic through the way of cosine angle, through these similar value of method match of machine learning, obtain the similarity between the final essential information characteristic then.
Particularly, after essential information is carried out characterization, can obtain one group of essential information characteristic sequence, it comprises: the weight that the id of characteristic and this id are corresponding, wherein, frequency that weight occurs according to id and the part of speech of id calculate.Then,, utilize the algorithm of cosine angle, can calculate a similarity of each final essential information characteristic for characteristic sequence.The similarity of each essential information characteristic of match just can obtain the similarity between the final essential information characteristic.Concrete operations can be with reference to the embodiment of follow-up table 1-4 description.
2) statistics two number of the account like products account for the accounting of product that this number of the account is sent out, and calculate the similarity that the like products portioned product distributes, and the product of product distribution similarity and product accounting obtains the similarity between the product information characteristic.
Preferably, the similarity between the product information characteristic also can utilize the algorithm of cosine angle to calculate.Particularly, at first obtain the id of every kind of product, quantity accounting that should product is represented the weight of this id, wherein, the quantity accounting obtains through the way of statistics.Use comprises that the information of product id and id weight forms the product information characteristic sequence, utilizes the algorithm of cosine angle to calculate similarity then.Concrete operations can be with reference to the embodiment of follow-up table 1-4 description.
3) utilize information such as historical behavior information and contact method, whether relatedly can obtain between a plurality of numbers of the account, obtain the similarity between the behavioural information characteristic between a plurality of numbers of the account.
The application adopts SVM (Support Vector Machines, SVMs) model of cognition to carry out the characteristic match after obtaining above-mentioned three similarities, obtains two similarities between the number of the account.For instance, at first extract the number of the account of a part, mark in twos, this part number of the account is extracted three aspect characteristics as above, and receive the markup information of user's input, learn out the SVM model of cognition of repetition number of the account.When classifying, three characteristics of two numbers of the account of input, the SVM model of cognition can provide a similar value, representes the repetition degree of these two numbers of the account, is higher than the repetition that is identified as of certain threshold values.Through the first vectorial clustering method of class, can do down classification to all numbers of the account, obtain final result, this result can use for each bar product line.Certainly, the application is not limited only to carry out feature identification with the SVM model of cognition, can also realize the application with other model of cognition.
The application's preferred embodiment makes things convenient for user and platform that a plurality of numbers of the account are managed through the repetition number of the account of the same company of identification or individual's registration.After identifying the repetition number of the account, website platform can be notified the user, clearly tells user's repetition number of the account, reminds the user to go to revise and management, accepts user's feedback simultaneously.Further, if the feedback indication merges above-mentioned repetition number of the account, but the indication that merges and incorrect, website platform can be revised this merging indication through preset program, so that carry out the indicated combine command of user better.
Repetition number of the account automatic identifying method and system based on above-mentioned each embodiment describes describe below concrete repetition number of the account and discern example automatically.
Suppose to have 4 companies, specifying information is respectively shown in following table 1-4:
Table 1
Figure BDA0000058988660000121
Table 2
Figure BDA0000058988660000122
Table 3
Table 4
Figure BDA0000058988660000131
To above-mentioned 4 numbers of the account, obtain essential information characteristic, behavioural information characteristic and the product information characteristic of 4 numbers of the account through said method, then,, calculate the similarity between the number of the account in twos through the SVM model of cognition according to the characteristic of above-mentioned three aspects.In said process, can receive the markup information of user's input, for example, and the similarity relation of the number of the account A of user's input, B, C, D, specific as follows, A B 1; A C 1; A D 0; B D 0; C D 1 (wherein, the non-repetition of 0 expression, 1 expression repetition).Before the SVM training, extract earlier the characteristic information of A, B, C, four numbers of the account of D respectively.
Below be example with account A, the process of essential information characterization is described.
1) for the essential information characteristic of number of the account, at first, the essential information of each number of the account is cut speech and part-of-speech tagging, and give weight.With the Business Name is example, and the result that the Business Name of number of the account A " Hangzhou Jia Hua Science and Technology Ltd. " is cut behind the speech is: Hangzhou, good China, science and technology, limited, company; Part-of-speech tagging is Hangzhou (zoning), good China (core institution name), science and technology (industry), limited (generic word), company (common).Then,, give each speech weight (this weight information can be imported in advance by the user and obtain), suppose that the result is: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2 according to factors such as parts of speech.Other dimensions that in like manner can the characterization essential information, for example, company introduction, contact method etc.In addition, for the product information characteristic of this number of the account institute release product, through text techniques as above, the product that can extract A is: mobile phone, MP3, digital camera etc., the accounting that comes out is respectively: 40%, 35%, 25%.Through above-mentioned statistics, obtain product information and be characterized as: mobile phone=0.4, MP3=0.35, digital camera=0.25.In addition, the behavioural information characteristic of this number of the account comprises: the userid of this number of the account, cookieid commonly used etc.
2) after characterization, calculate the similarity of the character pair between two numbers of the account.Following number of the account A and number of the account B (similarity relation is AB 1) are example, describe the similarity of the algorithm computation number of the account A and the Business Name between the B that utilize the cosine angle.Particularly, the Business Name of the A that obtains after the characterization is characterized as: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2; The Business Name of B is characterized as: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2, sales department=0.6.
Here, with the Business Name be the computing method that example is described the cosine angle.By on can know identical being characterized as in the Business Name of number of the account A and B: Hangzhou=1.95, good China=3.1, science and technology=0.8, limited=0.4, company=0.2.Calculate the score of same characteristic features in the Business Name of number of the account A and B then, its formula that adopts be same characteristic features respective weights product with, dl=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2 just; Then; Calculate the score of A, B characteristic respectively; The formula that adopts is the weight sum of products of all characteristics, da=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2, db=1.95*1.95+3.1*3.1+0.8*0.8+0.4*0.4+0.2*0.2+0.6*0.6.Final score is df=dl/ (sqrt (da) * sqrt (db)), and wherein, sqrt (da) refers to the evolution of da.
Through the algorithm of cosine angle, the similarity that can obtain the Business Name between above-mentioned A and the B is 0.96.In like manner, can calculate the similarity between other essential information characteristics between A and the B through identical method, wherein, other essential information characteristics comprise: company introduction, contact method etc.Finally, come similarity between each essential information characteristic between match number of the account A and the B to obtain the similarity between the essential information characteristic of final number of the account A and B through weight parameter, in the present embodiment; Fit method can adopt linear fit method; Particularly, the weight of supposing Business Name c1 is 0.55, and the weight of company introduction c2 is 0.35; The weight of contact method c3 is 0.1; The similarity d that calculates the essential information characteristic is: d=c1*0.55+c2*0.35+c3*0.1 for example, is 0.948.Further, if contact method is identical, then the repetition possibility of two numbers of the account is bigger, can above-mentioned similarity d further be handled, and for example, the similarity d of final essential information characteristic must be divided into: d=d*0.73+0.27.
In like manner, can utilize above-mentioned cosine angle computing method and above-mentioned fit procedure to calculate the similarity of number of the account A and other character pairs of number of the account B, comprise: the similarity between similarity between the product information characteristic and the behavioural information characteristic.Finally, can obtain the similarity of three characteristics, for example, the similarity of three characteristics of number of the account A and number of the account B is respectively 0.948,0.87,0.95.
After the similarity of having calculated between the characteristic of all marks, training SVM model.For example; Similarity relation AB 1 corresponding learning content is (0.948,0.87,0.95,1); That is, the input parameter when (0.948,0.87,0.95) is training SVM model, 1 is the desired output valve that obtains when training the SVM model; Adjust the inner parameter of SVM model through above-mentioned input parameter and output valve, arrive the purpose of training.In like manner, can come further training SVM model according to the learning content of similarity relation A C 1, A D 0, B D 0 and C D 1.The more parameters that training is adopted is many, and the inner parameter of SVM model can be accurate more by adjustment ground.
After having trained the SVM model, below just two numbers of the account are judged, for instance; Supposing needs to judge whether B and two numbers of the account of C repeat, and then can extract three characteristic informations of B C according to the method described above earlier, calculate B C characteristic of correspondence similarity then; Such as being (0.927,0.865,0.94).Give the SVM model with these three values, can obtain a rreturn value, as be 0.97, whether judge this rreturn value greater than the threshold values of setting, if greater than, then number of the account B and C then are judged to the repetition number of the account.
Top just example in the project of reality, can use a large amount of number of the account mark samples to learn.
Certainly, simply mate, perhaps through the information to the member, artificial mode also can realize the identification to a plurality of numbers of the account, but recognition efficiency is very low, and accuracy rate and recall rate are not high.
Needs to the technological challenge that faces at present, optimize allocation of resources and raising search experience; The application has developed the model of automatic identification repetition number of the account; Automatic identification technology through the high recall rate of high-accuracy; Identify a plurality of repetition numbers of the account, can the result of identification be applied to each bar product line with company or individual's registration.
Obviously; Each module or each step that it is apparent to those skilled in the art that above-mentioned the application can realize that they can concentrate on the single calculation element with the general calculation device; Perhaps be distributed on the network that a plurality of calculation element forms; Alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby can they be stored in the memory storage; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the application is not restricted to any specific hardware and software combination.
The preferred embodiment that the above is merely the application is not limited to the application, and for a person skilled in the art, the application can have various changes and variation.All within the application's spirit and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within the application's the protection domain.

Claims (17)

1. one kind is repeated the number of the account automatic identifying method, it is characterized in that, comprising:
Obtain first number of the account that the server of website preserves and the characteristic information of second number of the account;
Calculate the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account;
According to pre-assigned weight parameter the similarity between said each characteristic parameter is carried out match and obtain the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account;
Judge according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account are the repetition number of the account with said second number of the account.
2. method according to claim 1; It is characterized in that, calculate the similarity between second characteristic parameter of characteristic of correspondence in the characteristic information of first characteristic parameter and said second number of the account of the characteristic in the characteristic information of said first number of the account through following steps:
Obtain by said first characteristic parameter being cut first group of keyword A that speech obtains 1, A 2... A MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech A1, W A2... W AM
Obtain by said second characteristic parameter is cut speech and obtain second group of keyword B 1, B 2... B NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech B1, W B2... W BN
Select identical keyword C between said first group of keyword and the said second group of keyword 1... C H, H>=1 and corresponding weights W C1... W CH
Calculate the similarity df between said first characteristic parameter and said second characteristic parameter through following formula
df = d 1 ( da × db )
Wherein, d1=W C1* W C1+ ... W CH* W CH
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
3. method according to claim 1 is characterized in that, comes the similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is carried out match through following steps:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account; C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W C1, W C2W CqBe pre-assigned weight.
4. method according to claim 1; It is characterized in that, judge that according to the similarity between each characteristic of said first number of the account each characteristic corresponding whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises with said second number of the account:
Similarity between each characteristic that each characteristic of said first number of the account is corresponding with said second number of the account is as the input parameter of predetermined model of cognition, calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition;
Judge according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
5. method according to claim 4 is characterized in that, the step that calculates the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition comprises:
Training parameter through predetermined quantity is trained said predetermined model of cognition; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter;
With the similarity between the characteristic of correspondence in the characteristic information of each characteristic in the characteristic information of said first number of the account and said second number of the account as input parameter, the similarity between the characteristic information of characteristic information and said second number of the account through obtaining said first number of the account through the said predetermined model of cognition after the training.
6. method according to claim 4 is characterized in that, judges that according to resulting similarity whether said first number of the account and said second number of the account be that the step of repetition number of the account comprises:
Judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold;
If the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account greater than said predetermined threshold, is then judged said first number of the account and said second number of the account is the repetition number of the account.
7. according to each described method in the claim 1 to 6; It is characterized in that said characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account.
8. method according to claim 7 is characterized in that, obtains the essential information characteristic of said first number of the account and said second number of the account through following method:
Obtain the essential information of said first number of the account and said second number of the account;
Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account;
Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account.
9. method according to claim 7 is characterized in that, obtains the product information characteristic of said first number of the account and said second number of the account institute release product through following method:
Obtain the product information of said first number of the account and said second number of the account;
The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product;
The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product.
10. method according to claim 7 is characterized in that, obtains the behavioural information characteristic of said first number of the account and said second number of the account through following method:
Employed identification information Cookie ID when obtaining said first number of the account and said second number of the account and logining said website;
With the Cookie ID of said first number of the account that gets access to behavioural information characteristic, with the Cookie ID of said second number of the account that gets access to behavioural information characteristic as said second number of the account as said first number of the account.
11. one kind is repeated the number of the account automatic recognition system, it is characterized in that, comprising:
Acquiring unit; Be used to obtain first number of the account that the server of website preserves and the characteristic information of second number of the account; Wherein, said characteristic information comprises one of following characteristic or its combination: the essential information characteristic of number of the account, the product information characteristic of number of the account institute release product and the behavioural information characteristic of number of the account;
Computing unit; Be used for calculating the similarity between each characteristic parameter of characteristic of correspondence in the characteristic information of each characteristic parameter and said second number of the account of characteristic of characteristic information of said first number of the account, and according to pre-assigned weight parameter the similarity between said each characteristic parameter carried out match and obtain the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account;
Judging unit is used for judging according to the similarity between each characteristic of said first number of the account each characteristic corresponding with said second number of the account whether said first number of the account and said second number of the account are the repetition number of the account.
12. system according to claim 11 is characterized in that, said computing unit comprises:
First acquisition module is used to obtain by first characteristic parameter being cut first group of keyword A that speech obtains 1, A 2... A MAnd obtain by said first group of keyword being carried out part-of-speech tagging and each keyword in said first group of keyword being carried out first group of weights W that weight allocation obtains according to part of speech A1, W A2... W AM, wherein, a characteristic parameter of the characteristic in the characteristic information that said first characteristic parameter is said first number of the account;
Second acquisition module is used to obtain by second characteristic parameter being cut speech and obtains second group of keyword B 1, B 2... B NAnd obtain by said second group of keyword being carried out part-of-speech tagging and each keyword in said second group of keyword being carried out second group of weights W that weight allocation obtains according to part of speech B1, W B2... W BN, wherein, a characteristic parameter of the characteristic in the characteristic information that said second characteristic parameter is said second number of the account;
Select module, be used to select identical keyword C between said first group of keyword and the said second group of keyword 1... C H, H>=1 and corresponding weights W C1... W CH
First computing module is used for calculating the similarity df between said first characteristic parameter and said second characteristic parameter through following formula
df = d 1 ( da × db )
Wherein, d1=W C1* W C1+ ... W CH* W CH
da=W A1×W A1+…W AM×W AM
db=W B1×W B1+…W BN×W BN
13. system according to claim 11; It is characterized in that; Said computing unit also comprises: second computing module is used for coming the similarity between each characteristic parameter of each characteristic parameter of first characteristic of said first number of the account second characteristic corresponding with said second number of the account is carried out match through following steps:
d=c1×W c1+c2×W c2…+cq×W cq,q≥1
Wherein, d is the similarity between first characteristic of said first number of the account, second characteristic corresponding with said second number of the account;
C1, c2 ... Cq is the similarity between each characteristic parameter of each characteristic parameter and said second characteristic of said first characteristic;
W C1, W C2W CqBe pre-assigned weight.
14. system according to claim 11 is characterized in that, said judging unit comprises:
The 3rd calculates module; Be used for similarity between each characteristic that each characteristic of said first number of the account is corresponding with said second number of the account as the input parameter of predetermined model of cognition, calculate the similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account through said predetermined model of cognition;
Judge module is used for judging according to resulting similarity whether said first number of the account and said second number of the account are the repetition number of the account.
15. system according to claim 14 is characterized in that, the said the 3rd calculates module comprises:
The training submodule; Be used for said predetermined model of cognition being trained through the training parameter of predetermined quantity; Wherein, Each said training parameter comprises: as the similarity between two each characteristics of number of the account of input parameter, and, as the similarity between said two numbers of the account that are provided with in advance of output parameter;
Calculating sub module; Be used for the similarity between the characteristic of correspondence in the characteristic information of each characteristic of the characteristic information of said first number of the account and said second number of the account as input parameter the similarity between the characteristic information through obtaining said first number of the account through the said predetermined model of cognition after the training and the characteristic information of said second number of the account.
16. system according to claim 14 is characterized in that, said judge module comprises:
Judge submodule; Be used to judge that whether similarity between the characteristic information of characteristic information and said second number of the account of said first number of the account is greater than predetermined threshold; And the similarity between the characteristic information of the characteristic information of said first number of the account and said second number of the account is during greater than said predetermined threshold, judges said first number of the account and said second number of the account is the repetition number of the account.
17. according to each described system in the claim 11 to 16, it is characterized in that, said acquiring unit comprise following one of at least:
First acquisition module is used to obtain the essential information of said first number of the account and said second number of the account; Said essential information to said first number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said first number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said first number of the account; Said essential information to said second number of the account is cut speech and part-of-speech tagging; And each keyword of being cut speech by the said essential information of said second number of the account and obtaining is carried out weight allocation according to the part of speech of mark, to obtain the essential information characteristic of said second number of the account;
Second acquisition module is used to obtain the product information of said first number of the account and said second number of the account; The product information of said first number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said first number of the account and obtaining, and with the product information characteristic of said statistics as said first number of the account institute release product; The product information of said second number of the account is cut speech and part-of-speech tagging; Part of speech according to mark is carried out the number percent statistics to each keyword of being cut speech by the said product information of said second number of the account and obtaining, and with the product information characteristic of said statistics as said second number of the account institute release product; Perhaps
The 3rd acquisition module; Employed identification information Cookie ID when being used to obtain said first number of the account and said second number of the account and logining said website; With the Cookie ID of said first number of the account that gets access to behavioural information characteristic, with the Cookie ID of said second number of the account that gets access to behavioural information characteristic as said second number of the account as said first number of the account.
CN201110113252.1A 2011-05-03 2011-05-03 Method and system for identifying repeated account Active CN102768659B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201110113252.1A CN102768659B (en) 2011-05-03 2011-05-03 Method and system for identifying repeated account
HK12113367.4A HK1172706A1 (en) 2011-05-03 2012-12-25 Method and system for automatically identifying repeated account

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110113252.1A CN102768659B (en) 2011-05-03 2011-05-03 Method and system for identifying repeated account

Publications (2)

Publication Number Publication Date
CN102768659A true CN102768659A (en) 2012-11-07
CN102768659B CN102768659B (en) 2015-06-24

Family

ID=47096063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110113252.1A Active CN102768659B (en) 2011-05-03 2011-05-03 Method and system for identifying repeated account

Country Status (2)

Country Link
CN (1) CN102768659B (en)
HK (1) HK1172706A1 (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077366A (en) * 2014-06-13 2014-10-01 北京百度网讯科技有限公司 Method and device used for determining characteristic information in network device
CN104239490A (en) * 2014-09-05 2014-12-24 电子科技大学 Multi-account detection method and device for UGC (user generated content) website platform
CN104348871A (en) * 2013-08-05 2015-02-11 深圳市腾讯计算机系统有限公司 Similar account expanding method and device
CN104537118A (en) * 2015-01-26 2015-04-22 苏州大学 Microblog data processing method, device and system
CN104573076A (en) * 2015-01-27 2015-04-29 南京烽火星空通信发展有限公司 Social networking site user Chinese remark name system recommendation method
CN105095306A (en) * 2014-05-20 2015-11-25 阿里巴巴集团控股有限公司 Operating method and device based on associated objects
CN105207996A (en) * 2015-08-18 2015-12-30 小米科技有限责任公司 Account merging method and apparatus
CN105335390A (en) * 2014-07-09 2016-02-17 阿里巴巴集团控股有限公司 Object classification method, business pushing method and server
CN105491444A (en) * 2015-11-25 2016-04-13 珠海多玩信息技术有限公司 Data identification processing method and device
CN105516282A (en) * 2015-12-01 2016-04-20 深圳还是威健康科技有限公司 Data synchronous processing method and wearable device
CN105897726A (en) * 2016-05-09 2016-08-24 深圳市永兴元科技有限公司 Associated account data sharing method and device
CN105991621A (en) * 2015-03-04 2016-10-05 深圳市腾讯计算机系统有限公司 Safety detection method and server
CN106034149A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Account identification method and device
CN106126654A (en) * 2016-06-27 2016-11-16 中国科学院信息工程研究所 A kind of inter-network station based on user name similarity user-association method
WO2016188283A1 (en) * 2015-05-26 2016-12-01 阿里巴巴集团控股有限公司 Repeated data identification method and device
WO2016188051A1 (en) * 2015-05-27 2016-12-01 深圳市华傲数据技术有限公司 Information entropy-based object name matching method
CN106372977A (en) * 2015-07-23 2017-02-01 阿里巴巴集团控股有限公司 Method and device for processing virtual account
CN107066616A (en) * 2017-05-09 2017-08-18 北京京东金融科技控股有限公司 Method, device and electronic equipment for account processing
CN107404408A (en) * 2017-08-30 2017-11-28 北京邮电大学 A kind of virtual identity association recognition methods and device
CN107730364A (en) * 2017-10-31 2018-02-23 北京麒麟合盛网络技术有限公司 user identification method and device
EP3285179A4 (en) * 2015-04-14 2018-10-24 Alibaba Group Holding Limited Data transfer method and device
WO2018227931A1 (en) * 2017-06-12 2018-12-20 北京小度信息科技有限公司 Information determining method and apparatus
CN111046894A (en) * 2018-10-15 2020-04-21 北京京东尚科信息技术有限公司 Method and device for identifying vest account
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN111881304A (en) * 2020-07-21 2020-11-03 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
CN113536252A (en) * 2021-07-21 2021-10-22 北京房江湖科技有限公司 Account identification method and computer-readable storage medium
WO2022152018A1 (en) * 2021-01-14 2022-07-21 北京沃东天骏信息技术有限公司 Method and device for identifying multiple accounts belonging to the same person
CN111881304B (en) * 2020-07-21 2024-04-26 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101316262A (en) * 2007-05-31 2008-12-03 中兴通讯股份有限公司 Method for controlling repeated registration of the same account terminal
US7725421B1 (en) * 2006-07-26 2010-05-25 Google Inc. Duplicate account identification and scoring
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system
KR101022373B1 (en) * 2004-01-29 2011-03-22 주식회사 케이티 Log-in system allowing duplicated user account and method for registering of user account and method for authentication of user

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101022373B1 (en) * 2004-01-29 2011-03-22 주식회사 케이티 Log-in system allowing duplicated user account and method for registering of user account and method for authentication of user
US7725421B1 (en) * 2006-07-26 2010-05-25 Google Inc. Duplicate account identification and scoring
CN101316262A (en) * 2007-05-31 2008-12-03 中兴通讯股份有限公司 Method for controlling repeated registration of the same account terminal
CN101727487A (en) * 2009-12-04 2010-06-09 中国人民解放军信息工程大学 Network criticism oriented viewpoint subject identifying method and system

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104348871B (en) * 2013-08-05 2019-01-11 深圳市腾讯计算机系统有限公司 A kind of similar account extended method and device
CN104348871A (en) * 2013-08-05 2015-02-11 深圳市腾讯计算机系统有限公司 Similar account expanding method and device
CN105095306A (en) * 2014-05-20 2015-11-25 阿里巴巴集团控股有限公司 Operating method and device based on associated objects
CN105095306B (en) * 2014-05-20 2019-04-09 阿里巴巴集团控股有限公司 The method and device operated based on affiliated partner
CN104077366B (en) * 2014-06-13 2018-03-23 北京百度网讯科技有限公司 A kind of method and apparatus for being used to determine characteristic information in the network device
CN104077366A (en) * 2014-06-13 2014-10-01 北京百度网讯科技有限公司 Method and device used for determining characteristic information in network device
CN105335390A (en) * 2014-07-09 2016-02-17 阿里巴巴集团控股有限公司 Object classification method, business pushing method and server
CN104239490A (en) * 2014-09-05 2014-12-24 电子科技大学 Multi-account detection method and device for UGC (user generated content) website platform
CN104239490B (en) * 2014-09-05 2017-05-10 电子科技大学 Multi-account detection method and device for UGC (user generated content) website platform
CN104537118A (en) * 2015-01-26 2015-04-22 苏州大学 Microblog data processing method, device and system
CN104537118B (en) * 2015-01-26 2017-12-26 苏州大学 A kind of microblog data processing method, apparatus and system
CN104573076A (en) * 2015-01-27 2015-04-29 南京烽火星空通信发展有限公司 Social networking site user Chinese remark name system recommendation method
CN104573076B (en) * 2015-01-27 2017-11-03 南京烽火星空通信发展有限公司 A kind of Chinese remark names system recommendation method of social network sites user
CN105991621A (en) * 2015-03-04 2016-10-05 深圳市腾讯计算机系统有限公司 Safety detection method and server
CN105991621B (en) * 2015-03-04 2019-12-13 深圳市腾讯计算机系统有限公司 Security detection method and server
CN106034149B (en) * 2015-03-13 2019-06-18 阿里巴巴集团控股有限公司 A kind of account recognition methods and device
CN106034149A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Account identification method and device
US10484342B2 (en) 2015-04-14 2019-11-19 Alibaba Group Holding Limited Accuracy and security of data transfer to an online user account
EP3285179A4 (en) * 2015-04-14 2018-10-24 Alibaba Group Holding Limited Data transfer method and device
CN106294429A (en) * 2015-05-26 2017-01-04 阿里巴巴集团控股有限公司 Repeat data identification method and device
WO2016188283A1 (en) * 2015-05-26 2016-12-01 阿里巴巴集团控股有限公司 Repeated data identification method and device
WO2016188051A1 (en) * 2015-05-27 2016-12-01 深圳市华傲数据技术有限公司 Information entropy-based object name matching method
CN106372977A (en) * 2015-07-23 2017-02-01 阿里巴巴集团控股有限公司 Method and device for processing virtual account
CN106372977B (en) * 2015-07-23 2019-06-07 阿里巴巴集团控股有限公司 A kind of processing method and equipment of virtual account
CN105207996A (en) * 2015-08-18 2015-12-30 小米科技有限责任公司 Account merging method and apparatus
CN105207996B (en) * 2015-08-18 2018-11-23 小米科技有限责任公司 Account merging method and device
CN105491444A (en) * 2015-11-25 2016-04-13 珠海多玩信息技术有限公司 Data identification processing method and device
CN105491444B (en) * 2015-11-25 2018-11-06 珠海多玩信息技术有限公司 A kind of data identifying processing method and device
CN105516282A (en) * 2015-12-01 2016-04-20 深圳还是威健康科技有限公司 Data synchronous processing method and wearable device
CN105516282B (en) * 2015-12-01 2019-06-11 深圳市元征科技股份有限公司 A kind of method and wearable device of data synchronization processing
CN105897726A (en) * 2016-05-09 2016-08-24 深圳市永兴元科技有限公司 Associated account data sharing method and device
CN106126654B (en) * 2016-06-27 2019-10-18 中国科学院信息工程研究所 A kind of inter-network station user-association method based on user name similarity
CN106126654A (en) * 2016-06-27 2016-11-16 中国科学院信息工程研究所 A kind of inter-network station based on user name similarity user-association method
CN107066616A (en) * 2017-05-09 2017-08-18 北京京东金融科技控股有限公司 Method, device and electronic equipment for account processing
WO2018227931A1 (en) * 2017-06-12 2018-12-20 北京小度信息科技有限公司 Information determining method and apparatus
CN107404408A (en) * 2017-08-30 2017-11-28 北京邮电大学 A kind of virtual identity association recognition methods and device
CN107404408B (en) * 2017-08-30 2020-05-22 北京邮电大学 Virtual identity association identification method and device
CN107730364A (en) * 2017-10-31 2018-02-23 北京麒麟合盛网络技术有限公司 user identification method and device
CN111046894A (en) * 2018-10-15 2020-04-21 北京京东尚科信息技术有限公司 Method and device for identifying vest account
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN111881304A (en) * 2020-07-21 2020-11-03 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
CN111881304B (en) * 2020-07-21 2024-04-26 百度在线网络技术(北京)有限公司 Author identification method, device, equipment and storage medium
WO2022152018A1 (en) * 2021-01-14 2022-07-21 北京沃东天骏信息技术有限公司 Method and device for identifying multiple accounts belonging to the same person
CN113536252A (en) * 2021-07-21 2021-10-22 北京房江湖科技有限公司 Account identification method and computer-readable storage medium
CN113536252B (en) * 2021-07-21 2022-08-09 贝壳找房(北京)科技有限公司 Account identification method and computer-readable storage medium

Also Published As

Publication number Publication date
CN102768659B (en) 2015-06-24
HK1172706A1 (en) 2013-04-26

Similar Documents

Publication Publication Date Title
CN102768659B (en) Method and system for identifying repeated account
CN103679462B (en) A kind of comment data treating method and apparatus, a kind of searching method and system
CN103870507B (en) Method and device of searching based on category
CN111506721B (en) Question-answering system and construction method for domain knowledge graph
CN105893533A (en) Text matching method and device
CN102810117A (en) Method and equipment for supplying search result
CN105975453A (en) Method and device for comment label extraction
CN103473317A (en) Method and equipment for extracting keywords
CN112257419A (en) Intelligent retrieval method and device for calculating patent document similarity based on word frequency and semantics, electronic equipment and storage medium thereof
CN106776901A (en) Data extraction method, apparatus and system
CN105989001A (en) Image searching method and device, and image searching system
CN104462554A (en) Method and device for recommending question and answer page related questions
CN111737494A (en) Knowledge graph generation method of intelligent learning system
CN106919588A (en) A kind of application program search system and method
CN104715063A (en) Search ranking method and search ranking device
CN108182182A (en) Document matching process, device and computer readable storage medium in translation database
CN109902157A (en) A kind of training sample validation checking method and device
CN107679186A (en) The method and device of entity search is carried out based on entity storehouse
CN113792084A (en) Data heat analysis method, device, equipment and storage medium
CN108959289B (en) Website category acquisition method and device
CN111523798A (en) Automatic modeling method, device and system and electronic equipment thereof
CN104462556A (en) Method and device for recommending question and answer page related questions
CN116628162A (en) Semantic question-answering method, device, equipment and storage medium
CN103279549A (en) Method and device for acquiring target data of target objects
CN105095385A (en) Method and device for outputting retrieval result

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1172706

Country of ref document: HK