CN104281603A

CN104281603A - Word frequency grading statistical method and system

Info

Publication number: CN104281603A
Application number: CN201310282492.3A
Authority: CN
Inventors: 高玉军; 刘昉
Original assignee: Founder Information Industry Holdings Co Ltd; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Founder Information Industry Holdings Co Ltd; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2013-07-05
Filing date: 2013-07-05
Publication date: 2015-01-14
Anticipated expiration: 2033-07-05
Also published as: CN104281603B

Abstract

The invention relates to the technical field of computer information processing, and discloses a word frequency grading statistical method and system. The method comprises the steps of extracting attribute information of each original document; classifying the original documents according to the attribute information, and building a document attribute table of different categories; digitizing the original documents of the different categories in a one-by-one mode, and generating digitized documents; carrying out primary word frequency statistics and word statistics by taking the digitized documents as units according to the attribute information of words, and storing a statistical result to an electronic directory corresponding to the document attribute table and the digitized documents; carrying out step-by-step merging type word frequency statistics on the words within various statistical ranges by taking the word frequency statistical record of each document as a foundation statistical unit. According to the word frequency grading statistical method and system, the statistical speed, efficiency and the statistical accuracy can be greatly improved.

Description

Word is different size method and system frequently

Technical field

The present invention relates to technical field of computer information processing, be specifically related to a kind of word different size method and system frequently.

Background technology

The invention of word is the important symbol of human civilization, is also the Main Means that a national tradition is continued with culture.

China's word is with a long history, and body is changed various.For thousands of years, sign in picture writing from word graph, then to the inscriptions on bones or tortoise shells, inscription on ancient bronze objects, an ancient style of calligraphy, the lesser seal character, lishu, rapid style of writing, running hand, regular script, though the body of Chinese character has variation, but come down in a continuous line more, verily have recorded the brilliant course of Chinese civilization.

The resource quantity of wordbook at all times as the concrete carrier of word is huge, from Eastern Han Dynasty's " origin of Chinese character " to " the Chinese big dictionary " in the present age, all kinds ofly wordbook, the rhyming dictionary of large quantities of word and form-pronunciation-meaning attribute thereof are recorded, it is the basic resources carrying out word research platform, after digitizing is carried out to the literal resource of these vastnesses, how to carry out the word frequency statistic of various scope of statistics, statistics rank efficiently, greatly will promote the process of Chinese each race word research, and then accelerate internationalization, the standardized process of China's word processing.

And existing word frequency statistic method normally, first for the source material of wordbook at all times to be extracted, carry out digitized processing, set up base word collection database and process.These basic databases comprise wordbook resources bank, ancient writing attribute library, modern Chinese character attribute library, minority language attribute library etc. at all times, then in units of single character itself, the word frequency statistic traveling through character is carried out one by one in all Numerical Resources Databases, this statistical its statistical efficiency when big data quantity is poor, and speed is slower.In up to a million, up to ten million data statisticss, usually need the longer stand-by period.Even if this mode takes Optimized Measures at large database server end, its instant Statistical Speed still can not be satisfactory.

Summary of the invention

The invention provides a kind of word different size method and system frequently, to improve Statistical Speed and accuracy rate.

For this reason, the invention provides following technical scheme:

A kind of word different size method frequently, comprising:

Extract the attribute information of every part of original;

According to described attribute information, described original is classified, and set up different classes of document properties table;

One by one digitizing is carried out to original of all categories, generate digitized document;

Elementary word frequency statistic in units of described digitized document and word counting is carried out according to the attribute information of word, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table;

Be recorded as basic statistical unit by the word frequency statistic of every section of document, carry out the step by step combination type word frequency statistic of word in various scope of statistics.

Preferably, described attribute information comprises: fileinfo and content information;

The feature of described fileinfo comprises: document time information, filename;

The feature of described content information comprises: classification information, classification number, author, dynasty information, font type information, information of being unearthed, published information, sample names.

Preferably, describedly one by one digitizing is carried out to original of all categories, generates digitized document and comprise:

One by one the picture of original of all categories is converted to the digitized document can edited, retrieve.

Preferably, the attribute information of described word comprises following any one or more attribute information: the font of word, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure.

Preferably, the described attribute information according to word carries out elementary word frequency statistic in units of described digitized document and word counting comprises:

According to the attribute information of word, in units of every part of document, carry out word frequency statistic and the word counting of each character.

Preferably, carry out the step by step combination type word frequency statistic of word in various scope of statistics described in comprise:

Based on the content attribute information of described digitized document, carry out the quick combination type word frequency statistic by document properties information; And/or

Based on the attribute information of word, carry out the quick combination type word frequency statistic based on text attribute information.

A kind of word frequency division level statistical system, comprising:

Extraction unit, for extracting the attribute information of every part of original;

Taxon, for classifying to described original according to described attribute information, and sets up different classes of document properties table;

Digital unit, for carrying out digitizing to original of all categories one by one, generates digitized document;

Initial statistical unit, for carrying out elementary word frequency statistic in units of described digitized document and word counting, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table according to the attribute information of word;

Comprehensive statistics unit, is recorded as basic statistical unit for the word frequency statistic by every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.

Preferably, described digital unit, specifically for being converted to the digitized document can edited, retrieve one by one by the picture of original of all categories.

Preferably, described initial statistical unit, specifically for the attribute information according to word, carries out word frequency statistic and the word counting of each character in units of every part of document.

Preferably, described comprehensive statistics unit comprises:

First statistics subelement, for the content attribute information based on described digitized document, carries out the quick combination type word frequency statistic by document properties information; And/or

Second statistics subelement, for the attribute information based on word, carries out the quick combination type word frequency statistic based on text attribute information.

The word that the embodiment of the present invention provides is different size method and system frequently, by completing the elementary word frequency statistic of single section of document in advance to single section of digital document simultaneously, afterwards in conjunction with attribute conditions information, for all kinds of statistical condition, combine the elementary word frequency statistic data in each section document, carry out simple mathematics and add up and can complete the last gamut word frequency statistic needed fast.Compare traditional word frequency statistic method, substantially increase Statistical Speed and efficiency and accuracy.Further, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can according to word frequency statistic result quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of embodiment of the present invention word frequency different size method;

Fig. 2 is the word frequency statistic schematic diagram data of single section of document in the embodiment of the present invention;

Fig. 3 is the structural representation of embodiment of the present invention word frequency division level statistical system.

Embodiment

In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

For the resource of wordbook at all times of substantial amounts, compiling poor, the slow-footed problem of the word frequency statistic efficiency in process, the embodiment of the present invention provides a kind of word different size method and system frequently, carry out the combination type word frequency statistic of word within the scope of wordbook resource statistics at all times fast, its Statistical Speed and accuracy rate will improve greatly, and each word frequency statistic all can be reviewed with quick position very easily in original document.

As shown in Figure 1, be the process flow diagram of embodiment of the present invention word frequency different size method, comprise the following steps:

Step 101, extracts the attribute information of every part of original.

Described original refers to the text message completely corresponding with word in document picture, derive from all kinds of document sample, the concrete scope of these document samples refers to the sample of the resource of wordbook at all times including but not limited to a large amount of word, comprise the sample of the source book such as the books in ancient times, rare book, unearthed relics, also comprise the sample of modern printed publication.These sample standard deviations need the digitizing typing work carried out with picture and corresponding modern text form.

The attribute information of described original comprises: fileinfo and content information.Wherein, fileinfo feature comprises: document time information, filename etc.; Content information feature comprises: classification information, classification number, author, dynasty information (from ancient times to the present), font type information (as the Song typeface, regular script, lishu, the lesser seal character, the inscriptions on bones or tortoise shells etc.), unearthed information, published information, sample names etc.

Such as, for the newspaper of Republic, its Properties of Documents information should comprise: the general information such as newspaper date, the space of a whole page, font used, lemma, text word, news grade table.

Step 102, classifies to described original according to described attribute information, and sets up different classes of document properties table.

Content in described document properties table indicates literature content attributive character usually.

Which series is belonged to for original, need to classify, from language structure, the first order can be divided into Chinese character in ancient times, modern Chinese character, calligraphy font etc., the second level refers to the subclassification under even higher level of node, and such as ancient times, word can Further Division be second level subclassification, as inscription on ancient bronze objects, regular script, the lesser seal character etc.; Modern text can Further Division be second level subclassification, as regular script, lishu, rapid style of writing etc.; The third level, on the basis of the second level, is segmented further, as lower in regular script point of report class, stone inscription, periodical etc.; The like, the fourth stage, level V, carry out as required.

For People's Daily's master newspaper in 1949, its classification belongs to " modern Chinese character/regular script/report class/People's Daily/1946-1970 ".On this basis, set up " 194902010102 words frequently ", expression is the 2nd article of news in newspaper the 1st edition on February 1st, 1949; The two combines and is exactly " modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02 ", represent the 1946-1970 genus of the People's Daily under the report class that regular script classification under modern Chinese character is lower, the 2nd article of news on February 1st, 1949 newspaper the 1st edition; The document properties table that namely this establish;

Original attribute and document properties have corresponding relation; The two can be completely the same

Step 103, carries out digitizing to original of all categories one by one, generates digitized document.

Original derives from all kinds of document sample, before these samples do not have digitizing, is do not have corresponding electronic document.Therefore, according to after document photo or word sequence corresponding to document entity typing, corresponding numeral chemical examination document could must be formed.That is, the digitizing of original mainly refers to the digitized document being converted to by original picture and can editing, retrieve, and after being undertaken scanning and then carry out Chinese Character Recognition, is entered in computing machine by original word, forms digitized document; For not existing or the character of None-identified, the numbering that can give uniqueness is indicated.The numbering of uniqueness indicates the association for recording this word, is convenient to review original font figure.

Step 104, carries out elementary word frequency statistic in units of described digitized document and word counting according to the attribute information of word, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table.

The attribute information of described word can comprise following any one or more: the attribute information such as font, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure of word.

For each digitized document that step 103 generates, all according to described document properties table, set up corresponding electronic directory, such as electronic directory: " modern Chinese character/regular script/report class/People's Daily/1946-1970/ ", specific under this catalogue, corresponding digitized document then called after " 194902010102.TXT ", the word frequency file then called after " 194902010102 words are .TXT frequently " of its correspondence, the two all deposits in electronic directory: under " modern Chinese character/regular script/report class/People's Daily/1946-1970/ "; What " 194902010102.TXT " deposited is words all in the 2nd article of body in original newspaper on February 1st, 1949 the 1st edition, is obtained by manual entry; " 194902010102 words frequently .TXT " be then record each word in the 2nd article of body in newspaper the 1st edition on February 1st, 1 word frequently, namely there is how many times in the 2nd article of body in each word on February 1st, 1949 newspaper the 1st edition, how many probability, all have record; This completes the word frequency statistic of each character contained in a most elementary cell; If " dividing " word all having occurred how many times, how many probability in the 2nd article of body on February 1st, 1949 newspaper the 1st edition, all there is record.

The attribute information of word can comprise the attribute informations such as the particular location coordinate (X, Y) in the font of this word, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure, place contemporary literature.Both word can be added up frequently with the font of word and Unicode encoded attributes in statistics, also word can be added up frequently by the information such as the order of strokes observed in calligraphy, stroke.Such as can add up which word, the word which UNIOCDE coding occurs frequently, also can be added up certain order of strokes observed in calligraphy, the word of stroke has how many frequently.And the particular location coordinate (X, Y) in the contemporary literature of place mainly records the particular location of this word in contemporary literature, to facilitate this word of follow-up quick position original position in the literature, to facilitate follow-up researching and analysing.Meanwhile, the attribute information of all words can as inquiry according to occurring.

In actual applications, carrying out in original digitized process, according to the attribute information of word, word frequency statistic and the word counting of each character can be carried out in units of every part of document.So-called word counting, namely in units of every part of document, carries out all numbers of words in elementary statistics the document; So-called word frequency statistic, be exactly with certain word at the total number of word of this section of document divided by the total number of word of this section of document, in i.e. word frequency statistic=this section of document current word total number of word/this section of document in the total number of word of all words, precision can be carried out with permillage, to be accurate to after radix point three.

Particularly, for the single section of full section word of digitizing file scanning, number of words and word frequency statistic can be carried out for each Chinese character in this section of digitized document one by one, finally form the number of words in single section of digitized document shown in Fig. 2 and word frequency statistic result.

As shown in Figure 2, this section of digitizing document files is called: " 194902010102 words frequently ", expression is the 2nd article of news in newspaper the 1st edition on February 1st, 1949, total number of word 31 word, the number of words in this section of body of word wherein " is divided " to be 4, its word frequency statistic in this section of news is 4,/31,*10,00=,129 ‰, and the statistical unit namely shown in Fig. 1 is news item, and news item is one section of article.

The like, until the elementary word under all categories in all digitized document frequently and number of words and positional information add up complete respectively, and record.

For Fig. 2, add up all news bars of all spaces of a whole page of all newspapers successively, obtain following record:

For Fig. 1, all news bars of all spaces of a whole page of all newspapers will be added up successively.

Similar:

" 194902010101 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 1st, 1949 in the 1st article of body

" 194902010103 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 1st, 1949 in the 3rd article of body

" 194902010104 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 1st, 1949 in the 4th article of body

" 194902010201 words frequently ", expression is the word frequency statistic in newspaper the 2nd edition on February 1st, 1949 in the 1st article of body

" 194902020101 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 2nd, 1949 in the 1st article of body

……．

Step 105, is recorded as basic statistical unit by the word frequency statistic of every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.

Based on above-mentioned elementary word frequency statistic in units of described digitized document and word counting, in conjunction with the content attribute information of every part of digitized document, the quick combination type word frequency statistic by digitized document attribute information can be carried out; In conjunction with text attribute information, the quick combination type word frequency statistic based on text attribute information can be carried out.

So-called combination type statistics both can be the vertical consolidation cumulative statistics of certain character in time from top to bottom, from ancient times to the present in scope, also can be that the transverse direction of character in each application in certain age adds up word frequency statistic.So-called cumulative statistics, refers to and carries out simple digital addition and subtraction computing, and the word frequency of cumulative certain word generally referred in each section document of word frequency is cumulative can be completed.

Such as, " modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02 ", expression is the contents attribute of the 2nd bar of body document in newspaper the 1st edition on February 1st, 1949, based on this, can carry out the quick combination type word frequency statistic by document properties information, the word audio data of adding up by multiple " 194902010102 words frequently " merges.Particularly, can by modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02 " under all classification upwards vertically step by step merge accumulative word frequently, that also can carry out certain first-level class adds up word frequently across classification horizontal meaders.

For another example, use the attribute information of certain word (as the font of word or unique UNICODE coding of correspondence), carry out the quick combination type word frequency statistic of the attribute information based on described word.This combination type word frequency statistic comprises the vertical consolidation statistics of certain character in time from top to bottom, from ancient times to the present in scope, also the horizontal word frequency statistic of character in each application in certain age is comprised, as the accumulative word frequency of " dividing " word in all documents, " divide " word at the accumulative word frequency of multiple similar " 194902010102 words frequently ", " divide " word at the word frequency of single document " 194902010102 words frequently ", " the accumulative word in document frequently in multiple inhomogeneity for " dividing " word.

Conversely, according to the object information of word frequency statistic, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.

For news statistics in newspaper shown in Fig. 2, the word frequency statistic from elementary statistics i.e. this newspaper the 1st edition on February 1st, 1949 in the 2nd article of news, wherein just has " divide " word at the word frequency of single document " 194902010102 words frequently ".

Based on this, in statistics this newspaper on February 1st, 1949 the 1st edition that can directly add up step by step, in all news, " divide " the word frequency of word.

The upper level all spaces of a whole page in statistics this newspaper on February 1st, 1949 that can directly add up " divide that " word of word frequently again.

Upper level directly can add up and allly in statistics this newspaper of in February, 1949 " divide " the word frequency of word again.

Upper level directly can add up and allly in statistics this newspaper in 1949 " divide " the word frequency of word again.

Upper level directly can add up to add up in all news grades of Republic to own and " divide " the word frequency of word again.

Upper level directly can add up to add up in all news grades to own and " divide " the word frequency of word again.

Upper level directly can add up to add up in all regular script classes to own and " divide " the word frequency of word again.

Upper level directly can add up to add up in all modern Chinese character classes to own and " divide " the word frequency of word again.

Again upper level and toply can directly to add up in all documents of statistics " divide " word of word frequency (comprise modern Chinese character, ancient times Chinese character, all classification of writing brush word etc.).

Namely rank has continued the original class categories determined before: " modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02/ ".

The above-mentioned process that word frequency is cumulative step by step according to certain statistical scope is very simple with fast, because according to the difference of scope, change be only that the level of the plus-minus of numeral is different, and the computing of plus-minus is quickly, therefore carry out original digitized while, carry out the word frequency statistic for single section of document, it is huge for promoting the speed of the word frequency statistic of later customizable scope.

And traditional approach does not carry out the word frequency statistic of single section of document in advance, time to be needed, just scan each document one by one, add up and record word after each file scanning more frequently, and then cumulative, speed is obviously very slow.If the scope of statistics changes to some extent, when statistics next time, rescan statistics again, very time-consuming.

In the embodiment of the present invention, by completing the elementary word frequency statistic of single section of document in advance to single section of digital document simultaneously, afterwards in conjunction with attribute conditions information, for all kinds of statistical condition, combine the elementary word frequency statistic data in each section document, carry out simple mathematics and add up and can complete the last gamut word frequency statistic needed fast.Compare traditional word frequency statistic method, substantially increase Statistical Speed and efficiency and accuracy.Further, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can according to word frequency statistic result quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.

The method of the embodiment of the present invention can be widely used in informationization protection, the communication sphere of Computerized Information Processing Tech and Text extraction field and cultural heritage.

Correspondingly, the embodiment of the present invention also provides a kind of word frequency division level statistical system, as shown in Figure 3, is a kind of structural representation of this system.

In this embodiment, described word frequency division level statistical system comprises:

Extraction unit 301, for extracting the attribute information of every part of original;

Taxon 302, for classifying to described original according to described attribute information, and sets up different classes of document properties table;

Digital unit 303, for carrying out digitizing to original of all categories one by one, generates digitized document;

Initial statistical unit 304, for carrying out elementary word frequency statistic in units of described digitized document and word counting, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table according to the attribute information of word;

Comprehensive statistics unit 305, is recorded as basic statistical unit for the word frequency statistic by every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.

Wherein, described digital unit 303 is specifically for being converted to the digitized document can edited, retrieve one by one by the picture of original of all categories.

Described initial statistical unit 304, specifically for the attribute information according to word, carries out word frequency statistic and the word counting of each character in units of every part of document.

A kind of embodiment of described comprehensive statistics unit 305 can comprise: the first statistics subelement and the second statistics subelement (not shown).Wherein:

Described first statistics subelement is used for the content attribute information based on described digitized document, carries out the quick combination type word frequency statistic by document properties information; And/or

Described second statistics subelement is used for the attribute information based on word, carries out the quick combination type word frequency statistic based on text attribute information.

The detailed process utilizing described word frequency division level statistical system to carry out word frequency statistic can refer to the description frequently in different size method of embodiment of the present invention word above, does not repeat them here.

The word frequency division level statistical system of the embodiment of the present invention, by completing the elementary word frequency statistic of single section of document in advance to single section of digital document simultaneously, afterwards in conjunction with attribute conditions information, for all kinds of statistical condition, combine the elementary word frequency statistic data in each section document, carry out simple mathematics and add up and can complete the last gamut word frequency statistic needed fast.Compare traditional word frequency statistic method, substantially increase Statistical Speed and efficiency and accuracy.Further, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can according to word frequency statistic result quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.

The system of the embodiment of the present invention can be widely used in informationization protection, the communication sphere of Computerized Information Processing Tech and Text extraction field and cultural heritage.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for system embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.System embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.

Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method and apparatus of the present invention for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. a word different size method frequently, is characterized in that, comprising:

Extract the attribute information of every part of original;

2. method according to claim 1, is characterized in that, described attribute information comprises: fileinfo and content information;

3. method according to claim 1, is characterized in that, describedly carries out digitizing to original of all categories one by one, generates digitized document and comprises:

4. method according to claim 1, is characterized in that, the attribute information of described word comprises following any one or more attribute information: the font of word, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure.

5. method according to claim 1, is characterized in that, the described attribute information according to word carries out elementary word frequency statistic in units of described digitized document and word counting comprises:

6. the method according to any one of claim 1 to 5, is characterized in that, described in carry out the step by step combination type word frequency statistic of word in various scope of statistics and comprise:

7. a word frequency division level statistical system, is characterized in that, comprising:

8. system according to claim 7, is characterized in that,

Described digital unit, specifically for being converted to the digitized document can edited, retrieve one by one by the picture of original of all categories.

9. system according to claim 7, is characterized in that,

Described initial statistical unit, specifically for the attribute information according to word, carries out word frequency statistic and the word counting of each character in units of every part of document.

10. the system according to any one of claim 7 to 9, is characterized in that, described comprehensive statistics unit comprises: