CN104281603A - Word frequency grading statistical method and system - Google Patents

Word frequency grading statistical method and system Download PDF

Info

Publication number
CN104281603A
CN104281603A CN201310282492.3A CN201310282492A CN104281603A CN 104281603 A CN104281603 A CN 104281603A CN 201310282492 A CN201310282492 A CN 201310282492A CN 104281603 A CN104281603 A CN 104281603A
Authority
CN
China
Prior art keywords
word
document
word frequency
attribute information
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310282492.3A
Other languages
Chinese (zh)
Other versions
CN104281603B (en
Inventor
高玉军
刘昉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Founder Information Industry Holdings Co Ltd
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Founder Information Industry Holdings Co Ltd
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Founder Information Industry Holdings Co Ltd, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Founder Information Industry Holdings Co Ltd
Priority to CN201310282492.3A priority Critical patent/CN104281603B/en
Publication of CN104281603A publication Critical patent/CN104281603A/en
Application granted granted Critical
Publication of CN104281603B publication Critical patent/CN104281603B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Abstract

The invention relates to the technical field of computer information processing, and discloses a word frequency grading statistical method and system. The method comprises the steps of extracting attribute information of each original document; classifying the original documents according to the attribute information, and building a document attribute table of different categories; digitizing the original documents of the different categories in a one-by-one mode, and generating digitized documents; carrying out primary word frequency statistics and word statistics by taking the digitized documents as units according to the attribute information of words, and storing a statistical result to an electronic directory corresponding to the document attribute table and the digitized documents; carrying out step-by-step merging type word frequency statistics on the words within various statistical ranges by taking the word frequency statistical record of each document as a foundation statistical unit. According to the word frequency grading statistical method and system, the statistical speed, efficiency and the statistical accuracy can be greatly improved.

Description

Word is different size method and system frequently
Technical field
The present invention relates to technical field of computer information processing, be specifically related to a kind of word different size method and system frequently.
Background technology
The invention of word is the important symbol of human civilization, is also the Main Means that a national tradition is continued with culture.
China's word is with a long history, and body is changed various.For thousands of years, sign in picture writing from word graph, then to the inscriptions on bones or tortoise shells, inscription on ancient bronze objects, an ancient style of calligraphy, the lesser seal character, lishu, rapid style of writing, running hand, regular script, though the body of Chinese character has variation, but come down in a continuous line more, verily have recorded the brilliant course of Chinese civilization.
The resource quantity of wordbook at all times as the concrete carrier of word is huge, from Eastern Han Dynasty's " origin of Chinese character " to " the Chinese big dictionary " in the present age, all kinds ofly wordbook, the rhyming dictionary of large quantities of word and form-pronunciation-meaning attribute thereof are recorded, it is the basic resources carrying out word research platform, after digitizing is carried out to the literal resource of these vastnesses, how to carry out the word frequency statistic of various scope of statistics, statistics rank efficiently, greatly will promote the process of Chinese each race word research, and then accelerate internationalization, the standardized process of China's word processing.
And existing word frequency statistic method normally, first for the source material of wordbook at all times to be extracted, carry out digitized processing, set up base word collection database and process.These basic databases comprise wordbook resources bank, ancient writing attribute library, modern Chinese character attribute library, minority language attribute library etc. at all times, then in units of single character itself, the word frequency statistic traveling through character is carried out one by one in all Numerical Resources Databases, this statistical its statistical efficiency when big data quantity is poor, and speed is slower.In up to a million, up to ten million data statisticss, usually need the longer stand-by period.Even if this mode takes Optimized Measures at large database server end, its instant Statistical Speed still can not be satisfactory.
Summary of the invention
The invention provides a kind of word different size method and system frequently, to improve Statistical Speed and accuracy rate.
For this reason, the invention provides following technical scheme:
A kind of word different size method frequently, comprising:
Extract the attribute information of every part of original;
According to described attribute information, described original is classified, and set up different classes of document properties table;
One by one digitizing is carried out to original of all categories, generate digitized document;
Elementary word frequency statistic in units of described digitized document and word counting is carried out according to the attribute information of word, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table;
Be recorded as basic statistical unit by the word frequency statistic of every section of document, carry out the step by step combination type word frequency statistic of word in various scope of statistics.
Preferably, described attribute information comprises: fileinfo and content information;
The feature of described fileinfo comprises: document time information, filename;
The feature of described content information comprises: classification information, classification number, author, dynasty information, font type information, information of being unearthed, published information, sample names.
Preferably, describedly one by one digitizing is carried out to original of all categories, generates digitized document and comprise:
One by one the picture of original of all categories is converted to the digitized document can edited, retrieve.
Preferably, the attribute information of described word comprises following any one or more attribute information: the font of word, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure.
Preferably, the described attribute information according to word carries out elementary word frequency statistic in units of described digitized document and word counting comprises:
According to the attribute information of word, in units of every part of document, carry out word frequency statistic and the word counting of each character.
Preferably, carry out the step by step combination type word frequency statistic of word in various scope of statistics described in comprise:
Based on the content attribute information of described digitized document, carry out the quick combination type word frequency statistic by document properties information; And/or
Based on the attribute information of word, carry out the quick combination type word frequency statistic based on text attribute information.
A kind of word frequency division level statistical system, comprising:
Extraction unit, for extracting the attribute information of every part of original;
Taxon, for classifying to described original according to described attribute information, and sets up different classes of document properties table;
Digital unit, for carrying out digitizing to original of all categories one by one, generates digitized document;
Initial statistical unit, for carrying out elementary word frequency statistic in units of described digitized document and word counting, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table according to the attribute information of word;
Comprehensive statistics unit, is recorded as basic statistical unit for the word frequency statistic by every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.
Preferably, described digital unit, specifically for being converted to the digitized document can edited, retrieve one by one by the picture of original of all categories.
Preferably, described initial statistical unit, specifically for the attribute information according to word, carries out word frequency statistic and the word counting of each character in units of every part of document.
Preferably, described comprehensive statistics unit comprises:
First statistics subelement, for the content attribute information based on described digitized document, carries out the quick combination type word frequency statistic by document properties information; And/or
Second statistics subelement, for the attribute information based on word, carries out the quick combination type word frequency statistic based on text attribute information.
The word that the embodiment of the present invention provides is different size method and system frequently, by completing the elementary word frequency statistic of single section of document in advance to single section of digital document simultaneously, afterwards in conjunction with attribute conditions information, for all kinds of statistical condition, combine the elementary word frequency statistic data in each section document, carry out simple mathematics and add up and can complete the last gamut word frequency statistic needed fast.Compare traditional word frequency statistic method, substantially increase Statistical Speed and efficiency and accuracy.Further, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can according to word frequency statistic result quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of embodiment of the present invention word frequency different size method;
Fig. 2 is the word frequency statistic schematic diagram data of single section of document in the embodiment of the present invention;
Fig. 3 is the structural representation of embodiment of the present invention word frequency division level statistical system.
Embodiment
In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
For the resource of wordbook at all times of substantial amounts, compiling poor, the slow-footed problem of the word frequency statistic efficiency in process, the embodiment of the present invention provides a kind of word different size method and system frequently, carry out the combination type word frequency statistic of word within the scope of wordbook resource statistics at all times fast, its Statistical Speed and accuracy rate will improve greatly, and each word frequency statistic all can be reviewed with quick position very easily in original document.
As shown in Figure 1, be the process flow diagram of embodiment of the present invention word frequency different size method, comprise the following steps:
Step 101, extracts the attribute information of every part of original.
Described original refers to the text message completely corresponding with word in document picture, derive from all kinds of document sample, the concrete scope of these document samples refers to the sample of the resource of wordbook at all times including but not limited to a large amount of word, comprise the sample of the source book such as the books in ancient times, rare book, unearthed relics, also comprise the sample of modern printed publication.These sample standard deviations need the digitizing typing work carried out with picture and corresponding modern text form.
The attribute information of described original comprises: fileinfo and content information.Wherein, fileinfo feature comprises: document time information, filename etc.; Content information feature comprises: classification information, classification number, author, dynasty information (from ancient times to the present), font type information (as the Song typeface, regular script, lishu, the lesser seal character, the inscriptions on bones or tortoise shells etc.), unearthed information, published information, sample names etc.
Such as, for the newspaper of Republic, its Properties of Documents information should comprise: the general information such as newspaper date, the space of a whole page, font used, lemma, text word, news grade table.
Step 102, classifies to described original according to described attribute information, and sets up different classes of document properties table.
Content in described document properties table indicates literature content attributive character usually.
Which series is belonged to for original, need to classify, from language structure, the first order can be divided into Chinese character in ancient times, modern Chinese character, calligraphy font etc., the second level refers to the subclassification under even higher level of node, and such as ancient times, word can Further Division be second level subclassification, as inscription on ancient bronze objects, regular script, the lesser seal character etc.; Modern text can Further Division be second level subclassification, as regular script, lishu, rapid style of writing etc.; The third level, on the basis of the second level, is segmented further, as lower in regular script point of report class, stone inscription, periodical etc.; The like, the fourth stage, level V, carry out as required.
For People's Daily's master newspaper in 1949, its classification belongs to " modern Chinese character/regular script/report class/People's Daily/1946-1970 ".On this basis, set up " 194902010102 words frequently ", expression is the 2nd article of news in newspaper the 1st edition on February 1st, 1949; The two combines and is exactly " modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02 ", represent the 1946-1970 genus of the People's Daily under the report class that regular script classification under modern Chinese character is lower, the 2nd article of news on February 1st, 1949 newspaper the 1st edition; The document properties table that namely this establish;
Original attribute and document properties have corresponding relation; The two can be completely the same
Step 103, carries out digitizing to original of all categories one by one, generates digitized document.
Original derives from all kinds of document sample, before these samples do not have digitizing, is do not have corresponding electronic document.Therefore, according to after document photo or word sequence corresponding to document entity typing, corresponding numeral chemical examination document could must be formed.That is, the digitizing of original mainly refers to the digitized document being converted to by original picture and can editing, retrieve, and after being undertaken scanning and then carry out Chinese Character Recognition, is entered in computing machine by original word, forms digitized document; For not existing or the character of None-identified, the numbering that can give uniqueness is indicated.The numbering of uniqueness indicates the association for recording this word, is convenient to review original font figure.
Step 104, carries out elementary word frequency statistic in units of described digitized document and word counting according to the attribute information of word, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table.
The attribute information of described word can comprise following any one or more: the attribute information such as font, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure of word.
For each digitized document that step 103 generates, all according to described document properties table, set up corresponding electronic directory, such as electronic directory: " modern Chinese character/regular script/report class/People's Daily/1946-1970/ ", specific under this catalogue, corresponding digitized document then called after " 194902010102.TXT ", the word frequency file then called after " 194902010102 words are .TXT frequently " of its correspondence, the two all deposits in electronic directory: under " modern Chinese character/regular script/report class/People's Daily/1946-1970/ "; What " 194902010102.TXT " deposited is words all in the 2nd article of body in original newspaper on February 1st, 1949 the 1st edition, is obtained by manual entry; " 194902010102 words frequently .TXT " be then record each word in the 2nd article of body in newspaper the 1st edition on February 1st, 1 word frequently, namely there is how many times in the 2nd article of body in each word on February 1st, 1949 newspaper the 1st edition, how many probability, all have record; This completes the word frequency statistic of each character contained in a most elementary cell; If " dividing " word all having occurred how many times, how many probability in the 2nd article of body on February 1st, 1949 newspaper the 1st edition, all there is record.
The attribute information of word can comprise the attribute informations such as the particular location coordinate (X, Y) in the font of this word, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure, place contemporary literature.Both word can be added up frequently with the font of word and Unicode encoded attributes in statistics, also word can be added up frequently by the information such as the order of strokes observed in calligraphy, stroke.Such as can add up which word, the word which UNIOCDE coding occurs frequently, also can be added up certain order of strokes observed in calligraphy, the word of stroke has how many frequently.And the particular location coordinate (X, Y) in the contemporary literature of place mainly records the particular location of this word in contemporary literature, to facilitate this word of follow-up quick position original position in the literature, to facilitate follow-up researching and analysing.Meanwhile, the attribute information of all words can as inquiry according to occurring.
In actual applications, carrying out in original digitized process, according to the attribute information of word, word frequency statistic and the word counting of each character can be carried out in units of every part of document.So-called word counting, namely in units of every part of document, carries out all numbers of words in elementary statistics the document; So-called word frequency statistic, be exactly with certain word at the total number of word of this section of document divided by the total number of word of this section of document, in i.e. word frequency statistic=this section of document current word total number of word/this section of document in the total number of word of all words, precision can be carried out with permillage, to be accurate to after radix point three.
Particularly, for the single section of full section word of digitizing file scanning, number of words and word frequency statistic can be carried out for each Chinese character in this section of digitized document one by one, finally form the number of words in single section of digitized document shown in Fig. 2 and word frequency statistic result.
As shown in Figure 2, this section of digitizing document files is called: " 194902010102 words frequently ", expression is the 2nd article of news in newspaper the 1st edition on February 1st, 1949, total number of word 31 word, the number of words in this section of body of word wherein " is divided " to be 4, its word frequency statistic in this section of news is 4,/31,*10,00=,129 ‰, and the statistical unit namely shown in Fig. 1 is news item, and news item is one section of article.
The like, until the elementary word under all categories in all digitized document frequently and number of words and positional information add up complete respectively, and record.
For Fig. 2, add up all news bars of all spaces of a whole page of all newspapers successively, obtain following record:
For Fig. 1, all news bars of all spaces of a whole page of all newspapers will be added up successively.
Similar:
" 194902010101 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 1st, 1949 in the 1st article of body
" 194902010103 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 1st, 1949 in the 3rd article of body
" 194902010104 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 1st, 1949 in the 4th article of body
" 194902010201 words frequently ", expression is the word frequency statistic in newspaper the 2nd edition on February 1st, 1949 in the 1st article of body
" 194902020101 words frequently ", expression is the word frequency statistic in newspaper the 1st edition on February 2nd, 1949 in the 1st article of body
…….
Step 105, is recorded as basic statistical unit by the word frequency statistic of every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.
Based on above-mentioned elementary word frequency statistic in units of described digitized document and word counting, in conjunction with the content attribute information of every part of digitized document, the quick combination type word frequency statistic by digitized document attribute information can be carried out; In conjunction with text attribute information, the quick combination type word frequency statistic based on text attribute information can be carried out.
So-called combination type statistics both can be the vertical consolidation cumulative statistics of certain character in time from top to bottom, from ancient times to the present in scope, also can be that the transverse direction of character in each application in certain age adds up word frequency statistic.So-called cumulative statistics, refers to and carries out simple digital addition and subtraction computing, and the word frequency of cumulative certain word generally referred in each section document of word frequency is cumulative can be completed.
Such as, " modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02 ", expression is the contents attribute of the 2nd bar of body document in newspaper the 1st edition on February 1st, 1949, based on this, can carry out the quick combination type word frequency statistic by document properties information, the word audio data of adding up by multiple " 194902010102 words frequently " merges.Particularly, can by modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02 " under all classification upwards vertically step by step merge accumulative word frequently, that also can carry out certain first-level class adds up word frequently across classification horizontal meaders.
For another example, use the attribute information of certain word (as the font of word or unique UNICODE coding of correspondence), carry out the quick combination type word frequency statistic of the attribute information based on described word.This combination type word frequency statistic comprises the vertical consolidation statistics of certain character in time from top to bottom, from ancient times to the present in scope, also the horizontal word frequency statistic of character in each application in certain age is comprised, as the accumulative word frequency of " dividing " word in all documents, " divide " word at the accumulative word frequency of multiple similar " 194902010102 words frequently ", " divide " word at the word frequency of single document " 194902010102 words frequently ", " the accumulative word in document frequently in multiple inhomogeneity for " dividing " word.
Conversely, according to the object information of word frequency statistic, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.
For news statistics in newspaper shown in Fig. 2, the word frequency statistic from elementary statistics i.e. this newspaper the 1st edition on February 1st, 1949 in the 2nd article of news, wherein just has " divide " word at the word frequency of single document " 194902010102 words frequently ".
Based on this, in statistics this newspaper on February 1st, 1949 the 1st edition that can directly add up step by step, in all news, " divide " the word frequency of word.
The upper level all spaces of a whole page in statistics this newspaper on February 1st, 1949 that can directly add up " divide that " word of word frequently again.
Upper level directly can add up and allly in statistics this newspaper of in February, 1949 " divide " the word frequency of word again.
Upper level directly can add up and allly in statistics this newspaper in 1949 " divide " the word frequency of word again.
Upper level directly can add up to add up in all news grades of Republic to own and " divide " the word frequency of word again.
Upper level directly can add up to add up in all news grades to own and " divide " the word frequency of word again.
Upper level directly can add up to add up in all regular script classes to own and " divide " the word frequency of word again.
Upper level directly can add up to add up in all modern Chinese character classes to own and " divide " the word frequency of word again.
Again upper level and toply can directly to add up in all documents of statistics " divide " word of word frequency (comprise modern Chinese character, ancient times Chinese character, all classification of writing brush word etc.).
Namely rank has continued the original class categories determined before: " modern Chinese character/regular script/report class/People's Daily/1946-1970/1949/02/01/01/02/ ".
The above-mentioned process that word frequency is cumulative step by step according to certain statistical scope is very simple with fast, because according to the difference of scope, change be only that the level of the plus-minus of numeral is different, and the computing of plus-minus is quickly, therefore carry out original digitized while, carry out the word frequency statistic for single section of document, it is huge for promoting the speed of the word frequency statistic of later customizable scope.
And traditional approach does not carry out the word frequency statistic of single section of document in advance, time to be needed, just scan each document one by one, add up and record word after each file scanning more frequently, and then cumulative, speed is obviously very slow.If the scope of statistics changes to some extent, when statistics next time, rescan statistics again, very time-consuming.
In the embodiment of the present invention, by completing the elementary word frequency statistic of single section of document in advance to single section of digital document simultaneously, afterwards in conjunction with attribute conditions information, for all kinds of statistical condition, combine the elementary word frequency statistic data in each section document, carry out simple mathematics and add up and can complete the last gamut word frequency statistic needed fast.Compare traditional word frequency statistic method, substantially increase Statistical Speed and efficiency and accuracy.Further, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can according to word frequency statistic result quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.
The method of the embodiment of the present invention can be widely used in informationization protection, the communication sphere of Computerized Information Processing Tech and Text extraction field and cultural heritage.
Correspondingly, the embodiment of the present invention also provides a kind of word frequency division level statistical system, as shown in Figure 3, is a kind of structural representation of this system.
In this embodiment, described word frequency division level statistical system comprises:
Extraction unit 301, for extracting the attribute information of every part of original;
Taxon 302, for classifying to described original according to described attribute information, and sets up different classes of document properties table;
Digital unit 303, for carrying out digitizing to original of all categories one by one, generates digitized document;
Initial statistical unit 304, for carrying out elementary word frequency statistic in units of described digitized document and word counting, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table according to the attribute information of word;
Comprehensive statistics unit 305, is recorded as basic statistical unit for the word frequency statistic by every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.
Wherein, described digital unit 303 is specifically for being converted to the digitized document can edited, retrieve one by one by the picture of original of all categories.
Described initial statistical unit 304, specifically for the attribute information according to word, carries out word frequency statistic and the word counting of each character in units of every part of document.
A kind of embodiment of described comprehensive statistics unit 305 can comprise: the first statistics subelement and the second statistics subelement (not shown).Wherein:
Described first statistics subelement is used for the content attribute information based on described digitized document, carries out the quick combination type word frequency statistic by document properties information; And/or
Described second statistics subelement is used for the attribute information based on word, carries out the quick combination type word frequency statistic based on text attribute information.
The detailed process utilizing described word frequency division level statistical system to carry out word frequency statistic can refer to the description frequently in different size method of embodiment of the present invention word above, does not repeat them here.
The word frequency division level statistical system of the embodiment of the present invention, by completing the elementary word frequency statistic of single section of document in advance to single section of digital document simultaneously, afterwards in conjunction with attribute conditions information, for all kinds of statistical condition, combine the elementary word frequency statistic data in each section document, carry out simple mathematics and add up and can complete the last gamut word frequency statistic needed fast.Compare traditional word frequency statistic method, substantially increase Statistical Speed and efficiency and accuracy.Further, because all kinds of attribute records carried out in advance in digitizing process associate with word frequency statistic, also can according to word frequency statistic result quick position to all original information relating to word frequency statistic result, for the research process of word provides tracing function fast and easily.
The system of the embodiment of the present invention can be widely used in informationization protection, the communication sphere of Computerized Information Processing Tech and Text extraction field and cultural heritage.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for system embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.System embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method and apparatus of the present invention for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. a word different size method frequently, is characterized in that, comprising:
Extract the attribute information of every part of original;
According to described attribute information, described original is classified, and set up different classes of document properties table;
One by one digitizing is carried out to original of all categories, generate digitized document;
Elementary word frequency statistic in units of described digitized document and word counting is carried out according to the attribute information of word, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table;
Be recorded as basic statistical unit by the word frequency statistic of every section of document, carry out the step by step combination type word frequency statistic of word in various scope of statistics.
2. method according to claim 1, is characterized in that, described attribute information comprises: fileinfo and content information;
The feature of described fileinfo comprises: document time information, filename;
The feature of described content information comprises: classification information, classification number, author, dynasty information, font type information, information of being unearthed, published information, sample names.
3. method according to claim 1, is characterized in that, describedly carries out digitizing to original of all categories one by one, generates digitized document and comprises:
One by one the picture of original of all categories is converted to the digitized document can edited, retrieve.
4. method according to claim 1, is characterized in that, the attribute information of described word comprises following any one or more attribute information: the font of word, Unicode coding, the order of strokes observed in calligraphy, stroke, radicals by which characters are arranged in traditional Chinese dictionaries, font structure.
5. method according to claim 1, is characterized in that, the described attribute information according to word carries out elementary word frequency statistic in units of described digitized document and word counting comprises:
According to the attribute information of word, in units of every part of document, carry out word frequency statistic and the word counting of each character.
6. the method according to any one of claim 1 to 5, is characterized in that, described in carry out the step by step combination type word frequency statistic of word in various scope of statistics and comprise:
Based on the content attribute information of described digitized document, carry out the quick combination type word frequency statistic by document properties information; And/or
Based on the attribute information of word, carry out the quick combination type word frequency statistic based on text attribute information.
7. a word frequency division level statistical system, is characterized in that, comprising:
Extraction unit, for extracting the attribute information of every part of original;
Taxon, for classifying to described original according to described attribute information, and sets up different classes of document properties table;
Digital unit, for carrying out digitizing to original of all categories one by one, generates digitized document;
Initial statistical unit, for carrying out elementary word frequency statistic in units of described digitized document and word counting, and under statistics being saved in the electronic directory corresponding with described digitized document to described document properties table according to the attribute information of word;
Comprehensive statistics unit, is recorded as basic statistical unit for the word frequency statistic by every section of document, carries out the step by step combination type word frequency statistic of word in various scope of statistics.
8. system according to claim 7, is characterized in that,
Described digital unit, specifically for being converted to the digitized document can edited, retrieve one by one by the picture of original of all categories.
9. system according to claim 7, is characterized in that,
Described initial statistical unit, specifically for the attribute information according to word, carries out word frequency statistic and the word counting of each character in units of every part of document.
10. the system according to any one of claim 7 to 9, is characterized in that, described comprehensive statistics unit comprises:
First statistics subelement, for the content attribute information based on described digitized document, carries out the quick combination type word frequency statistic by document properties information; And/or
Second statistics subelement, for the attribute information based on word, carries out the quick combination type word frequency statistic based on text attribute information.
CN201310282492.3A 2013-07-05 2013-07-05 Word frequency different size method and system Expired - Fee Related CN104281603B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310282492.3A CN104281603B (en) 2013-07-05 2013-07-05 Word frequency different size method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310282492.3A CN104281603B (en) 2013-07-05 2013-07-05 Word frequency different size method and system

Publications (2)

Publication Number Publication Date
CN104281603A true CN104281603A (en) 2015-01-14
CN104281603B CN104281603B (en) 2018-01-19

Family

ID=52256479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310282492.3A Expired - Fee Related CN104281603B (en) 2013-07-05 2013-07-05 Word frequency different size method and system

Country Status (1)

Country Link
CN (1) CN104281603B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608074A (en) * 2016-01-15 2016-05-25 中译语通科技(北京)有限公司 Word counting method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1206158A (en) * 1997-07-02 1999-01-27 松下电器产业株式会社 Keyword extracting system and text retneval system using the same
US20070112756A1 (en) * 2005-11-15 2007-05-17 Microsoft Corporation Information classification paradigm
CN101055581A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 Document management system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1206158A (en) * 1997-07-02 1999-01-27 松下电器产业株式会社 Keyword extracting system and text retneval system using the same
US20070112756A1 (en) * 2005-11-15 2007-05-17 Microsoft Corporation Information classification paradigm
CN101055581A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 Document management system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608074A (en) * 2016-01-15 2016-05-25 中译语通科技(北京)有限公司 Word counting method and device
CN105608074B (en) * 2016-01-15 2018-06-29 中译语通科技股份有限公司 A kind of word counting method and device

Also Published As

Publication number Publication date
CN104281603B (en) 2018-01-19

Similar Documents

Publication Publication Date Title
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
CN101770446B (en) Method and system for identifying form in layout file
JP4343213B2 (en) Document processing apparatus and document processing method
CN109933796B (en) Method and device for extracting key information of bulletin text
Singh et al. OCR++: a robust framework for information extraction from scholarly articles
CN110704570A (en) Continuous page layout document structured information extraction method
CN102270206A (en) Method and device for capturing valid web page contents
CN106502991B (en) Publication treating method and apparatus
Choudhary et al. A four-tier annotated Urdu handwritten text image dataset for multidisciplinary research on Urdu script
CN105488471A (en) Character pattern recognition method and device
CN110990539A (en) Manuscript internal duplicate checking method and device, storage medium and electronic equipment
Long An agent-based approach to table recognition and interpretation
JP2005043990A (en) Document processor and document processing method
CN105740355A (en) Aggregated text density based webpage body text extraction method and apparatus
JP2013016036A (en) Document component generation method and computer system
CN110162684B (en) Machine reading understanding data set construction and evaluation method based on deep learning
CN109902148B (en) Automatic enterprise name completion method for address book contacts
CN104281603B (en) Word frequency different size method and system
CN106156121A (en) Copybook recommends method and copybook commending system
CN112784040B (en) Vertical industry text classification method based on corpus
CN107145947A (en) A kind of information processing method, device and electronic equipment
CN113806311A (en) Deep learning-based file classification method and device, electronic equipment and medium
Bataineh A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition.
Kozlova et al. The methodological foundations of standardization in the field of library and information support of science
Arnold et al. Transforming Data Silos into Knowledge: Early Chinese Periodicals Online (ECPO)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180119

Termination date: 20190705

CF01 Termination of patent right due to non-payment of annual fee