CN100568225C - The Words symbolization processing method and the system of numeral and special symbol string in the text - Google Patents

The Words symbolization processing method and the system of numeral and special symbol string in the text Download PDF

Info

Publication number
CN100568225C
CN100568225C CNB2006101656333A CN200610165633A CN100568225C CN 100568225 C CN100568225 C CN 100568225C CN B2006101656333 A CNB2006101656333 A CN B2006101656333A CN 200610165633 A CN200610165633 A CN 200610165633A CN 100568225 C CN100568225 C CN 100568225C
Authority
CN
China
Prior art keywords
numeral
string
symbol string
words
symbolization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2006101656333A
Other languages
Chinese (zh)
Other versions
CN101196881A (en
Inventor
郭庆
片江伸之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CNB2006101656333A priority Critical patent/CN100568225C/en
Priority to JP2007318985A priority patent/JP5130892B2/en
Publication of CN101196881A publication Critical patent/CN101196881A/en
Application granted granted Critical
Publication of CN100568225C publication Critical patent/CN100568225C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention is the Words symbolization processing method and the system of numbers and symbols string in a kind of natural language text, and described method may further comprise the steps: the input natural language text; Extract the numbers and symbols string in the described natural language text piecemeal; The template of current numeral and symbol string and pre-stored is mated, obtain the affiliated template type of current numeral and symbol string; The template type and the relevant information of log history numbers and symbols string; According to the template type of the template type under current numeral and the symbol string and the current numeral historical numbers and symbols string adjacent with symbol string and relevant information current numeral and symbol string being carried out Words symbolization handles.Accuracy of identification and efficient have been improved to numeral and special symbol in the text.

Description

The Words symbolization processing method and the system of numeral and special symbol string in the text
Technical field
The present invention relates to treatment technology that the numeral and the special symbol of complex form in the natural language text are carried out Words symbolization, is the Words symbolization processing method and the system of numeral and special symbol string in a kind of text concretely.
Background technology
In natural language text, numeral and special symbol (comprising the foreign word symbol, for example the English alphabet in the Chinese) extensively and in large quantities exist as the basic symbol in the natural language system.With Chinese is example, in People's Daily's language material in 1998, exists numeral or special symbol in about 25% the sentence.In field of information processing, many application relevant such as natural language understanding, mechanical translation, phonetic synthesis etc. with natural language processing technique, all need and to understand accurately numeral, the special symbol string that may exist in the natural language text, on the basis of understanding, wherein numeral or special symbol are carried out the Words symbolization processing, be about to numeral or special symbol and be converted to the literal of equal value with it.In speech synthesis system, numeral or special symbol are being carried out on the basis of Words symbolization processing, also to carry out the processing of letter-to-phone, and according to the numeral, special symbol string structure add suitable speech border or more higher leveled rhythm speech border, thereby make that the sound of phonetic synthesis is more natural.Therefore, in many application relevant with natural language processing technique, one effectively numeral, special symbol Words symbolization disposal system are essential.
Therefore numeral and special symbol have also formed many set forms commonly used owing to be present in the natural language text widely.With regard to numeral, in different context of co-texts, or in different usage structures, two kinds of possible pronunciations are arranged, a kind of is the numerical value pronunciation, another is the telegram pronunciation.With Chinese is example, and " 130 " are used to describe quantity for example when " this high-speed printer (HSP) can be printed 130 pages of paper in one minute ", pronounce " 130 ", and pronounces " one 30 " in " 130 hospital " or " 130 drill crew " such linguistic context.And for example " 70 years ", as syntactic units independently, itself just has ambiguousness, can be " seven zero years " (representing 1970), also can be " 70 years " (express time section).At this moment, often need for example come that it is carried out accurately Words symbolization on the basis of paragraph, chapter contextual analysis or semantic understanding at wider, darker level handles.
The pragmatic form of special symbol is varied especially, for the Words symbolization of special symbol, has the problem of two aspects.Being the diversity of its usage on the one hand, is the possible ambiguousness of bringing thus in addition on the one hand.With Chinese is example, and "-", "/" and ": " are three symbols commonly used.Because they can be used as different pragmatic means and appear in a lot of set forms, so they also are what to be difficult to by the computing machine correct understanding.These special symbols often follow numeral to occur together, further, can be mingled with Chinese character in the middle of these special symbols, the numeral sometimes and occur together, and they are combined and constitute a big syntactic units.For example: " 2000 yuan/month ", "-19 days on the 16th ", " 3 months-6 months ", " Boeing-747 ", " phone: 65992238 65993388-1826,1828 " etc.The ambiguousness of special symbol also is one of problem that must solve, for example ": " plays different pragmatic effects in three words below, ": " should turn to " ratio " by letter symbol in example sentence 1 and example sentence 3, should turn to " point " by letter symbol in example sentence 2.
Example sentence 1: carry out 6: 2: 2 composite wage system
21 evening of the example sentence 2:7 month 19:30 branch
Example sentence 3: the score that she beats opponents is 6: 2,5: 7 and 7: 5
There are many pieces of existing documents in the processing of carrying out Words symbolization about the numeral and the special symbol of complex form in the natural language text, and is special with United States Patent (USP) 6,721,697 (Duan; Lei; Franz; Alexander; Horiguchi; Keiko; April 13,2004, Method and system for reducing lexicalambiguity);
United States Patent (USP) 6,266,642 (Franz; Alexander M.; Horiguchi; Keiko; July 24,2001, Method and portable apparatus for performing spoken language translation);
United States Patent (USP) 6,826,568 (Bernstein; Philip A.; Madhavan; Jayant; November 30,2004, Methods and system for model matching);
United States Patent (USP) 5,930,756 (Mackie; Andrew William; Miller; Corey Andrew; Karaali; Orhan; June 23,1997, Method, device and system for a memory-efficientrandom-access pronunciation lexicon for text-to-speech synthesis);
United States Patent (USP) 6,182,028 (Karaali; Orhan; Mackie; Andrew William; November 7,1997, Method, device and system for part-of-speech disambiguation); Disclosed content is herein incorporated the prior art document as the present patent application.
In general, numeral and special symbol Words symbolization system utilize context knowledge, write for numeral, the special symbol string of different-format that pointed rule realizes.As " this high-speed printer (HSP) can be printed 130 pages of paper in one minute ", by to " 130 " numeric string and the hereinafter investigation of measure word " page or leaf ", can obtain " 130 pages " is the understanding of " number+measure word ", using corresponding Words symbolization rule (numerical value pronunciation), is " this high-speed printer (HSP) can be printed 130 pages of paper in one minute " thereby obtain the Words symbolization result.
In the prior art, numeral and special symbol Words symbolization system are by the sequential scanning input text, and wherein numeral and special symbol string of extraction piecemeal carries out to it then that template matches realizes.Such system mainly contains following two shortcomings:
The one, the scope of investigating is often between the actual zone less than semantic primitive.Like this, just can not accomplish semantic understanding completely in some cases, even be wrong semantic understanding sometimes, therefore obtain wrong Words symbolization result.Previous system is the sequential scanning input text often, finds numeral or special symbol, judges that at any time it could symbolism, runs into special symbol, then calls the processing rule of this special symbol.The weakness of this scan mechanism just is the locality of its investigation scope.For simple " numerical value+measure word " situation in front, generally no problem.But for complicated a little certain situation, the scope of Kao Chaing just seems more isolated sometimes.For example " 9:30 branch on January 1st, 1970 " may just be divided into four zones " 1970 ", " January ", " 1 day " and " 9:30 branch " and handle respectively.In fact, a complete semantic primitive has been formed in these four zones, should be treated to a language piece in the natural language understanding aftertreatment, uses in the relevant aftertreatment in phonetic synthesis, also should be counted as a prosodic phrase.More very it under some complicated situations, owing to can't investigate whole semantic primitive from the overall situation, just can not solve some ambiguities, especially the ambiguity of special symbol sometimes.Be three example sentences below, only on the basis that the overall situation is understood, could realize its correct Words symbolization is handled.
Example sentence 1: great majority are to exist at the age by the people of AIDS viral infection newly 15 years old-24 years oldThe young man.
Example sentence 2: phone: 65992238 65993388-1826,1828.
Example sentence 3: Suizhong 36-1The oil field is positioned at Bohai Sea Northern Liaodong Bay.
In addition, also have a class situation, also only on the basis that the overall situation is understood, could realize its correct Words symbolization processing.Be two example sentences below." 1996 " part in the example sentence 4 only could be determined its semanteme (year) after having investigated " 1996,1,997 two ", therefore should remove Words symbolization numeric string " 1996 " according to the telegram pronunciation.In other words, the Words symbolization rule of " 1996 " has been inherited the Words symbolization rule of " 1997 ".Therefore, this class problem is classified as reverse succession issue here.Certainly, the use of this reverse succession needs very strict restriction.For example, in these two Chinese example sentences, the punctuation mark pause mark is one of precondition of oppositely inheriting below.
Example sentence 4: only just bred in 1996,1,997 two and survive 10.
Example sentence 5: according to arranging the data that alkali directorate provide: 1985, grain and cotton output in 1986,1987 increases progressively year after year.
As seen, prior art pass through the sequential scanning input text, extract wherein numeral and special symbol string piecemeal, then it is carried out numeral and the special symbol string manipulation that template matches realizes, can't investigate whole semantic primitive from the overall situation, do not remove to handle numeral and special symbol string yet, therefore can not solve the ambiguity of some ambiguities, especially special symbol in the text in the mode of reverse succession.
Summary of the invention
The objective of the invention is to, the Words symbolization processing method and the system of numeral and special symbol string are provided in a kind of text, composing law based on numeral and special symbol in the text, to numeral in the text, when special symbol string carries out template identification piecemeal, investigate current numeral, the context of co-text of special symbol string, promptly investigate the adjacent numeral in possible front and back simultaneously, the template type of special symbol string, numeral for complex form, special symbol string, can find the complete semantic primitive of its correspondence, determine the pairing accurate template of this semantic primitive then, thus can be to the numeral of complex form, special symbol string carries out Words symbolization processing accurately.
The invention provides the Words symbolization processing method of numbers and symbols string in a kind of natural language text, described method may further comprise the steps: the input natural language text; Extract the numbers and symbols string in the described natural language text piecemeal; The template of current numeral and symbol string and pre-stored is mated, obtain the affiliated template type of current numeral and symbol string; The template type and the relevant information of log history numbers and symbols string; According to the template type of the template type under current numeral and the symbol string and the current numeral historical numbers and symbols string adjacent with symbol string and relevant information current numeral and symbol string being carried out Words symbolization handles.
In the template type of the adjacent historical numbers and symbols string of current numeral and symbol string and relevant information, find relevant context of co-text, then make current numeral and symbol string and adjacent historical numbers and symbols string be combined as a semantic primitive, and generate the pairing template of this semantic primitive, write down the pairing numbers and symbols string of this semantic primitive information.
Described semantic primitive is used required mark.
Described context of co-text comprises: template type, interval range, Words symbolization rule etc.
Travel through the template type and the relevant information of historical numbers and symbols string, if find the numbers and symbols string of not handled by Words symbolization, then judge whether this numbers and symbols string is carried out reverse inherited literal symbolism rule, if then this numbers and symbols string is handled according to reverse inherited literal symbolism rule.
The new literacy that adds in Words symbolization is handled is carried out aftertreatment.
Described symbol is meant: the unnatural language symbol.
The present invention also provides the Words symbolization disposal system of numbers and symbols string in a kind of natural language text, and described system comprises: input part is used to import natural language text; Numbers and symbols string extracting part is used for extracting piecemeal the numbers and symbols string of described natural language text; Template matches portion is used for the template of current numeral and symbol string and pre-stored is mated, and obtains the affiliated template type of current numeral and symbol string; The historical information recording portion is used for the template type and the relevant information of log history numbers and symbols string; Words symbolization rule generating unit, be used for current numeral and symbol string being carried out Words symbolization and handle, generate the Words symbolization rule of current numeral and symbol string according to the template type of the template type under current numeral and the symbol string and the current numeral historical numbers and symbols string adjacent and relevant information with symbol string.
Described system also comprises: context of co-text investigation portion is used for the template type and the relevant information of the current numeral historical numbers and symbols string adjacent with symbol string are investigated; The semantic primitive determination portion, if in the template type of the adjacent historical numbers and symbols string of current numeral and symbol string and relevant information, find relevant context of co-text, then make current numeral and symbol string and adjacent historical numbers and symbols string be combined as a semantic primitive, and generate the pairing template of this semantic primitive; Described historical information recording portion writes down the pairing numbers and symbols string of this semantic primitive information.
Described system also comprises: the semantic primitive labeling section is used for described semantic primitive is used required mark.
Described system also comprises: reverse succession portion, be used to travel through the template type and the relevant information of historical numbers and symbols string, if find the numbers and symbols string of not handled by Words symbolization, then judge whether this numbers and symbols string is carried out reverse inherited literal symbolism rule, if then this numbers and symbols string is handled according to reverse inherited literal symbolism rule.
Described system also comprises: aftertreatment portion is used for that Words symbolization is handled the new literacy that adds and carries out aftertreatment.
The present invention also provides the Words symbolization handling procedure of numbers and symbols string in a kind of natural language text, and described program comprises: the input natural language text; Extract the numbers and symbols string in the described natural language text piecemeal; The template of current numeral and symbol string and pre-stored is mated, obtain the affiliated template type of current numeral and symbol string; The template type and the relevant information of log history numbers and symbols string; According to the template type of the template type under current numeral and the symbol string and the current numeral historical numbers and symbols string adjacent with symbol string and relevant information current numeral and symbol string being carried out Words symbolization handles.
The present invention also provides a kind of readable storage medium storing program for executing of storing the Words symbolization handling procedure of numbers and symbols string in the natural language text, and described readable storage medium storing program for executing stores following program: the input natural language text; Extract the numbers and symbols string in the described natural language text piecemeal; The template of current numeral and symbol string and pre-stored is mated, obtain the affiliated template type of current numeral and symbol string; The template type and the relevant information of log history numbers and symbols string; According to the template type of the template type under current numeral and the symbol string and the current numeral historical numbers and symbols string adjacent with symbol string and relevant information current numeral and symbol string being carried out Words symbolization handles.
Beneficial effect of the present invention is, the present invention is based in the text composing law of numeral and special symbol, to numeral in the text, when special symbol string carries out template identification piecemeal, to investigate current numeral, the context of co-text of special symbol string, promptly investigate the adjacent numeral in possible front and back simultaneously, the template type of special symbol string, numeral for complex form, special symbol string, can find the complete semantic primitive of its correspondence, determine the pairing accurate template of this semantic primitive then, thus can be to the numeral of complex form, special symbol string carries out Words symbolization processing accurately.Meanwhile, can use other required marks to big semantic primitive.For example, current big semantic primitive is marked as a language piece in the natural language understanding aftertreatment, perhaps uses in the relevant aftertreatment in phonetic synthesis, also should be marked as a prosodic phrase, wherein may relate to division of rhythm speech or the like again.At last, this method provides a kind of mechanism of reverse succession.Accuracy of identification and efficient have been improved to numeral and special symbol in the text.
Description of drawings
Fig. 1, for the structured flowchart of system of the present invention;
Fig. 2, for the FB(flow block) of system embodiment of the present invention;
Fig. 3, be association process FB(flow block) of the present invention;
Fig. 4, be reverse succession FB(flow block) of the present invention;
Fig. 5, be the Words symbolization process flow block diagram of the specific embodiment of the invention;
Fig. 6, be that the Words symbolization of the specific embodiment of the invention is handled the FB(flow block) of oppositely inheriting;
Fig. 7, be the synoptic diagram that the template linguistic context of the specific embodiment of the invention is investigated knowledge base;
Fig. 8, be the numeral/special symbol string historical data base synoptic diagram of the specific embodiment of the invention;
Fig. 9, be the synoptic diagram of other mark knowledge bases of the specific embodiment of the invention;
Figure 10, be the template Words symbolization rule-based knowledge base synoptic diagram of the specific embodiment of the invention;
Figure 11, oppositely inherit the synoptic diagram of knowledge base for the template of the specific embodiment of the invention.
Embodiment
Below in conjunction with description of drawings the specific embodiment of the present invention.
As shown in Figure 1, the present invention is the Words symbolization disposal system of numbers and symbols string in a kind of natural language text, and described system comprises: input part is used to import natural language text; Numbers and symbols string extracting part is used for extracting piecemeal the numbers and symbols string of described natural language text; Template matches portion is used for the template of current numeral and symbol string and pre-stored is mated, and obtains the affiliated template type of current numeral and symbol string; The historical information recording portion is used for the template type and the relevant information of log history numbers and symbols string; Words symbolization rule generating unit, be used for current numeral and symbol string being carried out Words symbolization and handle, generate the Words symbolization rule of current numeral and symbol string according to the template type of the template type under current numeral and the symbol string and the current numeral historical numbers and symbols string adjacent and relevant information with symbol string.
System of the present invention can realize that input part wherein can be mode or its combinations such as keyboard, mouse, voice or communication interface in the network based on computing machine, server or server and terminal formation; Output can be mode or its combinations such as screen, printer, communication interface or voice.
The cardinal rule of the Words symbolization disposal system embodiment of numeral and special symbol as shown in Figure 2 in natural text.In this system, module 101 is the arbitrary text of input.
Text pretreatment portion (module 102) carries out normalized to input text, wherein comprise: the normalizing of the processing of the processing of punctuation mark, pragmatic symbol, the processing of other Languages literal, coded format (in the application of Chinese, on the contrary double byte character be converted to the half-angle character or) etc.
Numeral/the special symbol string that may exist in the input text is mated piecemeal in numeral/special symbol string template matches portion (module 103).Therein, this module extracts numeral/special symbol string wherein piecemeal with the sequential scanning input text, then defined template in current numeral/special symbol string and the template base is mated, and obtains its affiliated template type.
Context of co-text handling part (module 104) is main part of the present invention.This module will be analyzed the context of co-text of current numeral/special symbol string, thereby can realize the correct Words symbolization of current numeral/special symbol string is handled on the basis that the overall situation is understood.Specifically, context of co-text handling part (module 104) is made up of two submodules.The one, association process portion (module 1041), the 2nd, oppositely inherit portion's (module 1042).The former at first is responsible for log history numeral/special symbol string relevant information as interval range, matching template type, Words symbolization rule etc., template type according to current numeral/special symbol string carries out corresponding linguistic context investigation then, finally from the angle of the overall situation current numeral/special symbol string is carried out Words symbolization processing more accurately.The latter solves the problem of reverse succession aspect.
Numeral/special symbol string Words symbolization portion (module 105) carries out Words symbolization to numeral/special symbol string and handles.This module, is carried out Words symbolization to current numeral/special symbol string and is handled in conjunction with possible context of co-text according to the template type under current numeral/special symbol string.
Module 106 is aftertreatment portions, and the new literacy that adds in the Words symbolization process is carried out aftertreatment.In natural language processing, may be the introducing etc. of speech border, language block boundary.In phonetic synthesis, also comprise new literacy is carried out phonetic notation and adds border rhythm grade etc.
Module 107 is final analysis result.
Fig. 3 has provided the detailed description of association process portion (module 1041).
Module 202 is context of co-text investigation portions, it is according to current template type (being obtained by module 103 numerals/special symbol string template matches portion), the template linguistic context is investigated the relevant context investigation knowledge that knowledge base stores in the calling module 201, and the context of co-text of current numeral/special symbol string is investigated.The context of co-text of current numeral/special symbol string is obtained by the numeral/special symbol string historical data base that stores in the module 203.
Module 204 is accurate template generating units, promptly above-mentioned semantic primitive determination portion.It is the investigation result of linguistic context investigation portion based on context, as finding relevant context of co-text is arranged, and that is to say and has found big semantic primitive, promptly carries out the generation of this big pairing accurate template of semantic primitive.The accurate template type that newly obtains is stored in the module 205.
Module 206 is numeral/special symbol string historical record portions, and it is responsible for writing down numeral/special symbol string historical information.If in aforementioned accurate template generating unit, found big semantic primitive, the numeral/special symbol string historical information before module 206 also will be upgraded, the local digital/special symbol string information before promptly covering with big numeral/special symbol string semantic primitive.In other words, record is corresponding to the numeral/special symbol string historical information of complete semantic primitive.
Module 208 is other labeling section, and it is responsible for current big semantic primitive is carried out other possible marks according to other mark knowledge bases in the module 207.For example, current big semantic primitive is marked as a language piece in the natural language understanding aftertreatment, perhaps uses in the relevant aftertreatment in phonetic synthesis, also should be marked as a prosodic phrase, wherein may relate to division of rhythm speech or the like again.
Module 210 is Words symbolization rule generating units, and it is responsible for the Words symbolization rule of the relevant accurate template that stores in the calling module 209 template Words symbolization rule-based knowledge bases, generates the Words symbolization rule of current numeral/special symbol string.Analysis result is stored among the module 211 Words symbolization rule analysis results.
Fig. 4 describes the treatment scheme of reverse succession portion (module 1042) in detail.
Module 301 is responsible for traversal numeral/special symbol string historical record.
Module 302 is responsible for having or not as yet not by the numeral of Words symbolization/special character symbol string in check dight/special symbol string historical record.If find all numerals/special character symbol string, finish all by Words symbolization.If find to have as yet not by the numeral of Words symbolization/special character symbol string, then revolving die piece 304 is oppositely inherited inspection portions and is carried out subsequent treatment.
Module 304 is the inspection portions of oppositely inheriting, and it is oppositely inherited the relevant reverse that knowledge base stores according to template in the current template type calling module 303 and inherits restrictive condition, to current numeral/special symbol string oppositely inherited literal symbolism rule check.If can not inherit, revolving die piece 301.If can inherit, then fill in the Words symbolization of current numeral/special symbol string and inherit result's (module 305), the revolving die piece 301 then.
The present invention is for numeral, symbol (special symbol) string of complex form, can find the complete semantic primitive of its correspondence, determine the pairing accurate template of this semantic primitive then, thereby can carry out Words symbolization processing accurately numeral, the special symbol string of complex form.On this basis, can use other required marks to big semantic primitive.For example, current big semantic primitive is marked as a language piece in the natural language understanding aftertreatment, perhaps uses in the relevant aftertreatment in phonetic synthesis, also should be marked as a prosodic phrase, wherein may relate to division of rhythm speech or the like again.And provide a kind of mechanism of reverse succession.
Natural language comprises multilinguals such as Chinese, Japanese, English, it now is example with Chinese, a disposal route and a device that numeral and special symbol in the Chinese language text is carried out Words symbolization of in speech synthesis system, realizing, can carry out correct Words symbolization to numeral, the special symbol string that may exist in the text and handle, the numeral, special symbol string that especially is fit to handle some complexity for example telephone number, the quantity interval of special symbol etc. is arranged.
Fig. 5 has provided the object lesson of the Words symbolization processing of numeral and special symbol in the Chinese language text.Numeral/the special symbol string that may exist in the input text is mated piecemeal in numeral/special symbol string template matches portion (module 103).Therein, this module extracts numeral/special symbol string wherein piecemeal with the sequential scanning input text, then defined template in current numeral/special symbol string and the template base is mated, and obtains its affiliated template type.Like this, in this example sentence text, two numeral/special strings are arranged, i.e. " 15 " and " 44 ".And in fact, " 15 years old-44 years old " is a complete semantic primitive.In simple sequential scanning, piecemeal under the mechanism of coupling, the scope that numeral/special symbol string is investigated is often between the actual zone less than semantic primitive.Like this, under the situation of similar example sentence, just can not accomplish semantic understanding completely, even be wrong semantic understanding sometimes, and therefore obtain wrong Words symbolization result.
In Fig. 5, frame adds in the black frame and to be module 103 numerals/special symbol string template matches portion sequential scanning, numeral/special symbol string of being obtained of coupling piecemeal.The matching template of first digit/special symbol string " 15 " is " general positive integer ", in conjunction with follow-up measure word " year ", determine that its Words symbolization method is the numerical value pronunciation.The matching template of second digit/special symbol string " 44 " is " number that single minus sign is leading ", this matching template is given module 1041 association process portions and is carried out the investigation of context of co-text, template type in conjunction with follow-up measure word " year ", last digit/special symbol string, determine current numeral/special symbol string should and last digit/special symbol string be combined as a big semantic primitive i.e. " 15 years old-44 years old ", its accurate template is the quantity interval, and determines that therefore its Words symbolization method is quantity interval, numerical value pronunciation.Finally, " acquired immune deficiency syndrome (AIDS) has become the first reason of between twenty and fifty crowd's death in 15 years old-44 years old to input text." turned to by letter symbol that " acquired immune deficiency syndrome (AIDS) has become the first reason of between twenty and fifty crowd's death in 15 years old to 44 years old.”。In addition, in natural language understanding was used, other labeling section can be labeled as " 15 years old to 44 years old " a complete language piece.In phonetic synthesis was used, other labeling section can be labeled as two rhythm speech with " 15 years old " and " to 44 years old ", and can consider " 15 years old to 44 years old " is labeled as a prosodic phrase.
Fig. 6 has provided in the text Words symbolization of numeral and special symbol and has handled the oppositely object lesson of succession.
In general, the numeral/special symbol string that may exist in the input text is mated piecemeal in numeral/special symbol string template matches portion (module 103).Therein, this module extracts numeral/special symbol string wherein piecemeal with the sequential scanning input text, then defined template in current numeral/special symbol string and the template base is mated, and obtains its affiliated template type.Like this, in this example sentence text, three numeral/special strings are arranged, i.e. " 1985 ", " 1986 " and " 1987 ".When " 1985 " or " 1986 " were handled, because sequence analysis from left to right, only be local understanding this moment, therefore, can't make correct Words symbolization and handle.Only on the basis that the overall situation is understood, promptly investigate " 1987 (year) " part, could realize correct Words symbolization processing after handling through the reverse succession portion among the present invention whole three numeral/special symbol strings.
In Fig. 6, frame adds in the black frame and to be module 103 numerals/special symbol string template matches portion sequential scanning, numeral/special symbol string of being obtained of coupling piecemeal.The matching template of first digit/special symbol string " 1985 " is " a general positive integer (four figures) ", owing to do not investigate any context keyword, can't determine its Words symbolization method, and is therefore tentative for default.Second digit/special symbol string " 1986 " is the same.The matching template of third digit/special symbol string " 1987 " is " a general positive integer (four figures) ", in conjunction with follow-up special speech " year ", the template of determining current numeral/special symbol string is " year date ", and the Words symbolization method of " 1987 " is the telegram pronunciation.Then, module 1042 reverse succession portions check the possibility that has or not reverse succession.Like this, behind numeral/special symbol string " 1986 " ", " and before ", " determined " 1986 " oppositely to inherit the Words symbolization method of numeral/special symbol string thereafter, i.e. the Words symbolization method of " year date " template.In like manner, numeral/special symbol string " 1985 " is also oppositely inherited.Finally, " according to arranging the data that alkali directorate provide: 1985, grain and cotton output in 1986,1987 increases progressively input text year after year." turned to by letter symbol that " data that provides according to row alkali directorate: one nine eight five, one nine eight six, 1 years grain and cotton output increases progressively year after year.”。
Fig. 7 is the specific implementation example that the template linguistic context is investigated knowledge base (module 201).
In Fig. 7, provided Data Structures and example of template linguistic context investigation knowledge base.Investigate storage at least in the knowledge base in the template linguistic context: template type, previous numeral/special symbol string template type, previous numeral/special symbol string end position, previous numeral/special symbol string expansion end position, previous numeral/special symbol string keyword type, current numeral/special symbol string keyword type and accurate template type under current numeral/special symbol string.
Fig. 8 is the specific implementation example of numeral/special symbol string historical data base (module 203).
In Fig. 8, provided the Data Structures and an example (" 15 (year) " in Fig. 5 example sentence) of numeral/special symbol string historical data base.Storage at least in numeral/special symbol string historical data base: template type, numeral/special symbol string starting position, numeral/special symbol string end position, numeral/special symbol string expansion starting position, numeral/special symbol string expansion end position, numeral/special symbol string keyword type and numeral/special symbol string key words content under numeral/special symbol string.
Fig. 9 is the specific implementation example of other mark knowledge bases (module 207).
In Fig. 9, Data Structures and two examples of other mark knowledge bases have been provided.Storage at least in other mark knowledge bases: template type, rhythm speech marking convention and prosodic phrase marking convention under current numeral/special symbol string.
Figure 10 is the specific implementation example of template Words symbolization rule-based knowledge base (module 209).
In Figure 10, Data Structures and two examples of template Words symbolization rule-based knowledge base have been provided.Storage at least in template Words symbolization rule-based knowledge base: template type, context rule, keyword rule and symbolism rule.
Figure 11 is the specific implementation example that template is oppositely inherited knowledge base (module 303).
In Figure 11, Data Structures and an example that template is oppositely inherited knowledge base have been provided.Oppositely inherit storage at least in the knowledge base in template: separation Chinese character between list separator, current numeral/special symbol string keyword type, current numeral/special symbol string and the next numeral/special symbol string under current numeral/special symbol string between template type, next numeral/special symbol string template type, next numeral/special symbol string starting position, current numeral/special symbol string and the next numeral/special symbol string and reverse inheritance rules.
Above embodiment only is used to illustrate the present invention, but not is used to limit the present invention.

Claims (11)

1. the Words symbolization processing method of symbol string and numeral in the natural language text is characterized in that described method may further comprise the steps:
The input natural language text;
Extract symbol string and numeral in the described natural language text piecemeal;
Current sign string and numeral are mated the template type under acquisition current sign string and the numeral with the template of pre-stored;
The template type and the relevant information of log history symbol string and numeral, wherein, relevant information comprises: interval range, matching template type, Words symbolization rule;
With digital template type and relevant information current sign string and numeral are carried out the Words symbolization processing according to current sign string and digital affiliated template type and the current sign string historical symbol string adjacent with numeral, wherein, Words symbolization is handled, numeral or special symbol are converted to the literal of equal value with it
It is characterized in that, in the template type of current sign string and adjacent historical symbol string of numeral and numeral and relevant information, find context of co-text with current sign string and digital correlation, then making current sign string and digital and adjacent historical symbol string and combination of numbers is a semantic primitive, and generate the pairing template of this semantic primitive, write down pairing symbol string of this semantic primitive and numerical information.
2. method according to claim 1 is characterized in that, described semantic primitive is used required mark.
3. method according to claim 1 is characterized in that, described context of co-text comprises: template type, interval range, Words symbolization rule.
4. method according to claim 1, it is characterized in that, behind above-mentioned Words symbolization treatment step, travel through the template type and the relevant information of historical symbol string and numeral, if find symbol string and the numeral do not handled by Words symbolization, then judge whether this numbers and symbols string to be handled, if then this symbol string and numeral are handled according to reverse inherited literal symbolism rule according to reverse inherited literal symbolism rule.
5. method according to claim 1 is characterized in that, the new literacy that adds in Words symbolization is handled is carried out aftertreatment.
6. method according to claim 1 is characterized in that, described symbol is meant: the unnatural language symbol.
7. the Words symbolization disposal system of symbol string and numeral in the natural language text is characterized in that described system comprises:
Input part is used to import natural language text;
Symbol string and digital decimation portion are used for extracting piecemeal the symbol string and the numeral of described natural language text;
Template matches portion is used for current sign string and numeral being mated the template type under acquisition current sign string and the numeral with the template of pre-stored;
The historical information recording portion is used for log history symbol string and digital template type and relevant information, and wherein, relevant information comprises: interval range, matching template type, Words symbolization rule;
Words symbolization rule generating unit, be used for digital template type and relevant information current sign string and numeral being carried out the Words symbolization processing according to current sign string and digital affiliated template type and the current sign string historical symbol string adjacent with numeral, generate the Words symbolization rule of current sign string and numeral, wherein, Words symbolization is handled, numeral or special symbol are converted to the literal of equal value with it
It is characterized in that described system also comprises:
Context of co-text investigation portion is used for current sign string historical symbol string and digital template type and the relevant information adjacent with numeral are investigated;
The semantic primitive determination portion, if in the template type of current sign string and adjacent historical symbol string of numeral and numeral and relevant information, find context of co-text with current sign string and digital correlation, then making current sign string and digital and adjacent historical symbol string and combination of numbers is a semantic primitive, and generates the pairing template of this semantic primitive;
Described historical information recording portion writes down pairing symbol string of this semantic primitive and numerical information.
8. system according to claim 7 is characterized in that, described system also comprises:
The semantic primitive labeling section is used for described semantic primitive is used required mark.
9. system according to claim 7 is characterized in that, described context of co-text comprises: template type, interval range, Words symbolization rule.
10. system according to claim 7 is characterized in that, described system also comprises:
Reverse succession portion, be used to travel through the template type and the relevant information of historical symbol string and numeral, if find symbol string and the numeral do not handled by Words symbolization, then judge whether this numbers and symbols string to be handled, if then this symbol string and numeral are handled according to reverse inherited literal symbolism rule according to reverse inherited literal symbolism rule.
11. system according to claim 7 is characterized in that, described system also comprises:
Aftertreatment portion is used for that Words symbolization is handled the new literacy that adds and carries out aftertreatment.
CNB2006101656333A 2006-12-08 2006-12-08 The Words symbolization processing method and the system of numeral and special symbol string in the text Active CN100568225C (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNB2006101656333A CN100568225C (en) 2006-12-08 2006-12-08 The Words symbolization processing method and the system of numeral and special symbol string in the text
JP2007318985A JP5130892B2 (en) 2006-12-08 2007-12-10 Character encoding processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101656333A CN100568225C (en) 2006-12-08 2006-12-08 The Words symbolization processing method and the system of numeral and special symbol string in the text

Publications (2)

Publication Number Publication Date
CN101196881A CN101196881A (en) 2008-06-11
CN100568225C true CN100568225C (en) 2009-12-09

Family

ID=39547308

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101656333A Active CN100568225C (en) 2006-12-08 2006-12-08 The Words symbolization processing method and the system of numeral and special symbol string in the text

Country Status (2)

Country Link
JP (1) JP5130892B2 (en)
CN (1) CN100568225C (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184167B (en) * 2011-05-25 2013-01-02 安徽科大讯飞信息科技股份有限公司 Method and device for processing text data
CN103809766A (en) * 2012-11-06 2014-05-21 夏普株式会社 Method and electronic device for converting characters into emotion icons
CN104035919A (en) * 2014-06-25 2014-09-10 深圳市中兴移动通信有限公司 Number associating method and number associating device
CN106708797B (en) * 2015-07-15 2021-03-16 中兴通讯股份有限公司 Word processing method and device
CN105404670B (en) * 2015-11-16 2018-09-25 北京奇虎科技有限公司 Harass short message method of discrimination and device
CN105589846B (en) * 2015-12-22 2018-07-31 北京奇虎科技有限公司 A kind of method and device for identifying digital semantic method, detecting short message classification
CN106293125A (en) * 2016-08-09 2017-01-04 武汉开目信息技术股份有限公司 Support the method and system of the carried out special process symbol input of Android system
CN107633006B (en) * 2017-08-09 2020-10-13 联动优势科技有限公司 Dictionary format generation method and electronic equipment
CN107733924A (en) * 2017-11-27 2018-02-23 北京小米移动软件有限公司 Short message cloud synchronous method, device, terminal and storage medium
CN109299439B (en) * 2018-08-22 2021-05-11 腾讯科技(深圳)有限公司 Digital extraction method and apparatus, storage medium, and electronic apparatus
CN109558599B (en) * 2018-11-07 2023-04-18 北京搜狗科技发展有限公司 Conversion method and device and electronic equipment
CN110136688B (en) * 2019-04-15 2023-09-29 平安科技(深圳)有限公司 Text-to-speech method based on speech synthesis and related equipment
CN111026844B (en) * 2019-12-04 2023-08-01 河北数云堂智能科技有限公司 Method and device for identifying digital serial reading method
CN112800722A (en) * 2021-02-09 2021-05-14 柳州智视科技有限公司 Word organization coding algorithm based on semantic understanding

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08221418A (en) * 1995-02-17 1996-08-30 Meidensha Corp Japanese processing system
JP2005063030A (en) * 2003-08-08 2005-03-10 Ricoh Co Ltd Method for expressing concept, method and device for creating expression of concept, program for implementing this method, and recording medium for recording this program

Also Published As

Publication number Publication date
JP2008148322A (en) 2008-06-26
CN101196881A (en) 2008-06-11
JP5130892B2 (en) 2013-01-30

Similar Documents

Publication Publication Date Title
CN100568225C (en) The Words symbolization processing method and the system of numeral and special symbol string in the text
Schultz et al. Multilingual speech processing
CN1742273A (en) Multimodal speech-to-speech language translation and display
CN102214238B (en) Device and method for matching similarity of Chinese words
CN111222329B (en) Sentence vector training method, sentence vector model, sentence vector prediction method and sentence vector prediction system
van Heuven et al. Analysis and synthesis of speech: strategic research towards high-quality text-to-speech generation
Samudravijaya Indian language speech label (ILSL): a de facto national standard
Elovitz et al. Automatic translation of English text to phonetics by means of letter-to-sound rules
Sugisaki et al. Building a corpus from handwritten picture postcards: Transcription, annotation and part-of-speech tagging
Kamran Malik et al. Transliterating urdu for a broad-coverage urdu/hindi lfg grammar
Asahiah Development of a Standard Yorùbá digital text automatic diacritic restoration system
Pae Written languages, East-Asian scripts, and cross-linguistic influences
Nunsanga et al. Part-of-speech tagging in Mizo language: a preliminary study
Lee Machine-to-man communication by speech Part 1: Generation of segmental phonemes from text
Lu et al. Language model for Mongolian polyphone proofreading
Gutkin et al. Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
Rajendran et al. Text processing for developing unrestricted Tamil text to speech synthesis system
Bosch et al. Towards Zulu corpus clean-up, lexicon development and corpus annotation by means of computational morphological analysis
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation
Linn et al. Part of speech tagging for kayah language using hidden markov model
Van Nam et al. Building a spelling checker for documents in Khmer language
Saini An Exhaustive Meta-analytical Study of the History, Evolution and Development of ‘Saraiki NLP’
Garabík et al. A cross linguistic database of children's printed words in three Slavic languages
Milnor A comparison between the development of the Chinese writing system and Dongba pictographs
PATKAR ENGLISH TO AHIRANI LANGUAGE MODEL GENERATION USING HIDDEN MARKOV MODEL AND RECURRENT NEURAL NETWORK FOR TEXT TO SPEECH TRANSLATION

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant