WO2000065486A2 - A method of mapping semantic context to enable interoperability among disparate sources - Google Patents

A method of mapping semantic context to enable interoperability among disparate sources Download PDF

Info

Publication number
WO2000065486A2
WO2000065486A2 PCT/US2000/009203 US0009203W WO0065486A2 WO 2000065486 A2 WO2000065486 A2 WO 2000065486A2 US 0009203 W US0009203 W US 0009203W WO 0065486 A2 WO0065486 A2 WO 0065486A2
Authority
WO
WIPO (PCT)
Prior art keywords
vocabulary
source
common
query
local
Prior art date
Application number
PCT/US2000/009203
Other languages
French (fr)
Other versions
WO2000065486A3 (en
Inventor
Eric Lai
Elisa F. Kendall
James S. Wong
Benson T. Wong
Original Assignee
Sandpiper Software, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sandpiper Software, Inc. filed Critical Sandpiper Software, Inc.
Priority to AU63340/00A priority Critical patent/AU6334000A/en
Publication of WO2000065486A2 publication Critical patent/WO2000065486A2/en
Publication of WO2000065486A3 publication Critical patent/WO2000065486A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Definitions

  • the invention was made with Government support under DAAL01-93-C-3380 awarded by the Defense Advanced Research Project Agency (DARPA). The Government has certain rights in the invention.
  • DARPA Defense Advanced Research Project Agency
  • the invention relates to the field of data retrieval from databases (or other electronic sources), database or file management, and information processing technology. More specifically, it allows data to be accessed and /or exchanged between independently developed databases and applications, and it facilitates the integration of information, processes and applications.
  • the structural and syntactic characteristics of an electronic data source may describe its data types, layout, composition, and some constraints on its access.
  • the semantic properties of an application or repository define how that particular source should be interpreted. Semantic properties may reflect functional, behavioral, contextual, environmental, or content-specific characteristics of a process, application, or repository. An attribute name, for example, may have both syntactic and semantic properties. Other semantic (or context-related) characteristics may include units-of- measure, storage and precision for numeric data, definitions of terms, relationships among terms and values, rules for expression interpretation and evaluation, or localization characteristics. Structure and syntax may be explicitly defined through a database schema, data dictionary, or in source code. Semantic characteristics may be specified explicitly (e.g., in an attribute name), implicitly (e.g., units-of-measure, relationships among terms or objects, constraints on values), or not at all.
  • Figure 1 is a block diagram showing a number of point-to-point integrations between two client applications 10, and a number of data sources 12. In each case, a custom adapter is provided between each client application 10 and each data source 12.
  • Migrating data from one environment to another is typically expensive, time- consuming, disruptive, and error-prone, frequently resulting in loss of data or functionality.
  • Up to 80 percent of the cost of development of a data warehouse is expended in data extraction, cleaning and loading.
  • the migration process includes modeling existing processes and information sources, designing new processes and information models, designing the new schema, developing data conversion tools, and creating, converting, loading, testing, and validating the new repository or application. It often requires service outages for both anticipated tasks and unanticipated problems.
  • Converted repositories are frequently incompatible with existing sources; database synchronization issues often arise during transition intervals, where both the original and new systems operate in parallel. Conversion processes may require six to eighteen months to complete, though it is not uncommon for enterprise-scale projects to take more than two years.
  • a method of querying a plurality of data sources A common query is received, the common query being constructed utilizing a common vocabulary that is mapped to a plurality of source vocabularies associated with respective data sources of the plurality of data sources.
  • a plurality of source queries are generated to the plurality of data sources, each source query being derived from the common query and being constructed utilizing a respective source of vocabulary.
  • Figure 1 is a block diagram illustrating a prior art point-to-point integration scheme between a plurality of data consumers and a plurality of data sources.
  • Figure 2 is a block diagram illustrating the use of a common vocabulary to provide a two-level integration between a number of data consumers and a number of data sources, according to one embodiment of the present invention.
  • Figure 3 is a block diagram illustrating mappings between two common vocabularies and a plurality of source vocabularies, according to one embodiment of the present invention.
  • Figure 4A is a diagrammatic representation of an exemplary client-server (or two-tiered) environment within which the present invention may be deployed.
  • Figure 4B is a diagrammatic representation of an exemplary three-tier environment within which the present invention may be deployed.
  • Figure 5 is a block diagram illustrating architectural details of an exemplary data server constructed to implement one embodiment of the present invention.
  • Figure 6 is a flow chart illustrating the primary operations of an exemplary method of querying a plurality of data sources, and receiving a response to such a query, according to one embodiment of the present invention.
  • Figure 7 is a flow chart illustrating a method of generating a source query expression for an applicable source, the source query expression being derived from a common query expression, according to one embodiment of the present invention.
  • Figure 8 is a flow chart illustrating an exemplary method, according to one embodiment of the present invention, of performing attribute mapping function.
  • Figure 9 is a flow chart illustrating an exemplary method, according to one embodiment of the present invention, of performing a semantic mapping for algorithmic conversion of attribute values.
  • Figure 10 is a flow chart illustrating an exemplary method, according to one embodiment of the present invention, of performing semantic mapping for lookup- based mapping of attribute values.
  • Figure 11 is a diagrammatic representation of an exemplary attribute mapping table, according to one embodiment of the present invention.
  • Figure 12 is a block diagram illustrating an exemplary machine, in the form of a computer system, that may execute a sequence of instructions embodying the present invention and that includes a machine-readable medium for storing such a sequence of instructions.
  • the invention in one embodiment, enables a distributed collection of applications and repositories with dissimilar syntactic and semantic characteristics to be accessed as though they are a single entity. Queries are performed transparently, without requiring the user to understand the characteristics of any individual source, through the use of a common vocabulary and semantic mappings to the source repositories. A common vocabulary is used to provide the desired transparency.
  • Mappings establish the correspondence between the common vocabulary and each of the source-specific vocabularies which are to be integrated.
  • the common vocabulary may be tailored to the needs of an individual or group of users and /or applications and forms the basis for resolution of syntactic and semantic conflicts and ambiguities within a federation of information resources.
  • the terms and constraints defined by the common vocabulary are then used to construct the query.
  • the terms in the common vocabulary are mapped to each applicable source vocabulary for query and retrieval purposes.
  • a method of using semantic vocabularies and mappings is described. This approach allows data systems to be integrated or accessed transparently, from a single point of entry, despite differences in the semantic characteristics of the applications or repositories.
  • the requisite vocabularies and semantic mappings may be prepared and information sources may be integrated using the following operations:
  • Each data source to be integrated is modeled to create a source vocabulary.
  • the vocabulary is essentially a structural, syntactic and semantic schema for the source.
  • a common vocabulary containing any combination of terms and attributes from the source(s), and the relationship and constraints that formally define those terms, is created.
  • the characteristics of the common vocabulary need not be the same [in structure, syntax, or semantics] as any of the individual source vocabularies. It should be noted that the terminology, formats and semantics of the common vocabulary may be determined by one or more domain-specific standards, organizational requirements, or the user's preference.
  • mappings are converted to a form that is usable by the run-time engine. This process will take the mapping between the names in the source vocabulary and variables in the common vocabulary (created in operation 3) and generate the contents of the mapping tables. The details for each vocabulary item are retrieved from the source models and the common models created in operations 1 and 2. The structural, syntactic and semantic correlations between a particular pair of vocabularies will be encoded. In the case of mappings performed by algorithms (e.g., units-of- measure), it will be necessary determine if a mapping exists between the two forms. If not, this fact will be flagged for the knowledge engineer.
  • algorithms e.g., units-of- measure
  • An exemplary use of the common vocabulary may be the semantic integration of legacy sources.
  • a common vocabulary may also serve as a starting point from which new source systems are developed.
  • a pre-existing common vocabulary is selected based on the problem domain, relevant business processes and requirements, the available source repositories and applications, and user preferences, as appropriate.
  • the user (or application) may also select a subset of the available sources for searching through the use of a directory services mechanism.
  • each server maps the parameters specified in the request from the common vocabulary to the source-specific vocabulary. This process may include term renaming, unit conversion, translation, or mapping attributes and values to those used by the source.
  • the source application, database management system, file system, or other information processing system acts on the request and returns the result set (if any).
  • FIG. 1 illustrates the complexity of integration using the traditional, custom approach.
  • the present invention is advantageous in that it may substantially reduce the number of point-to-point integrations needed among applications and repositories that comprise the user's environment.
  • Figure 3 illustrates the use of a common vocabulary 14 to resolve attribute names, data types and units between a number of sources 12 and a number of client (user) applications 10.
  • Figure 3 illustrates the use of two independent, common vocabularies 14 which map combinations of attributes from three source vocabularies 16, each source vocabulary 12 being associated with a respective source 12.
  • each of the respective sources may display dissimilar structural syntax, semantics or access protocols from other source vocabularies 16.
  • the mapping component may be relocated to an application server servicing a cluster of sources 12 with minimal loss of flexibility. Whether vocabulary mapping functionality is performed by the information source server or a middle-tier application server, client applications (and therefore users) 10 are shielded from the specifics of the individual source vocabularies 16.
  • Exemplary Embodiment Figure 4A illustrates an exemplary client-server environment 20 in which the present invention may be used.
  • the environment 20 includes one or more client applications 10 which may utilize individual or shared vocabularies, and multiple data sources in the exemplary form of data servers 22 which may utilize individual or shared vocabularies.
  • Server vocabularies of the data servers 22 may overlap with the client vocabularies of the client applications 10 although this need not be the case.
  • the data servers 22 may have dissimilar structure and syntax, dissimilar semantics and terminology, and dissimilar access protocols from the client applications 10.
  • at least one data server 22 supports one client vocabulary, and the client applications 10 and servers 22 are connected by a network 24 or other communication medium.
  • the servers 22 (or a subset thereof) may share at least one common vocabulary 14. This common vocabulary 14 is utilized by at least one client application. There may be multiple common vocabularies.
  • the term "common" is with respect to some set of data servers or sources.
  • Requests 26 are sent from the client application 10 to the server 22 and responses 28 from the servers 22 to the client application 10.
  • the client application 10 generates requests 26 and receives responses 28 in the user's terminology and semantics (i.e., the common vocabulary).
  • Each data server 22 may receive requests 26 and generate responses 28 in the user terminology and semantics.
  • the internal repositories utilizes the source-specific local terminology and semantics. It should be noted that a user terminology and semantics set is not arbitrary, rather it is a set of corresponding terms, concepts and preferences which have been tailored to one or more users.
  • Each data server 22 sharing the common vocabulary 14 (or a user selected subset of the data servers 22) receives an identical request 26.
  • the exemplary architecture, illustrated in Figure 4A, is client-server.
  • the present invention is equally applicable to a three-tiered architecture such as that shown in Figure 4B, where the mapping functionality is performed on (one or more) intermediate application servers 30.
  • the application server 30 receives client requests, maps the requests and queries the appropriate data sources (e.g., the servers 22), then remaps and returns the results to the client.
  • An application server 30 is also capable of manipulating the results, if necessary, to sort, merge or filter the results as well as performing report generation and analysis functions.
  • the application server 30 understands the mappings between each common vocabulary and each source vocabulary accessible in the federation.
  • each server 22 is required to understand only those mappings between supported common vocabularies and its local vocabulary.
  • the application server 30 also supports all of the access protocols needed to access any of the data servers 22.
  • the client-server architecture takes advantage of a common access protocol. Since the application server 30 intermediates each transaction, it may require relatively powerful hardware and may potentially become a bottleneck (depending on the number of users or the tasks performed on behalf of users) or single-point-of-failure.
  • An application server 30 does, however, allow the mapping information and business applications to be centralized which may reduce the maintenance to clients and back- end servers.
  • the three-tier approach does not require a server-resident mapping process be added to each data server 22.
  • the concept of integrating dissimilar information sources using a common vocabulary remains unchanged. It will be appreciated that, in the three-tier architecture illustrated in Figure 4B, certain of these components will be deployed on the application server 30.
  • Figure 5 shows the primary components of an exemplary data server 22 as displayed within the two-tier (client-server) architecture shown in Figure 4A.
  • the illustrated primary components of the exemplary data server 22 include a communication interface 32, a mapper 34 and a source interface 36 (e.g., a user API, a DBMS) to the information repositories 38.
  • the communication interface 32 is arbitrary and supports whatever communication methodologies (e.g., TCP/IP, CORBA, ATM, Ethernet) is required to communicate over the network 24.
  • the communication interface 32 also provides services to format /extract information to/from messages. As previously discussed, these messages utilize the users' terminology and semantics set.
  • the mapper 34 transforms the users' terminology and semantics embodied in a request 26 into the local terminology and semantics (and vice versa). Once translated into the source-specific local terminology, a query is made to the appropriate data source 38 (e.g., database, file or other data management service). In order for the mapper 34 to perform the necessary translations, a number of mapping tables 40 are made available to it.
  • the appropriate data source 38 e.g., database, file or other data management service.
  • the client application 10 Since the client application 10 has knowledge of only the common vocabulary 14, the capability of bi-directional mapping is provided.
  • the top half of Figure 5 shows the mapping requests 26 from common to source vocabulary, while the lower half shows the mapping responses 28 from source to common vocabulary.
  • a variable length string may be used to transfer the requests and responses between the client application 10 and the data server 22.
  • lexical tokens representing attribute names, operators and values may be extracted by a lexical analyzer 42, after which the mapping appropriate to the token (e.g., name or value) is performed.
  • a syntax analyzer 44 determines whether a token is an attribute name, an operator, a value or something else. An alternative is to remember what the last token was (assuming the use of ⁇ attribute_name, operator, attribute_value> triples).
  • the mapping functions may be built as actions to the lexical analyzer 42 and generation of the query expression may be performed by actions in the syntax analyzer 44.
  • Relational expressions may be chained together in compound expressions as shown in Figure 7.
  • the input and output grammars are chosen by the user according to the application.
  • the lexical analyzer 42 and syntax analyzer 44 skeletons may be generated with commercially available tools used for that purpose (e.g., AT&T lex and yacc). These skeletons may then augmented with the appropriate set of actions.
  • Figure 6 is a flow chart illustrating an overview of an exemplary method 50 of performing a mapping of a request 26, expressed in a common vocabulary, to a source vocabulary 16.
  • the method 50 commences with lexical extraction operation 52 by the lexical analyzer 42, followed by a semantic mapping operation 54 performed by the mapper 34 and a syntax analysis operation 56 performed by the syntax analyzer 44.
  • Attribute name mapping 60 resolves naming differences between the attributes in the vocabularies.
  • Attribute value mapping 62 resolves differences between the contents of the attributes. Attribute value mapping 62 may be performed in one of two exemplary methods: algorithmically, as illustrated in Figure 9, or by lookup, as illustrated in Figure 10.
  • Figure 7 is a flow chart illustrating further details of a method 66, according to one embodiment of the present invention, of expression generation. It should be noted that Figure 7 does not illustrate lexical, syntactic or semantical analysis performed through compiler functionality, nor symbol table management or generation provided by compiler technology or techniques.
  • the method 66 illustrates the chaining together of relational expressions into a compound expression.
  • the method 66 commences with a mapping operation 54, followed by a determination at decision block 68 as to whether a logical operation has been detected. If so, the method 66 continues with the logical expression formation by proceeding via block 70 to perform a further mapping operation 54. On the other hand, should no logical operation be detected at decision box 68, a compound expression, comprising a chained series of relation expressions, is submitted at block 72.
  • Figure 8 is a flow chart illustrating a method 71, according to an exemplary embodiment of the present invention, of performing an attribute name mapping 60.
  • the method 71 commences at block 72 with a lookup attribute name operation. Specifically, a global name 74 is utilized to perform a lookup with respect to an attribute name table 78, and to return a local name 76.
  • attribute name table 78 there should be at least one entry within the attribute name table 78 for each attribute which stores the mapped global name 74, the mapped data type and a string which specifies any required algorithmic conversion(s).
  • a key consisting of the vocabulary name, the class name and the attribute (or local) name 76 concatenated together is used as a key to a database lookup. This key structure allows multiple vocabularies to be stored in a single database or table. It also allows the same attribute name 76 to exist in different classes as well as for the same class name to exist in different vocabularies.
  • the attribute (local) name 76 found in the attribute name table 78 is written to an output string followed by the detected relational operator.
  • the attribute data type (e.g., floating point number, string or integer) is retrieved.
  • formatting of the expression commences.
  • the lookup attribute value mapping 62 also includes two components, namely (1) a semantic mapping for algorithmic conversion component and (2) a semantic mapping for lookup-based mapping of attribute values component.
  • Figure 9 is a flow chart illustrating an exemplary method 90, according to one embodiment of the present invention, of performing a semantic mapping of algorithmic conversions on attribute values.
  • the lexical type for the relevant value token is determined.
  • an algorithmic conversion if indicated by the mapping structure, is performed at block 94.
  • an algorithmic conversion if indicated by the mapping structure, is performed at block 96.
  • the mapper 34 has access to a library of supported conversions integrated into it.
  • Examples of algorithmic conversions that may be performed at blocks 94 and 96 include unit-of-measure conversions (e.g., centimeters to inches) or any other conversions of values that may be algorithmically performed, as opposed to requiring a lookup operation.
  • unit-of-measure conversions e.g., centimeters to inches
  • Figure 10 is a flow chart illustrating an exemplary method 100, according to one exemplary embodiment of the present invention, of performing a semantic mapping for lookup-based mapping of attribute values.
  • a value is typed as a floating point number at block 92 in Figure 9, is subject to an algorithm conversion at block 94.
  • Figure 10 reveals that the value receives no further value mapping, and is accordingly inserted into an output string and a next token value is located.
  • additional lookups may be performed.
  • a key (not shown) consisting of a vocabulary name, a class name, an attribute name and the value are concatenated together and utilized as a key to a database lookup in the name table 78 or a value table 79.
  • a lookup is made in the attribute name table 78 to determine whether or not the value is in fact another attribute (local) name. If the lookup at block 102 succeeds, as determined at decision block 104, the local name 76 substitutes the global name 74, the result is placed in an output string, and the next value token is sorted.
  • the global name 74 is retained at block 108 and the key is utilized to perform a second lookup at block 110 in an attribute value table 79 to determine if the key is to an enumerated value.
  • the value token is determined at block 92 to comprise an integer type value
  • the actions described above with reference to blocks 110 - 116 are then performed with respect to such an integer type value.
  • the two stage mapping of string values is only needed for applications involving query generation (since the value of two attributes may be compared with one another).
  • only a single lookup may be used. Specifically, if a value token representing a logical operator is located, the logical operator will be added to an output string, and the process repeated. When the end of a valid expression is detected, the output string is used to query the data source. The results returned from the data source are then mapped back into the common vocabulary 14.
  • the reverse mapping process is substantially as described above, with two exceptions. First, the mapping tables (e.g., the name table 78 and the value table 79) are from the source (or local) vocabulary 16 to the common vocabulary 14. Secondly, the input and output are ⁇ attribute_name, attribute_value> tuples.
  • Figure 11 shows the structure of exemplary mapping tables 64 within the context of an object-oriented database.
  • the same structure may be implemented for the attribute name and the attribute value tables 78 and 79.
  • the attribute name table 78 all three fields are needed.
  • the attribute value table 79 only the field for the mapped value field is needed.
  • a mapped name may furthermore be an attribute name or an attribute value.
  • the tables 64 may be implemented through the object-oriented database using a hashed dictionary structure 130. Another data structure may be used so long as it has a look-up or search capability.
  • the data structures, itself, need not be persistent, if it can be loaded from a persistent source (e.g., a file) when required.
  • a database does, however, provide built-in mechanisms to more easily manage, store, retrieve and update the data. It is clear that the choice of data structures may affect the performance of the mapping functions.
  • the hashed dictionary structure 130 generally provides fast lookup performance relative to other commonly used methods if an effective hash function is selected. Since these tables 64 are of the same format and the key string allows sufficient uniqueness, the two tables may be merged into one. This approach, however, may not significantly reduce the number of lookups, the aggregate table size, or the maintenance requirements but may result in some loss of understandability.
  • the mapper 34 has a library of supported conversions integrated into it.
  • One embodiment of the present invention links a library (static or dynamic) of conversion algorithms into the mapper 34.
  • the appropriate conversion function is selected based on a code in the conversion specifier string, as shown in Figure 11.
  • These may be packaged as remote procedures or distributed functions such as the Object Management Group's Common Object Request Broker Architecture (CORBA). This integration allows new functions to be easily added.
  • CORBA Common Object Request Broker Architecture
  • FIG 12 is a block diagram illustrating a machine, in the exemplary form of a computer system 250, within which a set of instructions for causing the computer system 250 to perform any one of the methodologies discussed above may be executed.
  • the computer system 250 includes a processor 252, a main memory 254, and a static memory 255, which communicate with each other via a bus 256.
  • the computer system 250 further includes a video display unit 258 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CTR)).
  • LCD liquid crystal display
  • CTR cathode ray tube
  • the computer system 250 further includes an alpha-numeric input device 260 (e.g., a keyboard), a cursor control device 262 (e.g., a mouse), a storage medium in the exemplary form of a disk drive unit 264, a signal generation device 266 (e.g., a speaker) and a network interface device 268.
  • an alpha-numeric input device 260 e.g., a keyboard
  • a cursor control device 262 e.g., a mouse
  • storage medium in the exemplary form of a disk drive unit 264 e.g., a disk drive unit
  • a signal generation device 266 e.g., a speaker
  • the disk drive unit 264 includes a machine-readable medium 265 on which is stored a set of instructions (i.e., software 270) embodying any one, or all, of the methodologies described above.
  • the software 270 is also shown to reside, completely or at least partially, within the main memory 254 and /or within the processor 252.
  • the software 270 may furthermore be transmitted or received via the network interface device 268.
  • the term "machine-readable medium” shall be taken to include any medium that is capable of storing and encoding a sequence of instructions for execution by the machine, and that causes the machine to perform any one of the methodologies of the present invention.
  • the term “machine- readable medium” shall accordingly be taken to include, but not limited to, solid-state memories, optical and magnetic disks, and carrier wave signals.
  • the present application is useful for developing and integrating applications and repositories wherein the data is accessed using terminology, methodologies, or other mechanisms that may differ from the source implementation, format, or context. It may be used to resolve both semantic and syntactic differences among applications and repositories.
  • databases are used as examples throughout this disclosure, the approach is independent of the source implementation or underlying technology. This invention is also applicable to information brokering and mediation.
  • the present invention is also advantageous in that data migration is not required to customize the vocabulary or transparently to query multiple data sources as a single set. This alone results in a significant reduction of cost, operational disruption and risk. As the integrity of the various source repositories and applications is preserved, the operation of existing systems remains uninterrupted while new applications can access and make use of the information maintained by those systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A distributed collection of applications and repositories with dissimilar syntactic and semantic characteristics may be accessed as though they are a single entity. Queries are performed transparently, without requiring the user to understand or be aware of the characteristics of any individual source. This is performed through the use of a common vocabulary and semantic mappings to the source repositories. A common vocabulary is used to provide the desired transparency. Mappings establish the correspondence between the common vocabulary and each of the source-specific vocabularies which are to be integrated. The common vocabulary may be tailored to the needs of an individual or group of users and/or applications and forms the basis for resolution of syntactic and semantic conflicts and ambiguities within a federation of information resources. The terms and constraints defined by the common vocabulary are then used to construct the query. The terms in the common vocabulary are mapped to each applicable sources vocabulary for query and retrieval purposes. By mapping the semantic characteristics of the various sources to the common vocabularly, the integrity of the individual source repositories, file systems, processes, and applications is maintained, while new applications and access methodologies can make use of the integrated resources.

Description

A METHOD OF MAPPING SEMANTIC CONTEXT TO ENABLE INTEROPERABILITY
AMONG DISPARATE SOURCES
The present application claims the benefit of the filing date of U.S. provisional patent application serial no. 60/128,466 entitled "A METHOD OF MAPPING SEMANΗC CONTEXT TO ENABLE INTEROPERABILITY AMONG DISPARATE SOURCES", filed April 9, 1999.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT
The invention was made with Government support under DAAL01-93-C-3380 awarded by the Defense Advanced Research Project Agency (DARPA). The Government has certain rights in the invention.
FIELD OF THE INVENTION
The invention relates to the field of data retrieval from databases (or other electronic sources), database or file management, and information processing technology. More specifically, it allows data to be accessed and /or exchanged between independently developed databases and applications, and it facilitates the integration of information, processes and applications.
BACKGROUND OF THE INVENTION
The structural and syntactic characteristics of an electronic data source may describe its data types, layout, composition, and some constraints on its access. The semantic properties of an application or repository define how that particular source should be interpreted. Semantic properties may reflect functional, behavioral, contextual, environmental, or content-specific characteristics of a process, application, or repository. An attribute name, for example, may have both syntactic and semantic properties. Other semantic (or context-related) characteristics may include units-of- measure, storage and precision for numeric data, definitions of terms, relationships among terms and values, rules for expression interpretation and evaluation, or localization characteristics. Structure and syntax may be explicitly defined through a database schema, data dictionary, or in source code. Semantic characteristics may be specified explicitly (e.g., in an attribute name), implicitly (e.g., units-of-measure, relationships among terms or objects, constraints on values), or not at all.
Many information systems, data warehouses, databases, and applications are developed independently to solve particular business problems. As business requirements change and new functionality is required, the traditional methods for providing the requisite features have included:
augmenting an existing system, or in some cases, new development that is consistent with existing formats;
migrating source repositories to completely new environments or systems; and /or
integrating multiple sources through the use of custom adapters that transform the data for each unique information producer, consumer, and repository. Figure 1 is a block diagram showing a number of point-to-point integrations between two client applications 10, and a number of data sources 12. In each case, a custom adapter is provided between each client application 10 and each data source 12.
These tasks become increasingly difficult and more costly as the number and types of new and existing sources to be integrated increases. Even in cases where syntactic differences can be resolved through gateway or middleware technologies, for example, semantic conflicts and ambiguities remain. While many syntactic or structural details can be extracted from a database schema or through reverse engineering tools, semantic details that are implicitly specified or insufficiently documented are lost.
Additionally, individuals and information systems may use specific terminology to refer to objects and concepts. Cultural and organizational heritage, domain expertise, and educational and professional background all affect the language that an individual or group uses on a daily basis. The use of synonymous terms and terms whose definition is similar but not identical are common among geographically distributed organizations. Examples of terms whose usage reflects these issues are included in Table 1.
Table 1 Examples of Semantic Ambiguities - Terminology
Figure imgf000005_0001
In other cases, the interpretation of numerical data may be left to the user's implicit understanding of the data source. Some examples appear in Table 2.
Table 2 Examples of Semantic Ambiguities - Units and Normalization
Figure imgf000005_0002
As resources are increasingly accessible across organizational and national boundaries, users of those resources may find it difficult to determine the correct units or semantic context appropriate for a particular data element. For example, a search of two independently-developed databases for pricing information such as price greater then $100.00 with results merged for comparison purposes, may yield results that are inconsistent or inaccurate, or that lead to incorrect interpretation. In cases where the units are labeled, there is no mechanism available to the user to easily compare the results without manual translation.
Migrating data from one environment to another is typically expensive, time- consuming, disruptive, and error-prone, frequently resulting in loss of data or functionality. Up to 80 percent of the cost of development of a data warehouse is expended in data extraction, cleaning and loading. The migration process includes modeling existing processes and information sources, designing new processes and information models, designing the new schema, developing data conversion tools, and creating, converting, loading, testing, and validating the new repository or application. It often requires service outages for both anticipated tasks and unanticipated problems. Converted repositories are frequently incompatible with existing sources; database synchronization issues often arise during transition intervals, where both the original and new systems operate in parallel. Conversion processes may require six to eighteen months to complete, though it is not uncommon for enterprise-scale projects to take more than two years.
Various methods are employed by commercial software integration tools to resolve syntactic and structural differences among information sources, including database gateways, message-oriented middleware, and application-specific adapters. These methods fail to resolve semantic conflicts, inconsistencies, and ambiguities that naturally occur among independently developed information systems, however. Semantic integration issues are likely to arise when resources use differing naming conventions, where there may be distinctions in terminology and taxonomies, where the same terms may be used inconsistently or have differing properties (such as different units of measure, enumerated values, or constraints). These mismatches are far more difficult to resolve than syntactic and structural incompatibilities, and necessitate costly, custom solutions at best. The approach promoted by most systems integrators today is to define a single, monolithic data model for use throughout an organization, which, while effectively eliminating semantic conflicts, either forces the organization to migrate all of its information to the new model or prohibits integration with autonomously developed sources. This approach is not always feasible due to cost, resource constraints, or business requirements to maintain legacy systems. Semantic characteristics are often implicitly represented in the data and are seldom well documented. Failing to address semantic or context- or content-specific integration issues may produce unsatisfactory results and may have a negative impact on systems integrity. Natural language processing methods for the purposes of understanding and retrieving information from various static, dynamic, and interactive sources exist. These methods represent attempts to generalize (or refine) a set of query terms, to expand a set of query terms, or to construct computable queries from expressions used in normal human discourse. While these approaches support customization from the user's perspective, they do not support transparent querying or integration of disparate sources without requiring the user to have knowledge of the source specific terminology.
SUMMARY OF THE INVENTION
According to the present invention, there is provided a method of querying a plurality of data sources. A common query is received, the common query being constructed utilizing a common vocabulary that is mapped to a plurality of source vocabularies associated with respective data sources of the plurality of data sources. A plurality of source queries are generated to the plurality of data sources, each source query being derived from the common query and being constructed utilizing a respective source of vocabulary.
Other features of the present invention will be apparent from the accompanying drawings and from the detailed description that follows.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Figure 1 is a block diagram illustrating a prior art point-to-point integration scheme between a plurality of data consumers and a plurality of data sources.
Figure 2 is a block diagram illustrating the use of a common vocabulary to provide a two-level integration between a number of data consumers and a number of data sources, according to one embodiment of the present invention.
Figure 3 is a block diagram illustrating mappings between two common vocabularies and a plurality of source vocabularies, according to one embodiment of the present invention.
Figure 4A is a diagrammatic representation of an exemplary client-server (or two-tiered) environment within which the present invention may be deployed. Figure 4B is a diagrammatic representation of an exemplary three-tier environment within which the present invention may be deployed.
Figure 5 is a block diagram illustrating architectural details of an exemplary data server constructed to implement one embodiment of the present invention.
Figure 6 is a flow chart illustrating the primary operations of an exemplary method of querying a plurality of data sources, and receiving a response to such a query, according to one embodiment of the present invention.
Figure 7 is a flow chart illustrating a method of generating a source query expression for an applicable source, the source query expression being derived from a common query expression, according to one embodiment of the present invention.
Figure 8 is a flow chart illustrating an exemplary method, according to one embodiment of the present invention, of performing attribute mapping function.
Figure 9 is a flow chart illustrating an exemplary method, according to one embodiment of the present invention, of performing a semantic mapping for algorithmic conversion of attribute values.
Figure 10 is a flow chart illustrating an exemplary method, according to one embodiment of the present invention, of performing semantic mapping for lookup- based mapping of attribute values.
Figure 11 is a diagrammatic representation of an exemplary attribute mapping table, according to one embodiment of the present invention.
Figure 12 is a block diagram illustrating an exemplary machine, in the form of a computer system, that may execute a sequence of instructions embodying the present invention and that includes a machine-readable medium for storing such a sequence of instructions.
DETAILED DESCRIPTION
Overview A method and apparatus for querying a plurality of data sources are described. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be evident, however, to one skilled in the art that the present invention may be practiced without these specific details. The invention, in one embodiment, enables a distributed collection of applications and repositories with dissimilar syntactic and semantic characteristics to be accessed as though they are a single entity. Queries are performed transparently, without requiring the user to understand the characteristics of any individual source, through the use of a common vocabulary and semantic mappings to the source repositories. A common vocabulary is used to provide the desired transparency. Mappings establish the correspondence between the common vocabulary and each of the source-specific vocabularies which are to be integrated. The common vocabulary may be tailored to the needs of an individual or group of users and /or applications and forms the basis for resolution of syntactic and semantic conflicts and ambiguities within a federation of information resources. The terms and constraints defined by the common vocabulary are then used to construct the query. The terms in the common vocabulary are mapped to each applicable source vocabulary for query and retrieval purposes. By mapping the semantic characteristics of the various sources to the common vocabulary, the integrity of the individual source repositories, file systems, processes, and applications is maintained, while new applications and access methodologies can make use of the integrated resources.
A method of using semantic vocabularies and mappings is described. This approach allows data systems to be integrated or accessed transparently, from a single point of entry, despite differences in the semantic characteristics of the applications or repositories.
In one exemplary embodiment, the requisite vocabularies and semantic mappings may be prepared and information sources may be integrated using the following operations:
(1) Each data source to be integrated is modeled to create a source vocabulary. The vocabulary is essentially a structural, syntactic and semantic schema for the source.
(2) A common vocabulary, containing any combination of terms and attributes from the source(s), and the relationship and constraints that formally define those terms, is created. The characteristics of the common vocabulary need not be the same [in structure, syntax, or semantics] as any of the individual source vocabularies. It should be noted that the terminology, formats and semantics of the common vocabulary may be determined by one or more domain-specific standards, organizational requirements, or the user's preference.
(3) The common vocabulary is mapped to each of the source vocabularies. (4) The mappings are converted to a form that is usable by the run-time engine. This process will take the mapping between the names in the source vocabulary and variables in the common vocabulary (created in operation 3) and generate the contents of the mapping tables. The details for each vocabulary item are retrieved from the source models and the common models created in operations 1 and 2. The structural, syntactic and semantic correlations between a particular pair of vocabularies will be encoded. In the case of mappings performed by algorithms (e.g., units-of- measure), it will be necessary determine if a mapping exists between the two forms. If not, this fact will be flagged for the knowledge engineer. In this case it is the responsibility of the knowledge engineer to create a mapping and update the list of mappings available to the knowledge base used by the generation process. Allowing the vocabulary to be tailored enables the user to access and evaluate the information using familiar terminology and conventions. The use of a common vocabulary addresses the need for each unique client application to be integrated with each unique source. Furthermore, it allows the set of sources sharing this vocabulary to be queried transparently through a single query which need not be reformulated for each target repository. The user (and the client application) need not understand the specific details required to access each source. This approach allows sources to be integrated incrementally. Once a common vocabulary has been established, sources may be integrated at a pace determined by business needs. The common vocabulary can evolve over time as additional sources are integrated or new business requirements arise. Multiple common vocabularies, and thus multiple federated collections of applications and repositories, may be integrated using the same approach.
An exemplary use of the common vocabulary may be the semantic integration of legacy sources. A common vocabulary may also serve as a starting point from which new source systems are developed.
At run-time, the following operations are performed:
(1) A pre-existing common vocabulary is selected based on the problem domain, relevant business processes and requirements, the available source repositories and applications, and user preferences, as appropriate. The user (or application) may also select a subset of the available sources for searching through the use of a directory services mechanism.
(2) A query is formulated using the terms and constraints of the common vocabulary.
(3) Identical queries are distributed, in parallel, to each applicable source repository. (4) Upon receipt, each server maps the parameters specified in the request from the common vocabulary to the source-specific vocabulary. This process may include term renaming, unit conversion, translation, or mapping attributes and values to those used by the source.
(5) The source application, database management system, file system, or other information processing system acts on the request and returns the result set (if any).
(6) The terms of the result set are mapped from the local vocabulary to the common vocabulary, and
(7) the result set is returned to the client for further processing.
As a mapping exists from each individual source vocabulary to the common vocabulary, queries may be performed transparently across the set of integrated sources. This arrangement allows modular, plug-in, incremental integration of new sources into an existing environment. By creating a common vocabulary for the existing system, new sources may be mapped to the common vocabulary without requiring modification of those sources. No changes are required to existing applications to allow a new source to be utilized. Figure 1 illustrates the complexity of integration using the traditional, custom approach. The present invention is advantageous in that it may substantially reduce the number of point-to-point integrations needed among applications and repositories that comprise the user's environment. Figure 3 illustrates the use of a common vocabulary 14 to resolve attribute names, data types and units between a number of sources 12 and a number of client (user) applications 10. Specifically, Figure 3 illustrates the use of two independent, common vocabularies 14 which map combinations of attributes from three source vocabularies 16, each source vocabulary 12 being associated with a respective source 12. In one embodiment, each of the respective sources may display dissimilar structural syntax, semantics or access protocols from other source vocabularies 16.
The mapping component may be relocated to an application server servicing a cluster of sources 12 with minimal loss of flexibility. Whether vocabulary mapping functionality is performed by the information source server or a middle-tier application server, client applications (and therefore users) 10 are shielded from the specifics of the individual source vocabularies 16.
Exemplary Embodiment Figure 4A illustrates an exemplary client-server environment 20 in which the present invention may be used. The environment 20 includes one or more client applications 10 which may utilize individual or shared vocabularies, and multiple data sources in the exemplary form of data servers 22 which may utilize individual or shared vocabularies. Server vocabularies of the data servers 22 may overlap with the client vocabularies of the client applications 10 although this need not be the case. The data servers 22 may have dissimilar structure and syntax, dissimilar semantics and terminology, and dissimilar access protocols from the client applications 10. In one embodiment, at least one data server 22 supports one client vocabulary, and the client applications 10 and servers 22 are connected by a network 24 or other communication medium. The servers 22 (or a subset thereof) may share at least one common vocabulary 14. This common vocabulary 14 is utilized by at least one client application. There may be multiple common vocabularies. The term "common" is with respect to some set of data servers or sources.
Requests 26 are sent from the client application 10 to the server 22 and responses 28 from the servers 22 to the client application 10. The client application 10 generates requests 26 and receives responses 28 in the user's terminology and semantics (i.e., the common vocabulary). Each data server 22 may receive requests 26 and generate responses 28 in the user terminology and semantics. However, the internal repositories utilizes the source-specific local terminology and semantics. It should be noted that a user terminology and semantics set is not arbitrary, rather it is a set of corresponding terms, concepts and preferences which have been tailored to one or more users. Each data server 22 sharing the common vocabulary 14 (or a user selected subset of the data servers 22) receives an identical request 26.
The exemplary architecture, illustrated in Figure 4A, is client-server. The present invention is equally applicable to a three-tiered architecture such as that shown in Figure 4B, where the mapping functionality is performed on (one or more) intermediate application servers 30. The application server 30 receives client requests, maps the requests and queries the appropriate data sources (e.g., the servers 22), then remaps and returns the results to the client. An application server 30 is also capable of manipulating the results, if necessary, to sort, merge or filter the results as well as performing report generation and analysis functions. The application server 30 understands the mappings between each common vocabulary and each source vocabulary accessible in the federation. In the client-server case illustrated in Figure 4A, each server 22 is required to understand only those mappings between supported common vocabularies and its local vocabulary. The application server 30 also supports all of the access protocols needed to access any of the data servers 22. By using a server- based process, the client-server architecture takes advantage of a common access protocol. Since the application server 30 intermediates each transaction, it may require relatively powerful hardware and may potentially become a bottleneck (depending on the number of users or the tasks performed on behalf of users) or single-point-of-failure. An application server 30 does, however, allow the mapping information and business applications to be centralized which may reduce the maintenance to clients and back- end servers. Unlike the client-server architecture, the three-tier approach does not require a server-resident mapping process be added to each data server 22. Using a client-server or three-tiered architecture, the concept of integrating dissimilar information sources using a common vocabulary remains unchanged. It will be appreciated that, in the three-tier architecture illustrated in Figure 4B, certain of these components will be deployed on the application server 30.
Figure 5 shows the primary components of an exemplary data server 22 as displayed within the two-tier (client-server) architecture shown in Figure 4A. The illustrated primary components of the exemplary data server 22 include a communication interface 32, a mapper 34 and a source interface 36 (e.g., a user API, a DBMS) to the information repositories 38. The communication interface 32 is arbitrary and supports whatever communication methodologies (e.g., TCP/IP, CORBA, ATM, Ethernet) is required to communicate over the network 24. The communication interface 32 also provides services to format /extract information to/from messages. As previously discussed, these messages utilize the users' terminology and semantics set. As illustrated in Figure 5, the mapper 34 transforms the users' terminology and semantics embodied in a request 26 into the local terminology and semantics (and vice versa). Once translated into the source-specific local terminology, a query is made to the appropriate data source 38 (e.g., database, file or other data management service). In order for the mapper 34 to perform the necessary translations, a number of mapping tables 40 are made available to it.
Since the client application 10 has knowledge of only the common vocabulary 14, the capability of bi-directional mapping is provided. The top half of Figure 5 shows the mapping requests 26 from common to source vocabulary, while the lower half shows the mapping responses 28 from source to common vocabulary.
While not required for the basic functionality, a variable length string may be used to transfer the requests and responses between the client application 10 and the data server 22. In the data server 22, lexical tokens, representing attribute names, operators and values may be extracted by a lexical analyzer 42, after which the mapping appropriate to the token (e.g., name or value) is performed. A syntax analyzer 44 determines whether a token is an attribute name, an operator, a value or something else. An alternative is to remember what the last token was (assuming the use of <attribute_name, operator, attribute_value> triples). The mapping functions may be built as actions to the lexical analyzer 42 and generation of the query expression may be performed by actions in the syntax analyzer 44. Relational expressions may be chained together in compound expressions as shown in Figure 7. The input and output grammars are chosen by the user according to the application. The lexical analyzer 42 and syntax analyzer 44 skeletons may be generated with commercially available tools used for that purpose (e.g., AT&T lex and yacc). These skeletons may then augmented with the appropriate set of actions.
Figure 6 is a flow chart illustrating an overview of an exemplary method 50 of performing a mapping of a request 26, expressed in a common vocabulary, to a source vocabulary 16. The method 50 commences with lexical extraction operation 52 by the lexical analyzer 42, followed by a semantic mapping operation 54 performed by the mapper 34 and a syntax analysis operation 56 performed by the syntax analyzer 44.
Two types of semantic mapping are performed as part of the semantic mapping operation 54 in order to provide the desired transparency, namely attribute name mapping 60 and attribute value mapping 62. Attribute name mapping 60, further illustrated in Figure 8, resolves naming differences between the attributes in the vocabularies. Attribute value mapping 62 resolves differences between the contents of the attributes. Attribute value mapping 62 may be performed in one of two exemplary methods: algorithmically, as illustrated in Figure 9, or by lookup, as illustrated in Figure 10.
Figure 7 is a flow chart illustrating further details of a method 66, according to one embodiment of the present invention, of expression generation. It should be noted that Figure 7 does not illustrate lexical, syntactic or semantical analysis performed through compiler functionality, nor symbol table management or generation provided by compiler technology or techniques.
The method 66 illustrates the chaining together of relational expressions into a compound expression. The method 66 commences with a mapping operation 54, followed by a determination at decision block 68 as to whether a logical operation has been detected. If so, the method 66 continues with the logical expression formation by proceeding via block 70 to perform a further mapping operation 54. On the other hand, should no logical operation be detected at decision box 68, a compound expression, comprising a chained series of relation expressions, is submitted at block 72.
As stated above, two types of semantic mapping are performed as part of the semantic mapping operation. Figure 8 is a flow chart illustrating a method 71, according to an exemplary embodiment of the present invention, of performing an attribute name mapping 60.
The method 71 commences at block 72 with a lookup attribute name operation. Specifically, a global name 74 is utilized to perform a lookup with respect to an attribute name table 78, and to return a local name 76.
As shown in Figure 11, there should be at least one entry within the attribute name table 78 for each attribute which stores the mapped global name 74, the mapped data type and a string which specifies any required algorithmic conversion(s). A key consisting of the vocabulary name, the class name and the attribute (or local) name 76 concatenated together is used as a key to a database lookup. This key structure allows multiple vocabularies to be stored in a single database or table. It also allows the same attribute name 76 to exist in different classes as well as for the same class name to exist in different vocabularies. Once the mapping information is retrieved, the attribute (local) name 76 found in the attribute name table 78 is written to an output string followed by the detected relational operator.
Returning to Figure 8, at decision box 80, a determination is made as to whether the local name is valid for a relevant source vocabulary 16 (i.e., does the local name correspond to an actual name within the source vocabulary 16). If not, an error is generated at block 82. Alternatively, at block 84, the local name 76 is inserted into the query expression.
At block 86, the attribute data type (e.g., floating point number, string or integer) is retrieved. At block 88, formatting of the expression commences.
The lookup attribute value mapping 62 also includes two components, namely (1) a semantic mapping for algorithmic conversion component and (2) a semantic mapping for lookup-based mapping of attribute values component.
Figure 9 is a flow chart illustrating an exemplary method 90, according to one embodiment of the present invention, of performing a semantic mapping of algorithmic conversions on attribute values.
At block 92, upon the detection of a value token (or operand), the lexical type for the relevant value token is determined.
In a first instance, if the value associated with the value token is typed as a floating point number, an algorithmic conversion, if indicated by the mapping structure, is performed at block 94. Similarly, if the value associated with the value token is typed as an integer, an algorithmic conversion, if indicated by the mapping structure, is performed at block 96. To perform the algorithmic conversions at blocks 94 and 96, the mapper 34 has access to a library of supported conversions integrated into it.
Examples of algorithmic conversions that may be performed at blocks 94 and 96 include unit-of-measure conversions (e.g., centimeters to inches) or any other conversions of values that may be algorithmically performed, as opposed to requiring a lookup operation.
Figure 10 is a flow chart illustrating an exemplary method 100, according to one exemplary embodiment of the present invention, of performing a semantic mapping for lookup-based mapping of attribute values.
As noted, if a value is typed as a floating point number at block 92 in Figure 9, is subject to an algorithm conversion at block 94. Figure 10 reveals that the value receives no further value mapping, and is accordingly inserted into an output string and a next token value is located. On the other hand, for string or integer typed values, additional lookups may be performed. A key (not shown) consisting of a vocabulary name, a class name, an attribute name and the value are concatenated together and utilized as a key to a database lookup in the name table 78 or a value table 79.
Dealing first with a string typed value, two attempts may be made to map such a string value. First, at block 102, a lookup is made in the attribute name table 78 to determine whether or not the value is in fact another attribute (local) name. If the lookup at block 102 succeeds, as determined at decision block 104, the local name 76 substitutes the global name 74, the result is placed in an output string, and the next value token is sorted.
On the other hand, should it be determined at decision block 104 that the lookup at block 102 has failed, the global name 74 is retained at block 108 and the key is utilized to perform a second lookup at block 110 in an attribute value table 79 to determine if the key is to an enumerated value.
If the lookup operation at block 110 succeeds, as determined at decision block 112, then a the local value retrieved from the value table 79 is substituted for the global value, written to an output string, whereafter the next value token is sought.
On the other hand, should it be determined at decision block 112 that the lookup operation at block 110 has failed, then the original, global value is retained at block 116, and written into an output string. Thereafter, the next value token is sought.
In the case where the value token is determined at block 92 to comprise an integer type value, following any required algorithmic conversions at block 96, the actions described above with reference to blocks 110 - 116 are then performed with respect to such an integer type value.
It should be noted that the two stage mapping of string values is only needed for applications involving query generation (since the value of two attributes may be compared with one another). In an alternative embodiment, only a single lookup may be used. Specifically, if a value token representing a logical operator is located, the logical operator will be added to an output string, and the process repeated. When the end of a valid expression is detected, the output string is used to query the data source. The results returned from the data source are then mapped back into the common vocabulary 14. The reverse mapping process is substantially as described above, with two exceptions. First, the mapping tables (e.g., the name table 78 and the value table 79) are from the source (or local) vocabulary 16 to the common vocabulary 14. Secondly, the input and output are <attribute_name, attribute_value> tuples.
Figure 11 shows the structure of exemplary mapping tables 64 within the context of an object-oriented database. The same structure may be implemented for the attribute name and the attribute value tables 78 and 79. In the attribute name table 78, all three fields are needed. For the attribute value table 79, only the field for the mapped value field is needed. A mapped name may furthermore be an attribute name or an attribute value. The tables 64 may be implemented through the object-oriented database using a hashed dictionary structure 130. Another data structure may be used so long as it has a look-up or search capability. The data structures, itself, need not be persistent, if it can be loaded from a persistent source (e.g., a file) when required. A database does, however, provide built-in mechanisms to more easily manage, store, retrieve and update the data. It is clear that the choice of data structures may affect the performance of the mapping functions. The hashed dictionary structure 130 generally provides fast lookup performance relative to other commonly used methods if an effective hash function is selected. Since these tables 64 are of the same format and the key string allows sufficient uniqueness, the two tables may be merged into one. This approach, however, may not significantly reduce the number of lookups, the aggregate table size, or the maintenance requirements but may result in some loss of understandability.
As noted above, in order to perform the required algorithmic conversions, the mapper 34 has a library of supported conversions integrated into it. One embodiment of the present invention links a library (static or dynamic) of conversion algorithms into the mapper 34. The appropriate conversion function is selected based on a code in the conversion specifier string, as shown in Figure 11. These may be packaged as remote procedures or distributed functions such as the Object Management Group's Common Object Request Broker Architecture (CORBA). This integration allows new functions to be easily added.
Figure 12 is a block diagram illustrating a machine, in the exemplary form of a computer system 250, within which a set of instructions for causing the computer system 250 to perform any one of the methodologies discussed above may be executed. The computer system 250 includes a processor 252, a main memory 254, and a static memory 255, which communicate with each other via a bus 256. The computer system 250 further includes a video display unit 258 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CTR)). The computer system 250 further includes an alpha-numeric input device 260 (e.g., a keyboard), a cursor control device 262 (e.g., a mouse), a storage medium in the exemplary form of a disk drive unit 264, a signal generation device 266 (e.g., a speaker) and a network interface device 268.
The disk drive unit 264 includes a machine-readable medium 265 on which is stored a set of instructions (i.e., software 270) embodying any one, or all, of the methodologies described above. The software 270 is also shown to reside, completely or at least partially, within the main memory 254 and /or within the processor 252. The software 270 may furthermore be transmitted or received via the network interface device 268. For the purposes of the present specification, the term "machine-readable medium" shall be taken to include any medium that is capable of storing and encoding a sequence of instructions for execution by the machine, and that causes the machine to perform any one of the methodologies of the present invention. The term "machine- readable medium" shall accordingly be taken to include, but not limited to, solid-state memories, optical and magnetic disks, and carrier wave signals.
The present application is useful for developing and integrating applications and repositories wherein the data is accessed using terminology, methodologies, or other mechanisms that may differ from the source implementation, format, or context. It may be used to resolve both semantic and syntactic differences among applications and repositories. Although databases are used as examples throughout this disclosure, the approach is independent of the source implementation or underlying technology. This invention is also applicable to information brokering and mediation.
The present invention is also advantageous in that data migration is not required to customize the vocabulary or transparently to query multiple data sources as a single set. This alone results in a significant reduction of cost, operational disruption and risk. As the integrity of the various source repositories and applications is preserved, the operation of existing systems remains uninterrupted while new applications can access and make use of the information maintained by those systems.
While an exemplary embodiment of the present invention has been described above in the context of a query that is targeted to a set of structurally, syntactically or semantically diverse set of information sources, it will be appreciated that the invention extends to the integration of information sources through the use of a common vocabulary to resolve structural, syntactic or semantic differences between subsources and the customization and resolving of an interface to a set of structurally, syntactically or semantically diverse set of data sources utilizing terminology particular to a user's preference or task area.
Thus, a method and apparatus for querying a plurality of data sources have been described. Although the present invention has been described with reference to specific exemplary embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims

CLAIMSWhat is claimed is:
1. A method of querying a plurality of data sources, the method including:
receiving a common query constructed utilizing a first common vocabulary, the first common vocabulary being mapped to a plurality of source vocabularies associated with respective data sources of the plurality of data source; and
generating a plurality of source queries to the plurality of data sources, each source query being derived from the common query and being constructed utilizing a respective source vocabulary.
2. The method of claim 1 wherein the common query is constructed utilizing a first term of the first common vocabulary, the first term of the first common vocabulary being mapped by at least one mapping structure to corresponding first terms of the plurality of source vocabularies associated with the respective data sources.
3. The method of claim 2 wherein the generating of the plurality of source queries includes identifying the corresponding first terms of the plurality of source vocabularies utilizing the first term of the first common vocabulary and the at least one mapping structure, and including the first terms of the plurality of source vocabularies within the respective source queries directed to the plurality of data sources.
4. The method of claim 1 wherein the first common vocabulary is semantically mapped to the plurality of source vocabularies so as to address semantic differences across the plurality of source vocabularies.
5. The method of claim 1 wherein the common vocabulary is syntactically mapped to the plurality of source vocabularies so as to address syntactic differences across the plurality of source vocabularies.
6. The method of claim 1 wherein the common vocabulary is structurally mapped to the plurality of source vocabularies so as to address structural differences across the plurality of source vocabularies.
7. The method of claim 1 wherein the generating of the plurality of source queries comprises performing a name translation for the first term between a common name and a source name.
8. The method of claim 1 wherein the generating of the plurality of source queries comprises performing a value translation for a value associated with the first term between a common value and a source value.
9. The method of claim 8 wherein the value translation comprises a mapping of enumerated variables.
10. The method of claim 8 wherein the value translation comprises an algorithmic conversion of the common value to the source value.
11. The method of claim 1 wherein the first common vocabulary comprises terms specific to a first entity and a second common vocabulary comprises terms specific to a second entity.
12. Apparatus for distributing a common query across a plurality of data sources, the apparatus including:
a communication interface to receive a global query constructed utilizing a global vocabulary, the global vocabulary being mapped to a plurality of local vocabularies associated with respective data sources of the plurality of data sources; and
a mapper to generate a plurality of local queries to the plurality of data sources, each local query being derived from the global query and being constructed utilizing a respective local vocabulary.
13. The apparatus of claim 12 wherein the global query is constructed utilizing a first term of the global vocabulary, and wherein the mapper is to map the first term of the global vocabulary to at least one corresponding first term of a local vocabulary associated with a data source.
14. The apparatus of claim 13 wherein the mapper is to identify the first term of the source vocabulary utilizing the first term of the global vocabulary, and includes the first term of the local vocabulary within a respective local query directed to the data source.
15. The method of claim 12 wherein the mapper is to map the global vocabulary to at least one local vocabulary so as to address semantic specifics of the at least one local vocabulary.
16. The apparatus of claim 12 wherein the mapper is to map the global vocabulary to at least one local vocabulary so as to address syntactic specifics of the at least one local vocabulary.
17. The apparatus of claim 12 wherein the mapper is to map the global vocabulary to at least one local vocabulary so as to address structural specifics of the at least one local vocabulary.
18. The apparatus of claim 13 wherein the mapper is to perform a name translation of the first term between a global name and a local name.
19. The apparatus of claim 13 wherein the mapper is to perform a value translation for a value associated with the first term between a common value and a source value.
20. The apparatus of claim 19 wherein the mapper is to perform a mapping of enumerated variables.
21. The apparatus of claim 19 wherein the mapper is to perform an algorithmic conversion between a global value and a local value.
22. Apparatus for distributing a common query across a plurality of data sources, the apparatus including:
first means for receiving a global query constructed utilizing a global vocabulary, the global vocabulary being mapped to a plurality of local vocabularies associated with respective data sources of the plurality of data sources; and
second means for generating at least one local query to a data source, the local query being derived from the global query and being constructed utilizing the respective local vocabulary.
23. A machine-readable medium storing a sequence of instructions that, when executed by a machine, cause the machine to:
receive a common query constructed utilizing a first common vocabulary, the first common vocabulary being mapped to a plurality of source vocabularies associated with respective data sources of the plurality of data source; and generate a plurality of source queries to the plurality of data sources, each source query being derived from the common query and being constructed utilizing a respective source vocabulary.
PCT/US2000/009203 1999-04-09 2000-04-07 A method of mapping semantic context to enable interoperability among disparate sources WO2000065486A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU63340/00A AU6334000A (en) 1999-04-09 2000-04-07 A method of mapping semantic context to enable interoperability among disparate sources

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12846699P 1999-04-09 1999-04-09
US60/128,466 1999-04-09

Publications (2)

Publication Number Publication Date
WO2000065486A2 true WO2000065486A2 (en) 2000-11-02
WO2000065486A3 WO2000065486A3 (en) 2003-12-04

Family

ID=22435511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/009203 WO2000065486A2 (en) 1999-04-09 2000-04-07 A method of mapping semantic context to enable interoperability among disparate sources

Country Status (2)

Country Link
AU (1) AU6334000A (en)
WO (1) WO2000065486A2 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2368680A (en) * 2000-05-11 2002-05-08 Charteris Plc Mapping an XML logical structure to a business model to aid translation, querying and API programming
WO2002075594A2 (en) * 2001-03-20 2002-09-26 Interx Plc Information integration system
WO2002080026A1 (en) * 2001-03-30 2002-10-10 British Telecommunications Public Limited Company Global database management system integrating heterogeneous data resources
WO2002080028A1 (en) * 2001-03-30 2002-10-10 British Telecommunications Public Limited Company Global database management system integrating heterogeneous data resources
WO2003030025A1 (en) * 2001-09-28 2003-04-10 British Telecommunications Public Limited Company Database management system
EP1381972A2 (en) * 2000-12-06 2004-01-21 Biosentients, Inc. System, method, software architecture and business model for an intelligent object based information technology platform
WO2004077215A2 (en) * 2003-01-30 2004-09-10 Vaman Technologies (R & D) Limited System and method for data migration and conversion
WO2005096174A1 (en) * 2004-04-02 2005-10-13 Health Communication Network Limited Method, apparatus and computer program for searching multiple information sources
EP1631924A2 (en) * 2003-05-19 2006-03-08 Business Objects Americas Apparatus and method for accessing diverse native data sources through a metadata interface
CN100409190C (en) * 2005-02-14 2008-08-06 三星电机株式会社 Method and system of managing conflicts between applications using semantics of abstract services for group context management
US7673330B2 (en) 2006-01-05 2010-03-02 Microsoft Corporation Ad-hoc creation of group based on contextual information
US7702639B2 (en) 2000-12-06 2010-04-20 Io Informatics, Inc. System, method, software architecture, and business model for an intelligent object based information technology platform
WO2013013942A1 (en) * 2011-07-28 2013-01-31 International Business Machines Corporation Entity resolution
WO2013181588A2 (en) * 2012-06-01 2013-12-05 Staples, Inc. Defining and mapping application interface semantics
CN109829010A (en) * 2017-08-28 2019-05-31 西安中兴新软件有限责任公司 A kind of entry amending method and entry modify device
EP3511843A1 (en) * 2018-01-10 2019-07-17 Siemens Aktiengesellschaft Method and system for integrating data received from one or more data sources

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4751635A (en) * 1986-04-16 1988-06-14 Bell Communications Research, Inc. Distributed management support system for software managers
WO1990008360A1 (en) * 1989-01-12 1990-07-26 Telebase Systems, Inc. System and method for retrieving information from a plurality of databases
US5634053A (en) * 1995-08-29 1997-05-27 Hughes Aircraft Company Federated information management (FIM) system and method for providing data site filtering and translation for heterogeneous databases
US5701423A (en) * 1992-04-10 1997-12-23 Puma Technology, Inc. Method for mapping, translating, and dynamically reconciling data between disparate computer platforms

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4751635A (en) * 1986-04-16 1988-06-14 Bell Communications Research, Inc. Distributed management support system for software managers
WO1990008360A1 (en) * 1989-01-12 1990-07-26 Telebase Systems, Inc. System and method for retrieving information from a plurality of databases
US5701423A (en) * 1992-04-10 1997-12-23 Puma Technology, Inc. Method for mapping, translating, and dynamically reconciling data between disparate computer platforms
US5634053A (en) * 1995-08-29 1997-05-27 Hughes Aircraft Company Federated information management (FIM) system and method for providing data site filtering and translation for heterogeneous databases

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2368680A (en) * 2000-05-11 2002-05-08 Charteris Plc Mapping an XML logical structure to a business model to aid translation, querying and API programming
EP1381972A2 (en) * 2000-12-06 2004-01-21 Biosentients, Inc. System, method, software architecture and business model for an intelligent object based information technology platform
US7702639B2 (en) 2000-12-06 2010-04-20 Io Informatics, Inc. System, method, software architecture, and business model for an intelligent object based information technology platform
EP1381972A4 (en) * 2000-12-06 2007-03-14 Io Informatics Inc System, method, software architecture and business model for an intelligent object based information technology platform
WO2002075594A2 (en) * 2001-03-20 2002-09-26 Interx Plc Information integration system
WO2002075594A3 (en) * 2001-03-20 2003-06-05 Interx Plc Information integration system
WO2002080028A1 (en) * 2001-03-30 2002-10-10 British Telecommunications Public Limited Company Global database management system integrating heterogeneous data resources
WO2002080026A1 (en) * 2001-03-30 2002-10-10 British Telecommunications Public Limited Company Global database management system integrating heterogeneous data resources
WO2003030025A1 (en) * 2001-09-28 2003-04-10 British Telecommunications Public Limited Company Database management system
WO2004077215A2 (en) * 2003-01-30 2004-09-10 Vaman Technologies (R & D) Limited System and method for data migration and conversion
WO2004077215A3 (en) * 2003-01-30 2005-05-26 Vaman Technologies R & D Ltd System and method for data migration and conversion
EP1631924A2 (en) * 2003-05-19 2006-03-08 Business Objects Americas Apparatus and method for accessing diverse native data sources through a metadata interface
EP1631924A4 (en) * 2003-05-19 2009-12-30 Business Objects Americas Apparatus and method for accessing diverse native data sources through a metadata interface
WO2005096174A1 (en) * 2004-04-02 2005-10-13 Health Communication Network Limited Method, apparatus and computer program for searching multiple information sources
CN100409190C (en) * 2005-02-14 2008-08-06 三星电机株式会社 Method and system of managing conflicts between applications using semantics of abstract services for group context management
US7673330B2 (en) 2006-01-05 2010-03-02 Microsoft Corporation Ad-hoc creation of group based on contextual information
WO2013013942A1 (en) * 2011-07-28 2013-01-31 International Business Machines Corporation Entity resolution
US8972387B2 (en) 2011-07-28 2015-03-03 International Business Machines Corporation Smarter search
WO2013181588A2 (en) * 2012-06-01 2013-12-05 Staples, Inc. Defining and mapping application interface semantics
WO2013181588A3 (en) * 2012-06-01 2014-02-13 Staples, Inc. Defining and mapping application interface semantics
US9552400B2 (en) 2012-06-01 2017-01-24 Staples, Inc. Defining and mapping application interface semantics
CN109829010A (en) * 2017-08-28 2019-05-31 西安中兴新软件有限责任公司 A kind of entry amending method and entry modify device
CN109829010B (en) * 2017-08-28 2023-11-14 西安中兴新软件有限责任公司 Entry modification method and entry modification device
EP3511843A1 (en) * 2018-01-10 2019-07-17 Siemens Aktiengesellschaft Method and system for integrating data received from one or more data sources
WO2019137706A1 (en) * 2018-01-10 2019-07-18 Siemens Aktiengesellschaft Method and system for integrating data received from one or more data sources

Also Published As

Publication number Publication date
WO2000065486A3 (en) 2003-12-04
AU6334000A (en) 2000-11-10

Similar Documents

Publication Publication Date Title
US7877726B2 (en) Semantic system for integrating software components
US6418448B1 (en) Method and apparatus for processing markup language specifications for data and metadata used inside multiple related internet documents to navigate, query and manipulate information from a plurality of object relational databases over the web
US5913214A (en) Data extraction from world wide web pages
US7823123B2 (en) Semantic system for integrating software components
US8612468B2 (en) System and method for retrieving data from a relational database management system
US7779386B2 (en) Method and system to automatically regenerate software code
JP4264118B2 (en) How to configure information from different sources on the network
AU2006200383B2 (en) Mapping of a file system model to a database object
US8983931B2 (en) Index-based evaluation of path-based queries
US20150269223A1 (en) System for Accessing a Relational Database Using Semantic Queries
US20050234889A1 (en) Method and system for federated querying of data sources
US20080168420A1 (en) Semantic system for integrating software components
US7552151B2 (en) System, method and program product for adding, updating and removing RDF statements stored on a server
US20060282482A1 (en) Method and system for model-based replication of data
US9684699B2 (en) System to convert semantic layer metadata to support database conversion
WO2012054860A1 (en) Accessing relational databases as resource description framework databases
WO2000065486A2 (en) A method of mapping semantic context to enable interoperability among disparate sources
Braga et al. The use of mediation and ontology technologies for software component information retrieval
US20170011128A1 (en) Dynamic domain query and query translation
Indulska et al. A Type Management System for an ODP Trader.
US7970867B2 (en) Hypermedia management system
US20060265387A1 (en) Method and apparatus for loading artifacts
Barbara et al. Information brokers: Sharing knowledge in a heterogeneous distributed system
Weal et al. Open hypermedia as a navigational interface to ontological information spaces
US20050119871A1 (en) System and method for multiple model object sharing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANT TO RULE 69 EPC (EPO FORM 2524 OF 060203)

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP