WO2006003351A1

WO2006003351A1 - Analysing data formats and translating to common data format

Info

Publication number: WO2006003351A1
Application number: PCT/GB2004/002889
Authority: WO
Inventors: Brian Bolam; Stephen Byrne
Original assignee: Omprompt Limited
Priority date: 2004-07-02
Filing date: 2004-07-02
Publication date: 2006-01-12

Abstract

An interface (100) allows two shippers SI, S2 to send messages formatted according to their preferences to any or all of a number of truckers T1, T2, T3 and have the truckers T I, T2, T3 receive the messages in their own preferred format. The truckers TI, T2, T3 can thus be coordinated such that which ever of them is available may come and collect a load from one of the shippers S1, S2 as soon as the load has landed. In order to achieve this the interface translates incoming messages from the format they are received in to the format that the intended recipient desires to receive messages in.

Description

ANALYSING DATA FORMATS AND TRANSLATING TO COMMON DATA FORMAT

The present invention relates to a data interface and in particular a data interface capable of translating data input into it in a first format into output data in a second format.

Most contemporary businesses use computerised systems to keep information about their clients and to manage incoming or outgoing orders or information requests. However, different businesses use disparate systems, whether packaged of bespoke, which struggle to exchange data. Accordingly data transfer between businesses is often problematic.

This problem has long been recognised and has resulted in the proliferation of

'data and message standards' such as EDI, EDIFACT, XML etc. These 'standards' have themselves been modified by individual users and user communities to the point at which it becomes necessary to translate even 'standard' message exchange between systems. Further, change is the reality of business and it follows therefore that data exchange is in a constant state of flux. As a consequence, extensive effort is required to integrate, deploy and support linked systems.

hi order to ease these problems companies may attempt to coordinate into a community which uses a common system, replacing their earlier individualised systems. This however can be expensive and may not always be practicable.

Alternatively businesses may attempt to provide interfaces for many different formats or attempt to map one format into another. Software has been developed to reduce this effort and packages and services are available to undertake this work. However, these packages are not automated and thus many hours of skilled labour is required using drag and drop technology to set up the package so that inbound messages in a particular format may be translated into another specified format. Each time the system is adapted to translate messages in another form much time and expense is required to adapt the system to translate messages in the new form. Such systems however still breakdown where there is a mistake in the inbound data and either fail to accept the message or translate the message incorrectly.

It is thus an object of the present invention to provide an improved method of transferring messages from one form into another.

According to a first aspect of the present invention there is provided a method for interfacing between an input data set in a first format and an output data set in a second format comprising the steps of: analysing the syntax of the input data set; separating the individual data fields in the input data set; extracting the data in each individual field; arranging the extracted data in accordance with the second format and thereby generating an output data set.

According to a second aspect of the present invention there is provided an interface according to the first aspect of the present invention having means for receiving an input data set, means for processing the inbound data set in order to generate an output data set and means for exporting the output data set. Such an interface allows a system to deal with input data sets in a format that is not supported by the system by translating them into a format that is supported by the system. ^]

Preferably, the ontology for analysing the message syntax, separating the

5 individual data fields and thus extracting the data in each individual field is generated by Intelligent Agent/Knowledge Management (IA/KM) technology. Most preferably the IA/KM technology is similar to that available from MagentA Corp Ltd.

Additionally and or alternatively, the ontology for analysing the message syntax, separating the individual data fields and thus extracting data in each individual field is

10 Automated Message Profiling (AMP).

Preferably the interface may be adapted to analyse and translate input data sets in any desired format. Most preferably the interface is additionally adapted to export an output data set in any desired format. In order to translate from an input data set in a first format to an output data set in a second format, the interfacing method may 15. include the steps of translating the input data in the first format into a common format and then translating the data in the common format into the second format.

Particular formats that may be translated using such a system include but are not limited to EDI, XML, ASCII, SMS, Voice, VML, Fax and similar. The particular formats mentioned herein include any related formats and particular variations upon 0 these formats. In particular EDI as defined herein covers a number of sub formats

including but not limited to ANSI 12, HL7, SPEC 2000, UN/EDIFACT, EDIBUILD, EDICON, EMEDI and similar; XML as defined herein includes any such sub formats as may exist including but not limited to ASCI Xl 2 and UN/CEFACT.

The interface of the present invention may also be adapted to work with a number of database formats for instance to provide a means for databases in two different formats to be merged into a single database either in the same format as one of the original databases or in a third format. Alternatively, the interface can be positioned between a user and the two databases in order that user queries can be translated by each database.

Preferably in addition to such data mapping that the interface is able to perform as described above, it is additionally operative to provide message mapping. In particular it is adapted so as to be able to receive an inbound message in any format and translate the inbound message into an output message in a desired format.

Additionally and preferably, the interface is adapted to provide message switching; that is to receive an inbound message from a sender in a first format, translate the letter into a second format and forward the message to a desired recipient. There may be one or more desired recipients and the interface may act to translate the message into different formats for each individual recipient if desired. Messages received from one sender may always be sent to one or more particular recipients or may alternatively only be sent to recipients specified in the message. Messages may be sent to recipients only in a particular specified format or may be sent in a one of a variety of specified formats depending on the data in the message or

other criteria. The interface may further act to send messages to the sender confirming translation or receipt of the senders initial message. Preferably the interface translates all inbound messages into a common format and then translates messages from the common format into the desired output message format. The interface preferably stores copies of inbound messages either in their original format or in the common format or in both formats.

Preferably the method of interfacing additionally includes the step of verifying that the content of the input data set or message conforms to standard or expected criteria. If the message does not conform to the criteria an error message may be sent to the sender and or another authority. The error message may request that the sender or the authority check and or resend the message. The error message may additionally include information relating to why the message was rejected.

Alternatively or additionally if a message fails to conform to the criteria, the interface may attempt to correct minor errors and if successful may notify the sender of the correction and or some other authority. The interface may be required to await acceptance of the correction from the sender or other authority before forwarding the corrected message to the desired recipients.

This verification and or correction is useful in situations wherein documents are intended to contain identical information but for some reason do not. For instance in the transport industry a bank will not clear a letter of credit if its details do not exactly match those on the associated bill of landing. The verification and or correction described above can be used to either flag up any problems as early as possible or to correct minor errors to avoid the occurrence of such problems.

Preferably records are generated and stored relating to each message received from a sender and forwarded to a recipient. Preferably the records include some or all of the following: the format of the inbound message and or the outbound message, the time and date of receipt of a message, a log of any errors identified in the message, identification of the sender, identification of the recipient or recipients.

Preferably in order to accept a new sender of messages, the new sender is required to take part in an automated registration process. They must provide a standard set of details for the interface and then send a message to the interface in their preferred format. The message can then be stored as a calibration standard to be used to aid future translation of messages from the sender.

Preferably the above described interface may be adapted to provide a messaging interface for a group of businesses. In particular, the above may be used to form a messaging interface for the logistics industry. Preferably such an interface may be embodied by a plurality securely connected network service modules. Each network service module may be a peer-to-peer service module comprising a hub linking two communications servers, a database server and a processing server.

According to a third aspect of the present invention there is provided a method of deteπrώiing the structure of messages in a particular format by analysing a plurality of such messages comprising the steps of: identifying individual segments within each message to thereby determine the segment structure of the particular message format; identifying individual elements within each segment, thereby determining the element structure of each segment; and analysing corresponding individual elements in each segment over all the analysed messages to determine the structure of each element in the particular data format.

Preferably, the segment ID is used to identify the start and end positions of each segment. Preferably, the identified segments are counted so that the mean number of segments per message and the standard deviation in the number of segments per message can be determined. Most preferably, the segment structure of each individual message is analysed to determine which segments appear in every message either once or more than once in every message and which segments appear once or more than once in only some messages and the number of messages with identical segment structures.

Preferably, a user is provided with a list of all segments used in each message in the plurality of messages. Preferably said list also contains statistical information relating to the number of times that the segment has occurred over all messages and information relating to the number of occurrences of a segment as a percentage of the number of messages.

Preferably the method may be applied to identifying elements in segments where the elements are delimited and/or to identifying elements in segments where the elements have a fixed length. If the segments to be analysed have delimited elements, preferably the element delimiters or separators may be used to identify the start and end positions of individual elements. In such cases, as each element can be readily identified, preferably only one segment at a time is analysed. If the segments to be analysed have fixed lengths, each like segment from the plurality of messages is analysed at the same time to identify the individual elements.

Preferably, if the individual elements of each segment contain sub-elements, either above process may be used to identify the sub-element structure of each element as appropriate.

Preferably, to analyse the elements, the data in each like segment is entered into a table or similar such that corresponding elements in each segment are entered into the same column of the table. Preferably, the elements are then analysed to determine the amount of element variation between segments. Most preferably, said elements are classified in to a plurality of different classes by the amount of variation wherein variation is defined as the number of unique values that occur in a column as a percentage of the number of cells in the column.

Preferably the elements in the table are also analysed to determine whether they: are mandatory or optional; include date or time information; are alpha, alphanumeric or numeric; require a zero fill, if numeric; have a maximum or minimum length.

Preferably, the method includes the step of providing a suggested map between the determined structure of the message and a common data format. Preferably, the common data format contains syntax and extensive descriptive/reference details. Most preferably, the common data format may be extended to cope with new data segments, elements or formats. Preferably, the suggested map is generated by comparing each element in the particular format with each element in the common format. Preferably, this will provide a list of elements in the particular format that match elements in the common format and elements in the particular format that do not match elements in the common format. The elements that do not match may be matched to existing elements in the common format manually or may be matched to new elements created for the common format.

Alternatively or additionally, intelligent syntax generation may be used to compare a particular format against messages of other formats that have previously been mapped, and a map may be created by adapting a map generated for a previous format. A further possibility is that messages may be mapped automatically by analysing their overall ID, their segment ID and their element ID to find matches in a database of previous messages in previous formats.

Preferably, the method may be further adapted to provide the steps of detecting variations in known message formats. Most preferably, the detectable variations include: the addition of segments, elements or sub-elements; the omission of segments, elements or sub-elements; and the transposition of segments, elements or sub-elements.

In particular cases where a variation is detected, the method may further include the step of cleansing the message. This may be achieved by: replacing data stored in an element; adding segments, elements or sub-elements; deleting segments, elements or sub-elements; or transposing segments, elements or sub-elements. The third aspect of the present invention may be implemented in conjunction with the first and/or second aspects of the invention as desired or appropriate.

In order that the invention be more clearly understood one embodiment is described further herein, with reference to the accompanying drawings in which:-

Figure 1 shows a network interface according to the present invention for interfacing messages between a number of truckers and a number of shippers;

Figure 2 shows a schematic block diagram of the interface of figure 1 ;

Figure 3 shows a flow diagram demonstrating the use of the interface of figure 1;

Figure 4 shows a list of segments occurring in a sample of messages;

Figure 5 shows a list of standard messages built up from the segments listed in figure 4;

Figure 6 shows an example of a message having delimited elements;

Figure 7 shows an example of a message having fixed length elements;

Figure 8 shows steps in the analysis of a message segment having delimited elements;

Figure 9 shows steps in the analysis of a message segment having fixed length elements; Figure 10 shows a segment of a message in the format analysed in figure 9 illustrating the volatility of the various elements in the segment;

Figure 11 shows how elements in a new format may be mapped to elements in a common format using declarative mapping;

Figure 12 shows how elements in a new format may be mapped to elements in a common format using automatic mapping; and

Figure 13 shows examples of message statistics retrievable by a user.

Referring now to Figure 1 an interface 100 according to the present invention is in use in the transport logistics industry. The interface allows two shippers Sl, S2 to send messages formatted according to their preferences to any or all of a number of trackers Tl, T2, T3 and have the truckers Tl, T2, T3 receive the messages in their own preferred format. The truckers Tl, T2, T3 can thus be coordinated such that which ever of them is available may come and collect a load from one of the shippers Sl, S2 as soon as the load has landed. In order to achieve this the interface translates incoming messages from the format they are received in to the format that the intended recipient desires to receive messages in. The interface 100 in figure one is shown acting to translate messages in between five different formats (ANSI Xl 2,

EDIFACT, SMS, XML, Fax) but may however be adapted to translate messages input into it in any readable format into other readable format for output. In particular for the logistics industry the interface 100 is capable of translating between many variations upon standard EDI or XML and voice messages. The interface 100 is typically embodied by a network of securely connected network service modules 200, which may be peer-to-peer service modules. Each peer-to-peer service module 200 typically comprises a hub 210 linking two or more communications servers 202, 204, a database server 206, and a processing server 208.

Referring now to figure 2, a schematic block diagram of the interface 100 is shown. Incoming messages 102 are held in a buffer 104 until sufficient processing space is available, once space is available the incoming message 104 is passed a translation engine 106. The translation engine 106 translates the message using IA/KM technology in conjunction with stored protocols and formats 108 and stored knowledge 110 into a single common data format. The translation is generated by analysing the message syntax in order to separate the individual data fields and then extract the data stored in each individual field in order to thus generate a message in the common format containing all the data of the incoming message 102.

The incoming message is stored in its original format in a historical message database 112 and in the common format in a common message database 116. Each common format message is verified by a verification engine 114 to ensure that the data contained therein is in accordance with expected values. In the event that data inaccuracies are detected, the verification engine 114 may attempt to correct these by analyzing previous stored messages from the sender to obtain previous or expected values. If this is not possible, the verification engine 114 may request user intervention to correct the data. In the event of any detected data anomaly, the system will create a log which may be viewed by an authorized user and which may be used to generate outbound messages back to the sender of the incoming message 102. If the information in the incoming message 102 is successfully verified, the common format message is output to the outbound message generator 118. The outbound message generator determines to which recipients the message is to be sent and looks up the recipients addresses in the recipients address database 122 and the recipients preferred message format in the recipients preferred formats database 120. The outbound message generator then generates a correctly addressed outbound message 124 in the recipients preferred format which is then transmitted to the recipient. Typically the recipient will then confirm receipt of the message.

Many-to-one, one-to-one and one-to-many relationships between incoming 102 and outgoing messages 124 can be established between particular senders and recipients if desired. The interface ontology contains a section which codifies these relationships and which is used to generate the appropriate outbound messages 124. Outbound messages 124 will be triggered only when the full set of clean data required for that process has been received by the system.

Multiple outbound messages 124 are only triggered when the data for the complete set has been received and verified. The interface 100 will not generate outbound messages 124 on an incomplete data set, even though the subset of data required for that outbound message 124 has been received and verified.

The interface 100 automatically generates outbound messages 124 for known recipients based on their known preferences however, the interface 100 may generate messages in an alternative format and protocol for any user in the event that a receipt is not received for the first outbound message 124 or for any other requested reason. Once a receipt has been received for a complete set of outgoing messages 124, then the original incoming message may be removed from the local databases 112, 116 if desired.

In order to enable users to send and receive messages via the interface they must go through a registration process and provide a standard set of company name and contact details. They will then have to specify which messages, of which type, are sent to specific trading partners. Next they will be required to send a test message in their preferred format(s) and protocol(s). This is used as a calibration device to establish the standard for all future communications. If the receiving trading partner is also new to the service also then they too must register for the service and send a sample message in the format they wish to receive. This is used to help create the translation 'map'. Once the sender has successfully sent a test message it is translated through the common data format into the recipient's format and forwarded to the recipient. Successful receipt and validation must be confirmed before the 'map' is declared operational and the sender can begin 'live' transmissions.

If a sender attempts to send a message to a new recipient trading partner who is not registered, the interface 100 sends an error notification to the sender. The interface 100 will additionally check to see if there is a registered recipient with similar details and may suggest to the sender that they intended to send the message to this recipient instead. Alternatively if the recipient details are correct but the recipient is not yet registered then the e-mail may also contain a hot link to a registration

screen. The data required for registration typically comprises the following:

Code 10 Alphanumeric (A/N)(e.g. FOXTRANS)

Validation Code 1 4 A/N (e.g. 6950)

Validation Code 2 4 A/N (e.g. 1074)

Name/Description 40 A/N (e.g. Fox International Transport)

Address 40 x 5 A/N

Post/Zip Code 12 A/N

Telephone 20 A/N

Interface Type 1 Voice (code linked to standing data)

Interface Type 2 XML (code linked to standing data)

Interface Type ... EDI etc, etc.

When incoming messages 102 arrive they are validated against these details by the validation engine 114 and then passed to the outbound message generator 118 along with data related to the intended destination of the message.

The interface 100 may additionally be provided with means for storing and cross-referencing all messages passed through the interface 100. In this manner, if an order message and a subsequent dispatch message are sent via the interface they can be cross-referenced and verified. A further notice of collection or receipt will also be cross-referenced by the interface 100.

The data stored by the interface may be stored in any suitable database format such as csv and managed by any suitable system such as SQL. Other formats or systems may however be substituted if desired.

Messages may be stored in a message log database if desired, which records all messages sent via the interface. The contents of the message log may be exported to an external text file for analysis if desired. Additionally or alternatively reports of activity may be compiled and exported to suitable external software such as Microsoft Excel (registered trade mark).

An example of the use of this interface for communicating an update of the progress of a shipment being transported by a user is shown in figure 3. In particular this example relates to voice recognition over a telephone connection (cellular or landline) but other messages in other formats may alternatively be used.

At s300 the user must give an identification code. If the code is valid he progresses to s301 if the code is invalid he is prompted once more to give a valid identification code. At s301 the user is prompted to enter an order number if this is valid the user progresses to s302 wherein the interface can identify a matching order. If this number is not valid the user is prompted to try again.

At s302 the user may then enter an update of the position of the order and depending on the choice of update may be asked for supplementary information at s303. For instance if the order is late, s303c then the user is prompted to give an estimated time of arrival, if the order is delivered s303b, the user is asked to confirm the condition of the delivery (clear/short/damaged/refused). If however the order has just been collected, s303a a confirmation message to this effect is generated. The user then progresses to s304.

At s304 the user is asked whether they wish to update any further transactions. If they do they return to s301. If they do not, the connection is terminated.

Li a preferred embodiment, messages are translated into or out of a common format using Automated Message Profiling (AMP) to either identify the message as belonging to a known format and thus map the message to the common format or to determine that the message is not in a known format and generate a new map for translation the message into the common format. AUTOMATIC MESSAGE PROLFILING (AMP)identifies messages in known formats by looking at their message ID's. If the message ID is unknown, a new map must be generated.

In order to generate an accurate new map a plurality of messages in the new format must be analysed in parallel. Typically a new format is detected because a new user wishes to use the interface and to do so the new user is requested to provide a sample of a plurality of messages for analysis.

When generating a new map, automated message profiling (AMP) operates by comparing a plurality of messages in the new format, said messages containing a wide variety of typical variations. By comparing like and unlike parts of the messages and statistically analysing the recurrence of such like and unlike parts of each message, the segment structure of the message format may be determined. Similarly by comparing like and unlike parts of each segment, and by statistically analysing such like and unlike parts of each segment, the structure of individual fields, elements or records within each identifiable segment may be determined.

If a sufficiently wide variation in messages is provided for analysis, then identification of the individual segments in a message format and the individual elements within each format can be done automatically and thus a map of a message format can be generated automatically. If the plurality of messages provided are insufficient to generate a complete map, then a user may access the data, and manually determine the segment structure within a message or the element structure within a segment. If the messages are unstructured, this method is of course ineffective.

Automated message profiling (AMP) is used where the format of a message is not recognised, in order to generate a map allowing the message to be mapped to the common format. Once the message structure has been determined, the structure can be compared to previous message formats that have been mapped by the interface. Where the message has a similar but non-identical structure to a previous message format, the map generated for the previous message format may be adapted to generate a map for the new message format. In this way the interface may become better at determining maps for new message formats over time. If the message has a completely unknown structure a completely new map must be generated. All messages which pass through the translation process will be held in a data repository at detail and summary level. Information from the repository will be used as part of the mapping process where actual data from sample messages can be used to search how and where data of this type and value have been used before. The analysis of the repository data is similar to data warehousing but with the subtle difference that nothing is known about the source data other than its value.

When generating a mew data map, a plurality of typical messages are required. These messages are analysed to determine their syntax or structure, firstly in terms of their segment structure, secondly in terms of the element structure of each segment and thirdly in terms of any sub-element structure of each element. Automatic message profiling (AMP) may be used to determine the structure of any message in a defined, structured format including but not limited to: XML (including voice); EDI (EDIFACT, ANSI X.12, HL7, HIPAA); Fixed length, Flat File including NSF (National Standard Format); and SAP iDocs.

Typically automatic message profiling (AMP) will be implemented with an

Application Programming Interface (API) such that the processing of messages can be monitored manually and if necessary, the map created may be manually adapted. The parameters used by and or displayed by such an API are set out in table 1 below.

The first step in analysis of a plurality of messages to determine their structure and hence to provide a map from their format to a common format is to identify the segment structure of the messages and the individual segments in each message. Typically, the first element of each segment contains a segment ID, these segment Ids may be used to identify the start and end positions of successive segments, particularly in a format wherein the segment length is fixed. Once the segments are identified, all the segments in all the messages are analysed to provided the following statistics:

Message Count. Count the number of messages within the sample.

The end of message can be identified by an empty segment or an end of message marker.

Average Segments. Calculate the average number of segments per message. Standard Deviation of Segments. Calculate standard deviation for the number of segments across all messages. This will indicate the variation in the message set. The lower the number, the more standard the structure of the message. - Primary Required Segments. Count and identify each segment type which occurs only once in every message.

Secondary Required Segments. Count and identify each segment type which occurs at least once in every message and sometimes twice or more in some messages. - Primary Optional Segments. Count and identify each segment type which occurs only once within a message but not in every message. Secondary Optional Segments. Count and identify each segment type which occurs once or more within a message but not in every message. - Structure Count. Count and identify each message where the structure of the message is duplicated. If there were 60 messages in the sample and each structure was replicated six times the result would be ten counts with a value of six.

The next step is to analyse the above statistics to determine a definition of the message structure. This does not take account of message standards or syntax but does provide a view of the message structure. This process generates a segment list and a standard message as detailed below. Segment List. All the segments used are listed in the correct sequence. This includes all segments that are optional. Each segment listed is listed alongside associated information from the statistical analysis above such as segment type (Primary Required), the number of occurrences and the number of occurrences as a percentage of all messages. This list indicates the largest message that could be created based on the sample data. A simple example of such a list is shown in figure 4. This shows that from the sample set of figure 4, the largest message that could be constructed consists of: R521 (always 4), R515 (always 1), R529 (always 3), R550 (always 1), R590 (always 1 and sometimes 2), R999 (always 1).

Standard Message. One message is selected from each message group that falls within each structure count determined above. The messages are listed in descending order starting with a message from the group with the highest number of segments conforming to a common structure, followed by a message from the group containing the next highest number, etc. Each message is listed alongside information obtained from the statistical analysis above such as the group count. A typical example of such a list is shown in figure 5. This provides an output which may be viewed by a user, if desired. The user may thus see information indicating the full range of variation within the sample set and providing statistical details indicating how often a structure occurred within the sample message set, the message count, how many other structures occurred and the related percentage. This information may be of use to the user, if any manual input in to the mapping process is required. The user will thus be provided with an API having the parameters shown in table 2.

The next step in this method is to determine the structure of elements within each segment of a message. There are two types of element structure dealt with by AMP; Variable length elements (as shown in the message of figure 6), where each field is terminated by a special character (in figure 6 '*') called a field delimiter or a field separator and each element varies in size from message to message; Fixed length elements (as shown in the message of figure 7, (segments indented to illustrate structure and loops)), wherein there are no field delimiters but each element starts and ends in a specific position within the segment and is in the same place and has the same length in all segments of the same type. In order to deal with elements of these different types, two different methods of identifying and isolating the elements within a segment are used.

In processing the elements with a segment, automated message profiling (AMP) provides an API having the parameters shown in table 3.

In the case of variable length elements, only one segment at a time is analysed throughout the sample message set before moving on to the next segment. Figure 8 a shows a typical field delimited segment taken from an ANSI X.12 (214) Shipment Status message providing shipment details such as weight, quantity, etc.

The objective is to isolate each element ready for analysis. This segment is made up of eight elements, the first of which identifies the segment. When an element value is not used, as in the case of the sixth element, a field separator is still present so that element positioning is maintained. Figure 8b shows the segment parsed into individual elements. Once this has been done the next step is to establish if there are any sub-elements using the same technique. In the example of figure 8, there are no sub-elements. The next step is to isolate each of the segments from the AT8 segments in the rest of the messages to produce a table as shown in figure 8c. Each element can now be analysed to define the fundamental definition of the AT8 segment within the message set.

In the case of fixed length elements, if no similar message structure has been analysed before, some manual input may be required to determine the message structure. As for variable length messages, the aim is to isolate each element within a segment. However, where a segment is highly populated, it can be difficult to separate elements that neighbour each other when there is little or no change from one message to another within the specific segment. Only one segment at a time is analysed throughout the sample message set before moving on to the next segment. An example of a segment with fixed length elements is shown in figure 9a.

Unlike field delimited formats, fixed length formats require analysis of each specific segment throughout the whole sample message set in order to isolate each element. The first step is to examine a specific segment across all messages within the sample message set. When a common start point is established for an element it will be stored as an element start point. This is repeated until all elements in the segment have been identified. Figure 9b shows an example of this in action, vertical lines indicating the start position of each element. The elements have been identified by the following criteria:

1. The first element start was established from the interface data which stated the start position for the segment ID.

2. The second element starts where the segment ID ends. 3. The third element starts where change has occurred prior to a static pattern.

4. The fourth and fifth elements start where white space has occurred in the same position across all sample messages for this segment.

There are eight elements in the example shown in figure 9 but there are not sufficient segments to provide a more complete analysis. To identify further segments, either more samples are required or a user must perform some manual analysis.

Once all elements within a segment have been identified, the analysis conducted on the individual elements is the same whether they are fixed length or variable length. Firstly, the elements in each segment are listed in table form, then a series of analyses is conducted on the table as are described below.

Static Values. Each column where the value is identical throughout is identified. Although these values may not be constants in the true sense, they can be treated as such. If this message were being defined for output, the static values can he set in place without being concerned with the actual mapping for those elements.

Usage Grading. The number of unique values that occur in a column are counted and then expressed as a percentage of the number of cells in the column.

This can be displayed to the user as various colour grades, as is shown in grey scale in figure 10. A particular example of this is to colour code red where static values exist, dark blue where one percent change occurs, mid-blue where five percent or less change occurs, light blue where ten percent changes occur. Columns where changes are greater than ten percent are viewed as volatile.

Mandatory/Optional. The columns are analysed to identify blank cells. If a blank cell is found in a column that has some non-blank values, then this element is optional. If no blank cell is found in a column then this element is assumed to be mandatory.

Date Definition. Each cell within each column is analysed for date conformity. Dates may be held in a number of ways and each column must be examined to discover any date structure. If a date is discovered, and the date is present in all cells of the column, then the element is classified as a date element.

Time Definition. Each cell within each column is analysed for time conformity. Times may be held in a number of ways and each column must be examined to discover any time structure. If a time is discovered, and the time is present in all cells of the column, then the element is classified as a time element. In certain formats, time can be an integral part of a date element, for example in a UN/EDIFACT DTM segment. If this is the case, the element will be defined using a function which applies the SEF structure for syntax.

Element Type. Where an element has not been classified as a date or time, each cell within the column is analysed to determine whether the element is alpha, alphanumeric or numeric.

Zero Fill. If an element is numeric, is the element required to be zero filled. Minimum Size. Each cell within the column is analysed to define the element rm^'rrirrmm size. The number of characters in the cell containing the shortest value is the minimum size.

Maximum Size. Each cell within the column is analysed to define the element maximum size. The number of characters in the cell containing the shortest value is the maximum size.

When the above processing is complete, the details are stored and may additionally be passed to another application or to a user for verification. Providing the message details in this format allows a user to focus easily on the business process for which the message is being used rather than being concerned with redundant segments. The results of the processing are stored in a Cache table having a row corresponding to each element of the analysed segment. The parameters stored in the Cache table will typically be those shown in table 4.

Once all the segments, elements and sub-elements (if any) of a message format have been identified and defined, then a map for translating messages in the new format into a common format can be generated. The common format must contain sufficient fields to contain all the information in any input message broken down into its logical sub-structure. The common format may be set out in a table containing syntax and extensive descriptive/reference details. By grouping in this manner, a better view of the business process is provided which helps to indicate any further information that may be required in the mapping process. A simplified example of a common format group is shown in table 5. In the case that a new format containing new information is encountered, a new group may be added to the common format. In this way the common format will grow over time.

Table 5

Industry Group Sub Group Description Type Size Format

Logistics Shipment Shipper Identifier A/N 3 "SH" Logistics Shipment Shipper Name A/N 30-50 Logistics Shipment Shipper Address 1 A/N 30-50 Logistics Shipment Shipper Address 2 A/N 30-50 Logistics Shipment Shipper City A/N 25 Logistics Shipment Shipper Post Code A/N 5-12 AANN NAA

Logistics Shipment Consignee Identifier A/N 3 "CN" Logistics Shipment Consignee Name A/N 30-50 Logistics Shipment Consignee Address 1 A/N 30-50 Logistics Shipment Consignee Address 2 A/N 30-50 Logistics Shipment Consignee City A/N 25 Logistics Shipment Consignee Post Code A/N 5-12 AANN NAA

Logistics Shipment Account Identifier A/N 2 AA Reference Logistics Shipment Account Code A/N 30 Reference

Logistics Shipment Flight Identifier A/N 2 AA Reference Logistics Shipment Flight Code A/N 30 Reference In order to generate a map from a new format to the common format, the segment structure and Cache table for each element is required. The parameters shown in table 6 may also be of relevance.

A Cache knowledge database will be the main source of information for establishing mapping to the common format. Initially the mapping process may require manual intervention to complete the mapping but as more messages in new formats are processed, the knowledge database will contain more information eventually allowing full automation to occur. In order to access information from many different directions, the Cache database will incorporate OLAP (online analytical processing) systems. The design of the database is specific and targets fewer users to allow much larger amounts of data to be recovered. For instance, the knowledgebase structures put related data into physical proximity so that it can be accessed in the minimum number of reads and messages are stored in specific Cubes to allow any element to be summarised or specific values to be searched in order to obtain the message container.

Three different processes for map generation are now described below, Declarative Mapping, Intelligent Syntax Generation and Automatic Mapping.

Declarative Mapping. This is used in circumstances wherein a new message format is described but additional information is required at message level in order to align tb.e message with an associated industry and function. When the new format has been defined both physically and logically (assigned its place in the business process) it can then be simply mapped to the common format.

This process compares, at the syntax level, each element in the new format, with the elements in the common format. This results in the generation of an initial list for each element within the new message of those elements within the common

■ format that match or fall within the range limits of the syntax of the particular element of the new format. A list is also generated of those elements where a match cannot be found. A mapping tool can then be used to refine and correct the mapping to the common format. In the example shown in figure 11, an ANSI(214) Ll 1 segment has three elements two of which are automatically mapped to elements in the common format on the right.

This process is known as declarative mapping because matching is performed using the declared syntax details and extensive reference details and the actual sample user messages are not examined. Using this method, mapping is semi-automated and if a message has a simple structure and limited coding, this semi-automatic mapping will yield relatively high results, with up to 75% accuracy.

Intelligent Syntax Generation. When sufficient historic user and common format messages have been created, intelligent syntax generation can be used as the first step in mapping a new format. In this process, sample messages of the new format are analysed and compared with the repository of historic messages of varying structure. Where the new format is similar to standard formats or previously mapped formats, creating the syntax for the new format will be relatively easy. Typically in such cases the variations between client standard formats relate only to the profile of the message for that client. For instance, one client may only use 20 segments per message but another client may use 100 segments. In these case wherein a similar message has been mapped before, mapping can be created automatically and simply verified. If however, the message includes new segments, then the message must be fully analysed.

Automatic Mapping. This process involves cross-referencing with the knowledge database to determine if the message in its entirety, a single segment or a single element matching those in the new format can be found in the database. If a match is found, then the same mapping may be used. This process increases in efficiency, as more message format maps are stored in the database. There are three levels where cross-referencing with the knowledge database are required, message level, segment level and element level as described below.

Message Level. At this level, the Message ED (e.g. ANSI/X.12/214/004010) can be used to check for existing messages of the same type, business function and industry as an existing format for another customer. Where a match is^' found, the mapping references on the matched format are duplicated for the new format.

Segment Level. At this level the Message ID (e.g. ANSI/X.12/Segment) can be used to check for existing segments of the same type, business function ,and industry as an existing format for another customer. The match may be from a different format to the new format but still within the same industry. Where a match is found, the mapping references on the matched format are duplicated for the new format.

Element Level. At this level, the sample element may be matched to an element stored in the database to obtain a conclusive map to the common format. As described above, the Cache database is organised for fast cross-referencing. The sample containing the element to be matched is organised into rows and columns, each column representing one element of a segment. The column is then sorted to so as to display the most common value for the element. The most common value is then used to perform a saturation search for the value within the database. The result of the search may yield many mapping suggestions which are sorted into order of most suggested to least suggested. This process is repeated for each element within the new format until every element has a result, either positive (mapping suggestions) or negative (no suggestions). A pictorial example of this process is shown in figure 12.

In the example of figure 12, a new ANSI(214) BlO segment has an element where a match has been found within the knowledge database. By referring to the mapping reference for the matched EDIFACT data, it is suggested that the third element of the ANSI(214) BlO is a carrier code and its mapping reference is 'Logistics.ShipmentStatus. Carrier' .

The results of the mapping process are stored in a transient table for each specific message. The transient table is provided with a row for each element of each segment of the record. The parameters stored for each element are shown in table 7.

Information at summary and detail level on the message and the individual segments and elements can be selected and analysed from the Cache database. This functionality can be achieved by using integrated SQL (structured query language) capabilities. This provides a user with the vital ability to search and summarise any data at any level. The typical parameters for searching are set out in table 8.

The first step of the searching process is for the user to select a specific message format, by format ID (ANSI /X.12/214/004010) by customer or by any other searchable parameter. A list of historic messages is then generated with status information such as date, time, etc. The user can scroll up/down, left/right and select a specific message to examine. This process is illustrated by figures 13a and 13b.

The messages are displayed with their original statistical data. The user may request a segment list or may select any segment from the segments shown and request a detailed analysis of the specific message. For example, if the user selected line 15 in figure 13a (the AT7 segment), all AT7 segments from the sample data would be displayed as shown in figure 13b. The order of display can then be sorted by any element within the segment. This allows these tasks to be performed automatically rather than by a user copying the sample data into a spreadsheet. The results of any user analysis can be provided in a transient table detailing a minimum of a single segment. In such a table there will typically be provided a row for each element of each segment of the message. The parameters provided are shown in table 9.

The interface may also be adapted to undertake variation processing/error correction of messages in addition to translating messages. This type of processing can be viewed as identifying and correcting an acceptable error. The process works by identifying a variation from standard and if it is one of a particular set of known variation replacing it with a desired element, these variations are only process able only formats that have fully defined maps.

The typical situations wherein variations may occur are:

New segment. The customer uses a segment that was not within the sample data set. New element. The customer has implemented a minor format change.

Missing element or segment. The customer has implemented a minor format change. - Element or segment Transposition. The customer has implemented a minor format change.

If the messages in this format are not flagged by the system to be handled using variation processing, any messages with variation will be treated as errors. If the format is however flagged for variation processing the message will undergo variation processing to identify and if desired correct the variation. For variation processing to function, the interface is required to have the message format defined in terms of at least the parameters listed in table 10.

Return value Out 0 = success, 1 = fail

The processing of variations may typically take the following form in the following cases.

New Segment. The new segment can be automatically mapped as described previously. Alternatively, a suggested mapping can be generated which may be confirmed by a user. In either case, if no suggested mapping can be generated, further sample messages can be requested for additional processing.

New Element. The new element is automatically detected based on syntax and contents. From analysis of previous messages which contain similar values, except for the variation, the segment and the details of the new element can either be mapped automatically or alternatively a suggested mapping can be generated which may be confirmed by a user. In either case, if no suggested mapping can be generated, further sample messages can be requested for additional processing.

Missing Element. The missing element is automatically detected based on syntax and contents. From analysis of previous messages which contain similar values, except for the variation, the segment and the details of the missing element can either be mapped automatically or alternatively a suggested mapping can be generated which may be confirmed by a user. In either case, if no suggested mapping can be generated, further sample messages can be requested for additional processing. A similar process may be used in the case of a missing segment.

Transposed Elements. The transposed elements are automatically detected based on syntax and contents. From analysis of previous messages which contain similar values, except for the variation, the segment and the details of the transposed elements can either be mapped automatically or alternatively a suggested mapping can be generated which may be confirmed by a user, hi either case, if no suggested mapping can be generated, further sample messages can be requested for additional processing. A similar process may be used in the case of transposed segments.

The results of the above processes are provided in a transient table representing the specific message with details of the variation (not shown).

In addition to a variation in the message format being detected errors or variations in the message data may be detected and corrected or cleansed during the translation process. In such cases, there are different levels of priority depending on the type of data to be cleansed. Textual data such as name, address, product description, etc., can easily be cleansed due to the low business impact (e.g. Changing 'Jhon' to John'). Cleansing quantity or value data requires more rigorous control, due to the potential impact (e.g. changing IO to 10 when the actual value was 1). As with variation handling, message cleansing will only work on fully defined formats with sample data for the original format.

In order to provide this function, at least the parameters in table 11 must be provided.

Because data cleansing implies validation and correction, the implication is that when a bad message arrives it should be cleansed and translated in situ. This could have enormous performance implications and it is for this reason that the Cache database is organised to provide summary data which will enable cleansing to occur on a minimum database read basis. When a message is selected for cleansing, before it becomes operational, the summary data ready for cleansing is generated.

The method of cleansing is varies depending on the type of data to be cleansed. Numeric element may for instance require a range check whereas an alphanumeric element may require pattern matching. Examples of the types of data error that may be cleansed include: missing data; partial data; out of range; transposition; table lookup; invalid data.

A number of parameters must be set for all elements where cleansing is required such as: replace missing data (yes/no); incomplete data (yes/no); range checking (yes/no); range start; range end; transposition (yes/no); table lookup; validate (yes/no). The particular processing of errors will be dependent on the settings of the above parameters.

Replace missing data. This can only be corrected within limited boundaries such as where an element is static, that is, the element always appears in a message of this type or where the element can be calculated or can be obtained from the values of other elements.

Transposition and Incomplete data. This requires the value within the element to be sorted ("fox jumped over the moon" = "deeemjmmnjooooprtuvx") and compared with the values within the database. If the value was used before, sorting it in this way will find transposition errors ("fox jumped the moon over"). This method can also correct incomplete data provided only a small percentage of data is missing. This is achieved by using a percentage pattern match where the sorted fields are sorted and resorted in descending order to find a start or finish pattern and then allowing a 5% variation in match. This percentage could be increased if desired.

Range Checking. This is effectively validation however if range checking and not validation is selected, calculations can be performed to reconstitute the value from other values where possible.

Table lookup. This is also effectively validation using a table lookup.

The results of this process are provided in a transient table representing the specific message and an indicator to show success or fail. If successful, the message translation process can be completed. If failed, the message will be flagged for error analysis.

It should of course be understood that any techniques for generating maps for translating message formats from a new format to the common format described herein can also be applied to generating maps to translate messages in the common format into a new format. Also any of the variation detection or error cleansing methods described above in relation to translation of messages into the common format may also be applied to translation of messages out of the common format.

It is of course to be understood that the invention is not to be restricted to the details of the above embodiment which is described by way of example only.

Claims

1. A method for interfacing between an input data set in a first format and an output data set in a second format comprising the steps of: analysing the syntax of the input data set; separating the individual data fields in the input data set; extracting the data in each individual field; arranging the extracted data in accordance with the second format and thereby generating an output data set.

2. A method as claimed in claim 1 wherein, the ontology for analysing the message syntax, separating the individual data fields and thus extracting the data in each individual field is generated by

Intelligent Agent/Knowledge Management (IA/KM) technology.

3. A method as claimed in claim 1 wherein, the ontology for analysing the message syntax, separating the individual data fields and thus extracting data in each individual field is Automated Message Profiling (AMP).

4. A method as claimed in any one of claims 1 to 3 wherein, the method is used to translate messages in EDI, XML, ASCII, SMS, Voice, VML and Fax formats.

5. A method as claimed in any one of claims 1 to 4 wherein, the method is used to translate messages in ANSI 12, HL7, SPEC 2000,

UN/EDIFACT, EDIBUILD, EDICON, EMEDI, ASCI X12 and UN/CEFACT formats.

6. A method as claimed in any one of claims 1 to 5 wherein, the method is adapted to analyse and translate input data sets in any desired format.

7. A method as claimed in any one of claims 1 to 6 wherein, the method is additionally adapted to export an output data set in any desired format.

8. A method as claimed in any one of claims 1 to 7 wherein, in order to translate from an input data set in a first format to an output data set in a second format, the interfacing method may include the steps of translating the input data in the first format into a common format and then translating the data in the common format into the second format.

9. A method as claimed in any preceding claim wherein, the method includes the steps of receiving an inbound message in any format and translating the inbound message into an output message in a desired format.

10. A method as claimed in claim 9 wherein, all inbound messages are translated into a common format.

11. A method as claimed in claim 10 wherein, all output messages are translated from the common format into the desired output message format.

12. A method as claimed in claim 9 or claim 10 wherein, the method includes the step of storing copies of inbound messages either in their original format or in the common format or in both formats.

13. A method as claimed in any one of claims 9 to 12 wherein, the method of interfacing additionally includes the step of verifying that the content of the input data set or message conforms to standard or expected criteria.

14. A method as claimed in claim 13 wherein, the method of interfacing additionally includes the step that if the message does not conform to the criteria an error message is sent to the sender and or another authority.

15. A method as claimed in claim 14 wherein, the error message requests that the sender or the authority check and or resend the message.

16. A method as claimed in claim 15 wherein, the error message additionally includes information relating to why the message was rejected.

17. A method as claimed in any one of claims 9 to 16 wherein, the method includes the step of if a message fails to conform to the criteria, correcting minor errors and if successful notifying the sender and or some other authority of the correction.

18. A method as claimed in claim 17 wherein, acceptance of the correction from the sender or other authority must be received before forwarding the corrected message to the desired recipients.

19. A method as claimed in any one of claims 9 to 18 wherein, a new sender of messages is required to take part in an automated registration process.

20. A method as claimed in claim 19 wherein, the new sender must provide a standard set of details for the interface and then send a message to the interface in their preferred format.

21. A method as claimed in claim 20 wherein, the message is stored as a calibration standard to aid future translation of messages from the sender.

22. A method as claimed in any one of claims 9 to 21 wherein, the output message is forwarded to a desired recipient.

23. A method as claimed in claim 22 wherein, there are one or more desired recipients

24. A method as claimed in claim 23 wherein, the message is translated into different formats for each individual recipient.

25. A method as claimed in claim 23 or claim 24 wherein, messages received from one sender are always sent to one or more particular recipients.

26. A method as claimed in claim 23 or claim 24 wherein, messages received from one sender are only sent to recipients specified in the message.

27. A method as claimed in claim 25 or claim 26 wherein, messages are sent to recipients only in a particular specified format or are sent in a one of a variety of specified formats depending on the data in the message.

28. A method as claimed in any one of claims 9 to 27 wherein, the method further includes the step of sending messages to the sender confirming translation or receipt of the senders initial message.

29. A method as claimed in any one of claims 9 to 28 wherein, records are generated and stored relating to each message received from a sender and forwarded to a recipient.

30. A method as claimed in claim 29 wherein, the records include some or all of the following information: the format of the inbound message and or the outbound message, the time and date of receipt of a message, a log of any errors identified in the message, identification of the sender, identification of the recipient or recipients.

31. A method as claimed in claim 1 wherein, the method is used to merge databases in two different formats into a single database either in the same format as one of the original databases or in a third format.

32. A method as claimed in claim 31 wherein, the method is used to translate user queries into the correct format for a database.

33. An interface operating in accordance with the method of any one of claims 1 to 32 having means for receiving an input data set, means for processing the inbound data set in order to generate an output data set and means for exporting the output data set

34. An interface as claimed in claim 33 wherein, the interface is adapted to provide a messaging interface for a group of businesses.

35. An interface as claimed in claim 34 wherein, the interface is used to form a messaging interface for the logistics industry.

36. An interface as claimed in any one of claims 33 to 35 wherein, the interface is embodied by a plurality of securely connected network service modules.

37. An interface as claimed in claim 36 wherein, each network service module is a peer-to-peer service module comprising a hub linking two communications servers, a database server and a processing server.

38. An interface as claimed in claim 33 wherein, the interface is positioned between a user and the two databases in order that user queries can be translated by each database.

39. A method of determining the structure of messages in a particular format by analysing a plurality of such messages comprising the steps of: identifying individual segments within each message to thereby determine the segment structure of the particular message format; identifying individual elements within each segment, thereby determining the element structure of each segment; and analysing corresponding individual elements in each segment over all the analysed messages to determine the structure of each element in the particular data format.

40. A method as claimed in claim 39 wherein, the segment ID is used to identify the start and end positions of each segment.

41. A method as claimed in claim 39 or claim 40 wherein, the identified segments are counted so that the mean number of segments per message and the standard deviation in the number of segments per message can be determined.

42. A method as claimed in any one of claims 39 to 41 wherein, the segment structure of each individual message is analysed to determine which segments appear in every message either once or more than once in every message and which segments appear once or more than one in only some messages.

43. A method as claimed any one of claims 39 to 42 wherein, the segment structure of each individual message is analysed to determine the number of messages with identical segment structures.

44. A method as claimed in any one of claims 39 to 43 wherein, a user is provided with a list of all segments used in each message in the plurality of messages.

45. A method as claimed in claim 44 wherein, said list also contains statistical information relating to the number of times that the segment has occurred over all messages.

46. A method as claimed in claim 44 or claim 45 wherein, said list also contains statistical information relating to the number of occurrences of a segment as a percentage of the number of messages.

47. A method as claimed in any one of claims 39 to 46 wherein, the method is applied to identifying elements in segments where the elements are delimited.

48. A method as claimed in claim 47 wherein, the element delimiters or separators are used to identify the start and end positions of individual elements.

49. A method as claimed in claim 48 wherein, only one segment at a time is analysed.

50. A method as claimed in any one of claims 39 to 46 wherein, the method is applied to identifying elements in segments where the elements have a fixed length.

51. A method as claimed in claim 50 wherein, each like segment from the plurality of messages is analysed at the same time to identify the individual elements.

52. A method as claimed in any one of claims 47 to 51 wherein, the data in each like segment is entered into a table or similar such that corresponding elements in each segment are entered into the same column of the table.

53. A method as claimed in claim 52 wherein, the elements are then analysed to determine the amount of element variation between segments.

54. A method as claimed in claim 53 wherein, said elements are classified in to a plurality of different classes by the amount of variation.

55. A method as claimed in claim 54 wherein, variation is defined as the number of unique values that occur in a column as a percentage of the number of cells in the column.

56. A method as claimed in any one of claims 52 to 55 wherein, the elements in the table are analysed to determine whether they: are mandatory or optional; include date or time information; are alpha, alphanumeric or numeric; require a zero fill, if numeric; have a maximum or minimum length.

57. A method as claimed in claim 56 wherein, the method includes the step of providing a suggested map between the determined structure of the message and a common data format.

58. A method as claimed in claim 57 wherein, the common data format contains syntax and extensive descriptive/reference details.

59. A method as claimed in claim 57 or claim 58 wherein, the common data format may be extended to cope with new data segments, elements or formats.

60. A method as claimed in any one of claims 57 to 59 wherein, the suggested map is generated by comparing each element in the particular format with each element in the common format.

61. A method as claimed in claim 60 wherein, this provides a list of elements in the particular format that match elements in the common format and elements in the particular format that do not match elements in the common format.

62. A method as claimed in claim 61 wherein, the elements in the particular format that do not match elements in the common format are matched to existing elements in the common format manually or are matched to new elements created for the common format.

63. A method as claimed in any one of claims 57 to 59 wherein, intelligent syntax generation is used to compare a particular format against messages of other formats that have previously been mapped, and a map is created by adapting a map generated for a previous format.

64. A method as claimed in any one of claims 57 to 59 wherein, messages are mapped automatically by analysing their overall ID, their segment ID and their element ID to find matches in a database of previous messages in previous formats.

65. A method as claimed in claim 64 wherein, the method includes the step of detecting variations in known message formats.

66. A method as claimed in claim 65 wherein, the detectable variations include: the addition of segments, elements or sub-elements; the omission of segments, elements or sub-elements; and the transposition of segments, elements or sub-elements.

67. A method as claimed in claim 65 or claim 66 wherein, the method includes the step of cleansing the message.

68. A method as claimed in claim 68 wherein, cleansing the message is achieved by: replacing data stored in an element; adding segments, elements or sub-elements; deleting segments, elements or sub- elements; or transposing segments, elements or sub-elements.

69. A method as claimed in any one of claims 39 to 68 implemented in conjunction with the method of claims 1 to 32 and/or the interface of claims 33 to 38.