Qyan's Brave New World

Semantic Integration Research in the Database Community A Brief Survey

This survey firstly presents the applications that require semantic integrations, and brings up difficulties during the integration process. Schema matching and data matching, which play very important influence in the semantic integration, are focused on in the following content. Finally, some other open issues are discussed.

Applications mentioned in the survey include:

schema integration: merging a set of given schemas into a single global schema. The basic procedure is match - merge.
Translate data between multiple databases. Data coming from multiple sources must be transformed to data conforming to a single target schema.
Data integration system: provide the user with a uniform query interface (mediated schema) to a multitude of data sources. It's very inspiring, because our work on KADIS will belong to this category.
P2P, peer data management: allow peer to query and retrieve data directly from each other, without building the mediated schema.
Model management : create tools for easily manipulating models of data.
Many factors increase the need for the applications above, such as the rapid development of the Internet, the widespread of adoption of XML as a standard syntax to share data, the growth of the Semantic Web etc.

However, semantic integration is an extremely difficult problem, because of the challenges below:

It's hard to infer the semantics of the involved elements, unless the creators of data, or exports are available.
The clues used for matching schema elements are often unreliable. Sometimes, even the clues are not complete.
The global nature of matching makes it costly to find the BEST matching elements in different sources.
Depending upon the applications, the matching is often subjective. That means, the criteria of matching might changes with the changes of applications.

Next, the survey discuss the progress in schema matching and data matching respectively. Three aspects are concerned, they are:

Matching techniques:

Rule-based solutions: hand-crafted rules to match schemas (domain-independent, and domain-specific)

Benefits: inexpensive; fast because it typically operates on schemas; work very well in certain types of applications; capture valuable user knowledge about the domain.
Drawback: can not exploit data instances effectively; can not exploit the previous matching efforts to assist in the current matchings.

Learning-based solutions: neural network, Naive Bayes etc. are used in this category; external evidence, like past matches, is exploited; domain-specific schemas and matches, even the users are involved in this solution.

Architecture of matching solutions: module-based multi-matcher architecture, each module exploits well a certain type of information to predict matches.

On the other hand, domain-specific information is taken as global information to be handled after the matching phase.
Related work in knowledge-intensive domains.

Types of semantic matches:

one-to-one match
complex match : need domain knowledge; the correct complex match is often not the top-ranked match, but somewhere in the top few matches predicted.

Data matching is discussed in the following: this part is important, because the main goal of KADIS is to match and query data from multiple sources. The techniques used in data matching are similar to those in schema matching, even though its focus is on the tuple matching.

Finally, the paper also discussed the open issues related to the whole semantic integration process, beyond matching schema and data. These issues include:

User interaction: the key point is how to reduce the user's burden during the user interaction with the system;
Formal foundations: develop formal semantics of matching and try to formally explain the mechanism of matching.
Industrial strength schema matching: apply algorithms in real world setting. Help to understand better the applicability of current research and suggest future direction.
Mapping maintenance: dynamically maintain the semantic mappings between different sources, because of the changes of the source data.
Reasoning with imprecise matches on a large scale: use and evaluate the system where parts of the mapping always remain unverified and potentially incorrect because of the overwhelming size of the information.
Schema integration: construct a global schema through matching and merging schemas from multiple data sources, which is about the high-level operations, like model management etc.
Data translation: elaborate matches into mappings, to enable the translations of queries and data across schemas;
Peer-to-Peer data management:

Qyan's Brave New World

Saturday, December 02, 2006

No comments:

Contributors

Links

Blog Archive

Follow my heart ...