Qyan's Brave New World: 2006-11-26

Saturday, December 02, 2006

A Survey on Information Systems Interoperability

In this survey, some basic concepts are presented. It's pretty good to go over these concepts in this paper. Its structure is clear and it's easy to understand most of the content. The author had the rich experience to apply the techniques of information systems into the agriculture. I think that is a system to manage multiple database containing agricultural information, therefore it might be pretty helpful on our work, since our goal is to construct a data-integration system to manage multiple archeological database.

Now, let's take a look at his paper section by section, (the easiest way to review J)

In the first section, the motivation and the basic method used are discussed respectively.

the goal is the construction of data warehouse (or materialized view) integrating several kinds of data sources, particularly for scientific applications in agriculture. Interesting, because it's a little similar to our goal in KADIS, the only difference is that our application is about archeology.
The background of the application is that: distinct data sources may be maintained independently, and research on semantic data heterogeneity is focused, and an incremental and modularized approach is suggested by the paper to deal with the issues of data integration.

Information System Interoperability

The paper suggests that the only way to reach interoperability is by publishing the interfaces, schemas and formats used for information exchange, making their semantics as explicit as possible, so that they can be properly handled by the cooperative systems.
Three viewpoints must be considered about the information systems' interoperability: application domain, conceptual design and software system technology. For each viewpoint, interoperability should be achieved.

Data System Interoperability

in this section, the definitions of centralized database system and heterogeneous database system are firstly presented.
Two categories of approaches to enable integrated access to multiple physical databases: schema integration and the federated approach.
web database is also mentioned in the section, and the challenge of the querying Web database research is the construction of a unified and simple interface.

Data Integration

The basic procedure of data integration concerned here include the resolution of heterogeneity conflicts and transformations of source data to accommodate them in the integrated view.
The kinds of data to be integrated and the heterogeneity conflicts should be firstly categorized.
The structure of data is discussed: the structured data and semi-structured data.
Data heterogeneity, or conflict is summarized. Two ways are used to define the data conflicts or data heterogeneity.

representational conflict and semantic conflict;
based on the different levels of abstraction, such as instance, schema, data model. The conflicts can be classified as : data conflicts, schema conflicts, data versus schema conflicts, and data model conflicts.

Some proposals are brought up to solve these conflicts in the RDB and semi-structured data. Here some surveys need reading.
Another way to solve the conflict, is the construction of the standard to describe the semantics. (common semantics)
A series of procedures are suggested in the paper to generate the unified view of heterogeneous data. It's inspiring for our work, I think.

Building blocks to integrate data in cooperative systems

In the section, the author describes the software framework, modules, and techniques that could be used to contribute the integrated data views.

The semantic web

A simple digest is presented here, giving us a general idea of the semantic Web standards and technologies.

Character Encoding + URI
XML + Schema
RDF + RDFS
Ontology

Web services

Semantic Integration Research in the Database Community A Brief Survey

This survey firstly presents the applications that require semantic integrations, and brings up difficulties during the integration process. Schema matching and data matching, which play very important influence in the semantic integration, are focused on in the following content. Finally, some other open issues are discussed.

Applications mentioned in the survey include:

schema integration: merging a set of given schemas into a single global schema. The basic procedure is match - merge.
Translate data between multiple databases. Data coming from multiple sources must be transformed to data conforming to a single target schema.
Data integration system: provide the user with a uniform query interface (mediated schema) to a multitude of data sources. It's very inspiring, because our work on KADIS will belong to this category.
P2P, peer data management: allow peer to query and retrieve data directly from each other, without building the mediated schema.
Model management : create tools for easily manipulating models of data.
Many factors increase the need for the applications above, such as the rapid development of the Internet, the widespread of adoption of XML as a standard syntax to share data, the growth of the Semantic Web etc.

However, semantic integration is an extremely difficult problem, because of the challenges below:

It's hard to infer the semantics of the involved elements, unless the creators of data, or exports are available.
The clues used for matching schema elements are often unreliable. Sometimes, even the clues are not complete.
The global nature of matching makes it costly to find the BEST matching elements in different sources.
Depending upon the applications, the matching is often subjective. That means, the criteria of matching might changes with the changes of applications.

Next, the survey discuss the progress in schema matching and data matching respectively. Three aspects are concerned, they are:

Matching techniques:

Rule-based solutions: hand-crafted rules to match schemas (domain-independent, and domain-specific)

Benefits: inexpensive; fast because it typically operates on schemas; work very well in certain types of applications; capture valuable user knowledge about the domain.
Drawback: can not exploit data instances effectively; can not exploit the previous matching efforts to assist in the current matchings.

Learning-based solutions: neural network, Naive Bayes etc. are used in this category; external evidence, like past matches, is exploited; domain-specific schemas and matches, even the users are involved in this solution.

Architecture of matching solutions: module-based multi-matcher architecture, each module exploits well a certain type of information to predict matches.

On the other hand, domain-specific information is taken as global information to be handled after the matching phase.
Related work in knowledge-intensive domains.

Types of semantic matches:

one-to-one match
complex match : need domain knowledge; the correct complex match is often not the top-ranked match, but somewhere in the top few matches predicted.

Data matching is discussed in the following: this part is important, because the main goal of KADIS is to match and query data from multiple sources. The techniques used in data matching are similar to those in schema matching, even though its focus is on the tuple matching.

Finally, the paper also discussed the open issues related to the whole semantic integration process, beyond matching schema and data. These issues include:

User interaction: the key point is how to reduce the user's burden during the user interaction with the system;
Formal foundations: develop formal semantics of matching and try to formally explain the mechanism of matching.
Industrial strength schema matching: apply algorithms in real world setting. Help to understand better the applicability of current research and suggest future direction.
Mapping maintenance: dynamically maintain the semantic mappings between different sources, because of the changes of the source data.
Reasoning with imprecise matches on a large scale: use and evaluate the system where parts of the mapping always remain unverified and potentially incorrect because of the overwhelming size of the information.
Schema integration: construct a global schema through matching and merging schemas from multiple data sources, which is about the high-level operations, like model management etc.
Data translation: elaborate matches into mappings, to enable the translations of queries and data across schemas;
Peer-to-Peer data management:

Wednesday, November 29, 2006

Some questions about the Curve Segmentation

Last night, I finished the implementation of the curve segmentation algorithm in Java. I am thinking about some questions about the application of this algorithm in the topic segmentation.

1. Now we used MDS to extract the most important feature in the inputting text stream, and some information is lost during this extraction. The amount of lost information depends upon the degree of the complexity of data. That is, much information could be lost, when MDS can not find the MOST important feature: no dominant feature exist at all, or not only one feature is important of all etc.

Since the following procedure to do the curve segmentation is on the basis of the dominant features extracted through MDS, it's not clear that we could get the accurate segments about the information stream using the features that can not reflect the whole graph.

Thus, the question is, how to keep the accuracy of representative features. One solution is to use a method rather than MDS to analyze the dominant features of the inputting information stream, the other is to use complex features as the representative of the data.

MDS is kind of dimension reduction method, which tries to capture common features in the data, and compress the representative without losing much accuracy. In theory, it's not very effective to use other dimension reduction methods if MDS has done a pretty good work. Maybe we'd better try the second solution: adopt the curve segmentation to the high dimension space.

2. Our goal in the project is to analyze the information stream, which is often dynamically changed with the time. However, the method we used to do the curve segmentation requires the whole graph. That's, we have to analyze the data from the scratch when some changes happen, in order to exploit MDS to reduce the dimensions of feature space. One consequence is that the patterns we got just . That means, the analysis result can change as more events occur in the stream. It sounds not good, although reasonable. Usually, people are easily lost in the changing world! Therefore, the question is: how to segment the curve in an incremental way with keeping the consistency of the result as much as possible. I don't know the answer now. One idea is to construct different segmentation based on the sizes of window containing information in the stream along the time line.

Monday, November 27, 2006

Semantic Integration Research in the Database Community A Brief Survey

Applications mentioned in the survey include:

schema integration: merging a set of given schemas into a single global schema. The basic procedure is match - merge.
Translate data between multiple databases. Data coming from multiple sources must be transformed to data conforming to a single target schema.
Data integration system: provide the user with a uniform query interface (mediated schema) to a multitude of data sources. It's very inspiring, because our work on KADIS will belong to this category.
P2P, peer data management: allow peer to query and retrieve data directly from each other, without building the mediated schema.
Model management : create tools for easily manipulating models of data.
Many factors increase the need for the applications above, such as the rapid development of the Internet, the widespread of adoption of XML as a standard syntax to share data, the growth of the Semantic Web etc.

However, semantic integration is an extremely difficult problem, because of the challenges below:

It's hard to infer the semantics of the involved elements, unless the creators of data, or exports are available.
The clues used for matching schema elements are often unreliable. Sometimes, even the clues are not complete.
The global nature of matching makes it costly to find the BEST matching elements in different sources.
Depending upon the applications, the matching is often subjective. That means, the criteria of matching might changes with the changes of applications.

Next, the survey discuss the progress in schema matching and data matching respectively. Three aspects are concerned, they are:

Matching techniques:

Rule-based solutions: hand-crafted rules to match schemas (domain-independent, and domain-specific)

Benefits: inexpensive; fast because it typically operates on schemas; work very well in certain types of applications; capture valuable user knowledge about the domain.
Drawback: can not exploit data instances effectively; can not exploit the previous matching efforts to assist in the current matchings.

Learning-based solutions: neural network, Naive Bayes etc. are used in this category; external evidence, like past matches, is exploited; domain-specific schemas and matches, even the users are involved in this solution.

Architecture of matching solutions: module-based multi-matcher architecture, each module exploits well a certain type of information to predict matches.

On the other hand, domain-specific information is taken as global information to be handled after the matching phase.
Related work in knowledge-intensive domains.

Types of semantic matches:

one-to-one match
complex match : need domain knowledge; the correct complex match is often not the top-ranked match, but somewhere in the top few matches predicted.

Qyan's Brave New World

Saturday, December 02, 2006

Wednesday, November 29, 2006

Monday, November 27, 2006

Contributors

Links

Blog Archive

Follow my heart ...