Thursday, December 07, 2006

Some Issues to be Concerned in our Research
There are 2 parts in the work of next month, one of them is about research, and the other is about the implementation of the Demo. Now, let's take a close look at what we should do with the research work first. (The other part will be checked soon. )
  1. Query extension - XPath & XQuery: we could introduce the evaluation of complex queries in our work. We have two choices:
    1. Define the intermediate representation of the query. Then, transformation from XPath or XQuery to the intermediate form should be defined. 
    2. Another one is to equivalently transform the XQuery into one XPath query expression or one set of XPaths queries etc. 
  2. Dealing with the branch queries: (necessary, since we want to handle more complex queries in XPath)
    1. In our current work, we can process the simple path query by means of k-shortest paths algorithm in the directed graph. 
    2. When the branch query is concerned, we could make use of the existing algorithm like k-shortest paths, to finish the evaluation. 
      1. The assumption is that we should have a way to measure the 'agreement' of the branch, like that of the path. (More general definition of agreement. )
      2. Another question is related to the way how to get the instance satisfying the branch query from the directed graph. The possible solutions include:
        1. JOIN method: splitting the branch query into path queries,  evaluating each path query, and finally joining all instances from each path query. 
        2. some holistic query-evaluating algorithm could be used to process the branch queries, for instance Prufer sequence gives a choice to evaluate the branch query holistically. 
       One problem is that most of holistic algorithms require tree-like data modal, not         the directed         graph. Therefore, algorithms to find spanning tree in the directed         graph may firstly be exploited in this case.
  1. The last issue is about the definition of conflicts in our paper. It might need improving in the new situation where complex queries compared with the path query, are involved.
  1. In sum, we (actually it's me) should do something to try solving problems above. I have a very simple plan, as follows:
    1. XPath and XQuery 
      1. write a report on the basic knowledge of these two concepts.
    2. Path query --> Branch query
      1. implement the Join algorithm to evaluate the branch query
      2. implement the holistic query evaluating algorithm to evaluate the branch query.
    3. Understand the current definition of conflicts in our paper, and summarize the properties of the conflicts. It should be able to handle more complex cases about conflicts, like those among path instances that have different sources or sinks. 

Wednesday, December 06, 2006

In the morning, we had a meeting about the proceeding of the research. In the following work, we have two concentrations: one is about Demo, and the other is about the Journal paper for VLDB. 
  1. Research paper
    1. Based on the CleanDB paper, a journal paper should be compiled. Since the CleanDB paper is relatively simple because of the space limit, more details should be provided in this journal paper, such as:
      1. Motivation
      2. Extensions
      3. Expressive power
      4. Evaluation
    2. Some new aspects need noticing, such as 
      1. In order to deal with more complex query, (not only path query, but branch query ...), we should improve the query processing by introducing the algorithms to find ranked candidates for the branch query. (one way is to exploit the spanning tree to handle the branch queries, and the other is to use the join operation on the basis of path results.)
      2. Now, we can only deal with part of XPath query. In the future, XPath, even XQuery should be considered. 
      3. For results of one path query, they could have different sources or sinks, since many nodes can have the same tag, or value. It is not very hard if virtual nodes are used as the common source and sink. On problem in this case might be related to the identification of conflicts. (??? I am still considering what it is??? ) 
  2. Implementation of Demo: Three main parts make up the whole process: matching, merging and query evaluation. Furthermore, some components could be defined for each part. We also have some questions about the implementation.
    1. What data model we will use in our demo? In other words, what is the input of the system? The candidate includes OWL, which could be used for ontologies from multiple sources. 
    2. Matching: 
      1. Text similarity based matching; (please find the right algorithm ...)
      2. User can provide additional information to assist the matching among nodes
      3. Structure based matching, where we could make use of Kim's work (?? how)
    3. Merging algorithm, 
      1. for the moment, we use the scheme presented in the paper (submitted to Sigmod) to merger matched nodes.
    4. Zone identification and agreement computation
      1. From the integrated view of inputting ontologies, the zone graph should be constructed.
      2. Based on the zonal graph, zones are identified and agreements of zonal choices are computed. 
      3. A edge-labeled directed graph is then generated to be the agreement-based graph.
    5. Path instances enumeration for XPath
      1. Based on the user's XPath query, candidates of the results are enumerated and ranked based on the path agreements. (here, we could introduce k-span tree searching algorithm to extend the complexity of queries, from path to branch ...)
    6. Constraints evaluation and fuzzy program optimization
      1. this part is domain specific.
      2. degree of conflicts needs measuring.
The first 3 issues above and the last one are domain specific, and the rest is relatively independent of the domain of application. 


Tuesday, December 05, 2006

Semantic Heterogeneity in Global Information System  The Role of Metadata, Context and Ontologies 
The concentration of this paper is about the approaches to tackle the semantic heterogeneity problem in the context of Global Information Systems (GIS). In the beginning, I didn't think it's so useful for me as a survey, because it looked normal and hardly got much. It was proved in the end, although some content mentioned in the paper is helpful for our research. Now let's take a close look at the survey paper.

1. In the beginning of this paper, the author defines the necessity of approaches to the semantic heterogeneity. The problem of semantic heterogeneity is defined as identification of semantically related objects in different databases and the resolution of schematic differences among them. The approaches brought up in the paper has the key objective to reduce the problem of knowing the contents and the structure of each of the huge number of information repositories to the significantly smaller problem of knowing the contents of the domain specific ontologies, which a user familiar with the domain is likely to know or easily understand. 

2. What is Metadata?
The basic idea of the paper is to construct metadata from the original Database, Metadata is data or information about data. In the paper, it's categorized as follows:
    1. Content independent meta-data, doesn't depend on the content of the document with which it is associated such information as location, modification-data etc.
    2. Content dependent metadata, something to deal with the content of the document it is associated, like the size of the document, number of rows etc. This category can be divided into smaller classes:
      1. Direct content-based metadata: based directly on the content; examples include full-text indices, inverted tree, document vectors etc.
      2. Content-descriptive metadata: describe the contents without direct utilization of the contents of the document. Furthermore, it could be divided into:
        1. Domain independent metadata: capture information present in the document independent of the application or subject domain of the information. HTML document type definition is a good example in this category.
        2. Domain specific metadata: described in a manner specific to the application or subject domain of information. Thus issues of vocabulary is important, because terms have to be chosen in a domain specific manner.
According to the paper, the domain specific metadata is more important to deal with the issues related to semantic heterogeneity, since it could capture information in a specific application or a domain. And it could be viewed as a vocabulary of terms for construction of more domain specific metadata descriptions. 

In the paper, two kinds of meta-data are introduced, including metadata contexts, and conceptual context. 
    1. Metadata contexts mainly abstracts the representational details;
    2. Conceptual contexts captures domain knowledge. Together with the structured data, the paper uses this metadata to capture the information content.

3. Constructing c-contexts from ontological terms. 
The method used in the paper is to use terms from domain specific ontologies as the vocabulary to characterize the information. In other words, the meta-data of a document could be represented as a vector of attribute-value pairs. Based on this definition, the mechanism of reasoning and manipulation is also discussed. In order to map contextual descriptions to the Database schema, a set of projection rules are defined and used. 
Finally, some issues about language and ontology in context representation are presented. 
    1. About the value of context attributes:
Context here is a collection of contextual coordinates and their values. There are some basic requirements for the definition of these values. (1) declarative in nature helping to perform inferences on the context. (2) express the context as a collection of contextual coordinates, each describing a specific aspect of information present in the database or requested by a query, which is consistent with the representation of the c-context. (3) have primitives in the model world and ontology.
    1. About the ontology: The scalability of the ontology is more concerned at this step. The paper tries to present two approaches to the combination of various ontologies. 
      1. The common one 
        1. build an extensive global ontology.
        2. exploit the semanties of a single problem domain. 
      2. Re-use of existing ontologies/classifications
        1. combining different existing ontologies, but some problems should be noticed, like the overlap between different ontologies. 

4. Semantic interoperability using terminological relationships, like synonyms, hyponyms, hypernyms.
    1. When discussing the synonyms to interoperate across ontologies, the paper used OBSERVER as an example. 
      1. An architecture for interoperation is presented. It's mainly composed of query processor, ontology server, interontologies relationships manager (IRM) and ontologies. (This structure may be useful for our work. At least, we could make a comparison between it and our own architecture. In some sense, I don't think it could work very well, because the component IRM assumes too much work, and it's complex enough to intimidate any designer, although IRM helps a lot for the scalability of the whole procedure. )
    1. Using hyponyms and hypernyms to interoperate across ontologies: this section seems very useful because it describes the problems we have in our project. In reality, hierarchical relationships like hypernyms, hyponyms are more common than synonym ones. The solution introduced by the paper is to substitute a non-translated term by the intersection of its immediate parents or the union of its immediate children. The point is that synonyms, hyponyms and hypernyms should be firstly identified inside and between the user and target ontologies.