Monday, October 30, 2006

Some notes about XBench
First, we'd better clarify some definitions used by XBench. 
  1. Data structure of the source
       Text Centric (Single Document): the common features of XML documents in this class are a big text-dominated document with repeated similar entries, deep nesting and possible references between entries.
       Text Centric (Multiple Documents): the features of XML documents in this class are numerous relatively small text-centric XML documents with references between documents, looseness of schema and possibly recursive elements.
       Data Centric (Single Document): XML documents belonging to the class of data-centric/single document are similar to TC/SD in terms of structure but with less text content. However, the schemas of these data tend to be more strict in the sense that there is less irregularity in DC/SD than in TC/SD, since most of the XML documents in DC/SD are translated directly from relations.
       Data Centric (Multiple Documents): Data-centric multiple documents are transactional and are primarily used for data exchange. Thus, the tags are more descriptive and contain less text content. Usually, the structure is more restricted (in terms of irregularity) and flat (less depth) since most of the data came originates in relational databases.



Data generation
We exploit the xml data as the source of the evaluation part of our work. In the last meeting, one potential method is suggested by Dr.Candan. The basic idea is as follows:
1.        Select one DTD or schema as the template of some XML generator;
a.        We use schema from XBench, which belongs to the category of text-centric/multiple-documents. (The features of XML documents in this class are numerous relatively small text-centric XML documents with references between documents, looseness of schema and possibly recursive elements.)

2.        Run the XML generator to produce some XML documents; keep the existence of same parts among different XML documents through adjusting the parameters of the generator;
a.        ToXGene is selected as the XML document generator.
b.        Based on the schema from XBench, we can get a set of XML documents, which have the same schema but different values for entities and attributes. 

3.        Merge all of XML documents into the integrated view of all of them;
a.        The rules to merge documents should be defined. 
i.        The combination is based on the path matching; that is:
                               For each XML documents  in the collection do
                                       For each leaf in  do
                                               Find the path  according to ;
                                               Update the graph matrix based on 
.
b.        The basic idea is illustrated by the following figure:
       
       
c.        end