Thursday, January 19, 2006

A symbolic representation of NVX

I am trying to find a way to give a formal definition of NVX. The symbolic description is here required. I read through the front part of Selcuk's report, and got some illuminations. I found that in this phase an understanding of NVX is required, and a formal representation of each node in the XML document in the case of NVX existing is also necessary.
Borrowing ideas from Selcuk's report, we have 5 different types of null values.
  1. Existential Null (ex_mar);
  2. Maybe Null (ma_mar);
  3. Place holder Null (pl_mar);
  4. Partial Null (pa_mar);
  5. Partial Maybe Null (pm_mar).

Two categories of NVX are possible in the XML document, one is tag-related, and the other is structure-related. For each category, we use ??_mar<?> as the mark of the respective null value. For example, in the case of tag-related NVX, the possible marks include ex_mar<tag>, ma_mar<tag>, pl_mar<tag>, pa_mar<tag>, and pm_mar<tag>. On the other hand, for structure-related NVX, we use ex_mar<str>, ma_mar<str>, pl_mar<str>, pa_mar<str>, and pm_mar<str>.
In the following, let's take a look at the meanings which these marks represent. Before that, we should give a symbolic representation of nodes, NVX etc. Let a node or nodes in the XML document be a 3-tuple (id, tag, parent_id), where
  1. id is the identifier of the node, it's structure-related information.
    1. id could be a value unique to this node, which uniquely identifies this node from others. We use a mark va_mar<str> to represent a value here.
    2. id could be pl_mar<str>, meaning that this is a dummy node.
    3. id could be pa_mar<str>, representing a set of nodes which all satisfy some constraints.
    4. id could be ex_mar<str>, meaning that the node exist somewhere in the XML documents.
    5. id could be ma_mar<str>, meaning that the node might exist somewhere, but not sure.
    6. id could be pm_mar<str>, meaning that the node could be one of values in the set, or not exist at all.
  2. tag describes the content of the node, it's tag-related information.
    1. tag could be a value in the predefined domain. We could use a mark as va_mar<tag>. One question here is how to define the domain of this mark va_mar<tag>!
    2. tag could be any tag-related NVX marks, such as ex_mar<tag>, ma_mar<tag>, pl_mar<tag>, pa_mar<tag>, and pm_mar<tag>.
  3. parent_id is the identifier of the parent of the node, it's also structure-related information.
    1. parent_id could be a value unique to a node in the XML document, which is the parent of the concerned node. (va_mar<str>)
    2. parent_id could be one of marks for NVX, such as pl_mar<str>, ex_mar<str>, ma_mar<str>, pa_mar<str>.

Wednesday, January 18, 2006

A description for a simple definition of NVX

If we simply treat an XML document to be a tree without considering any features introduced by XML itself, each node could be taken as a 3-tuple (id, value, pointer, constraints).
  1. id: the identifier of the very node; it should be unique to each node, such as a number, a id string etc. Usually, it should not be null value, but possibly be.
    1. When id is a meaningful identifier, it is corresponded to an existing node in the XML document;
    2. When id belongs to one sort of null values, the situation is a little complicated:
      1. it must be something. None is not supposed to be permitted here.
      2. it could be a mark, representing a set of nodes satisfying some conditions.
  2. value: the content hold in the very node; it's kind of tag-related information. Possibly, the value could be an integer, a real number, a string etc. The tag-related Null values could happen in the place.
    1. The definition of tag-related Null values could exploit those in Selcuk's report. Such marks as ex_mar, ma_mar, pl_mar, pa_mar, pm_mar can be used to represent null values here.
  3. pointer: structure-related information about the very node, that is, it points nodes directly connected with the concerned node. In reality, we put the ids of those connected nodes here. The structure-related Null values could happen here.
    1. Usually, the pointer should point one node. In such a case, the only exception is when the concerned node is root of the whole XML document.
    2. A mark for one kind of null values is also possible here.
      1. it could be null, which means the node linked doesn't exist.
      2. it could be a mark for a set of nodes which satisfies a certain of condition.
  4. constraints: this item gives an 'explanation' on the Null values used for this node.

Next, a more formal representation of the node in the XML document would be present.

Make a selection between two different ways

In the morning, I discussed about the research with Dr.Candan. I found that there is a gap between our idea about the definition of Null Values in the XML documents (NVX). According to him, we should try to find a formal model to represent the concept of NVX and show a light on the very definition. I attached to the specific features too much. For instance, I considered the difference between leaf nodes and non-leaf nodes in XML. It's not bad, but as Candan said, I jumped too far.
Furthermore, Candan and I exploits different way to consider this problem. Selcuk thinks we'd better employ a general way. Any XML document could be treated as a simple tree structure, and every conclusion for now is from such a structure. It's unnecessary to consider some specific features such as DTD etc. My methodology is just different from his. I believe, we should put everything into the environment of XML when defining the concept of NVX. There must be some sense to do with XML as far as NVX is defined.
Following with Selcuk's idea, we could get a simple model first of all, and might add more features step by step. It's not a bad idea. Mine could lead to a rather (might not so hard) complex model, with regard to the complexity of XML documents. I don't yet consider the difference mentioned above carefully, and I hope it's right, because I would follow Selcuk's way in future research.


Tuesday, January 17, 2006

Definition of Null Values in the XML documents

According to Dr.Candan's report - A Unified Treatment of Null Values using Constraints, there are at least 5 types of null values in the relational database: existential null (ex_mar), maybe null (ma_mar), place holder null (pl_mar), partial null (pa_mar), and partial maybe null (pm_mar). All of these null values are tag-related, that's, they don't deal with the potential relationships among these values.
As far as an XML document is concerned, two kinds of null values can be categorized from a high level of view: tag-related null values and structure-related values. In such a situation, we can directly exploit the results obtained in the report to process the tag-related null value. (Is it true? I need more consideration and discussion!) For structure-related null values, it's not decided yet. We have to define them first of all.
Usually there is a DTD file associated with a set of XML documents, giving a definition of the structure of these documents. So we can call such a DTD as a structural description of a class of XML documents. All of possible non-leaf nodes should be provided in the DTD file, and the basic components making up the tree are also presented.
From this point, nodes in an XML document can be divided into two categories: non-leaf nodes and leaf nodes. Leaf nodes are value nodes, which constantly change. Tag-related null values can be happened as a leaf node. Non-leaf nodes are function nodes, which are almost constant. The relationship among these nodes gave a general idea about the structure of the XML document. In result, structure-related null values should take place in these nodes. But if we treat non-leaf and leaf nodes equally as nodes in the XML document, we are also able to take advantage of methods to do with tag-related null values when processing the very kind of structure-related null values. (Hope it's expected. )
On the other hand, we have to consider the relationship among different non-leaf nodes (such as parent-child, ancestor-descendant etc.) Another category of structure-related null values come out at this point. It's not close to tag-related null values at all. Some questions follows at once. How to define these null values, and how they influence the common operations on XML documents, ...

A starting point for the research on Null Value in the XML documents

In the morning, I went to Prof.Kintigh's office. We discussed about some problems in the implementation. Everything looks ok so far. We would prepare for the workshop in the next week. I think that I'd better attend then.
At noon, I had a meeting with Dr.Candan, talking about our research on Null Value in the XML document. The discussion made it a little more clear about the basic problems and some potential solutions. I don't believe these ideas are mature enough, because I still have some questions. At least it's starting. The following is some conclusions we got from the short meeting.

  1. One direction of the research is to convert the XML documents into the relational database. Then use suitable marks to represent the Null Values from the XML documents.
  2. We could represent each node in an XML tree structure to be a 3-tuple (node-id, tag, parent-id).

1) node-id is an identifier for each node in the XML documents; it may be a simple number without any meaning, and it's easily scalable; a range/path code containing the structure information about the concerned XML tree may also be exploited, but it's operating costly when addition or deletion of nodes are concerned. On the other hand, node-id might be substituted by a kind of query (such as an XQuery) leading to a set of node-ids. Or one of marks that catches different senses of Null Values.

2) tag describes the value contained in the node. In most cases, it's a string. It's also possible to be a mark defined in Dr.Candan's report - A Unified Treatment of Null Values using Constraints.

3) parent-id tells the information about parents of the concerned node. This part is about the structure information in the XML documents to do with the concerned node. As far as we know, a constraint composed of Boolean operators and XQuerys might be used to represent the Null Value happening at this point. But the question is what to define and how to define then. We should be careful not to make thing too complex.

Honestly, we don't think it's the final version of the representation for the node. Moreover, the current definition of the Null Value in the XML document is not clear enough.


From the point of 1&2, some concerns are as the following:
  1. What is the definition of Null Value in the XML document?
  2. How to represent the Null Value in the XML documents.
  3. Is it necessary to convert an XML document containing Null values into a relational database in order to process those null values?
  4. If the answer to 3 is yes, how to do the conversion? What is the influence for the XML operations?
  5. If the answer to 3 is no, what to deal with Null Values, and how to give out a formal representation in order to make it easy to process Null Values in the XML documents.