Monday, October 10, 2005

Reconsideration of the application of curve segmentation aglorithm in our approach

The curve segmentation algorithm is usually used to approximate a curve in multi-dimension space with a series of lines, including straight lines and arcs. In such a space, the different dimensions are mostly equal in the geometric properties. For example, you can use the same 'ruler' to measure the distance among any two points in this kind of space.
In our project, we exploit the curve segmentation to assist the topic segmentation. We map all entries into an one-dimensional space, and plot them in a two-dimensional plane. However, in the two-dimensional plan we used, each dimension has different meaning. X is for the sequence number of entries and Y for estimated content value. So we must provide a justification of the application of the curve segmentation algorithm to our project.
In fact, when exploiting the curve segmentation algorithm in our project, what I did is just to normalize each dimension into a range [0, 1]. The problem is whether we could directly use the result without any processing. It is really a question!

A new way to find the boundaris among topic segments

When I used the new algorithm (including curve segments combination and topic segments identification) to make experiments on the same dataset, I found that it is almost impossible to find some cases related to the 'interruptted' patterns. In the morning, I started to try larger thresholds. It still doesn't work. However, I met with a very different situation from the previous. Suddenly, I had a new idea to decide the boundaries among topic segments.

When I used a small threshold value for the 'dominated/dominant' pattern, the patterns of results construct an irregular sequence, that is, the 'dominated/dominant' patterns and the 'drifting' patterns interleave in a random manner. If I followed my previous scheme to get the boundaries in the squence, I have to analyze each segment one by one. But when I increased the value of threshold, some interesting changes happen. Some 'dominant/dominated' patterns come out in a consecutive way, that means, they could be combined together. According to the plot for all entries, it looks impossible or meaningless to be classified them to be 'dominated/dominant' patterns, whick LOOK obviously kind of 'drifting' pattern. But the result is so regular that it could produce much better boundaries if we combine the consecutive segments together. It is time to split the pattern analysis and the plot, although not so completely.

If the new approach to decide boundaries are proved useful, we can get at least two benefits. On one hand, the combined 'dominated/dominant' pattern topic segments make it possible to construct a hierarchical structure about the content of entries, although it might not be realistic for the moment. On the other hand, according to current results, the combination of topic segments could make improvement on the segmentation of topic segments. Furturemore, it could identify the main tendency of entries, i.e., the general topic development patterns in the entries.

By the way, this result made me reconsider the role of plots we got from the curve segmentation. We draw the curve in a two-dimensional plane, but the two dimensions are different in terms of scales. The X axis describes the sequence number of entries, which is measured by discreted numbers. Whereas the Y axis records the value of each entry, the value is kind of measure of 'content' in each entry. When we use curve segmentation algorithm to approximate the entries in the plane, we treat the two dimensions equally without considering their difference, but we should do so. :(

Some questions need considering

There days, I am always thinking over some problems in our new paper. There are really some tough questions about that. I must figure out some solutions soon. These questions are as the following.
  1. The reason that we use curve segmentation;
  2. Do we use curve segmentation algorithm in a correct way!?
  3. The relationship of our approach to achieve topic segments especially the topic development patterns from the construction of hierarchical structure of weblog.
  4. A reasonable method to decide the boundaries among segments.
  5. How to select the suitable thresholds for the experiment and why.
  6. In the experiment, a good dataset should be used which is able to cover most possible cases in our approach.