Saturday, October 08, 2005

Choose a good dataset to make the experiment of topic segmentation of Weblog

I spent the whole day on checking and improving my matlab code for my paper (topic segmentation). After some modifications, especially of the definition of 'interrupted' pattern, I found it's hard to identify a pattern like that with currently used dataset. So I decide to find a good one which is able to produce such a pattern as interrupted. I tried the entries in the TalkingPoint on 2004, and I found there were at least 2 interrupted patterns generating with our algorithm. Even though I need be careful about the result, it's a good news anyway.
:)

Friday, October 07, 2005

Comparison of different increasing rate

In our project, we need compare different slopes of lines in order to decide whether to combine them. It looks an easy job! But the problem is that the slope is not as we usually define. The dataset used is composed of a series of values, i.e., a one-dimension data stream. When we plot all points in a figure, we take the horizontal axis to represent the sequence number of these points. Whereas, we use the vertical axis to measure the value of each point. The 'slope' of a line in such a plot, means kind of increasing rate. You can also call its slope, but it's impossible to associate it with the tangent of a degree. (Because the scales of two axis are different, the degree here is not meaningful!) When calculating the difference between two consecutive lines, I firstly try a conversion method. That is, the value of 'slope' here is normalized in the beginning by means of normalizing scales of both axis. Then we can use a degree to represent each 'slope'. Next, I use tangent function to calculate the difference of two 'slope'. Finally, translating the difference back to the axis system we use originally.
But after a second thought, I don't believe it is a good solution to calculate the difference of slopes. The main reason is that it is no point using a two-dimensional method to calculate one-dimension difference.
At last, I return for the old approach: just use the arithmatical difference between two 'slope' we get in the current coordinative system. (only minus is concerned!)

Thursday, October 06, 2005

The determination of boundaries among topic segments

In order to get the information about boundaries of topic segments, we need simplify the method. That is:
  1. if the segment currently concerned is a dominated/dominant pattern, all of points in the segment can not be taken as the starting point of other segments.
  2. if there are not only one points in the segment line, the boundary might be the starting point when there are no any segments before it. Or else, the point next to the starting point of the segment is taken as the boundary.
  3. if there are only two nodes in the segment which are end points of the line. The peak point would be a boundary which is also the single point segment. The point next to the peak would be the boundary for the following topic segment.

The definition of topic segments

In terms of curve analysis, a topic segment is a point, a straight line or a series of consecutive straight lines. Topic segments should not have shared points. The boundary between two consecutive topic segments is defined as the starting point of the second topic segment.

  • The topic segment is a point: see (b) in the figure above. When the line is only composed of two points, say p1.p2 and p2.p3, the point p2 is taken as a single point topic segment. The boundary is itself. The point p3 will be a boundary for the following topic segment.
  • The topic segment is a line: there are some different cases in this situation. (a) has two topic segments, one is 'dominated/dominant' pattern, and the other is 'drifting'. For 'dominant/dominated' pattern topic segments, the point at the end of line can not be boundaris except the starting point. So p2 can not be a boundary. The boundary in the case (a) is p3, which is the starting point of the following 'drifting' segment. (c) has two 'drifting' segments, each of which is composed of more than one points, we use the peak point p2 as the boundary of two topic segments.
  • The last case (d) is a topic segment of 'interrupted' pattern. The boundary should be p1.

Steps to get the topic segments and patterns

After we got the plot of entries by means of MDS (Multiple Dimensional Scaling) method, there are at least four more different steps towards the achievement of topic segments and topic development patterns. They are:
  1. Curve Segmentation Algorithm on the plot of all entries concerned.
  2. Combining lines with similar behavior.
  3. Identify topic segments.
  4. Classify topic segments based on the development pattern insides.

If necessary, the step 2 to step 4 could be used on the result in an iterated way, until no changes happen any longer.

We exploit a curve segmentation algorithm from Lowe in the first step to obtain a series of lines approximating the original points.

In the second step, we decide whether to combine two consecutive curve segments (i.e., lines) through making a comparison of slopes and average variances between them. The key point is the combined lines should have similar slope and average variance. How to define the 'similarity'? We decide to use increasing rate to calculate the difference between two consecutive lines. We believe the approach with rate will be better than absolute value for its indenpendance of the specific application.

Because I just redefine the topic semgents (might be a point, a line or a series of lines), I must give more details about that and plan a new experiment to prove it.

In fact, we can classify each topic segments at the step three. When a point is a topic segment, it belongs to 'dominated/dominant' pattern. When the topic segment is composed of one line, it would be 'drifting' or 'dominated/dominant' pattern. When a topic segment is made up of not only one lines, it's probably 'interrupted' pattern. For the topic segments of 'interrupted' pattern, we use one line over its components to replace original lines. Such a replacement will possibly change the development patterns around, leading to a new iteration from step 2 to step 4.

Another idea about how to decide the boundaries of topics

After finishing the curve segmentation of original points for the pages, I should decide the boundaries of different topics according to the series of lines we got in the prior step. Some cases need considering. See figure in the following:

1. In the case (a), the line composed of p1 and p2 is a 'dominant' pattern segment whereas that of p2 and p3 is a 'drifting' pattern segment. The boundary between the two consecutive segments should be 'p2' which is located in the middle. But 'p2' represents a page whose value is similar to p1 instead of p3, so that p2 should belong to the segment starting at p1. Thus, the boundary should be p3. The same principle is used to the case (b), where the boundary should be p4 instead of p3.
2. As far as the case (c) and (d) are concerned, both of them are composed of drifting segments. The point in the middle (or in the peak) looks a good candidate for the boundary of topic patterns. But be carefully here, we have to reconsider the definition of topic segments!

Before, I didn't carefully tell the difference between topic segments and curve segments. Now it's just time. Curve segments are simple to define. They are just lines, more accurately, they are components composing into a curve. Each pair of consecutive lines shares the same connection point. Topic segments are a bit complex. A topic segment should be a page or a series of pages in A line. That is, a topic segment is a point, a line or some consecutive lines, but a line may not necessarily be a topic segment. So when deciding the boundaries of topics, we need consider more than shared points among lines.