Saturday, October 08, 2005
Choose a good dataset to make the experiment of topic segmentation of Weblog
:)
Friday, October 07, 2005
Comparison of different increasing rate
Thursday, October 06, 2005
The determination of boundaries among topic segments
- if the segment currently concerned is a dominated/dominant pattern, all of points in the segment can not be taken as the starting point of other segments.
- if there are not only one points in the segment line, the boundary might be the starting point when there are no any segments before it. Or else, the point next to the starting point of the segment is taken as the boundary.
- if there are only two nodes in the segment which are end points of the line. The peak point would be a boundary which is also the single point segment. The point next to the peak would be the boundary for the following topic segment.
The definition of topic segments

Steps to get the topic segments and patterns
- Curve Segmentation Algorithm on the plot of all entries concerned.
- Combining lines with similar behavior.
- Identify topic segments.
- Classify topic segments based on the development pattern insides.
If necessary, the step 2 to step 4 could be used on the result in an iterated way, until no changes happen any longer.
We exploit a curve segmentation algorithm from Lowe in the first step to obtain a series of lines approximating the original points.
In the second step, we decide whether to combine two consecutive curve segments (i.e., lines) through making a comparison of slopes and average variances between them. The key point is the combined lines should have similar slope and average variance. How to define the 'similarity'? We decide to use increasing rate to calculate the difference between two consecutive lines. We believe the approach with rate will be better than absolute value for its indenpendance of the specific application.
Because I just redefine the topic semgents (might be a point, a line or a series of lines), I must give more details about that and plan a new experiment to prove it.
In fact, we can classify each topic segments at the step three. When a point is a topic segment, it belongs to 'dominated/dominant' pattern. When the topic segment is composed of one line, it would be 'drifting' or 'dominated/dominant' pattern. When a topic segment is made up of not only one lines, it's probably 'interrupted' pattern. For the topic segments of 'interrupted' pattern, we use one line over its components to replace original lines. Such a replacement will possibly change the development patterns around, leading to a new iteration from step 2 to step 4.
Another idea about how to decide the boundaries of topics

1. In the case (a), the line composed of p1 and p2 is a 'dominant' pattern segment whereas that of p2 and p3 is a 'drifting' pattern segment. The boundary between the two consecutive segments should be 'p2' which is located in the middle. But 'p2' represents a page whose value is similar to p1 instead of p3, so that p2 should belong to the segment starting at p1. Thus, the boundary should be p3. The same principle is used to the case (b), where the boundary should be p4 instead of p3.
2. As far as the case (c) and (d) are concerned, both of them are composed of drifting segments. The point in the middle (or in the peak) looks a good candidate for the boundary of topic patterns. But be carefully here, we have to reconsider the definition of topic segments!
Before, I didn't carefully tell the difference between topic segments and curve segments. Now it's just time. Curve segments are simple to define. They are just lines, more accurately, they are components composing into a curve. Each pair of consecutive lines shares the same connection point. Topic segments are a bit complex. A topic segment should be a page or a series of pages in A line. That is, a topic segment is a point, a line or some consecutive lines, but a line may not necessarily be a topic segment. So when deciding the boundaries of topics, we need consider more than shared points among lines.