Qyan's Brave New World: 2005

Saturday, December 24, 2005

In the afternoon, I planned to make some improvement for my work on the Kadis. More datasets should be added into the current framework. The problem in the first place is that the data in the existing database (in fact, excel files) are composed of codes, not corresponding values. That means, some conversion work must be required. When I started to program to do the conversion, I had more questions.

1. Is the code system in the different dataset the same?

2. Is any rule used to decide these codes?

3. In the integrated dataset, some values in the SITE column don't exist in the code system. Where are they from?

4. In the other dataset, there is no FUI values at all. What to do with such a situation?

Friday, December 23, 2005

I start to working on solutions to the conflicts when merging different ontology files. Before he left, my advisor suggested me reading some materials of Null Value processing in the database. I read the first sections of the paper wrote by him some years ago. I have to stop and make a though that what the point of this paper is. Is it useful when is applied to the resolution of conflicts? I am doubting.

In his paper, the method presented deals with the null values of different semantic meanings in the relational database in A general way. At this time, I suddenly found that I didn't understand the very method completely, so that I'd better read the whole paper first.

The first step is to get some basic background knowledge of this topic. In the afternoon, I found some papers, which are mainly related to XML. I believe that our work will lie on the process of XML document. The subject is focused on the formalized representation, logical structure, and inconsistency process etc. Until now, I can not make sure anything I do is useful, but it's necessary.

On the other hand, I have some other tasks. What I can figure out just now are as the following:

Kadis: ontology document and database for two dataset;
Join component: going over the old work and adding new functions.

Thursday, December 15, 2005

Sarver, ashen, leans against a wall.
As he sits on the military plane that will take him home, the Bronze Star he’s been awarded is stowed away with the rest of his gear.
Staff Sergeant Jeffrey S. Sarver is at home in the nation he has sworn to protect—and a long way from the loneliest spot on earth.
Many settled in the barrios of Los Angeles, where they were preyed upon by the city’s turf-conscious Mexican
.....says the gang has steadily encroached on the neighborhood
Young kids see the gang members as role models
In exchange for leniency, Paz gave prosecutors firsthand information about armed robberies, stabbings and shootings stretching from California to Texas to North Carolina.
But the strictures and the isolation became too much for Paz.
When it was over, she felt like she finally belonged.
It turns into a driveway up the block and comes back, prowling slowly, watching her.

Wednesday, December 14, 2005

A bad beginning makes a bad ending.
A bad thing never dies.
A bad workman always blames his tools.
A bird in the hand is worth than two in the bush.
A boaster and a liar are cousins-german.
A bully is always a coward.
A burden of one's choice is not felt.
A candle lights others and consumes itself.
A cat has 9 lives.
A cat may look at a king.

Studying English - from Reader's Digest

Now, at the Baghdad intersection, Sarver’s team kneel in the dirt, and, like squires attending a knight, adjust his armor.
At 10 feet out, the point of no return, he gets the adrenaline surge he calls The Morbid Thrill.
“It’s a numbing, sobering time, it’s the loneliest spot on earth.”
As the Humvee rattles down the road, Sarver, lost in thought, stares out the window at the blazing Iraqi sunset. I like what I do, he thinks to himself.
Soon it will be dark, curfew time.
With a glance he could suss out any bomb’s architecture.
making him one of ten Army bomb techs to die in the field as of November 2005.
when fatigue, distraction and homesickness can dull a soldier’s instincts.
“When you’re 10 feet away from it,” he says, “you get comfortable because you’re at the point of no return.”

Wednesday, November 16, 2005

The basic phase of the demo of KADIS

The task in this phase is mainly to collect the selections from the user and compose them into a query, i.e., try to figure out what the user wants. Based on the document provided by Keith, there are mainly two steps: the first step asks the user to make some selections to make up the final query, and in the second step, the user makes some decisons for the manner to display the results.
In the first step, there are five different selections for the query for the moment. All of these selections should present with a concept at the ontological level. That means, it should not have much to do with the detail of each dataset. Because different reseachers might use different ways to describe the same concept, it's hard for one user to understand others' notes for one concept without detailed explanation. We plan use XML files to organize all ontological information about the datasets concerned. (Note: the ontological information is kind of representation of concept which can be understood by most people in the field of archaeology.

Dataset selection: all datasets existing in the collections are displayed in a checkbox list.
Taxonomic level: a treeview plus checkbox is used to display the hierarchical structure of the taxonomic categories.
FUI selection: it is connected with the selections made in 2.
Element selection: a list box may be used to display all elements in the datasets.

Design for KADIS demo

The goal of our work at this time is to create a website supporting the integration and the query of archaeological databases. There are at least two focuses: one is to satisfy the basic query demand from the user, and the other is to find a good way to display the conflicts generated in the integration of different databases and ask the user to give solutions through interactions.
So it's composed of two phases corresponding to the two focuses: the basic phase and the conflict-resolving phase.

Basic phase: At the point of a user, s/he makes a series of selections for the query, and then gets the result with some kind of format (e.g., txt, excel files etc.). Standing on the side of the system (or web application), it displays the necessary information about the query to the user (such as the datasets) with a friendly way, and outputs the results according to the user's requirements.
Conflict-resolving phase: when a new dataset is integrated into the existing database, some conflicts may happen for the different definitions used by the different reseachers. While a query is processed, it's the user who should provide its resolutions to the concerned conflicts in order to produce the right result. It's an interactive procedure.

Honestly, I don't believe there is a clear-cut boundary between these two phases. When displaying integrated database to the user, some conflicts must be resolved first. The processing of the query should be the kernel of the project, including the ontologies extraction, the dataset integration, and the query processing etc. But the demo just visualizes the idea of our project/proposal. It's enough to put the emphasis on the visualiztion and interaction.
In the implementation, we use classical client-server architecture. The basic idea is as the following: the archaeological datasets are stored in the serve. We use MySql database to store all of data for the moment. The ontological information about the datasets is organized with an XML format, in owl or xml files. When the user logs the website and starts to select, the client on the user's side will fetch relevant information from the server (one possible way is to read data from the XML or mysql database, generate suitable html pages and send them back to the client browers) .

Friday, October 21, 2005

About MDS (Multi-Dimensional Scaling) algorithm

We use MDS to map the original space into a new low-dimensional space. But when new dimension is introduced into the original space, the resulting new low-dimensional space may change a lot. See the figures above. The first one is the plot depicting the first 20 entries in a blog. The second depicts the first 28 entries. Even though the difference between them should be small, the resulting plots are very different.
The MDS analyzes the data set as a whole, and it's impossible to make a prediction only based on the information about the past. That might be a problem.

Man-made topic interrupted patterns

The basic idea to produce an interrupted pattern is as the following:

The first step is to identify some rather long dominated or drifting patterns in the original sequence.
Add some noises into these patterns.

It's relatively easy to find some dominated or drifting patterns in the seuqnce of entries. We should firstly prepare a suitable dataset which contains some patterns we want.

But for the second step, it seems not so simple. Noise is the important issue that we need consider.

What is the noise? There are at least two choices. One is that we introduce new entry from other irrelevant blogs. The other is just to adjust the original sequence so that some entry suddenly happens where it should not be.

The introduction of new element in the entries space, will lead to the modification of new space generated by means of MDS. Thus, the original patterns can not be maintained. The result is unexpected.
Just adjusting the order of the original sequence won't change the existing patterns, and it's safe.

So I believe the second choice is better.

Another issue that needs noting is about the curve segmentation algorithm. When we changes the order of entries in the sequence, patterns might change too for the curve segmentation algorithm. That means, even though we use the second selection to generate the interrupted patterns, it's not safe enough: the original patterns may be changed, so that the interrupted pattern can not happen as expected. See figures below.

The figure above is the original plot describing all of entries in a sequence. The labels of each point are for the sequence number of entries. The segment from No.8 to No.15 is a topic dominated pattern, and we decide to move point No.19 or No.20 into this segment in order to produce the interrupted pattern.

We move point N0.19 into the position between No.10 and No.11 and get the figure above. The adjustation changes the original pattern so much that the dominated segment #8 to #10 doesn't exist any longer. I believe that it's due to the value of #11 is so close to the original segment.

The figure above describes the case when we move point #19 to the place between point #10 and #11. An interrupted pattern is generated in this case. The value of #19 is far different from others in the original topic segment (from #8 to #15 in the first figure).

In the last figure, we move both #19 and #20 to the place between #10 and #11 in the original (first) figure. In result, we get an interrupted pattern too.

Tuesday, October 18, 2005

Study of the interrupted pattern

After we change the formula to decide the interrupted pattern, we found it's almost impossible to find a case belonging to the interrupted pattern in our current dataset. So we have a problem! If we can not find an interrupted pattern case, how could we verify our approach? On the other hand, when we create the formula to identify the topic development patterns, we have not construct a direct relationship between our patterns and the content. The fundation of these formula is just some observations about the 'curve'. It's not persuasive enough.
We need produce some "interrupted" patterns purposely, and find a good way to represent them with some kinds of fomular. But it should be noted that how much this 'man-made' patterns reflect the real world! It's another problem.

Monday, October 10, 2005

Reconsideration of the application of curve segmentation aglorithm in our approach

The curve segmentation algorithm is usually used to approximate a curve in multi-dimension space with a series of lines, including straight lines and arcs. In such a space, the different dimensions are mostly equal in the geometric properties. For example, you can use the same 'ruler' to measure the distance among any two points in this kind of space.

In our project, we exploit the curve segmentation to assist the topic segmentation. We map all entries into an one-dimensional space, and plot them in a two-dimensional plane. However, in the two-dimensional plan we used, each dimension has different meaning. X is for the sequence number of entries and Y for estimated content value. So we must provide a justification of the application of the curve segmentation algorithm to our project.

In fact, when exploiting the curve segmentation algorithm in our project, what I did is just to normalize each dimension into a range [0, 1]. The problem is whether we could directly use the result without any processing. It is really a question!

A new way to find the boundaris among topic segments

When I used the new algorithm (including curve segments combination and topic segments identification) to make experiments on the same dataset, I found that it is almost impossible to find some cases related to the 'interruptted' patterns. In the morning, I started to try larger thresholds. It still doesn't work. However, I met with a very different situation from the previous. Suddenly, I had a new idea to decide the boundaries among topic segments.

When I used a small threshold value for the 'dominated/dominant' pattern, the patterns of results construct an irregular sequence, that is, the 'dominated/dominant' patterns and the 'drifting' patterns interleave in a random manner. If I followed my previous scheme to get the boundaries in the squence, I have to analyze each segment one by one. But when I increased the value of threshold, some interesting changes happen. Some 'dominant/dominated' patterns come out in a consecutive way, that means, they could be combined together. According to the plot for all entries, it looks impossible or meaningless to be classified them to be 'dominated/dominant' patterns, whick LOOK obviously kind of 'drifting' pattern. But the result is so regular that it could produce much better boundaries if we combine the consecutive segments together. It is time to split the pattern analysis and the plot, although not so completely.

If the new approach to decide boundaries are proved useful, we can get at least two benefits. On one hand, the combined 'dominated/dominant' pattern topic segments make it possible to construct a hierarchical structure about the content of entries, although it might not be realistic for the moment. On the other hand, according to current results, the combination of topic segments could make improvement on the segmentation of topic segments. Furturemore, it could identify the main tendency of entries, i.e., the general topic development patterns in the entries.

By the way, this result made me reconsider the role of plots we got from the curve segmentation. We draw the curve in a two-dimensional plane, but the two dimensions are different in terms of scales. The X axis describes the sequence number of entries, which is measured by discreted numbers. Whereas the Y axis records the value of each entry, the value is kind of measure of 'content' in each entry. When we use curve segmentation algorithm to approximate the entries in the plane, we treat the two dimensions equally without considering their difference, but we should do so. :(

Some questions need considering

There days, I am always thinking over some problems in our new paper. There are really some tough questions about that. I must figure out some solutions soon. These questions are as the following.

The reason that we use curve segmentation;
Do we use curve segmentation algorithm in a correct way!?
The relationship of our approach to achieve topic segments especially the topic development patterns from the construction of hierarchical structure of weblog.
A reasonable method to decide the boundaries among segments.
How to select the suitable thresholds for the experiment and why.
In the experiment, a good dataset should be used which is able to cover most possible cases in our approach.

Saturday, October 08, 2005

Choose a good dataset to make the experiment of topic segmentation of Weblog

I spent the whole day on checking and improving my matlab code for my paper (topic segmentation). After some modifications, especially of the definition of 'interrupted' pattern, I found it's hard to identify a pattern like that with currently used dataset. So I decide to find a good one which is able to produce such a pattern as interrupted. I tried the entries in the TalkingPoint on 2004, and I found there were at least 2 interrupted patterns generating with our algorithm. Even though I need be careful about the result, it's a good news anyway.
:)

Friday, October 07, 2005

Comparison of different increasing rate

In our project, we need compare different slopes of lines in order to decide whether to combine them. It looks an easy job! But the problem is that the slope is not as we usually define. The dataset used is composed of a series of values, i.e., a one-dimension data stream. When we plot all points in a figure, we take the horizontal axis to represent the sequence number of these points. Whereas, we use the vertical axis to measure the value of each point. The 'slope' of a line in such a plot, means kind of increasing rate. You can also call its slope, but it's impossible to associate it with the tangent of a degree. (Because the scales of two axis are different, the degree here is not meaningful!) When calculating the difference between two consecutive lines, I firstly try a conversion method. That is, the value of 'slope' here is normalized in the beginning by means of normalizing scales of both axis. Then we can use a degree to represent each 'slope'. Next, I use tangent function to calculate the difference of two 'slope'. Finally, translating the difference back to the axis system we use originally.

But after a second thought, I don't believe it is a good solution to calculate the difference of slopes. The main reason is that it is no point using a two-dimensional method to calculate one-dimension difference.

At last, I return for the old approach: just use the arithmatical difference between two 'slope' we get in the current coordinative system. (only minus is concerned!)

Thursday, October 06, 2005

The determination of boundaries among topic segments

In order to get the information about boundaries of topic segments, we need simplify the method. That is:

if the segment currently concerned is a dominated/dominant pattern, all of points in the segment can not be taken as the starting point of other segments.
if there are not only one points in the segment line, the boundary might be the starting point when there are no any segments before it. Or else, the point next to the starting point of the segment is taken as the boundary.
if there are only two nodes in the segment which are end points of the line. The peak point would be a boundary which is also the single point segment. The point next to the peak would be the boundary for the following topic segment.

The definition of topic segments

In terms of curve analysis, a topic segment is a point, a straight line or a series of consecutive straight lines. Topic segments should not have shared points. The boundary between two consecutive topic segments is defined as the starting point of the second topic segment.

The topic segment is a point: see (b) in the figure above. When the line is only composed of two points, say p1.p2 and p2.p3, the point p2 is taken as a single point topic segment. The boundary is itself. The point p3 will be a boundary for the following topic segment.
The topic segment is a line: there are some different cases in this situation. (a) has two topic segments, one is 'dominated/dominant' pattern, and the other is 'drifting'. For 'dominant/dominated' pattern topic segments, the point at the end of line can not be boundaris except the starting point. So p2 can not be a boundary. The boundary in the case (a) is p3, which is the starting point of the following 'drifting' segment. (c) has two 'drifting' segments, each of which is composed of more than one points, we use the peak point p2 as the boundary of two topic segments.
The last case (d) is a topic segment of 'interrupted' pattern. The boundary should be p1.

Steps to get the topic segments and patterns

After we got the plot of entries by means of MDS (Multiple Dimensional Scaling) method, there are at least four more different steps towards the achievement of topic segments and topic development patterns. They are:

Curve Segmentation Algorithm on the plot of all entries concerned.
Combining lines with similar behavior.
Identify topic segments.
Classify topic segments based on the development pattern insides.

If necessary, the step 2 to step 4 could be used on the result in an iterated way, until no changes happen any longer.

We exploit a curve segmentation algorithm from Lowe in the first step to obtain a series of lines approximating the original points.

In the second step, we decide whether to combine two consecutive curve segments (i.e., lines) through making a comparison of slopes and average variances between them. The key point is the combined lines should have similar slope and average variance. How to define the 'similarity'? We decide to use increasing rate to calculate the difference between two consecutive lines. We believe the approach with rate will be better than absolute value for its indenpendance of the specific application.

Because I just redefine the topic semgents (might be a point, a line or a series of lines), I must give more details about that and plan a new experiment to prove it.

In fact, we can classify each topic segments at the step three. When a point is a topic segment, it belongs to 'dominated/dominant' pattern. When the topic segment is composed of one line, it would be 'drifting' or 'dominated/dominant' pattern. When a topic segment is made up of not only one lines, it's probably 'interrupted' pattern. For the topic segments of 'interrupted' pattern, we use one line over its components to replace original lines. Such a replacement will possibly change the development patterns around, leading to a new iteration from step 2 to step 4.

Another idea about how to decide the boundaries of topics

After finishing the curve segmentation of original points for the pages, I should decide the boundaries of different topics according to the series of lines we got in the prior step. Some cases need considering. See figure in the following:

1. In the case (a), the line composed of p1 and p2 is a 'dominant' pattern segment whereas that of p2 and p3 is a 'drifting' pattern segment. The boundary between the two consecutive segments should be 'p2' which is located in the middle. But 'p2' represents a page whose value is similar to p1 instead of p3, so that p2 should belong to the segment starting at p1. Thus, the boundary should be p3. The same principle is used to the case (b), where the boundary should be p4 instead of p3.
2. As far as the case (c) and (d) are concerned, both of them are composed of drifting segments. The point in the middle (or in the peak) looks a good candidate for the boundary of topic patterns. But be carefully here, we have to reconsider the definition of topic segments!

Before, I didn't carefully tell the difference between topic segments and curve segments. Now it's just time. Curve segments are simple to define. They are just lines, more accurately, they are components composing into a curve. Each pair of consecutive lines shares the same connection point. Topic segments are a bit complex. A topic segment should be a page or a series of pages in A line. That is, a topic segment is a point, a line or some consecutive lines, but a line may not necessarily be a topic segment. So when deciding the boundaries of topics, we need consider more than shared points among lines.

Qyan's Brave New World