Tuesday, December 13, 2005

Learning Structural metadata information of books

  • Introduction
Structural meta data can be an important component of the metadata of a book in a digital library.
But, adding the structural tags manually is time consuming. Is there a way of doing it automatically? Especially when we have a large annontated data( by annotated i mean example data containing structural meta data) can we learn some how from it and use it to assign the corresponding structural meta data of a given and new page.

  • Some questions to think about:
  1. Is the problem do-able?
  2. How easy / hard is to to do it ?
  4. If yes what kind of assumptions should we be taking?
  5. What kind of results should we expecting.
  6. What is the related work ?
  7. Any machine learning approaches, what other approches exists?
  8. What are their results and observations ?
  9. Should i use the images or the textual content of the book? what are the adv and disadv in each?
  • A Rudimentary Approach:
As a first step, we assume that the structural meta data is whether the page is the first page in the book, index page, preface, cover page, normal page etc etc.
Can i then view this problem of assigning structural meta data problem as a classification problem. Formulation is as follows:
Given large annontated data containing the structural information, I should be able to successfuly learn from it and use it to assign structural information to any given page
with some accuracy.

Convinced to approach the problem as a classification problem, the question stil remains still the image should be used or the textual content. Howver it is not very clear.
Whatever may be the case, the next important phase in approaching the problem is extracting appropriate features. (this has to be done depending on what we want to use ie image or text).
What machine learning techniques to use? The same old famous Neural Networks with n hidden layers?
Still to think..................

Personalized Search - Overview of approaches - (Trying to complete a survey )

I am begining to write a survey of approches on personalized search. In this post, I present the category of approches to personalized search. It is as follows ....

Categorization of Personalized Search Approaches

First of all, search is not a solved problem. Morever, with the tremendous growth in the available information on the web, personalized search is increasinly becoming an active research area.
Actually there are a variety and a growing literature of approaches proposed for Personalized search. A category of the approaches can be

1) Link Based Approaches using the Graph structure of the web, Primiarily Extending PageRank(what google uses), Hubs and Authorities..
2) Domain Specific Personalized based on Ontologies etc.
3) Content Based approaches (based on Vector Model in Information Retrieval)
4) Machine learning Based Approaches
5) Approaches based on Linear Algebra
6) Recommemdnation based personalized search (using collaborative filteirng and content based filteirng)
7) Based on Long term history Short term history of the user from web log etc etc. etc.

All the existing approaches to personalized saarch in the literature can more or less be categorized in one more of of the approches. Each approah can fall into one or more categories.
For example, a machine learning based approach uses the content of the page.. etc etc.

The visualization of this categorization can be better done in terms of sets. Each of the 7 categories
can be represented as a set. certain sets contain certain other sets. there are small overlaps, big overlaps accordingly. The approaches belonging to each of the above category are the elements of the respective sets.

Link Analysis


I am posting some info i know about link analysis.

link analysis as far as i know is making use of the hyperlinks on the web for various applications on the web. The applications include finding authoritative or important and pages having significance on thwe web [1], computing web page ranking(google) for searches [2], finding web user communities [3][4], finding similar pages [5], web page clustering, web site classification, in recommendation systems etc.

u know google's page rank algorithm na.. which use the backlinks
and out links from a given page to calcualte the popularity of the page.
like that. The basic funda in page rank algorithm is calculating the
popularity of a web page based on back links it has. It is believed that A
good/popular page has links from good/popular pages. for example, my home
page and yahoo. my home page has few back links where as yahoo has many.
so yahoo is a strong popular page than my home page.

basically they see how is a page connected in the web. to what
pages it gives links (out links), from what pages does it have links (back
links). for example clustering of web page can be done by observing the
links of a page. it is believed that simlar pages will have similar out
links and back link and seeing the out links and back links we can see how
two pages are related .. some such stuff.

In this process, usually the content of the page is not used
except for anchor text. (in running text or something when we give
hyperlink we tend to put a small description of the hyperlink. that is
called anchor text). the context is not much used because most of the
resarch in this area is done my db ppl and for some reasons they dont tend
to use the content of pages.

[1] http://iiit.net/~pkreddy/wdm03/wdm/auth.pdf
[2] http://iiit.net/~pkreddy/wdm03/wdm/page98pagerank.pdf

Monday, December 12, 2005

design pattern conformance

Developers realize data patterns in various forms. Though an architect might be contiuing his analysis assuming it is "X" design pattern, in reality the implementation may not reflect the same.

If we could dynamically discover the design pattern from the running of the system and match it with a standard template, we could probably infer that..

Look in terms of this paper

Probabilistic state machines could help may be?

Store design pattern is a state machine, and compare it against an inferred state machine?

Saturday, December 10, 2005

Inferring constraints of usage

Most proper usages of frameworks require you and enforce certain sequence or order in which you perform your activities. Some simple example could be call routine A before you call routine B. Subscribe before you publish, open before you close.

Is there research done in mining and automatically inferring such rules for using a framework.

Please update me if you do knw. If there isn't any, throw some light on how does one know what is the right order to perform those?

"Mining Specifications" - Look at this paper and George's thesis at CMU ISRI.