Tuesday, December 13, 2005

Learning Structural metadata information of books

  • Introduction
Structural meta data can be an important component of the metadata of a book in a digital library.
But, adding the structural tags manually is time consuming. Is there a way of doing it automatically? Especially when we have a large annontated data( by annotated i mean example data containing structural meta data) can we learn some how from it and use it to assign the corresponding structural meta data of a given and new page.

  • Some questions to think about:
  1. Is the problem do-able?
  2. How easy / hard is to to do it ?
  4. If yes what kind of assumptions should we be taking?
  5. What kind of results should we expecting.
  6. What is the related work ?
  7. Any machine learning approaches, what other approches exists?
  8. What are their results and observations ?
  9. Should i use the images or the textual content of the book? what are the adv and disadv in each?
  • A Rudimentary Approach:
As a first step, we assume that the structural meta data is whether the page is the first page in the book, index page, preface, cover page, normal page etc etc.
Can i then view this problem of assigning structural meta data problem as a classification problem. Formulation is as follows:
Given large annontated data containing the structural information, I should be able to successfuly learn from it and use it to assign structural information to any given page
with some accuracy.

Convinced to approach the problem as a classification problem, the question stil remains still the image should be used or the textual content. Howver it is not very clear.
Whatever may be the case, the next important phase in approaching the problem is extracting appropriate features. (this has to be done depending on what we want to use ie image or text).
What machine learning techniques to use? The same old famous Neural Networks with n hidden layers?
Still to think..................

Personalized Search - Overview of approaches - (Trying to complete a survey )

I am begining to write a survey of approches on personalized search. In this post, I present the category of approches to personalized search. It is as follows ....

Categorization of Personalized Search Approaches

First of all, search is not a solved problem. Morever, with the tremendous growth in the available information on the web, personalized search is increasinly becoming an active research area.
Actually there are a variety and a growing literature of approaches proposed for Personalized search. A category of the approaches can be

1) Link Based Approaches using the Graph structure of the web, Primiarily Extending PageRank(what google uses), Hubs and Authorities..
2) Domain Specific Personalized based on Ontologies etc.
3) Content Based approaches (based on Vector Model in Information Retrieval)
4) Machine learning Based Approaches
5) Approaches based on Linear Algebra
6) Recommemdnation based personalized search (using collaborative filteirng and content based filteirng)
7) Based on Long term history Short term history of the user from web log etc etc. etc.

All the existing approaches to personalized saarch in the literature can more or less be categorized in one more of of the approches. Each approah can fall into one or more categories.
For example, a machine learning based approach uses the content of the page.. etc etc.

The visualization of this categorization can be better done in terms of sets. Each of the 7 categories
can be represented as a set. certain sets contain certain other sets. there are small overlaps, big overlaps accordingly. The approaches belonging to each of the above category are the elements of the respective sets.

Link Analysis


I am posting some info i know about link analysis.

link analysis as far as i know is making use of the hyperlinks on the web for various applications on the web. The applications include finding authoritative or important and pages having significance on thwe web [1], computing web page ranking(google) for searches [2], finding web user communities [3][4], finding similar pages [5], web page clustering, web site classification, in recommendation systems etc.

u know google's page rank algorithm na.. which use the backlinks
and out links from a given page to calcualte the popularity of the page.
like that. The basic funda in page rank algorithm is calculating the
popularity of a web page based on back links it has. It is believed that A
good/popular page has links from good/popular pages. for example, my home
page and yahoo. my home page has few back links where as yahoo has many.
so yahoo is a strong popular page than my home page.

basically they see how is a page connected in the web. to what
pages it gives links (out links), from what pages does it have links (back
links). for example clustering of web page can be done by observing the
links of a page. it is believed that simlar pages will have similar out
links and back link and seeing the out links and back links we can see how
two pages are related .. some such stuff.

In this process, usually the content of the page is not used
except for anchor text. (in running text or something when we give
hyperlink we tend to put a small description of the hyperlink. that is
called anchor text). the context is not much used because most of the
resarch in this area is done my db ppl and for some reasons they dont tend
to use the content of pages.

[1] http://iiit.net/~pkreddy/wdm03/wdm/auth.pdf
[2] http://iiit.net/~pkreddy/wdm03/wdm/page98pagerank.pdf

Monday, December 12, 2005

design pattern conformance

Developers realize data patterns in various forms. Though an architect might be contiuing his analysis assuming it is "X" design pattern, in reality the implementation may not reflect the same.

If we could dynamically discover the design pattern from the running of the system and match it with a standard template, we could probably infer that..

Look in terms of this paper

Probabilistic state machines could help may be?

Store design pattern is a state machine, and compare it against an inferred state machine?

Saturday, December 10, 2005

Inferring constraints of usage

Most proper usages of frameworks require you and enforce certain sequence or order in which you perform your activities. Some simple example could be call routine A before you call routine B. Subscribe before you publish, open before you close.

Is there research done in mining and automatically inferring such rules for using a framework.

Please update me if you do knw. If there isn't any, throw some light on how does one know what is the right order to perform those?

"Mining Specifications" - Look at this paper and George's thesis at CMU ISRI.

Saturday, November 19, 2005

MSR 2006 potential projects

1) Find line to line correspondence. Get mapping with author of the line.

Findbugs + update it into Bugzilla reports with corresponding author information.

2) JDepend gives you complexity of Code and Design quality.

Find checking relationships and patterns from CVS mining. Then decide about Coupling and Cohesion. Could there be a potential complementary analysis that can help better evaluation of code?

3) BIRT is a cool project that helps you access and format and create reports from BUGZILLA.

Integration of bugzilla with other analysis and reporting techniques and grouping and observing change patterns can be done.

Tuesday, October 18, 2005

Invariant detection in CVS Code Repositories

Some lines are not altered by developers over a period of time for ‘n’ number of checkins. These lines could contain extractable patterns or programmatical equations which could be extracted and called “invariants”. If these are suddenly changed, then may be we need to alert the user.

Friday, August 26, 2005

Search and Browse information not data

Currently when I search for some information on the net, Search Enginges definitely do a good job and show me the numerous (37999 results matched !) results.

However, when I start reading each of the links in the top 10 list, I see that most of them have information overlap. I spend 10 mins reading each page, only to find at the end of the 10 links, that the information I have really gained is 1.75 pages may be.

Now, search Enginges hog the bandwidth of the Websites and download the complete data. However the real power of this complete data is not harnessed.

If there is a search engine which has a reader attached to it, which can show me the snippets or excerpts of the information on the page or atleast cluster results based on search content overlap, that would be cool.

Now Challenges for this –
What exactly is information and how do you find its overlap?
Is this computationally feasible?
Adding another layer between search engine and the documents (reader layer), will that be usable?

Friday, August 19, 2005

CVS DB tool + Software Inspection ?

I saw today: http://www.cwi.nl/projects/renovate/javaQA/intro.html
The jCosmo code smell browser detects code smells in Java source code that can be used to review the quality of the analyzed code and indicate regions that could benefit from refactoring.

Questions for ME:
Can Inspection be improved by appending Empirical Analysis techniques with Static and Dynamic Software Analsysis?

More generally can knowledge from “Mining Software Repositories” help Inspection and suggest Potential Spots for Refactors??

Cyclomatic Complexity



Saturday, August 13, 2005

Friday, August 12, 2005

Architecture based Analysis

As Software engg tries to redefine itself more and more as "re-use engg" we feel the necessity of modularization and component based development.

Estimating quality attributes of a system like performance, reliability, security and others in advance of actually building the system is important. This could be based on the individual components that compose the system. Architecture could be the best place where we could introduce such analysis and then reason about the system further.

Models include - Reliablity based on State-models, Markov Models, Poisson distribution of faults in modules and the interfaces of module interactions. Queying Network theory for Performance etc.

Need lot of exploring in there, but these analyses are what improve the necessity and significance of architectures and also makes architectural formalism worth it.

Arch based analysis of Performance, Reliability and Security of Software Systems: Vibhy Saujanya Sharma and Kishore Trivedi
Architecture based performance analysis: Bridget Spitznagel and David Garlan

Tuesday, August 09, 2005

Monday, August 08, 2005

Problem Frames

Context: Software Requirements & Specifications, Software Design, Problem Analysis

When you analyse a problem you see what kind of problem it is, and identify the concerns and difficulties you will have to deal with to solve it. Problem analysis takes you from the level of identifying the problem to the level of making the descriptions needed to solve it. But most realistic problems are too big and complex to handle in just two levels like this. Another level is needed: structuring the problem as a collection of interacting subproblems.If your structuring is successful, the subproblems will be smaller and simpler than the problem you started with, and their interactions will be clear and understandable. Then you can analyse each subproblem separately, as a simple problem on its own, and make the descriptions it needs.

Problem frames help in problem analysis and structure.
- help you by defining different simple problem classes. When you analyse the subproblem, you see what concerns it raises according to the problem frame that fits it. The frame shows you what you must do to solve it.
- problem frames help you to focus on the problem, instead of drifting into inventing solutions. They do this by emphasising the world outside the computer, the effects that are required there, and the relationships among things and events of the world

Problem frames share much of the spirit of design patterns. Design patterns look inwards towards the computer and its software, while problem frames look outwards to the world where the problem is found. But they both identify and describe recurring situations. They provide a taxonomy within which each increment of experience and knowledge you acquire can be assigned to its proper place in a larger scheme, and can be shared and accessed more effectively. So just as in object-oriented design a familiarity with design patterns allows you to say 'we need an instance of the decorator pattern here', so in problem decomposition a familiarity with problem frames allows you to say 'this part of the problem is a workpieces problem'. Having identified a part of a problem with a recognised problem frame, you can draw on experience associated with the frame.




CRC cards

Context: Software Design, Analysis




Wednesday, July 13, 2005

War of the Worlds

Had lots of leftover tasks to do, but a steven spilberg movie, with Cruise in the lead is something that definitely no one can resist. And the time I spent watching it, putting aside my paper work for the conference - was indeed worth it, every min.

I should say it was a master's play. He is one helluva director. I have watched all of his movies, but this one just leads way at the front! A family story and sentiments amongst horrendous and breathtaking scenes with those 'awfully cute' creatures...boy! you need to be him to make movies of that kind!

Well, I bet Hindi/Telugu/Tamil movie directors would have done quite better! with all those 'Revenge taking' against the aliens scenes and the 'Saviour of the world' -invincible, mid 50s protagonist. Well thats a totally different league and I can talk at length abt them on a different post - But for now Steven spilberg rocks !

Thursday, June 23, 2005

journey to Pittsburgh

i reached US yesterday. It was a very very long and the worst of the
journeys i ever had. Firstly the stay at Dubai was 14hrs and that turned a
night mare. Spending 14hrs aimlessly isnt an easy job. I was roaming over
the airport, then slept for 2hrs, then talked to a fellow passenger, then
ate lunch, then read for 2 hrs, then roamed over again, then talked again,
then ate dinner and finally after slept off...by gods grace i just got up 1
hr before the flight departure and it was really amazing, coz i wud hav
definitely slept longer. Finally got on the plane came to London. There it
was ok, the next flight was in 2hrs so not much of waiting. So i reached
Newyork. Here was another night mare. I had to wait 3 hrs for the flight to
pittsburgh and then 1/2 hr before the arrival, it started raining and
thunder storming, so the flight was delayed by 1hr. It seems the rains
ruptured the complete schedule and so after 1 hr of waiting, they said the
flight was cancelled. So went back to Service centre and took a ticket for
flight at 8 30. Then waited till 8 30 and saw that the flight was again
cancelled. So went back and waited in the long queue of service centre, by
then i had not even a single ounce of energy to argue with them. But good
souls, they gave me an accomodation in Holiday Inn hotel for that night. I
reached the hotel at 11. Took a nice bath, ate and slept by 12. Got up at 6
again and caught a flight to pittsburgh for 9 20. That got delayed and it
finally took off at 10 30. Reached here and settled down at kishore's place.
Looking out for a place to live. Might find one before july hopefully.

BTW the JET Lag has struck me so badly. Since I sleep at 2 or
so in the night when I was in IIIT, I feel very sleepy at about 4:30 or so
in the afternoon here. So still living in Indian timings. Lets see when i
get to normal.

Wednesday, May 11, 2005

Books after a long time!

We were off on a family trip to Shirdi. Our family only goes on holy trips! Not much of a fun although but its good that we all get to meet once in a while. I must have committed some kind of an ominous act (I do those often, but I think I had some chicken on the way to the holy place) and even the good God was annoyed with me - the result a severe painful pharyngiti caused by streptococci infection. Well, well... to be simple its a 'viral throat infection' and a fever accompanying that.

I decided to fight it myself by ignoring it and moving on. That worked for 2 days, but since I had this seminar at the DLI workshop - thats the proj I work on, I had to finish it before that and went to a Doc. That didnt work either....so after the seminar was done, I decided to spend a few days at my Cousin's place, just to have some good home food, some quality time with the TV and some rest above all. My cousin was dead against me watching the TV as that apparently strains oneself. So fine.....what else do I do....no other option I was looking all over the room just to find some time-killers. Finally I found some books ! yes BOOKS. I was never a fan of reading, as I had many other things to do. Never felt I had time to read them. But guess this was the time. I had some "Readers Digest" montly editions by my side and started out. It was great ! atleast it was a close and cool encounter with the books after a long time.

So my fellow baversites (you dont find it in any Oxfords -dont even try) and friends try reading once in a while ! Its good for health :) . Ive started contemplating on as to why watching television strains you whereas a book doesn't. Will put that in soon...