Business Management Consultant - Stuntdubl Search and Marketing Consulting

Information Retrieval Research - IDF/ TVT

Inverse Document Frequency/ Term Vector Theory

Well, I haven’t had as much chance I would like to do research on this latest update, but I did see Jake mention IDF in this post at SEW (which also has a ton of other good information (msg #50 from xan among others) in it. I’ve also heard it mentioned a handful of times, and figured it was high time to sit down and do some dedicated research on at least one of the speculative new technologies (LSI, Hilltop, and all the incredible information orion is bombarding us with these days)…while everything seems to point back to “quality relevant links”, I think it’s good to broaden one’s horizons and understand what determines “quality” in a changing environment.


Inverse Document Frequency
- term used to help determine the position of a term in a vector space model.
Formula for IDF:
IDF = log(D/d) where D = collection size and d = number of documents containing a given term.

weight of a term, w=tf*IDF

- see alsoTerm Vector Theory

According to orion at the above mentioned TVT thread, the formula for term vector theory is as follows:
w(i) = tf(i)*IDF = tf(i)*log[D/df(i)]

where

tf(i) = term frequency, number of times a term i occurs in a document
IDF = Inverse document frequency = log[D/df(i)]
D = database size or number of documents available
df(i) = number of documents containing term i

I wish they’d do more pictures of this stuff for the slower people in the crowd:
Term Vector Theory Chart
More on Term Vector Theory at Webmasterworld and an Art vs. Science discussion - on term weight formula from HighRankings.

Not sure if I digested all this, but at least now I have some good bookmarks for later. My take is that you may start seeing more pages (if you haven’t already;)…that will show up without the actual keyphrase you searched on the pages that are returned in the serps.

2 Comments Leave a comment »

The URI to TrackBack this entry is: http://www.stuntdubl.com/2005/02/10/inverse-document-frequency-research/trackback/

Brandon
February 10th, 2005,
3:39 pm

And I thought logs only occured naturally… in woods…

*whew that joke is funny on so many levels

Soren
February 10th, 2005,
4:08 pm

We need someone to convert this theory into words without the math. Any volunteers?

Leave a Reply