Twitter LinkedIn Google+
WP Greet Box icon

Welcome back, visitor!. You might want to subscribe to the RSS feed for online marketing info as Todd posts it.

Switch Reading StyleNighttimeDaytime

A Peek Into the Google Algorithm

Online marketing information can change quickly This article is 8 years and 101 days old, and the facts and opinions contained in it may be out of date.

Last week, a gent by the name of Ruslan Abuzant, got a rare peak at a portion of the algorithm of Google, stumbling accross it when looking at the cached version of a multi-language page. He was kind enough to post his findings on digital point forums which I found via threadwatch.

Perhaps, it’s because it happend over the holiday weekend, but I thought it was a bit odd that more SEO’s weren’t as excited by this as I was. No, there’s probably not A LOT that can be learned from this, but there is some, and it was finally like being “through the looking glass” to get a rare glimpse of how google really ranks pages.

pacemaker-alarm-delay-in-ms-overall-sum 2341989
pacemaker-alarm-delay-in-ms-total-count 7776761
cpu-utilization 1.28
cpu-speed 2800000000
timedout-queries_total 14227
num-docinfo_total 10680907
avg-latency-ms_total 3545152552
num-docinfo_total 10680907
num-docinfo-disk_total 2200918
queries_total 1229799558
e_supplemental=150000 –pagerank_cutoff_decrease_per_round=100 –pagerank_cutoff_increase_per_round=500 –parents=12,13,14,15,16,17,18,19,20,21,22,23 –pass_country_to_leaves –phil_max_doc_activation=0.5 –port_base=32311 –production –rewrite_noncompositional_compounds –rpc_resolve_unreachable_servers –scale_prvec4_to_prvec –sections_to_retrieve=body+url+compactanchors –servlets=ascorer –supplemental_tier_section=body+url+compactanchors –threaded_logging –nouse_compressed_urls –use_domain_match –nouse_experimental_indyrank –use_experimental_spamscore –use_gwd –use_query_classifier –use_spamscore –using_borg

While this isn’t EXTREMELY telling, there are some things we can take a look at here that are potentially useful. Perhaps the other reasons SEO’s weren’t to excited, because as you break this down, you will tend to see a lot of the variables that we often speculate about anyhow. TallTroll (hey Brendon – I’d link to ya if I knew any of your sites;)), mentioned on threadwatch a while back:

The joke is that even if they published a definitive version of the algo, the kind of people who moan about Google still wouldn’t be any better off, since they STILL wouldn’t have any clue what to do with the information. Those who do know what to do with it already have a good idea of what the algo looks like, at least in broad terms, and so will gain little themselves.

I guess Most SEO’s don’t NEED to know the algorithm, because they have adapted best practices to suit their process for the most part. They may be able to adapt their process a bit if they knew the EXACT algo, but many folks have a pretty good guess of where the knobs are dialed to, although I’m certain it’s far from a comprehensive understanding of exactly what the mountain of Ph.d’s at G, Y, and MSN have up their sleeves.

So without further ado, here’s a bit of my speculation on what I thought was one of the coolest developments in a long time. It’s only a piece of what is a much bigger thing, but I thought it was definitely worth a look, when Matt confirmed it was real (and also that we will most likely NEVER see something like this again).

**Note This is pure speculation and 99% of it may be pure trash


pacemaker-alarm-delay-in-ms-overall-sum 2341989


Best guess: Could be about anything I suppose – potentially a metric for spidering frequency to the specific page


pacemaker-alarm-delay-in-ms-total-count 7776761


Best guess: spidering frequency to entire site?


cpu-utilization 1.28


Best guess: Metric for how CPU intensive site spidering was


cpu-speed 2800000000

Best guess: Perhaps how fast to spider the website based on server performance


timedout-queries_total 14227


Best guess: How many times the web site has timed out to requests over time

num-docinfo_total 10680907


Best guess: File size of the document – last time requested


avg-latency-ms_total 3545152552

Best guess: Latency speed of the webserver serving the document requested


num-docinfo_total 10680907

Best guess: File size of the document – current request


num-docinfo-disk_total 2200918

Best guess: Total stored site size


queries_total 1229799558


Best guess: Total queries for the site category, or perhaps the specific site Perhaps “navigational” queries are used to measure the popularity of a site?


e_supplemental=150000


Best guess: Threshhold for placing results into the supplemental index

–pagerank_cutoff_decrease_per_round=100


Best guess: Some cutoff point for figuiring link popularity – perhaps an incorporated trust filter to decrease link popularity by several multiples until it’s found trustworthy


–pagerank_cutoff_increase_per_round=500


Best guess: Some cutoff point for figuiring link popularity – see above

–parents=12,13,14,15,16,17,18,19,20,21,22,23


Best guess: Parent topical categories (think DMOZ) – or parent pages within the site (think SE theme pyramids or virtual site heirarchy)


–pass_country_to_leaves


Best guess: Choose primary country of origin for website or page


–phil_max_doc_activation=0.5


Best guess: Threshold for maximum spidering of website


–port_base=32311


Best guess: an indicator of filetype or which datacenters it’s the data is distributed throughout


–production


Not much to go on here –


–rewrite_noncompositional_compounds


From – Automatic Discovery of Non-Compositional Compounds
Spaces in texts of languages like English offer an easy first approximation to minimal content-bearing units. However, this approximation mis-analyzes non-compositional compounds (NCCs) such as “kick the bucket” and “hot dog.” NCCs are compound words whose meanings are a matter of convention and cannot be synthesized from the meanings of their space-delimited components.

Best guess: Sounds like some implementation of LSA/LSI to create meaning from non-standard language. Perhaps some type of language AI.


–rpc_resolve_unreachable_servers


Best guess: Have googlebot revisit unreachable servers


–scale_prvec4_to_prvec


Best guess: Adjustments on PR algo


–sections_to_retrieve=body+url+compactanchors


Best guess: Disregard navigation that is consistent throughout the website – Some type of block level analysis


–servlets=ascorer


Best guess: Who the hell knows…not much to go on here…I’m grasping at straws already if you got this far and didn’t realize it;)


–supplemental_tier_section=body+url+compactanchors


Best guess: Aditional block level analysis, perhaps some duplicate content detection


–threaded_logging


Best guess: Log more in depth information (links, clickthrough rates, etc.) for this page


–nouse_compressed_urls


Best guess: Perhaps a fix for SID’s in urls or other disregarding other types of urls that create infinite loops – disregarding any type of variables after the questionmark in a url


–use_domain_match


Best guess: Some type of Canonicalization fixes


–nouse_experimental_indyrank


Best guess: Dunno, but it sounds like a good thing to start tryin’ to figure out – perhaps they finally ARE going to roll toolbar or user data into the algo. Perhaps personalization finally making its’ way in.


–use_experimental_spamscore


Best guess: Newer version of the below spamscore – number filters that give an indicator of how likely a page is spam.


–use_gwd


Best guess: not much to go on here – I’ll go with “google word database”
Other guesses have included “google web directory” or “google world domination”


–use_query_classifier


Best guess:Something as simple as

  • -navigational
  • -informational

  • -transactional
  • Similar to yahoo mindset

  • -or-
    More likely a deeper extension of the above.

  • Query specific variables to certain verticals –
    Think “transactional real estate” – new york real estate agent
    vs.
    “informational real estate” – new york real estate news

    This criteria would also help to decipher which queries to serve “onebox results” for froogle/googlebase/google local/ google maps/ etc.


    –use_spamscore


    Best guess: The “non-beta” or working version of the above mentioned spam score that is a constant work in progress. Things like multiple dashes in a domain have are good indicators of a high likelihood of a page being spam. Domain names over a certain lengths, and probably many other things would fall into what could be used to evaluate a sites “spamscore”


    –using_borg


    Best guess: A. Some technology or systems developed by Anita Borg (time for some homework) – or B. google really *IS* trying to take over the world, and we’re all being added to a massive database – I’m going with A as my best guess though;)

    People sometimes have a hard time understanding that algorithm variables are not necessarily good or bad, fair or unfair..they are only effective or ineffective in judging quality. People evaulate search results subjectively, but a search algo is objective to many different criteria that make up the final result. A webmaster may think that tracking the number of times a site goes down is “unfair”, but on a massive scale it is an accurate indication of the quality of a website.

    I’m sure the boys at the ‘plex are getting a nice chuckle from some of my wild speculation, so I’d like to be my normal google nitpicking self and add my own two cents to Matt’s super beta-algo (I like where it’s going:):

    –initial_time_travel_wormhole=”Wednesday, December 31 1969 11:11 pm”
    –use_googlepray=false
    –docid_size=more-than-four-bytes
    –SETI_alien_communication_port=31337
    –skynet_sentience=0.33
    –plane_load=snakes
    –pigeonrank_seed=42
    –use_mentalplex=true
    –unicorn_versus_werewolf=its-on-now

    You may be better off with:
    –initialize_flux_capacitor=”November 5, 1955, 0600 AM” (stop Doc Brown!)
    –docid_size=return_to_1985

  • -use_googlekarma=true
    –reveal_matrix=red_pill
  • –SETI_alien_communication_port=31337
    –skynet_sentience=0.33
    –plane_load=snakes
    –pigeonrank_seed=42
    –use_mentalplex=true

  • -use_googledance=tango
  • -use_men-in-black-flashy=true

  • -toolbar_phone_home=ET
  • -tinfoil_hat_wearer=true

  • -source_code_level=hello_world
  • –ninjas_riding_unicorns_vs_pirates_with_werewolves

    Hope this helps spice things up a bit:)

    We know there are hundreds if not thousands of variables and combinations, so you have pretty good odds that you can pick SOMETHING that is in the secret sauce SOMEWHERE. This could of course be just another ploy to keep SEO’s busy and wondering rather than actually WORKING on creating more websites;) Anyone else care to toss out their best guesses on what some of this stuff may or may not mean? Wasn’t anyone else excited to get a brief little peak of the code we all so diligently try to reverse engineer?

    More information about Todd Malicoat aka stuntdubl.

    Twitter LinkedIn Google+ 

    • DazzlinDonna

      I tried to get something going over at seorefugee – http://www.seorefugee.com/forums/showthread.php?t=2770

      but it didn’t get too far. A few interesting thoughts, however.

    • phaithful

      Great write up. Very interesting assumptions

    • http://www.mr-seo.com Mr SEO

      Todd, bravo! Nice breakdown. I think you hit the nail on the head here. the only problem I have is, Google tells us what they want us to know. Sometimes they do things to turn us away from the truth. So things aren’t always what they seem.

    • http://ben.wilks.net/ Xenith

      Nice speculation Todd, funny shit!

    • General Public

      All these google algorithms seems to be failing, looking at the crap we see in the SERP. Don’t worry google is not going to dominate the world, they are simply filling up the disk space on big daddy datacenter with crap pages that adsense spammers spew out.

      Time for next generation search engine.

    • http://chovy.com chovy

      excellent ideas…I’m sure a lot of the performance stuff has an effect. Perhaps the faster, more reliable servers rank higher, although that is a gross generalization on my part.

    • Pingback: Teh Xiggeh » Blog Archive » Google Error Message

    • Pingback: Google-kode avslørt? - Søkebloggen

    • http://www.searchrevenues.com Search Revenues

      Oh, I feel much better now. LOL!

    • http://www.neutralize.com Teddie

      Duh? For goodness sake this is from a server in the query handling process so why on earth would you expect it to feedback any details about spidering?

      You also seemed to have skipped over how the eval data gets factored into this, relevent/not relevent etc I’d put money on that being the borg value.

    • Michael Martinez

      You apparently haven’t done much meta/macro programming, Todd. Anything with a hyphen (-) in front of it is probably an instruction to a processing application to NOT use that module or value. Anything in the form of “name=value” usually sets either a limit (to cap off calculated values) or a fixed value to be used to replace macro names embedded in code. While I don’t know any more about what this stuff is than you do, it looks very much like Job Control Language statements, or macro definitions, as are often used in meta coding environments where you have code that requires run-time data values be passed to it.

      It seems highly doubtful that anything passed to Google’s cache display application would have instructions that affect spidering.

    • http://www.stuntdubl.com Stuntdubl SEO

      Yep. I think you’re both right on the spidering speculation. This was all the first stuff that popped to mind (think word association). It was done in haste, without much research into the programmatic nature of the code, and without much previous understanding of it.

      Mainly, I was just hoping to get some other folks to comment on the different variables (if only to tell me I was wrong).

      Perhaps at some point I’ll go back and review the variables with those concepts in mind. Thanks for explaining that a bit Michael.

    • http://www.neutralize.com Teddie

      Todd – just run them through Google :-)

      The first bunch are just server status details, and the rest are settings for that particular production server, nothing there site specific at all, although the command names themselves are interesting.

    • http://www.searchreturn.com Detlev Johnson

      Hello everyone,

      I am afraid the focus on a site and Google crawling may not be accurate for assuming such things. Remember this is a Google error message and not information for publisher consumption. Flip your thinking.

      SearchReturn guessed (same day as this post) our thoughts of what some of these mean. We left out things which are terribly obvious references to internal CPU usage and document counting.

      These are *not* site related but are error messages about Google server performance and settings. The hint is in the numbers, which infer server performance and look too large for (even as bytes) things like file size.

      Port is a port number – see? These are references for Google employees to figure out what went wrong in case of such an error. There were, however, one or two interesting things that can be speculated about and we listed what we think they mean.

      http://www.searchreturn.com/digest/076.shtml#id348

      *cheers*
      -d

    • http://www.piedmontdesign.com/ Piedmont

      I really appreciate all your hard work on this. I agree that it’s a really interesting peek into Goodles secret. I also agree with your assesment that it’s not a major development for good SEO types as they should have a broad level understanding already.

      I still don’t have any understanding as to why something effecting search algo results could be in the cache. Is the thinking that this data was intended for a DB to be used by the algo but was saved in the cache for some unknown reason?

    • http://ofb.net/~wtanaka/ Wesley Tanaka

      You may have misinterpreted some of the abbreviations. But your post inspired me to speculate what it all means myself.

    • http://seside.net/ JohnM

      “-use_gwd” — “Google Web Directory” might not be so far off considering the new “NOODP” attribute.

      I assume that part of the values here are status variables from the server itself, others are current settings. I think we can ignore the status variables and concentrate on the settings. Why should *these* settings be here? Are they non-default settings? Are all other “settings” statically linked into the algorithm? Could these be the settings that differentiate the different datacenters? or are they just current “state” variables that change with each server state (my guess)? Depending on what they really are, their importance can be high or almost nothing (my guess).

    • Pingback: i, Revenue » How Google Ranks Pages

    • Kevsh

      Did the person who first posted this ever find out what site this may have been pertaining to? I’m assuming the result he got was a dataset from one specific site for the query he used.

      Anyhow, I couldn’t find the original thread but I suspect not. Perhaps it wouldn’t make any difference, but for a few of the items it may have been useful to analyze the page this result set was for?

    Buffer