Twitter LinkedIn Google+
WP Greet Box icon

Welcome back, visitor!. You might want to subscribe to the RSS feed for online marketing info as Todd posts it.

Switch Reading StyleNighttimeDaytime

Scraper Sites and SE Ambiguity: What is Your Site’s Reading Level?

Online marketing information can change quickly This article is 15 years and 148 days old, and the facts and opinions contained in it may be out of date.

What is a scraper site

Question from client:
What do you make of this site?

http:// goes here

Whats the purpose?

My response:
It’s what we call a “scraper site”

Designed specifically to game adsense and make money
from clicks. It’s an evolution of the escalating information arms race between spammers and search engineers.

They “scrape” the title and descriptions for the search top 10,20, 50 search results and spit it out to a webpage. There are LOTS of variations of this technique, and one of the reasons I worry so much about duplicate content, as it’s one of the biggest problems SE’s currently face.

The real question is why google allows them as much as they do, and don’t shut down or penalize accounts that use them. Tin foil hat theory says to pollute yahoo and msn and make extra money from advertisers using content targeting. That and it’s a very new gray area.

This is the darker side of SEO that includes heavy automation of webpage creation for gaming the engines.I’ve heard it joked before that about a couple dozen SEO’s are responsible for 3/4’s of the SE’s indexes…the funny part is that it is probably nearly true (I’ve met a couple of them).



The Clone Wars

“DMOZ clones” were among the first “scaper sites”. They rendered “common backlink tools” much less useful, as any site that is listed in DMOZ would show up on many of these clone sites. They also pollute the search engines. The engines have managed to get rid of most DMOZ clones, and are doing better on ridding themselves of scraper sites, but it is still a hotly debated issue that probably isn’t going to go away completely anytime soon.

I think G has realized that some things that seem detrimental may have certain benefits associated with them as well. I’m sure the web spam team doesn’t like scrapers (and I agree with them), but if they investors (or board for that matter) knew about them would they really care, or would it be seen as a nice added short term revenue stream? I won’t add fuel to the tin hat fire, but I don’t think scrapers are all that different than click fraud, when left to the eyes of someone outside the search space (especially if they see THEIR ad on a sh*t site). It’s going to take a long time for the ROI of advertisers to dip low enough for them to realize all the trash that they don’t need to be paying for shouldn’t be considered a cost of doing business.

The Rules of SE Ambiguity

Duplicate content and scrapers are going to be ongoing concerns with the search engines for quite some time. As aggregation becomes more and more simple, so does abusing the uses of aggregation. Button pushers are testing the limits all the time to establish thresh-holds. This is also why you won’t get black and white answers from webmaster guidelines. They are not going to tell you anything even remotely close to variable threshholds. Google and Yahoo are not going to tell you that you can copy up to 20% of content and not get kicked. Then nearly every page on the net (or at least in SEO-land) would have 19.9999% dupe content. Most likely, this is not how the content is filtered anyhow. It’s more likely that “overlays” and pattern matching are used.

Don’t be collateral damage

Understanding how duplicate content filtering works is important to those that are legitimately aggregating portions of content on their site. You don’t want to create a problem in the engines, just so your users can read CNN headlines on your website. Try to keep the content as unique as possible, and you won’t end up as collateral damage.

Unique content also poses an interesting question. What is “quality content”? Do you need to be Harvard educated to write a piece on the sociological implications of search on the political structure of our country? If I mention Harvard and big three and four syllable words, will the assumption be that my content is fit for college graduate readership, and thus deemed higher “quality”?

What’s Your Site’s Reading Level?

My guess is that as the benefits to being more relevant improve for the search engines (there are certainly arguments against why efficiency is not currently in the SE’s best interest), they will get increasingly better at determining “reading levels” of sites. Someone searching for information on investing in a stock portfolio or 401k plan will not want to read an article written by a high school dropout, or click through a strange looking page with ads at the top to get to what they are really looking for. Whenever you are pondering a subject like duplicate content, you have to assume that at the very least G is three moves ahead of what you are thinking. What are their next three moves AFTER duplicate content detection? Think ahead, and you won’t be worrying about increasing your site’s reading grade level in a year, or talking about the “11th grade penalty vs. filter”.

Scraper Site Discussions and articles:

More information about Todd Malicoat aka stuntdubl.

Twitter LinkedIn