Twitter LinkedIn Google+
WP Greet Box icon

Welcome back, visitor!. You might want to subscribe to the RSS feed for online marketing info as Todd posts it.

Switch Reading StyleNighttimeDaytime

How to Remedy Duplicate Content and Magical % Thinking

Online marketing information can change quickly This article is 8 years and 162 days old, and the facts and opinions contained in it may be out of date.

Unique content is a valuable commodity. There was a discussion on duplicate content in WMW supporters forum a few days ago, that I thought was worthy of a post for those who aren’t subscribed there (you should be though!).

Duplicate content has become the big area of misinformation with everyone concerned that they have hit a “duplicate content filter”, or been penalized for duplicate content. Chances are you haven’t been banned or penalized unless you really have very little unique content on your entire site. For this reason, I’d thought I’d dig a little bit further into dupe content and remedies so I have a reference document for later.

Duplicate Content is not a Magical Percentage

If it was as easy as saying that any page with more than 42% duplicate content will be filtered from the search results, then all site owners and SEO’s would probably grab 40% duplicate content for every page filler. It IS NOT a percentage. There may be percentage variables that apply, but the first step to understanding duplicate content is to get out of the “magical percentage” line of thinking.

From this paper on Finding near-replicas of documents on the web there are a few clues into the way SE’s may handle duplicate content:

Clustering exact copies by checksum

Comparing doc size for exact or near exact webpages
This is generally how people think of duplicate content detection in terms of the “magical percentage”. As long as you have 20% unique content you’ll be fine…Riiiight. It is an overly simplistic method of detecting duplicate content that is at the core of the technique of dupe content detection, but does not consider other techniques that may be applied. Many people do not consider the ways of detecting duplicate content much further than this method, and thus get stuck in the “magical percentage” line of thinking.

Computing all-pairs document


“Chunking” documents and searching for similarities and flagging them for a “second look”

The resulting document is then
chunked into smaller units…

Understanding some of the methods for filtering duplicate content is the first step in getting beyond the “magical percentage” thinking (from here on referred to as MP thinking). Imagine 10 different documents that all pull 5 lines of text from 3 documents containing 20 lines of text. They 10 different documents are most likely “unique” if the randomization settings are done well. They will all, however, have different levels of percentage similarities. Now before reverting to the line of thinking that says “which level will get me penalized?, consider other options for scoring the relevance of these pages. Consider also that it takes multiple iterations of processing to determine the similarities between ALL documents. Now, as an relevance engineer…how would YOU handle that mess?

Sort-based approach

Sorting and finding overlap.

Probablistic-counting based approach

Comparing the probability of dupe content based on footprint “sets” of different types if there is overlap between documents

Understanding Duplicate Content Filtering

Okay, you’re no longer an MP thinker. You’ve moved beyond wishing for the percentage you can push the limit to, and have agreed that you probably need a content writer to put something worthwhile on your site.

My other favorite whitepaper on duplicate content is:
Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content

From this we get some ideas on different levels of severity for dupe content:

Level 1 — Structural and content identity.
Every page on host A with relative path P, (i.e., a URL of the form http:/ /A/P) is represented by a byte-wise identical page on host B, at location http:/ /B/P, and vice versa.

Level 2 — Structural identity. Content equivalence
Every page on host A with relative path P, is represented by an equivalent content page on host B, at location http:/ /B/P, and vice versa.

Level 3 — Structural identity. Content similarity.
Every page on host A with relative path P, is represented by a highly similar page on host B, at location http:/ /B/P, and vice versa.

Level 4 — Partial structural match. Content similarity.
Some pages on host A with relative path P, are represented by a page on host B, at location http:/ /B/P, and vice versa, and these pairs of pages are highly similar

Level 5 — Structural identity. Related content.
Every page on host A with relative path P, is represented by a page on host B, at location http:/ /B/P, and vice versa. The pages are pair-wise related (e.g., every page is a translation of its counterpart) but in general are not syntactically similar.

Mismatch — None of the above.

Duplicate Content Penalties, Filters, and Bannings

It should be noted that this is my own gut feel on the topic, and there’s a very good chance I may be wrong. Take it with a grain of salt. In case you missed it, I’ve ranted about them before.

Using the level’s from above:
Level 1 – Banned
example: dmoz/wiki clones
Level 2 – Banned
example: page scraped from a site using relative urls
Level 3 – Partial penalization and/or filtering on some content depending on severity of duplication.
example: same CMS system, some dupe content
IE. Oscommerce and stock product manufacturer descriptions.
Level 4 – Possible penalization and/or filtering on some content depending on severity of duplication.
example: similar to #3 – similar content and cms system
Two widget forums both ran on phpbb or vbulletin, had similar categories of content, and allowed people to post the same content in both places (creating some exact dupe content), or aggegate rss feeds.
Level 5 – Not much to worry about –
Two widget forums both ran on vbulletin and had similar categories of content Mismatch = best case scenario – This is what you’re striving for. Having NO duplicate content indexed is the ideal. Your best bet is to keep all duplicate content from being indexed at all, and make sure if you use out of the box solutions that you change up the “footprints” a bit.

Filter – You have some duplicate content within your own site or an external site, or you have a lack of unique content – You’ll most likely end up with these PAGES in the supplemental index. Filters are generally page level problems that decrease rankings

Penalty – You’ve served duplicate content one too many times. You may have served the spiders the same content so many times that they won’t come around as often (Calendar software or session ID’s are good examples of this). With a penalty, you may get your website spidered on a less frequent or more superficial basis (meaning you won’t get deep crawled). Penalties can be page or site level issues with varying degrees of severity that decrease rankings

Bannings – Chances are you’ll probably KNOW when you’re banned – Otherwise it’s most likely a penalty or filter. Chances are the only ways to get outright BANNED for dupe content is to be cloaking others content, being guilty of violating the DMCA, or other severly aggregious offenses where you KNOW what you did. If someone did this with your branded site, you better start practicing your grovelling, and develop the story of how it’s all your shady seo’s fault. Bannings are a pleasant way of someone telling you that you’re f*cked

Okay Smart Guy, What the Hell are Shingles?

From yet another fantastic PDF, we get some insight on shingles (why in the world do smart people insist on using pdf’s?).

Shingles
A k-shingle is a sequence of k consecutive words

  • The quick brown
  • quick brown fox
  • brown fox jumped
  • fox jumped over

I think TRUST plays a big role in determining which sites/ pages are flagged for resemblance checks of shingles. Meaning not all sites are held to the same standards – which also make it impossible to ever predict the magical percentage. Basing reviews on trust is probably one of the biggest helps to reducing the sample size, which in turn, reduces the processing power required for such a massive amount of data relevance interpretation. All documents will build a set of associated shingles over time. This is why delivering content strategically based on what is the unique content of your site is so important..

Jake’s Top 6 Duplicate Content Mistake
Courtesy of Mr. Baked’s Duplicate content presentation and Barry, and Mike’s coverage of the session, comes the top 6 duplicate content mistakes:

  1. Circular Navigation
  2. Print-Friendly Pages
  3. Inconsistent Linking
  4. Product Only Pages
  5. Transparent Serving
  6. Bad Cloaking

To get Jake’s fixes, you’ll have to attend his Duplicate content session at SES San Jose.

Techniques for Remedying Duplicate Content

Don’t have the same content indexed in two places!!! Be consistent with your linking structure!
Robots.txt – Site Level
Don’t let the bots near your folders of duplicate content – keep it all in one place for users, and don’t let the bots near it.

Meta robots tag – Page Level
Using variations of the robots tag and allowing spiders to index, noindex, follow, or nofollow the given pages

Rel=Nofollow – Link/Block Level beta
Okay, this is what I’m talking about as a potential positive for rel=nofollow. The trouble is that support for it from the engines is shaky at best, and I’ve been shown examples where it flat out doesn’t work. I wouldn’t rely on it at this point, but it’s probably worthwhile to use it as a failsafe to keep spiders from getting to certain areas of your site.

Other Remedies for Duplicate Content
I-frames with 0 border – put the duplicate content on a seperate page and use the noindex, nofollow.

Text to image – Thanks to Web Professor.

Invisitext – Another brilliant script from Web Professor.

Just like reciprocal links, poor titles, run of site links, and a multitude of other SE variables, duplicate content occurs naturally on the web sometimes. It is not inherently a bad thing that can hurt your site. What CAN hurt your site is not having an understanding of how to handle duplicate content, and having spiders spend time indexing your duplicate content when they could be grabbing your good unique value added content.

To truly understand duplicate content issues, you need to learn what the problems associated with duplicate content are from a SE perspective, as well as HOW they are trying to remedy those issues. Understanding the strategies the SE’s are using to improve relevance (in this case, trying to de-dupe their index), is important to developing strategies for new and existing sites in the months and years to come.

More duplicate content reading and resources

More Whitepapers on Duplicate Content (PDF’s and required registration)

More information about Todd Malicoat aka stuntdubl.

Twitter LinkedIn Google+ 

  • chris

    I find this very interesting, I am wondering what happens with sites like [site edited]
    were they simply write one line then take a paragraph or reword what someone else has writen. he seems to be doing alright with over 220,000 page views a month for the last year. could this be a penalty site?

    **admin note
    Site edited…not big on giving specific site examples.

  • http://www.whatsmyip.com.au/ Lea

    Its very hard to determine when duplicates are a problem. A company I host some sites with has their main domain, their support domain, and the several domains that seem to be the physical servers – all serving the same content. They aren’t banned; they appear in the SERPs for various phrases, but I’ve mentioned to them they should 301 all those superfluous domains to minimise the dups – they are effectively competing with themselves :(

  • http://www.betterwayz.com Dfasdy

    Thanks for sharing a rather in-depth explanation on dup content issue. Hopefully, SE would continue being better at ranking duplicated content. I have found many instances where the originating site is no where near sight, whereas the copying site continue to move up the ladder.

  • http://www.isulongseoph.org/ Isulong Seoph

    I just wonder how these SE’s consider websites using CMS or thoes very common Blog softwares like wordpress which have a default template and these are used by so many.

    I also wonder about sites I came accross that are obviously using duplicated contents and still topping the SERPs for a long time now?

  • http://www.my-myspace-layouts.com J-Weezy

    Great blog… excellent writing as always!

  • Pingback: Interviewed, and Some Other Random Musings » SEO by the SEA

  • Pingback: Unofficial SEO Blog » Duplicate Content Issues - SEO Information much before its official!

  • http://www.macalua.com Marc Macalua

    Great post Todd. Read some parts of it from the supporter forum but I think I missed these golden nuggets:

    “All documents will build a set of associated shingles over time.”

    and

    “This is why delivering content strategically based on what is the unique content of your site is so important..”

    Before I used to ask myself why anyone would want to block bots via robots.txt, but after seeing it done for one ecommerce mega site, I think it’s the next best anti-duplicate content fix to do after deep trusted links and fresh/unique content generation.

    Marc

  • Pingback: blogHelper » Duplicate Content in Blogs: The Problem

  • http://studio9.ws al toman

    In testing SEO theory and Google, I’ve purposely created dupe pages (very obvious and annoying to anyone who actually lands on these pages). I have two topic-sets of them. They’ve been up for a bit and all out rank (PR) the relevant content, good down-to-earth, meaty web pages, all of which are exhibiting lower PR. Consequently, I’ve removed the GOOD low ranking stuff to see whether or not the high PR junk dupe pages will bring up the site’s PR, overall.

    I previously tested Google’s response to cloaking, same color font on same color background, etc, and I’d probably be into the 8th of my 9 lives before Google actually does anything about it. It takes them about 2 years. Consequently, I feel fairly safe from the Google Girls.

    =====

    Considering that the no-follow isn’t necessarily always no-followed, I’m working on an
    easy way to “no-follow” in-line links using a bit of php script, robot.txt, .htaccess, and re-directs. I should have that in working order for reality testing shortly. It appears transparent and seamless to the link clicker duders.

    In the meanwhile, dupe your content smartly Google won’t even notice and assign it all high PR as a bonus.

    Kind Regards,
    Al Toman

  • Pingback: Well you thought there was a duplicate content filter huh? - Jaan Kanellis Search Marketing Blog

  • http://www.mr-seo.com Mr SEO

    Supplemental index will be the result.

  • Pingback: » Duplicate Content Misunderstandings

  • Pingback: 5 Ways To Reduce The Chances Your Pages Won’t Get The Duplicate Content Penalty. · Marty Fiegl

  • http://www.ragepank.com/articles/43/duplicate-content/ Duplicate Content

    I have a fairly simplified approach to duplicate content, which holds true most of the time.

    “If your snippet is the same as another page’s snippet, then Google considers it duplicate content”.

    So have a think about how snippets are created and you are on the right track. Keep those meta descriptions unique, or don’t use them at all. Make sure the text around your desired search phrase is unique, and this will help too.

    And NEVER submit your site to a directory using text off your own site – if the directory has stronger links than you, then it may well rank instead of you. I have seen this happen on a number of sites.

    My 2c, explained further at http://www.ragepank.com/articles/43/duplicate-content/

  • Pingback: BrianStocker.org » Duplicate Content and Similar Content

  • Pingback: The Truth About Duplicate Content Penalization » TP

  • Pingback: The Organic SEO

  • Pingback: Nashua Broadband - Duplicate Content Escapees « musings

  • Pingback: » How to avoid duplicate content in search engine promotion

  • http://www.fineshoppingnetwork.com Wan

    Ever wonder why ezinearticles.com which is probably the biggest source of duplicate content on the web never gets punished or penalized but gets rewarded with high ranking for many keywords? It’s all about trust.

  • Pingback: Natpal, Doppleganger Sites And Other Counter-Productive Marketing Techniques | Website Promotion is not Voodoo

  • http://lunchpauze.blogspot.com/ Robin

    I’ve created a very very simple little test to measure the resemblance between to URLs. It uses 3-Shingles and the Jaccard measure.
    It just illustrates some of the principles mentioned in this article.
    Here it is: http://www.wetrade.nl/duptest/

  • Pingback: Skip SEO For A Minute, Work On Your Writing Skills - Daily Blogging Tips and Web 2.0 Development

  • http://www.searchtempo.com Brisbane SEO Guy

    A very interesting discussion. How much weight does the order of paragraphs have? Is it possible to avoid the penalty by:

    Changing the title
    Changing the article description or snippet
    Reordering the paragraphs
    Using synonyms for key words

  • Pingback: Aaron Wall's SEO Book.com

  • Pingback: Web Site Optimization » [Video] Creating Your Site’s Internal Link Structure for Google and Searchers

  • Pingback: [Video] Creating Your Site’s Internal Link Structure for Google and Searchers

  • http://www.ceruleangallery.com Robin

    I wonder about sites I came accross that are obviously using duplicated contents and still topping the SERPs for a long time now?

  • Pingback: When is Duplicate Content a Good Thing? | Seo Design Solutions Blog

  • Pingback: How search engines work : QuickstartSEO.com

  • http://www.realstudio.ro/ RealStudio

    As times passes, it’s more and more difficult to come up with original content. And as there are more sites with “how-to”‘s, or “why”‘s, it’s very likely that they have one or more commom sources: the top ranked pages when searching for that word.

  • James MacFarlane

    Never use an apostrophe when making a word plural.

    “Using the level’s from above” and “PDF’s”

    It’s just “levels” and “PDFs”. An apostrophe makes it possessive.

  • Pingback: Finding Duplicate Content with Free Tools - Shimon Sandler - SEO Consultant

  • Pingback: How Search Engines Work: Search Engine Relevancy Reviewed - Entertane.com – Tech News

Buffer