Stuntdubl Business Search Marketing Consulting

How to Remedy Duplicate Content and Magical % Thinking

Unique content is a valuable commodity. There was a discussion on duplicate content in WMW supporters forum a few days ago, that I thought was worthy of a post for those who aren’t subscribed there (you should be though!).

Duplicate content has become the big area of misinformation with everyone concerned that they have hit a “duplicate content filter”, or been penalized for duplicate content. Chances are you haven’t been banned or penalized unless you really have very little unique content on your entire site. For this reason, I’d thought I’d dig a little bit further into dupe content and remedies so I have a reference document for later.

Duplicate Content is not a Magical Percentage

If it was as easy as saying that any page with more than 42% duplicate content will be filtered from the search results, then all site owners and SEO’s would probably grab 40% duplicate content for every page filler. It IS NOT a percentage. There may be percentage variables that apply, but the first step to understanding duplicate content is to get out of the “magical percentage” line of thinking.

From this paper on Finding near-replicas of documents on the web there are a few clues into the way SE’s may handle duplicate content:

Clustering exact copies by checksum

Comparing doc size for exact or near exact webpages
This is generally how people think of duplicate content detection in terms of the “magical percentage”. As long as you have 20% unique content you’ll be fine…Riiiight. It is an overly simplistic method of detecting duplicate content that is at the core of the technique of dupe content detection, but does not consider other techniques that may be applied. Many people do not consider the ways of detecting duplicate content much further than this method, and thus get stuck in the “magical percentage” line of thinking.

Computing all-pairs document


“Chunking” documents and searching for similarities and flagging them for a “second look”

The resulting document is then
chunked into smaller units…

Understanding some of the methods for filtering duplicate content is the first step in getting beyond the “magical percentage” thinking (from here on referred to as MP thinking). Imagine 10 different documents that all pull 5 lines of text from 3 documents containing 20 lines of text. They 10 different documents are most likely “unique” if the randomization settings are done well. They will all, however, have different levels of percentage similarities. Now before reverting to the line of thinking that says “which level will get me penalized?, consider other options for scoring the relevance of these pages. Consider also that it takes multiple iterations of processing to determine the similarities between ALL documents. Now, as an relevance engineer…how would YOU handle that mess?

Sort-based approach

Sorting and finding overlap.

Probablistic-counting based approach

Comparing the probability of dupe content based on footprint “sets” of different types if there is overlap between documents

Understanding Duplicate Content Filtering

Okay, you’re no longer an MP thinker. You’ve moved beyond wishing for the percentage you can push the limit to, and have agreed that you probably need a content writer to put something worthwhile on your site.

My other favorite whitepaper on duplicate content is:
Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content

From this we get some ideas on different levels of severity for dupe content:

Level 1 — Structural and content identity.
Every page on host A with relative path P, (i.e., a URL of the form http:/ /A/P) is represented by a byte-wise identical page on host B, at location http:/ /B/P, and vice versa.

Level 2 — Structural identity. Content equivalence
Every page on host A with relative path P, is represented by an equivalent content page on host B, at location http:/ /B/P, and vice versa.

Level 3 — Structural identity. Content similarity.
Every page on host A with relative path P, is represented by a highly similar page on host B, at location http:/ /B/P, and vice versa.

Level 4 — Partial structural match. Content similarity.
Some pages on host A with relative path P, are represented by a page on host B, at location http:/ /B/P, and vice versa, and these pairs of pages are highly similar

Level 5 — Structural identity. Related content.
Every page on host A with relative path P, is represented by a page on host B, at location http:/ /B/P, and vice versa. The pages are pair-wise related (e.g., every page is a translation of its counterpart) but in general are not syntactically similar.

Mismatch — None of the above.

Duplicate Content Penalties, Filters, and Bannings

It should be noted that this is my own gut feel on the topic, and there’s a very good chance I may be wrong. Take it with a grain of salt. In case you missed it, I’ve ranted about them before.

Using the level’s from above:
Level 1 - Banned
example: dmoz/wiki clones
Level 2 - Banned
example: page scraped from a site using relative urls
Level 3 - Partial penalization and/or filtering on some content depending on severity of duplication.
example: same CMS system, some dupe content
IE. Oscommerce and stock product manufacturer descriptions.
Level 4 - Possible penalization and/or filtering on some content depending on severity of duplication.
example: similar to #3 - similar content and cms system
Two widget forums both ran on phpbb or vbulletin, had similar categories of content, and allowed people to post the same content in both places (creating some exact dupe content), or aggegate rss feeds.
Level 5 - Not much to worry about -
Two widget forums both ran on vbulletin and had similar categories of content

Mismatch = best case scenario - This is what you’re striving for. Having NO duplicate content indexed is the ideal. Your best bet is to keep all duplicate content from being indexed at all, and make sure if you use out of the box solutions that you change up the “footprints” a bit.

Filter - You have some duplicate content within your own site or an external site, or you have a lack of unique content - You’ll most likely end up with these PAGES in the supplemental index. Filters are generally page level problems that decrease rankings

Penalty - You’ve served duplicate content one too many times. You may have served the spiders the same content so many times that they won’t come around as often (Calendar software or session ID’s are good examples of this). With a penalty, you may get your website spidered on a less frequent or more superficial basis (meaning you won’t get deep crawled). Penalties can be page or site level issues with varying degrees of severity that decrease rankings

Bannings - Chances are you’ll probably KNOW when you’re banned - Otherwise it’s most likely a penalty or filter. Chances are the only ways to get outright BANNED for dupe content is to be cloaking others content, being guilty of violating the DMCA, or other severly aggregious offenses where you KNOW what you did. If someone did this with your branded site, you better start practicing your grovelling, and develop the story of how it’s all your shady seo’s fault. Bannings are a pleasant way of someone telling you that you’re f*cked

Okay Smart Guy, What the Hell are Shingles?

From yet another fantastic PDF, we get some insight on shingles (why in the world do smart people insist on using pdf’s?).

Shingles
A k-shingle is a sequence of k consecutive words

  • The quick brown
  • quick brown fox
  • brown fox jumped
  • fox jumped over

I think TRUST plays a big role in determining which sites/ pages are flagged for resemblance checks of shingles. Meaning not all sites are held to the same standards - which also make it impossible to ever predict the magical percentage. Basing reviews on trust is probably one of the biggest helps to reducing the sample size, which in turn, reduces the processing power required for such a massive amount of data relevance interpretation. All documents will build a set of associated shingles over time. This is why delivering content strategically based on what is the unique content of your site is so important..

Jake’s Top 6 Duplicate Content Mistake
Courtesy of Mr. Baked’s Duplicate content presentation and Barry, and Mike’s coverage of the session, comes the top 6 duplicate content mistakes:

  1. Circular Navigation
  2. Print-Friendly Pages
  3. Inconsistent Linking
  4. Product Only Pages
  5. Transparent Serving
  6. Bad Cloaking

To get Jake’s fixes, you’ll have to attend his Duplicate content session at SES San Jose.

Techniques for Remedying Duplicate Content

Don’t have the same content indexed in two places!!! Be consistent with your linking structure!
Robots.txt - Site Level
Don’t let the bots near your folders of duplicate content - keep it all in one place for users, and don’t let the bots near it.

Meta robots tag - Page Level
Using variations of the robots tag and allowing spiders to index, noindex, follow, or nofollow the given pages

Rel=Nofollow - Link/Block Level beta
Okay, this is what I’m talking about as a potential positive for rel=nofollow. The trouble is that support for it from the engines is shaky at best, and I’ve been shown examples where it flat out doesn’t work. I wouldn’t rely on it at this point, but it’s probably worthwhile to use it as a failsafe to keep spiders from getting to certain areas of your site.

Other Remedies for Duplicate Content
I-frames with 0 border - put the duplicate content on a seperate page and use the noindex, nofollow.

Text to image - Thanks to Web Professor.

Invisitext - Another brilliant script from Web Professor.

Just like reciprocal links, poor titles, run of site links, and a multitude of other SE variables, duplicate content occurs naturally on the web sometimes. It is not inherently a bad thing that can hurt your site. What CAN hurt your site is not having an understanding of how to handle duplicate content, and having spiders spend time indexing your duplicate content when they could be grabbing your good unique value added content.

To truly understand duplicate content issues, you need to learn what the problems associated with duplicate content are from a SE perspective, as well as HOW they are trying to remedy those issues. Understanding the strategies the SE’s are using to improve relevance (in this case, trying to de-dupe their index), is important to developing strategies for new and existing sites in the months and years to come.

More duplicate content reading and resources

More Whitepapers on Duplicate Content (PDF’s and required registration)

I love Social Media! - Votes are noticed and appreciated:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Fark
  • Reddit
  • YahooMyWeb

31 Comments Leave a comment »

The URI to TrackBack this entry is: http://www.stuntdubl.com/2006/06/12/dupe-content/trackback/

chris
June 12th, 2006,
5:49 pm

I find this very interesting, I am wondering what happens with sites like [site edited]
were they simply write one line then take a paragraph or reword what someone else has writen. he seems to be doing alright with over 220,000 page views a month for the last year. could this be a penalty site?

**admin note
Site edited…not big on giving specific site examples.

Lea
June 12th, 2006,
7:52 pm

Its very hard to determine when duplicates are a problem. A company I host some sites with has their main domain, their support domain, and the several domains that seem to be the physical servers - all serving the same content. They aren’t banned; they appear in the SERPs for various phrases, but I’ve mentioned to them they should 301 all those superfluous domains to minimise the dups - they are effectively competing with themselves :(

Dfasdy
June 13th, 2006,
4:49 am

Thanks for sharing a rather in-depth explanation on dup content issue. Hopefully, SE would continue being better at ranking duplicated content. I have found many instances where the originating site is no where near sight, whereas the copying site continue to move up the ladder.

Isulong Seoph
June 13th, 2006,
7:24 am

I just wonder how these SE’s consider websites using CMS or thoes very common Blog softwares like wordpress which have a default template and these are used by so many.

I also wonder about sites I came accross that are obviously using duplicated contents and still topping the SERPs for a long time now?

J-Weezy
June 13th, 2006,
9:35 am

Great blog… excellent writing as always!

Interviewed, and Some Other Random Musings » SEO by the SEA
June 13th, 2006,
11:55 am

[…] Todd Malicoat came out with fine article on duplicate content a couple of days ago that I highly recommend - How to Remedy Duplicate Content and Magical % Thinking Email author | […]

Unofficial SEO Blog » Duplicate Content Issues - SEO Information much before its official!
June 14th, 2006,
2:59 am

[…] There is also a post by Todd Malicoat (aka Stuntdubl) which is one of the comprehensive guides when it comes to duplicate content. These two papers are more than worth a read. […]

Marc Macalua
June 15th, 2006,
12:04 pm

Great post Todd. Read some parts of it from the supporter forum but I think I missed these golden nuggets:

“All documents will build a set of associated shingles over time.”

and

“This is why delivering content strategically based on what is the unique content of your site is so important..”

Before I used to ask myself why anyone would want to block bots via robots.txt, but after seeing it done for one ecommerce mega site, I think it’s the next best anti-duplicate content fix to do after deep trusted links and fresh/unique content generation.

Marc

blogHelper » Duplicate Content in Blogs: The Problem
June 17th, 2006,
11:57 pm

[…] If you need deeper understanding of the duplicate content issue, check out recently written detailed articles on the subject over at Stuntdubl and SEO by the Sea. After reading those, come back here for the implications of the duplicate content issue on WordPress. […]

al toman
July 5th, 2006,
8:10 am

In testing SEO theory and Google, I’ve purposely created dupe pages (very obvious and annoying to anyone who actually lands on these pages). I have two topic-sets of them. They’ve been up for a bit and all out rank (PR) the relevant content, good down-to-earth, meaty web pages, all of which are exhibiting lower PR. Consequently, I’ve removed the GOOD low ranking stuff to see whether or not the high PR junk dupe pages will bring up the site’s PR, overall.

I previously tested Google’s response to cloaking, same color font on same color background, etc, and I’d probably be into the 8th of my 9 lives before Google actually does anything about it. It takes them about 2 years. Consequently, I feel fairly safe from the Google Girls.

=====

Considering that the no-follow isn’t necessarily always no-followed, I’m working on an
easy way to “no-follow” in-line links using a bit of php script, robot.txt, .htaccess, and re-directs. I should have that in working order for reality testing shortly. It appears transparent and seamless to the link clicker duders.

In the meanwhile, dupe your content smartly Google won’t even notice and assign it all high PR as a bonus.

Kind Regards,
Al Toman

Well you thought there was a duplicate content filter huh? - Jaan Kanellis Search Marketing Blog
July 19th, 2006,
11:45 pm

[…] Duplicate Content Issues and Search Engines by Bill Slawski How to Remedy Duplicate Content and Magical % Thinking by Todd Malicoat […]

Mr SEO
July 24th, 2006,
7:28 pm

Supplemental index will be the result.

» Duplicate Content Misunderstandings
August 1st, 2006,
9:14 am

[…] I think the subject of duplicity is fully covered here: http://www.stuntdubl.com/2006/06/12/dupe-content/ […]

5 Ways To Reduce The Chances Your Pages Won’t Get The Duplicate Content Penalty. · Marty Fiegl
September 23rd, 2006,
4:44 pm

[…] Want to learn more about duplicate content? Here’s a few helpful resources: Stuntdubl Profitbooks Profitpapers […]

Duplicate Content
September 30th, 2006,
6:08 am

I have a fairly simplified approach to duplicate content, which holds true most of the time.

“If your snippet is the same as another page’s snippet, then Google considers it duplicate content”.

So have a think about how snippets are created and you are on the right track. Keep those meta descriptions unique, or don’t use them at all. Make sure the text around your desired search phrase is unique, and this will help too.

And NEVER submit your site to a directory using text off your own site - if the directory has stronger links than you, then it may well rank instead of you. I have seen this happen on a number of sites.

My 2c, explained further at http://www.ragepank.com/articles/43/duplicate-content/

BrianStocker.org » Duplicate Content and Similar Content
October 26th, 2006,
8:25 pm

[…] The details of duplicate content, and similar content, are complex and interesting. A full discussion is here at studtdubl.com. […]

The Truth About Duplicate Content Penalization » TP
November 5th, 2006,
9:10 pm

[…] That sort of a slap can be dynasty-destroying. You’ll return to square one in terms of search engine visibility and be trying to get your old job at the liquor store back. Know what duplicate content is, what it isn’t - and then don’t do it. For more detail on the science of DC, including a break down of the different levels of severity, read this fantastic article by Todd Malicoat. Content Press Releases Uncategorized […]

The Organic SEO
December 29th, 2006,
6:25 pm

The State of Duplicate Content…

According to a number of search engine optimization experts, duplicate content (having essentially identical content in more than one online location) is a bad thing. My attempt here will not be to give my opinion or to conclude the argument……

Nashua Broadband - Duplicate Content Escapees « musings
March 16th, 2007,
6:54 am

[…] If you do any reading at all about search engines and website design/hosting, etc., then you MUST have noticed that almost EVERYBODY is talking about duplicate content issues. […]

» How to avoid duplicate content in search engine promotion
April 27th, 2007,
10:53 am

[…] See also How to Remedy Duplicate Content and Magical % Thinking (Stuntdubl) […]

Wan
June 17th, 2007,
12:29 am

Ever wonder why ezinearticles.com which is probably the biggest source of duplicate content on the web never gets punished or penalized but gets rewarded with high ranking for many keywords? It’s all about trust.

Natpal, Doppleganger Sites And Other Counter-Productive Marketing Techniques | Website Promotion is not Voodoo
June 17th, 2007,
8:21 am

[…] And if you’d like a longer more technical description, you can read this post by Todd Malicoat, which includes a link to a PDF describing the concept of “shingles” relative to duplicated content (this is more technical than I typically recommend and not for the faint of heart). In short, if you’re going to make a duplicate of a site you need to block the search engines from “indexing” (collecting data from) the duplicate - otherwise you run the risk of seriously hurting the real site. […]

Robin
September 11th, 2007,
6:19 pm

I’ve created a very very simple little test to measure the resemblance between to URLs. It uses 3-Shingles and the Jaccard measure.
It just illustrates some of the principles mentioned in this article.
Here it is: http://www.wetrade.nl/duptest/

Skip SEO For A Minute, Work On Your Writing Skills - Daily Blogging Tips and Web 2.0 Development
September 24th, 2007,
5:27 pm

[…] 1. Resurrect Your Writing, Redeem Your Soul 2. How to Remedy Duplicate Content and Magical % Thinking 3. Ten Tips for writing a blog post 4. 5 Simple Ways to Open Your Blog Post With a Bang 5. 7 Steps to Being Recognized as an Expert 6. How to improve writing skills with writing exercises 7. Scannable Content 8. Converting One off Visitors to your Blog into Regular Readers 9. 10 Tips on Writing the Living Web 10. Good Blog Writing Style 11. 11 Ways to Improve Your Writing and Your Business 12. Better Writing Through Design 13. 50 Tools that can Improve your Writing Skills 14. Writing Skills - Before You Write It Down, Know This 15. Everything You Need to Know About Writing Successfully - in Ten Minutes 16. Top Ten Writing Tips To Help You To Write More 17. An Introduction to Copywriting 18. Ten Ways to Improve Your Technical Writing 19. You don’t need permission to create 20. Improving your Writing (a great resource) Author: Daniel Vukadinovic […]

Brisbane SEO Guy
October 12th, 2007,
11:55 pm

A very interesting discussion. How much weight does the order of paragraphs have? Is it possible to avoid the penalty by:

Changing the title
Changing the article description or snippet
Reordering the paragraphs
Using synonyms for key words

Aaron Wall's SEO Book.com
November 2nd, 2007,
3:07 am

[Video] Creating Your Site’s Internal Link Structure for Google and Searchers…

Web Site Optimization » [Video] Creating Your Site’s Internal Link Structure for Google and Searchers
November 9th, 2007,
6:33 am

[…] Duplicate content: Google likes webmasters to believe that Google has duplicate content figured out, but if they have multiple similar pages indexed you are splitting your PageRank and they may rank the wrong version. Make sure you do not place the same (or exceptionally similar) content on multiple pages. Stuntdubl has a good list of resources for dealing with duplicate content. […]

[Video] Creating Your Site’s Internal Link Structure for Google and Searchers
November 17th, 2007,
10:01 pm

[…] Duplicate content: Google likes webmasters to believe that Google has duplicate content figured out, but if they have multiple similar pages indexed you are splitting your PageRank and they may rank the wrong version. Make sure you do not place the same (or exceptionally similar) content on multiple pages. Stuntdubl has a good list of resources for dealing with duplicate content. […]

Robin
November 20th, 2007,
12:00 pm

I wonder about sites I came accross that are obviously using duplicated contents and still topping the SERPs for a long time now?

When is Duplicate Content a Good Thing? | Seo Design Solutions Blog
November 20th, 2007,
8:41 pm

[…] The moral behind the story is, if you have duplicate content on your pages (check StuntDubl’s page on duplicate content for a more detailed explanation) the severity of the penalty varies on the conditions that evoked it. […]

How search engines work : QuickstartSEO.com
November 29th, 2007,
2:07 pm

[…] Duplicate content detection is not just based on some magical percentage of similar content on a page, but is based on a variety of factors. Both Bill Slawski and Todd Malicoat offer great posts about duplicate content detection. This shingles PDF explains some duplicate content detection techniques. […]

Leave a Reply