Online marketing information can change quickly This article is 9 years and 168 days old, and the facts and opinions contained in it may be out of date.
Unique content is a valuable commodity. There was a discussion on duplicate content in WMW supporters forum a few days ago, that I thought was worthy of a post for those who aren’t subscribed there (you should be though!).
Duplicate content has become the big area of misinformation with everyone concerned that they have hit a “duplicate content filter”, or been penalized for duplicate content. Chances are you haven’t been banned or penalized unless you really have very little unique content on your entire site. For this reason, I’d thought I’d dig a little bit further into dupe content and remedies so I have a reference document for later.
Duplicate Content is not a Magical Percentage
If it was as easy as saying that any page with more than 42% duplicate content will be filtered from the search results, then all site owners and SEO’s would probably grab 40% duplicate content for every page filler. It IS NOT a percentage. There may be percentage variables that apply, but the first step to understanding duplicate content is to get out of the “magical percentage” line of thinking.
From this paper on Finding near-replicas of documents on the web there are a few clues into the way SE’s may handle duplicate content:
Clustering exact copies by checksum
Comparing doc size for exact or near exact webpages
This is generally how people think of duplicate content detection in terms of the “magical percentage”. As long as you have 20% unique content you’ll be fine…Riiiight. It is an overly simplistic method of detecting duplicate content that is at the core of the technique of dupe content detection, but does not consider other techniques that may be applied. Many people do not consider the ways of detecting duplicate content much further than this method, and thus get stuck in the “magical percentage” line of thinking.
Computing all-pairs document
“Chunking” documents and searching for similarities and flagging them for a “second look”
The resulting document is then
chunked into smaller units…
Understanding some of the methods for filtering duplicate content is the first step in getting beyond the “magical percentage” thinking (from here on referred to as MP thinking). Imagine 10 different documents that all pull 5 lines of text from 3 documents containing 20 lines of text. They 10 different documents are most likely “unique” if the randomization settings are done well. They will all, however, have different levels of percentage similarities. Now before reverting to the line of thinking that says “which level will get me penalized?, consider other options for scoring the relevance of these pages. Consider also that it takes multiple iterations of processing to determine the similarities between ALL documents. Now, as an relevance engineer…how would YOU handle that mess?
Sorting and finding overlap.
Probablistic-counting based approach
Comparing the probability of dupe content based on footprint “sets” of different types if there is overlap between documents
Understanding Duplicate Content Filtering
Okay, you’re no longer an MP thinker. You’ve moved beyond wishing for the percentage you can push the limit to, and have agreed that you probably need a content writer to put something worthwhile on your site.
My other favorite whitepaper on duplicate content is:
Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content
From this we get some ideas on different levels of severity for dupe content:
Level 1 — Structural and content identity.
Every page on host A with relative path P, (i.e., a URL of the form http:/ /A/P) is represented by a byte-wise identical page on host B, at location http:/ /B/P, and vice versa.
Level 2 — Structural identity. Content equivalence
Every page on host A with relative path P, is represented by an equivalent content page on host B, at location http:/ /B/P, and vice versa.
Level 3 — Structural identity. Content similarity.
Every page on host A with relative path P, is represented by a highly similar page on host B, at location http:/ /B/P, and vice versa.
Level 4 — Partial structural match. Content similarity.
Some pages on host A with relative path P, are represented by a page on host B, at location http:/ /B/P, and vice versa, and these pairs of pages are highly similar
Level 5 — Structural identity. Related content.
Every page on host A with relative path P, is represented by a page on host B, at location http:/ /B/P, and vice versa. The pages are pair-wise related (e.g., every page is a translation of its counterpart) but in general are not syntactically similar.
Mismatch — None of the above.
Duplicate Content Penalties, Filters, and Bannings
It should be noted that this is my own gut feel on the topic, and there’s a very good chance I may be wrong. Take it with a grain of salt. In case you missed it, I’ve ranted about them before.
Using the level’s from above:
Level 1 – Banned
example: dmoz/wiki clones
Level 2 – Banned
example: page scraped from a site using relative urls
Level 3 – Partial penalization and/or filtering on some content depending on severity of duplication.
example: same CMS system, some dupe content
IE. Oscommerce and stock product manufacturer descriptions.
Level 4 – Possible penalization and/or filtering on some content depending on severity of duplication.
example: similar to #3 – similar content and cms system
Two widget forums both ran on phpbb or vbulletin, had similar categories of content, and allowed people to post the same content in both places (creating some exact dupe content), or aggegate rss feeds.
Level 5 – Not much to worry about –
Two widget forums both ran on vbulletin and had similar categories of content Mismatch = best case scenario – This is what you’re striving for. Having NO duplicate content indexed is the ideal. Your best bet is to keep all duplicate content from being indexed at all, and make sure if you use out of the box solutions that you change up the “footprints” a bit.
Filter – You have some duplicate content within your own site or an external site, or you have a lack of unique content – You’ll most likely end up with these PAGES in the supplemental index. Filters are generally page level problems that decrease rankings
Penalty – You’ve served duplicate content one too many times. You may have served the spiders the same content so many times that they won’t come around as often (Calendar software or session ID’s are good examples of this). With a penalty, you may get your website spidered on a less frequent or more superficial basis (meaning you won’t get deep crawled). Penalties can be page or site level issues with varying degrees of severity that decrease rankings
Bannings – Chances are you’ll probably KNOW when you’re banned – Otherwise it’s most likely a penalty or filter. Chances are the only ways to get outright BANNED for dupe content is to be cloaking others content, being guilty of violating the DMCA, or other severly aggregious offenses where you KNOW what you did. If someone did this with your branded site, you better start practicing your grovelling, and develop the story of how it’s all your shady seo’s fault. Bannings are a pleasant way of someone telling you that you’re f*cked
Okay Smart Guy, What the Hell are Shingles?
From yet another fantastic PDF, we get some insight on shingles (why in the world do smart people insist on using pdf’s?).
A k-shingle is a sequence of k consecutive words
- The quick brown
- quick brown fox
- brown fox jumped
- fox jumped over
I think TRUST plays a big role in determining which sites/ pages are flagged for resemblance checks of shingles. Meaning not all sites are held to the same standards – which also make it impossible to ever predict the magical percentage. Basing reviews on trust is probably one of the biggest helps to reducing the sample size, which in turn, reduces the processing power required for such a massive amount of data relevance interpretation. All documents will build a set of associated shingles over time. This is why delivering content strategically based on what is the unique content of your site is so important..
- Circular Navigation
- Print-Friendly Pages
- Inconsistent Linking
- Product Only Pages
- Transparent Serving
- Bad Cloaking
To get Jake’s fixes, you’ll have to attend his Duplicate content session at SES San Jose.
Techniques for Remedying Duplicate Content
Don’t have the same content indexed in two places!!! Be consistent with your linking structure!
Robots.txt – Site Level
Don’t let the bots near your folders of duplicate content – keep it all in one place for users, and don’t let the bots near it.
Meta robots tag – Page Level
Using variations of the robots tag and allowing spiders to index, noindex, follow, or nofollow the given pages
Rel=Nofollow – Link/Block Level beta
Okay, this is what I’m talking about as a potential positive for rel=nofollow. The trouble is that support for it from the engines is shaky at best, and I’ve been shown examples where it flat out doesn’t work. I wouldn’t rely on it at this point, but it’s probably worthwhile to use it as a failsafe to keep spiders from getting to certain areas of your site.
Other Remedies for Duplicate Content
I-frames with 0 border – put the duplicate content on a seperate page and use the noindex, nofollow.
Text to image – Thanks to Web Professor.
Invisitext – Another brilliant script from Web Professor.
Just like reciprocal links, poor titles, run of site links, and a multitude of other SE variables, duplicate content occurs naturally on the web sometimes. It is not inherently a bad thing that can hurt your site. What CAN hurt your site is not having an understanding of how to handle duplicate content, and having spiders spend time indexing your duplicate content when they could be grabbing your good unique value added content.
To truly understand duplicate content issues, you need to learn what the problems associated with duplicate content are from a SE perspective, as well as HOW they are trying to remedy those issues. Understanding the strategies the SE’s are using to improve relevance (in this case, trying to de-dupe their index), is important to developing strategies for new and existing sites in the months and years to come.
More duplicate content reading and resources
- Thanks to Marcia who showed me these papers
- Discussion at WMW
- Duplicate content checker tool
- Rae on dupe content
- DaveN on dupe content
- Dupe content penalty – seomoz.org
- Bill Slawski talks dupe content
- SmartIT consulting
- Topic Ideas for content pages
- Duplicate content observation from caveman
More Whitepapers on Duplicate Content (PDF’s and required registration)
- Really cool presentation on “shingles” that is almost in english (pdf)
- A comparison of techniques to find mirrored hosts on the WWW
- Detecting phrase level duplication on the world wide web (pdf)
- Spam, damn spam, and statistics (pdf)
- On the evolution of clusters of near duplicate webpages (pdf)