In a past job, I spent a great deal of time debating ways to prevent Google from seeing similar pages within the same website. The thinking behind this debate was that duplicate and thin content was bad (this was post-Panda), so if you had lists of things that had a lot of overlap with other lists of similar things you published, it would hurt your site’s results in search.
If you boarded this train of thought and rode it for all it’s worth, you’d do the following:
- Allow fear of duplicate content to alter your publishing strategy and lead you to edit or eliminate certain pages.
- Tag your pages into oblivion, using rel=canonical in a desperate attempt to educate Google about how these pages relate to one another.
- Use ajax or robots.txt to make faceted versions of the pages invisible to Google.
Intuitively, this never made sense to me. It’s perfectly normal to have multiple views of things on a site, and those views may substantially duplicate other things on your site. Amazon is a great (and oft-cited) example – they have products, and products have sizes, colors, brands and a host of other variables. Amazon also has lists of products based on categories, and some of those categories overlap. Do I really believe that Google – in all its sophistication – would penalize a site for having a page dedicated to listing “digital cameras,” another for listing “digital point and shoot cameras” and another listing “digital slr cameras?”
I’ve also always viewed robots.txt as the nuclear option of search optimization: I want my content to be visible to the maximum extent possible. Call it transparency. Call it a marketer’s paranoia about not being available when someone – anyone – comes a’crawlin’.
(There are other factors that might lead you to go to some of these lengths, specifically crawl efficiency. If similar pages create infinite loops that trap the search bots or if they spend so much time crawling pages with little importance that they don’t make it to pages that are highly relevant to your audience, that’s a problem worth fixing. For now, I’m focusing solely on the perception that similar or duplicate content within the same site is a problem.)
To a great extent, my intuition has been confirmed and re-confirmed by WebMasterHelp videos. When I listen to the way Matt Cutts describes duplicate content, the context is never around penalty – it’s clarity. The more you do to comment your duplicate or similar content, the easier it is for the search engines not to get confused, but two things seem true:
- Even if you don’t, they’ll do their best to figure out the right page for the right query (and most of the time, they’ll get it right).
- Even if you do, they may still figure out something different (which is good, in case you make a mistake in your canonicalization of pages).
The following videos in particular shaped my perception of this:
The video announcing the introduction of rel=canonical provides a really helpful overview of the topic and good insight into the fact that the value of its use is to help the search engines avoid getting confused, but also makes clear the extent to which the search engines interpret data independently.
In a later video, Cutts goes even further in explaining the distinction between treatment of normal duplicate content and penalty cases. He urges people not to stress out about it.
In a separate video (for which I haven’t been able to find the citation…yet), Cutts talks about affiliate sites and retail product pages – highly competitive duplicate content. His remedy? Figure out how to stand out, do something unique with it. It’s good advice, and really, in this regard, the search engines are like normal consumers, trying to figure out: why should I care about this provider over that provider? That’s not a question of tactical SEO so much as it is one of strategy, and by and large those are far more important (and interesting) questions than how to set up rel=canonical.
The lessons I take away from this?
- Confuse “similar content” for “duplicate content.” Penalties come in when there’s a clear pattern of scraping or re-publishing without adding value. It’s easy to tie yourself in knots thinking about this stuff, and like any factor Google takes into consideration, it’s highly contextual.
- Change your publishing strategy out of fear of duplicate content. If there’s a valid editorial reason for duplicate or similar content within your site, then let there be duplicate content. First of all, I wouldn’t give duplicate content within the same site a second thought. I wouldn’t even worry too much about similar content across the internet. Look, there’s nothing new under the sun (there very rarely is). Chances are that whatever you’re talking about, someone else is talking about it, too. Competitive analysis is good, but at the end of the day, put out content about whatever your area of expertise covers and do it as distinctively as you can.
- Have a distinctive voice and identity. In a world where everything is written about and discussed ad infinitum in real time, you can’t avoid duplicating content altogether, so style matters.
- Use rel=canonical for related content. Even if Google doesn’t need it per se to understand which page to view as the master page, it’s a good idea to take advantage of any kind of markups that can make the structure of your content clear.
- Use 301 re-directs. One URL per page is a good principle to live by. Eliminating all duplicate URL’s only makes your content data cleaner.
- Be incredibly consistent across your URL’s, site maps and internal links. The cleaner you can make your data, the better off you are.