Sources of Duplicate Content
It is possible to duplicate content either intentionally or by accident. When it comes to your website, you can do without this kind of flattery since imitation is the sincerest form of flattery. No matter what the motive of the copycat is, you don’t want it to be indexed as theirs if you can avoid it.
There are two basic types of duplicate content:
- Outside your domain duplicate content. It occurs when two websites have the same text indexed in search engines.
- Within your domain duplicate content. Second, there are websites that create duplicate content within their own domain (such as www.domain.com, which is the root of the site’s unique URL).
Many sites have duplicate content on their own domains because of faulty internal linking procedures, and webmasters aren’t even aware of the problem.
The more pages on your own site that are duplicates, the less likely one or both of them are to be found in search results. Some webmasters inadvertently cause search engines to disregard more than half of a site’s pages due to duplicate content issues.
There are various reasons you may end up with duplicate content on your own site, such as having multiple URLs all containing the same content; printer-friendly pages; pages created on the fly with session IDs in the URL; using syndicated or third-party content; problems caused by localization, minor content variations, or an unfriendly Content Management System; and archives.
How To Find Out Duplicate Content
A good place to start looking at duplicate content is to find out how many of your web pages are currently indexed, versus how many the search engines consider to be duplicated. Here’s how:
- On Google, type [site:domain.com] in the search text box (leaving out the square brackets and using your domain), and then press Enter.
- When the results page comes up, scroll to the bottom and click the highest page number that shows (usually 10). Notice the total number of results shown on “Page 1 of about ### results” at the top of the page. The “of about ###” number represents the approximate total number of indexed pages on the site.
- Now navigate to the very last page of the results. The count shown there represents the filtered results. The difference between these two numbers represents the number of duplicates as well as pages the search engine considers low-quality. Typical examples are outdated product pages or those with parameters in the URL.
For performance reasons, Google doesn’t display all the indexed pages and omits the ones that seem most like duplicates. If you truly want to see all the listings for a site, you can navigate to the very last results page of your [site:] query and click the option to “repeat the search with the omitted results included” at the bottom of the page. (Even then, Google only shows up to a maximum of 1,000 listings.)
Google Search Console (the free communication hub and toolset that Google makes available to every website owner) can give you a hand discovering duplicate content on your site. Just run the Duplicate title tags or Duplicate meta descriptions reports (located under Search Appearance ➪ HTML Improvements) to identify any pages indexed with the same tags.
How To Avoid Multiple URLs With The Same Content?
Even with websites under your own control, you may have duplicate content resulting from any of the following sources:
- Similar pages on the same website (www.yourdomain.com)
- Similar pages in different domains that you own (www.yourdomain.com and www.yourotherdomain.com)
- Similar pages in your www-prefixed site and the non-www version (www.yourdomain.com and yourdomain.com)
When search engines find two pages with nearly the same content, they may include both pages in their index. However, they take only one of these pages into consideration for search results. The search engines do this because they want to show users a variety of listings — not several that are the same.
To make the entire body of your website count, you want to ensure that each of your web pages is unique.
But here’s the rub: The search engines may consider two pages the same even if only part of the page is duplicated. Just the headings, the first paragraph, the Title tag, or any other portion being the same can trigger “duplicate” status.
For instance, if you use the same Title tag on multiple pages, the search engines might see them as duplicate pages just because they share that single, but important, line of HTML code. To avoid issues with duplicate content, you always need to write unique headings, tags, and content for each page on your website
How To Avoid Duplicate Content on Your Own Site
When it comes to cleaning up duplicate pages on your own website, which you have control over, after all, don’t spend time wondering, “How similar can they be?” As much as possible, just make the content different, and you’ll follow the best practice of having unique, original content throughout your site. In some situations, duplication may be unavoidable, so read on for recommendations. But stay away from the edges of what might be all right with the search engines.
To keep your site in a safe harbor, here are some ways you can avoid or handle duplicate content within your own website:
1. Title tags, Meta description, and keywords tags
Make sure each page has a unique Title tag, Meta description tag, and Meta keywords tag in the HTML code.
2. Heading tags
Within the body copy, make sure the heading tags (labeled H#) differ from those on other pages. By keeping in mind that all of your headings should contain meaningful, non*generic words, this becomes a little easier.
3. Repeated text
You may consider using images on most pages if you need to repeat a sentence or paragraph throughout your site, such as a company slogan. Choose one web page where that repeated content should rank and leave it as text so search engine spiders can crawl it.
Search engines are able to find this unique content on your selected page if anyone tries to find it. A classic car customization website with the slogan “We restore the rumble to your classic car” would probably want to feature that throughout the site.
The search engines should not see the repetition, however. Just include it as HTML text on one page, such as your home page or About Us page. Simply create a nifty graphic that shows the slogan only to users and not to search engines everywhere else.
Make sure that your HTML sitemap includes links to your preferred page’s URL, for example, http://www.sitename.com/silo1/ versus http://www.sitename.com/silo1/index.html (in cases where you have similar versions). Sitemaps assist search engines in determining which version of a page is the best or most preferred.
5. Product pages
Make sure to include unique text content on all product pages, not just the manufacturer-supplied descriptions. It creates a lot of extra work to write original descriptive text and/or incorporate user-generated text such as reviews on each product page.
But if you want to rank for “Acme X19,” you have to differentiate your Acme X19 page from all the other retail sites carrying that product.
On the other end of the sales spectrum, if you produce a product that is sold on other people’s sites, determine whether you want your own page to rank for that product. If so, you need to keep your page unique.
Consider distributing different descriptive text to your resellers than what you show on your site. Product variations, such as different colors or sizes, ideally should not be on separate pages (though you may have a highly sought-after variation that merits its own page).
Use drop-down lists or other web design elements to let users choose the size, color, and other options so that you can keep a product consolidated on one page.
6. Canonical tag
Canonical tags can be used in cases where duplicate pages are unavoidable. You can insert this HTML link element (rel=”canonical”) in the header section of each duplicate page to indicate which URL should be considered the canonical (best or original) version of the page. Google, Yahoo, and Microsoft came together and created the canonical tag in 2009 as a last-resort solution for webmasters.
While the tag is treated as a suggestion rather than a directive by the search engines, it can often resolve duplicate content and prevent penalties. Canonical link elements appear to be observed by Google most of the time, whereas Bing does so only occasionally.
You should only use canonical tags when you are an experienced webmaster who understands both the risks and how to use them. That’s why we provide no specific instructions here. More information can be found in Google Search Console Help by searching for [canonical tag].
7. Block indexing
You may have duplicate pages that you need to keep on your site. For example, a site may need to retain previous years’ terms of service pages for legal reasons, even though they’ve been superseded by a slightly altered current version.
In cases like that, you could block the search engines from indexing the page either by specifying it in your site’s robots.txt file or by inserting the following tag into the Head section of a particular page: <meta name=”robots” content=”noindex”>.
8. Consolidate similar pages
You may be able to combine several pages with similar or identical content and edit the content accordingly. Determine why the duplicate content occurred and take the appropriate action. Canonical tags may be your answer if you have a legitimate reason for wanting an additional version of the page.
The additional page can be trimmed if it is not needed. There are a few precautions to take if you need to combine a number of pages into one. Do not accidentally wipe out any link equity you may have accumulated.
A site’s link equity refers to the value of all inbound links that point to it. Similarly, you don’t want to cause people’s links and bookmarks to suddenly break, which will result in them receiving an error message.
Here are some precautions to take when consolidating two pages into one main version:
- Check for inbound links. Find out who has already linked to your different URLs for a page by doing a [link:domain.com/yourpage.html] search in Google or using a third-party backlink checker. The one that 4,000 people linked to will be the one you should keep if one version has 15 links and the other version has 4,000.
- 2. Update your internal links. Ensure that all other pages of your site and your site map no longer link to the page you decided to remove.
- 3. Set up a 301 Redirect. Take down the removed content of the removed page, and replace it with a 301 Redirect, which reroutes any incoming links to the URL with the content you want to keep.
How To Avoid Duplications Between Your Different Domains?
Most websites today operate both the domain that begins with www and the domain without this prefix (such as www.yourdomain.com and yourdomain.com). You do not want to duplicate your site content between these two domains.
Instead, set up a 301 Redirect from one root domain to the other and keep only one set of your web documents in production. (Note: It doesn’t matter whether you redirect the www or the non-www version, although it may be more common to make the www version the main site.) Users coming to either URL can get to the same content.
If you own multiple domains with the same content, you can solve the problem of duplicate content with the same technique. Decide which domain you want to rank well for your keywords and redirect the other domains to that one.
Or if you truly need separate sites with duplicate content, know that you’re going to pay a price when the search engines pick the one-page version that they decide is the authoritative one and ignore your others.
1. Printer-friendly pages
A common practice that inadvertently creates duplicate content within a website involves printer-friendly pages. Printer-friendly pages are separate pages designed for printing, without the heavy images, navigation elements, and advertisements that eat up a lot of printer ink.
Recipe sites are notorious for these pages, but many sites offer both an HTML version and a simplified version of each page that their users can easily print. The printer-friendly page has its own URL, so it’s actually a twin of the HTML page.
You don’t need to have separate text-only pages for printing. The best way to allow easy printing, keep your users happy, and follow SEO best practices is to use CSS (Cascading Style Sheets, an efficient way to control the look of your site by defining styles in one place).
A print style sheet within your CSS can reformat your HTML pages on the fly so that they can be easily printed. Inside your CSS file, you can specify how a page should automatically change when a user chooses to print it.
Your CSS can control print formatting such as page width, margins, font substitutions, and which images to print and which to omit. Creating a print style sheet within your CSS file is a much easier and search-friendly solution than duplicating your pages with printer-friendly versions.
2. Dynamic pages with session IDs
Many websites track the user’s session — the current time period the user is active on the website since logging on. Sometimes these sites add a session ID code to each page’s URL as the user travels the site.
This is a really bad way to handle passing a session ID from page to page because it creates what looks like duplicate content. Even though the page itself has not changed, the varying parameters showing at the end of each URL cause search engines to think they are separate pages.
In fact, you don’t want to put any type of variables directly into your URL strings except for ones that actually correspond with changed page content. You need unique content for every URL. If your site passes session IDs through the URL string, here are three ways to fix it:
- Tell search engines how to handle parameters. In Google Search Console, you can use the URL Parameters tool to specify what parameters are used on your site, what they refer to, and whether Google should crawl them. Bing has a similar webmaster tool. However, it’s better if you can stop adding codes to a user’s URL.
- Show spiders friendly URLs. This is a more advanced solution, but you could consider using “user agent sniffing” to detect search engine spiders. User-agent sniffing occurs when websites show different content to different users depending on the browser. If the page detects it’s a search engine spider, it could deliver parameter-free URLs to that spider. This sniffing is invisible to the spider and, with a 301 Redirect on the old URL pointing to the rewritten URL, the spiders have a URL that they can index.
3. Content syndication
Content syndication simply means sending out your website content to others. The big upside of syndicating your content is having more people read your stuff, which in turn can lead to increased traffic coming back to your website.
The potential downside of syndication is duplicating your content. In essence, you are trading potential search engine ranking for direct links and the traffic that your content syndication brings in.
1. Blog index pages
Blog posts are not isolated articles; they exist among many other posts accumulating over time in a blog. Although each post has a distinct page URL to identify it, the blog as a whole also has an index page and usually category pages that list multiple blog posts.
When a blogger publishes a new post, the blog software automatically adds the new post’s title and an excerpt to these index and category pages, which is super convenient for readers and bloggers alike. But blog index and category pages can inadvertently create duplicate content issues for a website if too much of the article is reused.
To avoid this problem, include only a portion, not the full text of the article, in the excerpt. Even better, if bloggers manually edit the excerpt slightly, they can ensure that it isn’t duplicated as well as a word to stimulate someone who’s reading the excerpt to click through for the full story.
2. Press releases
Then there is the problem of press releases. Press releases are posted on wire services for the express purpose of being picked up and duplicated on as many sites as possible. If your company puts out press releases, don’t stop.
Even though they do become duplicate content that may not bring any unique ranking to your website, you distribute a press release for other reasons than search engine ranking, such as for branding, public relations, investor relations, and so on. These are legitimate goals, too.
Be sure to make the text content of your press release unique. In other words, don’t copy your announcement word for word from your website’s static pages. And include a brief “About” section for the press release that doesn’t exactly match company information found on your website.
3. Social media sharing
Bloggers and other content producers routinely turn to social media to share any new content they publish. To avoid creating duplicate content with your social shares, post a snippet or summary, not the entire text of your website content or blog post, on Google+, Facebook, LinkedIn, Tumblr, or other social media site. Or, rewrite the content enough so that you’re posting a new version of it on social media, linked back to the original on your site, if appropriate.
Because we’re talking about avoiding duplicate content, it seems fitting to mention curation, which can be a highly effective form of a content generation when done right. When you curate content, you pull together several existing articles or other types of content from various sources and then create a new article that presents that content to readers with your own added commentary.
Top Ten lists are a good example. The curated article gives a short description of each item with a link to the source. When curating content, you can quote small bits, but be careful not to duplicate the original sources. And always include your own fresh text to add value.
Many websites repurpose the same content for various locations. For example, a large real estate company with brokerage offices throughout the state offers the same template to all their brokers, customized only with a different city name and local property listings.
Or a national cosmetics company gives local representatives their “own” sites, but all of them have the same standard template and content. For local searching, a template site may do all right.
For instance, if a person located in Poughkeepsie searches for [real estate for sale], or if anyone searches with a specific geographic term such as [real estate listings Poughkeepsie], that searcher will probably find brokerage sites that are located in Poughkeepsie, including the sites built within a national company’s template for that location, which may include duplicated content.
However, if the Poughkeepsie broker himself would like to rank for a broader search query like [New York properties], he’s probably out of luck. His website won’t have enough unique content about that keyword phrase to rank in the search results. Doing a quick find and replace search for a keyword to create “new” targeted pages (such as changing Poughkeepsie to Albany) is not enough to create unique content and is in fact considered spam by the search engines.
You must have truly unique content on unique pages in order to survive and thrive in the search engines. You can use a template to create many web pages or sites, but here’s how to do it without creating duplicate content. You need to do more than just a find and replace the search for a few terms.
Customize the content for each location, including the headings, Title tags, Meta tags, body content, and so on. Unless you’re located in a highly competitive demographic market area such as Chicago or New York City, your template-based site may be sufficient if you’re only after local search business.
If you want to rank for non-local search queries, you absolutely need to customize your site content and make it unique. Add information that shows you know about that local area. Depending on what kind of traffic you hope to attract through the search engines, you probably need to make changes to your content focused on improving both the quality and quantity.
On the web, a mirror refers to a full copy of a web page or site. The mirrored version is an exact replica of the original page. Yes, this is blatantly duplicate content, but there are a few legitimate reasons to mirror a web page.
- You may need to mirror a web page for user convenience, such as when multiple websites offer copies of a downloadable file so that users can access it at each location.
- You may be testing different versions of a page. A common type of web test, known as a split test or A/B test, involves directing half of the visitors coming to a page to the normal page, while the other half sees a slightly different version of the page. The server splits the traffic randomly, and data collected over time shows whether Page A or Page B performs better. We recommend that you run limited tests on specific variables (that is, minor variations between pages A and B). Use the Google Analytics Content Experiments tool (part of the free Google Analytics software) so that the search engine understands you are just testing.
- You may want to display a backup version of a page that’s temporarily down.
- Many websites have a separate mobile site (usually set up as a sub-domain m.yourdomain.com) that contains nearly the same content as the desktop site. This is not really a problem, but use canonical tags to ensure that the search engines know which is which. Set up the tags to indicate that the desktop (www.yourdomain.com) version of each page should be indexed as canonical.
Mirroring should never be done deceptively. Hackers and pornography sites are notorious for mirroring sites, having content in 20, 30, 40, or more locations because of how frequently their sites are discovered and taken down.
You want to ensure that search engines consider you a legitimate company with original content. Unless you have a need to put up a temporary page such as the ones mentioned above, try to avoid using mirrors on the web.
How To Avoid Duplication by Outsiders
There is no excuse for taking someone else’s page intact, adding a different façade, making a few top-of-page cosmetic changes, and then uploading it to another site for indexing with the search engines. Sometimes it even still contains the original displayed text and links! Unfortunately, there is no foolproof defense against someone taking your content from the web.
To deter others from copying your content, we recommend that you take the following steps:
- Display a copyright notice on your website.
- Register for federal copyright.
These two proactive steps can help you defend your website against intentional spam. It’s a good idea for you to register for federal copyright of your website as software. This is a low-cost and important step in your anti-theft effort.
Even though all content carries copyright naturally, you want to actually file for a copyright registration because only federal copyright has enough teeth in it to help you fight violations of your copyrights in court, if it ever comes down to that.
With federal copyright on file, you have legal recourse if things get ugly. You also carry a lot more weight when you tell people your work is copy-righted with the U.S. government and then ask them to remove it from their site.
The federal copyright can be enforced throughout the United States and internationally. In the following sections, we list different types of intentional spam, with tips for what you can do to protect your website.
Scrapers are people who send a robot to your website to copy (or “scrape”) the entire site and then republish it as their own. Sometimes they don’t even bother to replace your company name throughout the content.
Scraping a site is a copyright violation, but it’s also more than that: It is theft, and if the content is protected by federal copyright, the thief can be sued in federal court. If your website has been scraped, you need to decide what your objective is.
Will you be satisfied simply to get the content pulled down? Or do you feel that the other party’s actions are so serious and malicious that you want to sue for damages? You need to decide how much money and effort you’re willing to spend and what outcome you’re really after.
If your site is scraped, your first step can be a simple email requesting that the site stop using your content. Often this is enough to get it removed. You can also report the site to the search engines or the ISP (Internet service provider) that hosts the site domain. If you notify the ISP that the site has been scraped and provide some proof, that ISP may shut the site down.
Because scraping is a crime, you may choose to file a police report for theft. You should have printouts and other evidence that the text is yours and that it has been stolen to back you up. You can even hire a lawyer and serve the scrapers with a cease and desist order, demanding that they take down the offending web pages or face legal action. As a last resort, you can file a lawsuit and fight it out in court.
2. Clueless newbies
Clueless newbies are what we call people who take someone else’s website content but don’t realize they’ve done anything wrong. They may be under the mistaken impression that everything on the Internet is fair game and free for the taking.
They may not realize that intellectual property laws apply to the Internet just as they do everywhere else. If your content has been stolen by one or more clueless newbies, we suggest you email them.
Tell them that it’s copyrighted material and kindly ask them to take it down. If you’re feeling generous, as an alternative, you might suggest that they only include an excerpt or summary of your content, link to your site instead, and put Meta robots “noindex” tag on their page so that the search engine spiders won’t crawl it.
The newbie site owner may comply, and you have taught him or her a lesson in Internet etiquette. But even if they don’t comply, the duplicated page is probably a low risk to you. A new site generally doesn’t have much authority in the search engine’s eyes, so their site may not hurt your rankings. They have no right to your content for their own commercial use, however, so you don’t have to let them use it.
3. Stolen content
When you work hard to create unique, engaging content for your own website, it can be frustrating for you or even damaging to your search engine rankings when that content gets stolen and duplicated on some other site. We suggest that you regularly check to see if your website content has been copied and used somewhere else. You have two ways to check this:
- Exact-match search: Copy a long snippet of text (a sentence or two) from one of your web pages. Then paste it within quotation marks (“ ”) in a search box to find any indexed web pages containing that exact text.
- Copyscape: Another method uses the free service at Copyscape. It is straightforward to use Copyscape; you just type in your page’s URL in the text box and click Go. If the page has been scraped, you see the offending URL in the results.
When your content is stolen, you may see it appearing everywhere. Like playing the Whack-A-Mole arcade game, you might succeed in getting one site to remove your stolen content, only to find it popping up on another.
If you’re in the Whack-A-Mole situation and lots of other sites now have your content, hopefully, you have federal copyright and can follow some of the recommendations we give earlier.
If you don’t have federal copyright, you may have only one recourse: changing your content. It’s unfair, it’s a pain, but if you don’t have a registered copyright, you can’t do much to stop people from stealing your stuff.
Being unique on the web is more important to your search engine rankings than playing Whack-a-Mole, trying to stop thieves from taking your content; so rewrite your own text to be different from theirs. Enforce your copyright when you find people ripping you off, but don’t think that it will solve your stolen content problem.