Hiding Site Content from Search Engines

Search is undoubtedly the heart of today’s World Wide Web. Everybody loves, or at least can’t live without, Google.

Still at times the need arises to hide parts of a web site from search engines. More often than not it is private content (customers’ account information, contact information for individuals, etc.) that the client does not want visible to the public. It could also be third-party information like ads. Yet another example is duplicated information across the website as it could negatively impact the page rankings in search engines.

In this article we will first go through the technique that allows us to exclude entire pages and directories from the search crawler. Then we will look into the trickier topic of hiding only part of the page content from the search engine spider.

Hiding entire pages and directories using robots.txt

The de facto standard on the web for excluding pages and directories from being crawled is to create and configure a special file named robots.txt at the root of the web site. This file indicates those parts of the site that should not be accessed by search engine crawlers. It utilizes a small set of commands following the Robots Exclusion Standard protocol.

The User-agent directive is utilized to specify the search robot at which the limitations are aimed. The wildcard * can also be used to stand for all robots. Then there is the Disallow command which lists the directories or single files to be excluded from the search crawl. So if we wanted to exclude, say, the directory that contains all JavaScript files in a web site, the robots.txt file would look like this:

User-agent: *
Disallow: /Scripts/

Here is a jump start on creating robots.txt files. A good online syntax checker is available here.

Hiding part of a page

Hiding entire pages and directories from search engines is quite straight-forward as described above. Now, hiding only part of the content on a page is trickier and requires us to get a bit more resourceful. In fact, there are solutions over the net that get as resourceful as putting the content in an image. What we will do instead is a lot simpler and a lot more flexible.

First, we include a simple HTML placeholder on the page where the content should appear:

<div id="dynamicContent"></div>

Then, we utilize a simple jQuery statement to dynamically insert the content once the page has finished loading:

$(document).ready(function () {
    $("#dynamicContent").text("This content is populated dynamically with script and will not be indexed by search engines.");
});

The code above should not be placed inline the page. Rather, we put it in a separate file – named dynamicContent.js in the sample solution, and place the file itself in the directory containing all script files. This directory should be excluded from search indexing using a robots.txt file as outlined in the first part of the article.

Last, we should of course reference the external script file in the page header:

<script type="text/javascript" src="Scripts/dynamicContent.js"></script>

That’s it! The dynamically populated content will now be excluded from the search index. Simple, flexible and extensible as we could utilize further jQuery code to load the content from the server. The entire sample can be downloaded from here.

To top it off, there is a really useful online tool that enables you to test how Google crawls or renders a URL on your site. You can use it to double check your work once you have the site public. It is available here (you should have a Google account in order to use it).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>