Respectfully Request to Be Unlisted with robots.txt

network bugs? web crawler?

web crawler? (Photo credit: smswigart)

Web search engines by default will crawl (like a “spider”) over all visible and interesting sites to build a comprehensive and hopefully useful search index. If you have a site that you would like excluded from these public search indexes, you can “respectfully request” to be unlisted using a “robots.txt” file. Friendly search engine spiders that obey well-established convention will look for a robots.txt file on your server and if present, read and honor your properly-formatted directives to ignore all (or part) of your site when crawling and indexing. If you request that your entire site be ignored, a well-behaved web spider (or “robot”) will politely crawl on by without a second glance at your site.

In addition to a little privacy, your robots.txt file is a polite way to save the robot from doing extra work reading and indexing a site that should not be included in search results. This is good etiquette for web site administrators who prefer their Internet-accessible site to be unlisted.

Some popular content management systems may include an option or setting to exclude your site from search results. These easy settings will make sure that web-crawling robotic spiders find an appropriately-formatted robots.txt file when visiting your site.

For others who manage the actual back-end website files, place a plain text file called “robots.txt” in the root of your web content directory with contents similar to the following:

User-agent: *
Disallow: /

Please NOTE that this does *NOT* add any security to your site. It does not block any access and it does not prevent mis-behaving web spiders from reading and adding your site to their search index. To test whether your “robots.txt” is functioning properly, append “/robots.txt” to your server base URL while testing from a public computer. You should see the contents of your robots.txt file in the browser (or on “view source” if you get a blank page). A good example might be the Google main robots.txt file at http://www.google.com/robots.txt (note the robot directives in the Google file are much more complex than the simple example given here).

As you may have deduced, there is a lot of flexibility to pick and choose which sections of a site you would like to remain unlisted (and un-crawled). For starters, see the Wikipedia article for “Robots exclusion standard”.

Advertisements

About notesbytom

Keeping technology notes on WordPress.com to free up my mind to solve new problems rather than figuring out the same ones repeatedly :-).
This entry was posted in System Administration and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s