Web search engines by default will crawl (like a “spider”) over all visible and interesting sites to build a comprehensive and hopefully useful search index. If you have a site that you would like excluded from these public search indexes, you can “respectfully request” to be unlisted using a “robots.txt” file. Friendly search engine spiders that obey well-established convention will look for a robots.txt file on your server and if present, read and honor your properly-formatted directives to ignore all (or part) of your site when crawling and indexing. If you request that your entire site be ignored, a well-behaved web spider (or “robot”) will politely crawl on by without a second glance at your site.
In addition to a little privacy, your robots.txt file is a polite way to save the robot from doing extra work reading and indexing a site that should not be included in search results. This is good etiquette for web site administrators who prefer their Internet-accessible site to be unlisted.
Some popular content management systems may include an option or setting to exclude your site from search results. These easy settings will make sure that web-crawling robotic spiders find an appropriately-formatted robots.txt file when visiting your site.
For others who manage the actual back-end website files, place a plain text file called “robots.txt” in the root of your web content directory with contents similar to the following:
User-agent: * Disallow: /
Please NOTE that this does *NOT* add any security to your site. It does not block any access and it does not prevent mis-behaving web spiders from reading and adding your site to their search index. To test whether your “robots.txt” is functioning properly, append “/robots.txt” to your server base URL while testing from a public computer. You should see the contents of your robots.txt file in the browser (or on “view source” if you get a blank page). A good example might be the Google main robots.txt file at http://www.google.com/robots.txt (note the robot directives in the Google file are much more complex than the simple example given here).
As you may have deduced, there is a lot of flexibility to pick and choose which sections of a site you would like to remain unlisted (and un-crawled). For starters, see the Wikipedia article for “Robots exclusion standard”.