Importance of the Robots.txt File

Importance of the Robots.txt File

Despite the importance of the Robots.txt file in getting your website indexed with the major search engines, many webmasters don’t offer one on their site. What is the robots.txt file you ask? If you don’t know, you are far from alone. The robots.txt file is a simple text file (no html) that is placed in your website’s root directory in order to tell the search engines which pages to index and which to skip.

When a search engine sends its webcrawler to your site, one of the first things the webcrawler will do is search the root directory for the robots.txt file. A correctly formated robots.txt file will consist of several records, each providing instructions for a particular search-bot. A record will generally consist of two components, the first is called the user-agent and is where the name of the search-bot is listed. The second line consits of one or more “disallow” lines. These lines tell the webcrawler which files or folders should not be indexed (ie a cgi-bin folder).

If you currently have a website and do not have a robots.txt file, you can create one easily. As mentioned earlier, the files are plain text, so just open up notepad and save the file at robots.txt. Most webmasters can use one record that will apply to all of the search engine crawlers. Once you have opened notepad enter the following:

User-agent: *
Disallow:

The “*” applies this rule to all bots. In this example, there is nothing listed in the disallow line. This tells the robot to index the entire site. You can also enter a folder path here such as “/private” if there is a folder that shouldn’t be indexed. This can be very useful if you are still testing a portion of your website or is a section is still under construction.

Now that you know what should go into your robots.txt file, there are several common mistakes people make when creating these files. Never enter notes or comments into the file as these items can cause confusion for the webcrawler. Also, the format should always be the user-agent on the first line, followed by the disallow(s). Do not reverse the order. Another common mistake made involves using the incorrect case. If the disallowed folder is /private, make sure your robots.txt file does not list the folder as /Private. It seems like a very minor issue, but it will cause problems if done incorrectly. Finally, there is no Allow command. You cannot tell the webcrawler what to look at, only what not to look at.

If you are still curious about the robots.txt file you can find many more complex examples online. Just try one of your favorite websites and look for their robots.txt file. For example you can go to www.cnn.com/robots.txt. If you need help creating a robots.txt file for your site, there are plenty of places online that will create the file for you for free. One example is www.seochat.com/seo-tools/robots-generator/. Despite its apparently simplicity, this file can make or break your site’s chances with the search engines. Make sure you have your robots.txt file in place and correctly formatted today.

Justin Scarborough is founder of the Affiliate Marketing Linx internet marketing directory . His goal with this website is to create a very selective, human-edited directory that will help others find quality links and information relating to affiliate and internet marketing.

Find More Robot Articles

How To Keep Robots Out Of Your Web Site

How To Keep Robots Out Of Your Web Site

THE ROBOTS.TXT FILE


You know that search engines have been created to help people find information quickly on the Internet, and the search engines acquire much of their information through robots (also known as spiders or crawlers), that look for web pages for them.


The spiders or crawlers robots explore the web looking for and recording all kinds of information. They usually start with URL submitted by users, or from links they find on the web sites, the sitemap files or the top level of a site.


Once the robot accesses the home page then recursively accesses all pages linked from that page. But the robot can also check out all the pages that can find on a particular server.


After the robot finds a web page it works indexing the title, the keywords, the text, etc. But sometimes you might want to prevent search engines from indexing some of your web pages like news postings, and specially marked web pages (in example: affiliate´s pages), but whether individual robots comply to these conventions is pure voluntary.


ROBOTS EXCLUSION PROTOCOL


So if you want robots to keep out from some of your web pages, you can ask robots to ignore the web pages that you don´t want indexed, and to do that you can place a robots.txt file on the local root server of your web site.


In example if you have a directory called e-books and you want to ask robots to keep out of it, your robots.txt file should read:


User-agent: * Disallow: e-books/


When you don´t have enough control over your server to set up a robots.txt file, you can try adding a META tag to the head section of any HTML document.


In example, a tag like the following tells robots not to index and not to follow links on a particular page:


meta name=”ROBOTS” content=”NOINDEX, NOFOLLOW”


Support for the META tag among robots is not so frequent as the Robots Exclusion Protocol, but most of major web indexes currently support it.


NEWS POSTINGS


If you want to keep the search engines out of your news postings, you can create an an “X-no-archive” line in of your postings’ headers:


X-no-archive: yes


But although common news clients allow you to add an X-no-archive line to the headers of your news postings, some of them don´t permit you to do so.


The problem is that most search engines assume that all information they find is public unless marked otherwise.


So be careful because though the robot and archive exclusion standards may help keep your material out of major search engines there are some others that respect no such rules.


If you’re highly concerned about the privacy of your e-mail and Usenet postings, you must use some anonymous remailers and PGP. You can read about it here:
www dot well dot com/user/abacard/remail.html
www dot io dot com/~combs/htmls/crypto.html
world dot std dot com/~franl/pgp/


Even if you are not particularly concerned about privacy, remember that anything you write will be indexed and archived somewhere for eternity, so use the robots.txt file as much as you need it.


Written by Dr. Roberto A. Bonomi

Dr. Roberto Bonomi is a successful e-book writer that shares his home business experience at: www.easy-home-business.com If you already have, or are looking for an Internet Home Business, you can’t miss the free knowledge that you’ll receive at his site, and you can post free your own articles at

who will win?

More Robot Articles

Page 1 of 612345...Last »
line
footer
Powered by Wordpress | Designed by Elegant Themes