businessvartha: CRAWL ERRORS FACED BY BLOGGER

Wednesday, April 20, 2011

CRAWL ERRORS FACED BY BLOGGER

One blogger emailed and asked me about the Crawl errors that he saw in his Webmaster Tool page of Blogger account. He saw many crawl errors and got confused with the importance of those crawl errors. Many bloggers who are using Google Webmaster tools are frequently facing this Crawl errors. Here in this post I will give a small explanation about such errors in Google Webmaster Tools.

After adding your blog to Google Webmaster tools, Google bots will crawl your blog in certain intervals. This crawling is very important to get indexed in Search Engines, especially in Google. In the Google Webmaster tools Dashboard, you can see your blog name and when you click on it, the whole data about your site including performance, keywords, links , crawl errors will be displayed. The Crawl errors provides details about the URLs in your blog or website that Google tried to crawl but could not access. The Mobile crawl errors page provides details about problems that a Google bot encountered crawling URLs on your mobile website.

For viewing crawl errors, login to Webmaster Tools Home page, click the site you want and look under Diagnostics, then, click Crawl errors.

Different Crawl Errors and Details

Mostly the reported crawl errors are “Not found”, “URLs not followed”, “URLs restricted by robots.txt”, “URLs timed out”, “HTTP errors”, “URL unreachable”, and “Soft 404s”.

1. Not found : This crawl error indicates that while crawling, the bot failed to find the specific URL, that may be due to the server successfully returned the page, the requested page doesn't exist or the server is temporarily unavailable. This sometime happens due to the bad performance of your site or blog or the server in which you host the blog have some technical issues. Usually this is not that much important Crawl error.

2. URLs not followed : This crawl error lists URLs that Google bots were unable to completely follow. This is considered to be an issue have some importance because, if a spider is not able to follow a URL, then the important PageRank determining data remains incomplete, which may result in penalizing the page from appearing in search results.

In order to overcome this error, reduce the use of features such as Javascript, cookies, session IDs, frames, DHTML, or Flash in your site. Also, If you use dynamic pages (for instance, the URL contains a ? character), then not all search engine spiders crawl dynamic and static pages. Also use links that are absolute or full URL.

3. URLs restricted by robots.txt : This crawl error is due to the robots.txt restriction. Check whether your robots.txt file might prohibit the Googlebot entirely or it might prohibit access to the directory in which this URL is located; or it might prohibit access to the URL specifically or if a URL redirects to a URL that is blocked by a robots.txt file, the first URL will be reported as being blocked by robots.txt (even if the URL is listed as Allowed in the robots.txt analysis tool).

Google respects the use of robot.txt file to prevent crawling a URL for specific reasons of the site owner. If that is the case, then there is no need to worry about this error.

4. URLs timed out: This crawl error happens when Google spiders receive a timeout when tried to access the page or crawl the page. Check the page and make sure that your URL is accessible.

5. HTTP errors: This crawl error is produced when a request is made to your server for a page on your site and your server returns an HTTP status code in response to the request. All the HTTP error codes can be viewed here. http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

6. URL unreachable: This crawl error is reported when Google encountered an error when trying to access a URL. This may be due to any of the reasons mentioned above. Other than page loading time and robot.txt file issue, other issues are normal.

7. Soft 404s : This crawl error is announced when someone requests a page that doesn’t exist, and a server will return a 404 (not found) error. This HTTP response code informs both browsers and search engines that the page doesn’t exist. As a result, the content of the page will not be crawled or indexed by search engines.

I hope that you have received a general idea about these crawl error notifications. I will publish more articles about all these subjects in more detail manner. Kindly subscribe our feed with your email address.

More to Read:

2 comments:

Admin said...: Thanks so much for this information. I find it very useful and informative. Keep up the excellent work..!; April 20, 2011 at 11:34 PM
Best hosting service said...: Really great post and useful for me.; April 25, 2011 at 2:58 AM