Web crawlers, Robots.txt and Googlebot Webner Blogs - eLearning, Salesforce, Web Development & More

What are web crawlers?

Web crawlers are programs or scripts which search engines use to gather certain types of data from the websites and check links and update the indexes, so that search engines can show up-to-date results. Most of the search engines like Google and Yahoo, use crawlers to build and update their indexes:

Google’s main crawler (user-agent) is called Googlebot
Use of robots.txt

User-agents like Googlebot recognize the robots.txt file to decide if a link should be followed for indexing or not.
If you want all of your website pages to be crawled by Google then you don’t need a robots.txt file. But If you want to block or allow Google crawlers from accessing some of your website pages then you can do this in robots.txt by specifying Googlebot as the user-agent.

Example 1. When you want to fully block the google web crawler:
All you just have to create a robots.txt in web root directory of your website.
Step.1. touch /var/www/html/robots.txt
Step.2. vi /var/www/html/robots.txt (paste the below content and save the file):

User-agent: Googlebot
Disallow: /( use of “/” is to block whole webroot directory)

# Note: * is wildcard and that can be use to block all crawlers (user-agents).

Example 2. When you want to partially block the google web crawler:

User-agent: Googlebot
Disallow: /important-docs

Example 3. When you want to block the google web crawler at multiple directory levels:

User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /logs/
Disallow: /temp/

Example 4. When you want to fully block the google web crawler while allowing specific directory:

User-agent: Googlebot
Disallow: /
Allow: /webapp/reports

Example 5. When you want Ads/Advertisements on all your pages, but you don’t want those pages to appear in Google Search. Here you block Googlebot, but allow Mediapartners-Google:

User-agent: Googlebot
Disallow: 
User-agent: Mediapartners-Google
Disallow:

Example 6. When you might want all your pages to appear in Google Search, but you don’t want images from your private directory to be crawled:

User-agent: Googlebot
Disallow:

User-agent: Googlebot-Image
Disallow: /private

Related posts:

Leave a Reply Cancel reply