You created a website, and that website is crawlable for the search engine. But what if you don’t want the search engine to crawl some parts of the website? How can you block that parts of the website for the search engine? The robots.txt file.
So you like to understand the robots.txt file? But if I say it is a crawling bot path inspector. That allows or disallows bots on some roads, and let’s block on others. This file allows or disallows search engine bots to crawl some sections of your webpage and disallow others.
Understand the robots.txt file of a website. What is it, and what kind of functions it performs?
Search Engine Optimization and Robots.txt
What are the Search Engine Bots?: These are the bots that read data from a website or webpage and transfer that to their database like Google, Bing, Yandex, or any other. For example, suppose you create a website and continuously writing articles on that.
Internal Backlink and robots.txt: If you’re writing an article, and within that, you provide some internal link that may be of category or label or anything else that is blocked by robots.txt. Now that link you provided within the article is to follow as it is internal, and at the same time, you’re disallowing it by the permission set in the robots.txt file. So best practice for internal pages that shouldn’t be index but crawl is to noindex those pages, not to disallow using robots.txt file.
What pages should be blocked by the robots.txt file: It should block the sensitive pages. That may be the admin section of your website or blog. All other pages that cause junk or double content should be noindex using proper meta tags or x-robots tags.
Links from external resources will block: Suppose someone provided a backlink of your website of category section, then in such case crawl engine will try to crawl your website but robots.txt block it from crawling and a hard-earned backlink will waste.
Robots.txt syntax used
This declared the bots or web crawler to which we’re giving instruction or controlling them for the various sections using allow and disallow function.
You can declare the page sections that shouldn’t be crawled by search engines. Using this method, we can save the crawling quota of the search engine dedicated to our website.
It is usually used for Googlebot to allow sections for the crawl of a website. For example, it may allow a subfolder that’s parent folder is disallowed to crawl.
It declares the XML sitemap’s location or blog—search engines like Google, Bing, Yandex support this command.
robots.txt file example:
User-agent: Mediapartners-Google Disallow: User-agent: * Disallow: /*/junk/* Disallow: /search Allow: / Sitemap: https://example.com/sitemap.xml
In the above robots.txt example
- User-agent: Mediapartners-Google declares instruction for Google AdSense, instructions are followed as Disallow: to nothing. That means AdSense can crawl your whole website and display ads.
- The next command link is User-agent: *, which means instruction for all other bots or crawlers other than Google AdSense.
- Disallow: /*/junk/* and Disallow: /search that disallows subfolder “junk” to any parent folder. And also parent folder “search”. Allow command allows the whole website to crawl. You can include private sections of the website that should not index in the search results.
- Sitemap: https://example.com/sitemap.xml is the location of the sitemap added to the domain.
Check how you can set it for WordPress and Blogger.
No, the robots.txt file is always located under the root directory with the name robots.txt. And, you can’t even change its name.
Yes, You can provide multiple sitemaps in the robots.txt file.
No, for that, you’ve to use meta robots tag or X-Robots-Tag in the header response.