One of the main ways of telling a search engine where it can or can’t go on your website is through a robots.txt file. Search engines usually support a robots.txt file’s basic working, but some even respond to the extra regulations to get the most out of it.
This article will cover all there is to know about robots.txt – from how you can use it to how to avoid some significant mistakes that may end up harming your site. The key is to read through this guide and understand it entirely before integrating robots.txt into your website.
- What is robots.txt?
- How does the robot.txt standard work?
- The technicalities of robots.txt
- Examples
- What does it mean for you?
- The advantages and disadvantages of robots.txt
- Best practices for robots.txt files
- How to block potentially nefarious bots and scrapers in your robots.txt file
- How to validate your robots.txt file?
What is robots.txt?
Let’s start with the very basics: what is robots.txt? Essentially, a robots.txt file is a text file that is read by search engine spiders and follows a strict, specified syntax. These “spiders” are more commonly called robots (hence robot.txt). The syntax of the file is strict and specified mainly because it has to be computer-readable. This makes the entire process of using robot.txt certain – there is literally no room for error!
The thing with the robots.txt file is that it results from a mutual agreement among early search engine spider developers. By no means is it an official standard set by a standards organization – yet all leading search engines confine to it and function accordingly.
For SEO purposes, sending the right signals to search engines is essential, and this is exactly what robots.txt files do. They tell the search engines what your website’s rule of engagement is and are hence an indispensable part of your SEO strategy.
How does the robot.txt standard work?
Here’s how it all works: search engines index the web by spidering pages, catching links to go from site A to site B to site C. However, before the search engine spiders land on a page on the domain it hasn’t interacted with before, it will open that domain’s robots.txt file. As the robots.txt file tells the search engine which URLs on that site are allowed to index, the search engine then knows how to interact with and navigate your site.
Typically, the search engine will cache the contents of the robots.txt. However, it will usually also refresh it several times a day, which allows for changes to be made as quickly as possible.
The technicalities of robots.txt
A robots.txt file comprises one or more blocks of directives. These directives start with a user-agent line. User-agent line is the name of the particular spider it addresses.It’s up to you to either have one block for all search engines or use a wildcard for the user-agent or specific blocks for specific search engines. However, keep in mind that a search engine spider will always pick the block that best matches its name!
Example directive:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
User-agent: bingbot
Disallow: /not-for-bing/
*One thing to keep in mind when creating directives is that they usually aren’t case sensitive. Therefore, whether you write Allow/Disallow is the lower or upper case is really your choice. However, do make sure that the values, i.e., /photos/ are case sensitive, so make sure you use the right case for these.
As intimidating as these look, once you get the hang of it, it will all make sense!
Examples
In order to help you and make integrating robot.txt into your site more comfortable, we present to you a list of common user-agents that will match with the most commonly used search engines.
Baidu | General | baiduspider |
Baidu | Images | baiduspider-image |
Baidu | Mobile | baiduspider-mobile |
Baidu | News | baiduspider-news |
Baidu | Video | baiduspider-video |
Bing | General | bingbot |
Bing | General | msnbot |
Bing | Images & Video | msnbot-media |
Bing | Ads | adidxbot |
General | Googlebot | |
Images | Googlebot-Image | |
Mobile | Googlebot-Mobile | |
News | Googlebot-News | |
Video | Googlebot-Video | |
Adsense | Mediapartners-Google | |
Adwords | AdsBot-Google | |
Yahoo! | General | slurp |
Yandex | General | yandex |
*For robots.txt files, Google only supports a file size limit of 500 kibibytes (512 kilobytes). Therefore, keep in mind that any content after this maximum file size may be ignored by Google!
What does it mean for you?
If you’re on this article, chances are you want to know where you should put your robots.txt file. That’s simple: the robots.txt file should always be at the root of your domain. So, if your domain is www.something.com, it would look something like this when you attach your robots.txt file: https://www.something.com/robots.txt.
The advantages and disadvantages of robots.txt
Like everything, robots.txt files also have their pros and cons. In order to understand it better, let us now turn to those.
Advantage
Supervising crawl budget
The most significant advantage that comes with robot.txt files is that they allow you to manage the crawl budget. It is believed that when a search spider comes to your webpage, it comes with a pre-determined “allowance” for how many pages it will crawl. In SEO jargon, this is called the crawl budget. What this means for you is that you can restrict sections of your site from the search engine spider, and ultimately allow your crawl budget to be used for other sections.
Blocking the search engines from crawling troublesome sections of your site can be considerably beneficial to your webpage. This is especially true for sites where a lot of SEO clean-up has to be done. Then, once you clean things up, you can let them back in whenever you require.
Additionally, it can also help you with the following:
- It can help you prevent the appearance of duplicate content.
- If some parts of your website are under work/construction, you can use robots.txt to hide the unfinished pages!
- You can exclude any pages that you don’t want the public to have access to.
Disadvantages: search results and link value
Not removing the page from search results
One of the more noticeable disadvantages of robots.txt files is that although they allow you to hide specific pages from your website, they don’t remove those pages from the search results.
Back in the day, you could do even that by utilizing the “noindex” directive. If you added a noindex directive to your robots.txt, you could remove URLs from Google’s search results! However, this is no longer supported.
Basically, because you can use the robots.txt file to tell a spider where it can’t go on your website, doesn’t mean that you can use it to stop a certain search engine from displaying the URL of your hidden page. Simply put: blocking a webpage using robots.txt will not stop it from being indexed.
Not spreading link value
Another disadvantage of the robots.txt is that if the search engine can’t crawl a page, it won’t be able to spread the link value across the links on that page. Therefore, if you’re blocking a page with robots.txt, you need to understand that it is pretty much a dead-end. Any possible link value you could get from that page is lost.
Some other risks it poses are:
1) It gives attackers the location of the site’s directory structure and private data. (not a serious if the WEB server security is properly set up)
2) If the setting is not correct, it will cause the search engine to delete all the indexed data.
Best practices for robots.txt files
Now, let’s look at some of the best practices you can adopt concerning robots.txt
- The robots.txt file should be placed in the root of a website (this is the top-level directory of the host) and carry the filename robots.txt. This is case sensitive!
- Only allow one group of directives per robot.
- Be as specific as possible, especially when you are defining the Disallow directive. This is important because when it comes to the Disallow directive, it can even trigger on partial matches.
- Make sure that you have directives for all bots but also directives for specific bots.
- Monitor your robots.txt file for any changes.
- Don’t use noindex in your robots.txt.
How to block potentially nefarious bots and scrapers in your robots.txt file
You can find an example robots.txt configuration created by mitchellkrogza featuring bots you can safely disallow:.
How to validate your robots.txt file?
Hexometer monitors your robots.txt file as standard and alerts you if any issues are detected, this is by far the easiest and most convenient way of ensuring a configuration issue in your robots.txt is not causing havoc on your website.
Alternatively, there are various tools out there that can assist you including the Google robots.txt testing tool in its Google Search Console.
CMO & Co-founder
Helping entrepreneurs automate and scale via growth hacking strategies.
Follow me on Twitter for life behind the scenes and my best learnings in the world of SaaS.