A robots.txt file is a mechanism by which a webmaster indicates to a search engine crawler which pages can and cannot be requested from a website. These are not limited to HTML pages and can include all types of files including pictures and video. Although the concept may seem counterintuitive because the idea is for a search engine to review everything in a website, pages with too many unimportant resources and high traffic can overwhelm a server, thus indexing files that are not going to help with ranking while ignoring important content. Therefore, these files can ensure that the right information is indexed while also blocking other resources including duplicate content or any other item that may negatively affect the site. The file follows a protocol developed in 1994 that forms part of the Robot Exclusion Protocol.
How do these files work and format of robots.txt?
The resource is a simple ASCII file that follows the basic formatting rules. Therefore, those interested in creating the file can use programs like Notepad. However, more complex word processes like Microsoft Word add proprietary formatting information to the file, thus making their use incompatible. The file should be placed in the root directory of the webpage and be named robots.txt for the bots to identify and read it. Websites that use subdomains will need to place the resource in the main directory of said subdomain. After the file is created, webmasters or interested parties can use a collection of tools such as a robots.txt validator to check the format and contents of the file and verify its validity and syntax.
The format of the file follows a basic user-agent and disallow property that indicates to a bot that a resource is off limits to crawling. This can be as short as two lines of code with the basic set, or as complex as multiple lines that have many directives. If a version of the file has more than one instruction, then each command is paired and each pair is separated by a blank line. This allows webmasters to permit certain resources while restricting others. When a bot reads the robots.txt resources, it does so in a sequential manner starting from the top and finishing at the end of the file. Users may implement the use of comments by placing a pound sign before said comments. These are good for creating reminders or indicating the explanation of the idea of the restriction for future access. Webmasters should be aware that the files are case sensitive and that all files are allowed for crawling unless otherwise flagged for restriction.
Why are these files important?
There are many reasons to use robots.txt to limit a search engine from crawling every file on a particular website. Among these is that search engines like Google do not like duplicate content. In certain cases, duplicating information may be necessary to provide visitors with access to information. To avoid being penalized by a search engine, users can restrict the resource. Additionally, bigger websites may have countless files. When a bot crawls a site, it may have a limit called a crawl budget. The idea in search engine optimization (SEO) is to rank a page as high as possible by optimizing the content for relevance and visibility. However, if the bot spends its crawl budget by reviewing low quality or unimportant URLs, then any SEO performed on the site will be meaningless. As a countermeasure, webmasters can target specific content by creating a strategy where only pages with highly-valued information are allowed and all else is restricted, thus ensuring that the right content is indexed by search engines.