Robots.txt: What It Is And How To Create It (Complete Guide)

If you own a website or manage its content, you’ve likely heard of robots.txt. It’s a file that instructs search engine robots on how to crawl and index your website’s pages. Despite its importance in search engine optimization (SEO), many website owners overlook the significance of a well-designed robots.txt file.

In this complete guide, we’ll explore what robots.txt is, why it’s important for SEO, and how to create a robots.txt file for your website.

What is Robots.txt File?

A robots.txt is a file that tells search engine robots (also known as crawlers or spiders) which pages or sections of a website should be crawled or not. It is a plain text file located in the root directory of a website, and it typically includes a list of directories, files, or URLs that the webmaster wants to block from search engine indexing or crawling.

This is how a robots.txt file looks like:

Robots.txt File

Why Is Robots.txt Important?

There are three main reasons why robots.txt is important for your website:

1. Maximize Crawl Budget

“Crawl budget” stands for the number of pages that Google will crawl on your site at any given time. The number depends on the size, health, and quantity of backlinks on your site.

The crawl budget is important because if the number of pages on your site goes over the crawl budget, you will have pages that aren’t indexed.

Furthermore, pages that are not indexed will not rank for anything.

By using robots.txt to block useless pages, Googlebot (Google’s web crawler) may spend more of your crawl budget on pages that matter.

2. Block Non-Public Pages

You have many pages on your site that you do not want to index.

For example, you might have an internal search results page or a login page. These pages need to exist. However, you don’t want random people to land on them.

In this case, you’d use robots.txt to prevent search engine crawlers and bots from accessing certain pages.

3. Prevent Indexing of Resources

Sometimes you will want Google to exclude resources such as PDFs, videos, and images from search results.

Possibly you want to keep those resources private, or you want Google to focus more on important content.

In such cases, using robots.txt is the best approach to prevent them from being indexed.

How Does a Robots.txt File Work?

Robots.txt files instruct search engine bots which pages or directories of the website they should or should not crawl or index.

While crawling, search engine bots find and follow links. This process leads them from site X to site Y to site Z over billions of links and websites.

When a bot visits a site, the first thing it does is look for a robots.txt file.

If it detects one, it will read the file before doing anything else.

For example, suppose you want to allow all bots except DuckDuckGo to crawl your site:

User-agent: DuckDuckBot Disallow: /

Note: A robots.txt file can only give instructions; it cannot impose them. It’s similar to a code of conduct. Good bots (such as search engine bots) will follow the rules, whereas bad bots (such as spam bots) will ignore them.

How to Find a Robots.txt File?

The robots.txt file, like any other file on your website, is hosted on your server.

You can access the robots.txt file of any website by entering the complete URL of the homepage and then adding /robots.txt at the end, such as https://pickupwp.com/robots.txt.

Robots.txt File

However, if the website does not have a robots.txt file, you will receive a “404 Not Found” error message.

How to Create a Robots.txt File?

Before showing how to create a robots.txt file, let’s first look at the robots.txt syntax.

The syntax of a robots.txt file can be broken down into the following components:

  • User-agent: This specifies the robot or crawler that the record applies to. For example, “User-agent: Googlebot” would apply only to Google’s search crawler, while “User-agent: *” would apply to all crawlers.
  • Disallow: This specifies the pages or directories that the robot should not crawl. For example, “Disallow: /private/” would prevent robots from crawling any pages within the “private” directory.
  • Allow: This specifies the pages or directories that the robot should be allowed to crawl, even if the parent directory is disallowed. For example, “Allow: /public/” would allow robots to crawl any pages within the “public” directory, even if the parent directory is disallowed.
  • Crawl-delay: This specifies the amount of time in seconds that the robot should wait before crawling the website. For example, “Crawl-delay: 10” would instruct the robot to wait for 10 seconds before crawling the website.
  • Sitemap: This specifies the location of the website’s sitemap. For example, “Sitemap: https://www.example.com/sitemap.xml” would inform the robot of the location of the website’s sitemap.

Here is an example of a robots.txt file:

User-agent: Googlebot Disallow: /private/ Allow: /public/ Crawl-delay: 10 Sitemap: https://www.example.com/sitemap.xml

Note: It’s important to note that robots.txt files are case-sensitive, so it’s important to use the correct case when specifying URLs.

For example, /public/ is not the same as /Public/.

On the other hand, Directives like “Allow” and “Disallow” are not case-sensitive, so it’s up to you to capitalize them or not.

After learning about robots.txt syntax, you can create a robots.txt file using a robots.txt generator tool or create one yourself.

Here is how to create a robots.txt file in just four steps:

1. Create a New File and Name It Robots.txt

Simply opening a .txt document with any text editor or web browser.

Next, give the document the name robots.txt. To work, it must be named robots.txt.

Once done, you can now start typing directives.

2. Add Directives to the Robots.txt File

A robots.txt file contains one or more groups of directives, each with multiple lines of instructions.

Each group starts with a “User-agent” and contains the following data:

  • Who the group applies to (the user-agent)
  • Which directories (pages) or files can the agent access?
  • Which directories (pages) or files can’t the agent access?
  • A sitemap (optional) to inform search engines about the sites and files you believe are important.

Lines that do not match any of these directives are ignored by crawlers.

For example, you want to prevent Google from crawling your /private/ directory.

It would look like this:

User-agent: Googlebot Disallow: /private/

If you had further instructions like this for Google, you’d put them in a separate line directly below like this:

User-agent: Googlebot Disallow: /private/ Disallow: /not-for-google

Furthermore, if you’re done with Google’s specific instructions and want to create a new group of directives.

For example, if you wanted to prevent all search engines from crawling your /archive/ and /support/ directories.

It would look like this:

User-agent: Googlebot Disallow: /private/ Disallow: /not-for-google User-agent: * Disallow: /archive/ Disallow: /support/

When you’re finished, you can add your sitemap.

Your completed robots.txt file should look like this:

User-agent: Googlebot Disallow: /private/ Disallow: /not-for-google User-agent: * Disallow: /archive/ Disallow: /support/ Sitemap: https://www.example.com/sitemap.xml

Next, save your robots.txt file. Remember, it must be named robots.txt.

For more useful robots.txt rules, check out this helpful guide from Google.

3. Upload the Robots.txt File

After saving your robots.txt file to your computer, upload it to your website and make it available for search engines to crawl.

Unfortunately, there is no tool that can help with this step.

Uploading of the robots.txt file depends on your site’s file structure and web hosting.

For instructions on how to upload your robots.txt file, search online or contact your hosting provider.

4. Test Your Robots.txt

After you’ve uploaded the robots.txt file, next you can check if anyone can see it and if Google can read it.

Simply open a new tab in your browser and search for your robots.txt file.

For example, https://pickupwp.com/robots.txt.

Robots.txt File

If you see your robots.txt file, you’re ready to test the markup (HTML code).

For this, you can use a Google robots.txt Tester.

Google Robots.txt Tester

Note: You have a Search Console account set up to test your robots.txt file using robots.txt Tester.

The robots.txt tester will find any syntax warnings or logic errors and highlight them.

Plus, it also shows you the warnings and errors below the editor.

Google Robots.txt Tester Result

You can edit errors or warnings on the page and retest as often as necessary.

Just keep in mind that changes made on the page aren’t saved to your site.

To make any changes, copy and paste this into the robots.txt file of your site.

Robots.txt Best Practices

Keep these best practices in mind while creating your robots.txt file to avoid some common mistakes.

1. Use New Lines for Each Directive

To prevent confusion for search engine crawlers, add each directive to a new line in your robots.txt file. This applies to both Allow and Disallow rules.

For example, if you don’t want a web crawler to crawl your blog or contact page, add the following rules:

Disallow: /blog/ Disallow: /contact/

2. Use Each User Agent Only Once

Bots don’t have any problem if you use the same user agent again and over.

However, using it just once keeps things organized and reduces the chance of human error.

3. Use Wildcards To Simplify Instructions

If you have a large number of pages to block, adding a rule for each one might be time-consuming. Fortunately, you may use wildcards to simplify your instructions.

A wildcard is a character that can represent one or more characters. The most commonly used wildcard is the asterisk (*).

For instance, if you want to block all files that end in .jpg, you would add the following rule:

Disallow: /*.jpg

4. Use “$” To Specify the End of a URL

The dollar sign ($) is another wildcard that may be used to identify the end of a URL. This is useful if you want to restrict a specific page but not the ones after it.

Suppose you want to block the contact page but not the contact-success page, you would add the following rule:

Disallow: /contact$

5. Use the Hash (#) To Add Comments

Everything that begins with a hash (#) is ignored by crawlers.

As a result, developers often use the hash to add comments to the robots.txt file. It keeps the document organized and readable.

For example, if you want to prevent all files ending with .jpg, you may add the following comment:

# Block all files that end in .jpg Disallow: /*.jpg

This helps anyone understand what the rule is for and why it’s there.

6. Use Separate Robots.txt Files for Each subdomain

If you have a website that has multiple subdomains, it is recommended to create an individual robots.txt file for each one. This keeps things organized and helps search engine crawlers grasp your rules more easily.

Wrapping Up!

The robots.txt file is a useful SEO tool since it instructs search engine bots on what to index and what to not.

However, it is important to use it with caution. Since a misconfiguration can result in complete deindexation of your website (e.g., using Disallow: /).

Generally, the good way is to allow search engines to scan as much of your site as possible while keeping sensitive information and avoiding duplicate content. For example, you may use the Disallow directive to prevent specific pages or directories or the Allow directive to override a Disallow rule for a particular page.

It’s also worth mentioning that not all bots follow the rules provided in the robots.txt file, so it’s not a perfect method for controlling what gets indexed. But it’s still a valuable tool to have in your SEO strategy.

We hope this guide helps you learn what a robots.txt file is and how to create one.

For more, you can check out these other helpful resources:

Lastly, follow us on Facebook and Twitter for regular updates on new articles.