Robots.txt File Testing

Testing Robots.txt with the SEO spider

A robots.txt file is used to issue instructions to robots on what URLs can be crawled on a website. All major search engine bots conform to the robots exclusion standard and will read and obey the instructions of the robots.txt file, before fetching any other URLs from the website.

Commands can be set up to apply to specific robots according to their user-agent (such as ‘Googlebot’), and the most common directive used within a robots.txt is a ‘disallow’, which tells the robot not to access a URL path.

You can view a site’s robots.txt in a browser, by simply adding /robots.txt to the end of the domain name (www.insightbeforeaction.com/robots.txt for example).

Robots.txt file examined under a chosen domain name

While robots.txt files are generally fairly simple to interpret, however, when there are lots of lines, user agents, directives and thousands of pages, it can be difficult to identify which URLs are blocked, and those that are allowed to be crawled.

Obviously, the consequences of blocking URLs by mistake can have a huge impact on visibility in the search results, so caution is advised.

Robots.txt file – to see what is blocked or controlled via the robots.txt file, select Response Codes on the top menu and then filter to ‘Blocked by Robots.txt’.

Using the feature within Screaming frog can help to identify and validate the contents of the robots.txt file.

From here you can download it and add it to Google’s robots.txt tester tool. This will analyse your robots.txt file to see if there are any errors or warnings associated with it.

Google robots tester tool to view and amend a chosen robots file belonging to a web property
Check the settings of your robots txt file within screaming frog
Check the settings of your robots txt file within screaming frog

How the SEO spider obeys Robots.txt

The Screaming Frog SEO Spider obeys robots.txt in the same way as Google. It will check the robots.txt of the subdomain(s) and follow (allow/disallow) directives specifically for the ‘Screaming Frog SEO Spider’ user-agent, if not Googlebot and then ALL robots.

URLs that are disallowed in robots.txt will still appear and be ‘indexed’ within the user interface with a ‘status’ as ‘Blocked by Robots.txt’, it is just that they will not be crawled, so the content and outlinks of the page will not be seen.

Showing internal or external links blocked by robots.txt in the user interface can be switched off in the robots.txt settings.

It’s important to remember that URLs blocked in robots.txt can still be indexed in the search engines if they are linked either internally or externally.

A robots.txt merely stops the search engines from seeing the content of the page. A ‘noindex’ meta tag (or X-Robots-Tag) is a better option for removing content from the index.

The tool supports URL matching of file values (wildcards * / $), just like Googlebot, too.

Typical Robots.txt examples

An asterisk alongside the ‘User-agent’ command (User-agent: *) indicates directives apply to ALL robots, while specific User-agent bots can also be used for specific commands (such as User-agent: Googlebot).

If commands are used for both all and specific user-agents, then the ‘all’ commands will be ignored by the specific user-agent bot and only its own directives will be obeyed.

If you want the global directives to be obeyed, then you will have to include those lines under the specific User-agent section as well.

Below are some typical examples of directives used within robots.txt file:

Block all Robots from all URLs

User-agent: *
Disallow: /

Block all Robots from a folder

User-agent: *
Disallow: /folder/

Block all Robots from a specific URL

User-agent: *
Disallow: /a-specific-url.html

Block Googlebot from all URLs

User-agent: Googlebot
Disallow: /

Block and allow commands together

User-agent: Googlebot
Disallow: /
Allow: /crawl-this/

If you have conflicting directives (i.e an allow and disallow to the same file path) then a matching allow directive beats a matching disallow when it contains equal or more characters in the command.

Robots.txt URL wildcard matching

Google and Bing allow the use of wildcards in the robots.txt file. For example, to block all crawlers’ access to all URLs that include a question mark (?).

User-agent: *
Disallow: /*?

You can use the dollar ($) character to match the end of the URL. For example, to block all crawlers’ access to the .html file extension.

User-agent: *
Disallow: /*.html$

Finding blocked images

Please Note: Some images might be blocked by the robots.txt file and therefore you may wish to find them and see what they are within your crawl. To do this select Configuration > Robots.txt > Settings > choose Ignore Robots.txt and press ok


Further support

The guide above should help illustrate the simple steps required to get started with Screaming frog SEO Spider.

You can read more about URL matching-based path values in Google’s robots.txt specifications guide.

For more information check out the videos on Youtube.

Likewise, if you have any further questions, then please get in touch via our contact page.