What is a Robot txt file?

Robot txt file:
Text file ( UTF-8
encoded)
User agent: is the one
who has come to
crawl the pages.
It may be google bot,
other search engine
bots etc.
Pages after Allow
command- The
pages which is
allowed for crawling.
Disallow; It specifies
pages excluded from
crawling

 It is a text file or a set of instructions that guides the crawler or tells the
web robots which pages on your website to crawl or not to crawl.
 It specifies the URLs that the search engine crawler can access on your
website.
 Whenever web robots or crawlers visit your website, the robot.txt file
guides them on preferential crawling

 It follows the robot exclusion protocol, the protocol set for communication
between websites and crawlers or web robots.
 This standard was proposed by Martijn Koster, a Dutch software engineer and it
quickly got acceptance as a de-facto standard in theWorldWide Web.
 The robot.txt file is generally placed in the website hierarchy’s root. (
Refer: Moz)

Well, without going much into the technicalities which may be difficult for you to understand, let’s
first understand the format of a robot.txt file.
The general format is
User-agent: *
Allow: /
The Asterix after user-agent denotes that the standard applies to all web robots visiting the website.
The slash after ‘Allow’ tells the robot to visit any pages on the website.

User-agent: *
Disallow:
The slash after disallow tells the robot not to visit any pages on the website.
( Refer: Neil Patel)

User-agent: [user-agent name]
Disallow: [URL which should not to be crawled(string)]
In this case, the URL which should not be crawled has been specified.
User-agent: *
Disallow::/folder/
User-agent: *Disallow: /file.html

 In this case, partial access is provided. Refer to the documentation
on Robot.txt file of Google Search Central to know more….

How a robot.txt file helps SEO?
• If you have some pages on your website that you don’t want to be public, maybe some
pages for just internal viewing of your employees or any page which are meant for specific
merchants or users, robot.txt helps you exclude them from indexing on Google.
• If all the pages of your site are not getting indexed by Google, you are already having a
crawl budget problem. You can reduce this problem by limiting the number of pages to be
crawled.

Crawl Budget and Crawl Limit
Search Engines have limited resources and they have billions of web pages to crawl.
So, they assign a crawl budget to prioritize their crawling effort based on…
• Crawl Limit– the amount of crawling a website can afford and the webmasters or the
website owner’s preferences.
• Crawl Demand– the URLs that should be re-crawled based on their popularity and
freshness ( arising from updation of pages).
• Refer: Content King App

SOME RULES;
Your Site Can Have Only One Robot.txt file
The robot txt. File must be present at the root of the website host
to which it applies
Before crawling, the search bots downloads the robot txt file and
parses it to extract information from it
Robot txt file must contain
A field: User agent, Allow, Disallow, Sitemap
A colon: (:)
A value:

•user-agent: identifies which crawler the rules apply to.
•allow: a URL path that may be crawled.
•disallow: a URL path that may not be crawled..
•sitemap: the complete URL of a sitemap.

What is a Robot txt file?

In this document

More Related Content

What's hot

Similar to What is a Robot txt file?

More from Digital Marketing Tatva

Recently uploaded

What is a Robot txt file?