The robot.txt file is a text file that guides search engine crawlers on which pages of a website they are allowed to access and crawl. It specifies URLs that crawlers can and cannot access, follows the robot exclusion protocol standard, and is generally placed in the website root directory. The robot.txt file helps with search engine optimization by allowing webmasters to exclude internal or private pages from being indexed while also limiting crawling to optimize crawl budget and demand.
Introduces the Robot.txt file, a UTF-8 encoded file that guides crawlers on which pages to access. It specifies URLs for crawling and is a crucial part of web communication.
Explains the protocol guiding communication between websites and crawlers. Describes standard format of Robot.txt files and their placement in website hierarchy.
Details the syntax for specifying which user agents can crawl specific pages. Highlights accessibility rules and references Google Search Central for documentation.
Discusses how Robot.txt helps SEO by excluding non-public pages from indexing and addressing crawl budget issues to ensure better site optimization.
Lists important rules for Robot.txt file creation including singularity, placement, parsing requirements, and essential fields such as user-agent and sitemap.
Robot txt file:
Textfile ( UTF-8
encoded)
User agent: is the one
who has come to
crawl the pages.
It may be google bot,
other search engine
bots etc.
Pages after Allow
command- The
pages which is
allowed for crawling.
Disallow; It specifies
pages excluded from
crawling
3.
It isa text file or a set of instructions that guides the crawler or tells the
web robots which pages on your website to crawl or not to crawl.
It specifies the URLs that the search engine crawler can access on your
website.
Whenever web robots or crawlers visit your website, the robot.txt file
guides them on preferential crawling
4.
It followsthe robot exclusion protocol, the protocol set for communication
between websites and crawlers or web robots.
This standard was proposed by Martijn Koster, a Dutch software engineer and it
quickly got acceptance as a de-facto standard in theWorldWide Web.
The robot.txt file is generally placed in the website hierarchy’s root. (
Refer: Moz)
5.
Well, without goingmuch into the technicalities which may be difficult for you to understand, let’s
first understand the format of a robot.txt file.
The general format is
User-agent: *
Allow: /
The Asterix after user-agent denotes that the standard applies to all web robots visiting the website.
The slash after ‘Allow’ tells the robot to visit any pages on the website.
User-agent: [user-agent name]
Disallow:[URL which should not to be crawled(string)]
In this case, the URL which should not be crawled has been specified.
User-agent: *
Disallow::/folder/
User-agent: *Disallow: /file.html
8.
In thiscase, partial access is provided. Refer to the documentation
on Robot.txt file of Google Search Central to know more….
9.
How a robot.txtfile helps SEO?
• If you have some pages on your website that you don’t want to be public, maybe some
pages for just internal viewing of your employees or any page which are meant for specific
merchants or users, robot.txt helps you exclude them from indexing on Google.
• If all the pages of your site are not getting indexed by Google, you are already having a
crawl budget problem. You can reduce this problem by limiting the number of pages to be
crawled.
10.
Crawl Budget andCrawl Limit
Search Engines have limited resources and they have billions of web pages to crawl.
So, they assign a crawl budget to prioritize their crawling effort based on…
• Crawl Limit– the amount of crawling a website can afford and the webmasters or the
website owner’s preferences.
• Crawl Demand– the URLs that should be re-crawled based on their popularity and
freshness ( arising from updation of pages).
• Refer: Content King App
12.
SOME RULES;
Your SiteCan Have Only One Robot.txt file
The robot txt. File must be present at the root of the website host
to which it applies
Before crawling, the search bots downloads the robot txt file and
parses it to extract information from it
Robot txt file must contain
A field: User agent, Allow, Disallow, Sitemap
A colon: (:)
A value:
13.
•user-agent: identifies whichcrawler the rules apply to.
•allow: a URL path that may be crawled.
•disallow: a URL path that may not be crawled..
•sitemap: the complete URL of a sitemap.