How Search Works

@patrickstox @ahrefs #pubcon
How Search Works
Presented by:
Patrick Stox

Product Advisor, Technical SEO, &
Brand Ambassador at
• I write for Ahrefs blog but have written for many industry
publications in the past
• I speak at some conferences like SMX, Pubcon, UnGagged, DMO
Advanced, TechSEO Boost, BrightonSEO
• Organizer for the Raleigh SEO Meetup (most successful in US) and
the Beer & SEO Meetup
• We also run a conference, the Raleigh SEO Conference
• Founder Technical SEO Slack Group
• Moderator /r/TechSEO on Reddit
• Helped define the role of Search Marketing Strategist for the US
Department of Labor
• Lead author for the SEO Chapter of the 2021 Web Almanac, reviewer
for the 2022 Chapter
• Technical Review Editor for The Art of SEO 4th Edition
Who is Patrick Stox?

Disclaimer
This is my understanding of systems and is based on a lot of public statements
from Google and my own knowledge.
Warning: It’s not going to be 100% complete or accurate.

How Many Domains Exist?
Q3 2022 according to Verisign: 349.9 million registered
January 2023 according to Netcraft: 270.9 million unique domains responded
Ahrefs 213.1 million (after removing spam domains)

How Many Pages?
Google in 2016: 130T known

How Big Is The Index?
Google: hundreds of billions of pages indexed
100 PB in size
Ahrefs: ~380B pages

A Fraction Of The Web Is Useful Content
Rough math:
(400B / 130T) * 100 = 0.3%

https://twitter.com/
lilyraynyc/status/150
9176261884747781

Spam
Google 2021: “every day, we discover 40 billion spammy pages”
That’s 14.6T spam pages a year.

Googlebot
Googlebot is a lot of systems (1000+) and there are multiple Googlebots.
• Googlebot Image
• Googlebot News
• Googlebot Video
• Googlebot Desktop
• Googlebot Mobile
• +Ads and more
https://developers.google.com/search/docs/crawling-indexing/overview-
google-crawlers

Googlebot Is A Protocol Buffer
It stores structured data.
Similar to JSON, but smaller and faster.

Googlebot Rendering Pipeline (Simplified)

URL Sources
• Links on pages, or anything that even looks like a link
• Sitemaps
• Request indexing in GSC
• Indexing API (limited use cases)
• RSS Feeds
• WebSub (formerly PubSubHubbub)

Crawler Queue / Scheduler
Determines what URLs to crawl and when.
2 main purposes:
• Discovery
• Refresh

What SEOs Call Crawl Budget, Google Calls
Crawl demand
How much Google wants to crawl your site.
Crawl rate limit
How much crawling your website can support.

What Counts Against Your Crawl Budget?
All URLs and requests including:
• Pages/files
• Alternate URLs like AMP or m-dot pages, hreflang
• CSS
• JavaScript, including XHR requests
• Embedded content
***All Googlebots share the same crawl budget, including the ones for Ads,
images, etc.

Crawl Demand Factors
• PageRank
• How often pages change (freshness/staleness)
• When it was last crawled
• Any major changes

Crawl Rate Factors
• Stability / crawl health
• Slow responses
• Errors. 5xx (server errors) or 429 (too many requests) HTTP status codes.
They don’t want to crash the sites and the crawlers will generally back down if
they start seeing issues.

Crawl Rate Settings GSC

Crawling
The little spider is named Crawley.

Crawling
Mostly from Mountain View, CA, USA.
Every request needs to respect robots.txt.
15MB max HTML size.

Google Doesn’t Navigate Like Users
Sends requests for the files individually, doesn’t navigate between pages like a
user.

Caching Files
They Crawl more than HTML:
• Pages and other file types
• JavaScript
• CSS

Caching Files
Files are stored for use in rendering.
Google will ignore your cache timings and fetch a new copy when they want to.
JS
HTML HTML
HTML JS CSS CSS CSS
Cache
Cache

Processing – We’ll Cover This Shortly

Web Rendering Service (WRS)
Needed to process JavaScript
Evergreen (up-to-date) Googlebot
Headless (no Graphical User Interface)

Web Rendering Service (WRS)
• Stateless (storage and cookies cleared between loads)
• Denies Permissions
• Flattens light DOM and shadow DOM
• Date / Time functions adjusted
• Service workers rejected
• Animations may differ
• Random may not be random

Myth: 5 Second Limit
I think this started with a test from Max Prin on the time when the testing tools
took a screenshot. They need to have reasonable time limits for testing tools.
https://maxprin.com/tests/js-timer/

No 5 Second Limit
They’ll try to wait for pages to finish, something like networkidle0 (no more
activity).
Eventually cuts off in case something gets stuck or someone is trying to mine
bitcoin.

It Doesn’t Even Make Sense
They’re basically loading a page with everything cached already.
WRS
JS
HTML HTML
HTML
JS CSS CSS CSS
Cache
Cache

This System Causes Other Issues
Impossible states – previous file versions used when rendering.
File versioning /fingerprinting should help.
XHR requests are done in real time.

Myth: Weeks To Render
All pages go through the renderer.
The average wait time is 5 seconds according to Google’s Martin Splitt.
The 90th percentile is only minutes, not weeks.
Probably comes from pages not being prioritized
for crawling.

Rendering At Web Scale
The 8th wonder of the world.

They Use Some Hacks
“In Google search we don’t really care about the pixels because we don’t
really want to show it to someone. We want to process the information
and the semantic information so we need something in the intermediate
state. We don’t have to actually paint the pixels.” – Martin Splitt

What That Looks Like
Gray = downloads
Blue = HTML
Yellow = JavaScript
Purple = Layout
Green = Painting

They Won’t Render Noindexed Pages
<meta name="robots" content="noindex">
<meta name="robots" content="none">
None = noindex, nofollow

They’re Not Taking Actions
They don’t scroll.
They generally don’t click.

Mobile Desktop

They Don’t Click
Load content into the Document Object Model (DOM) by default. They won’t
see the content if it requires a click that makes an XHR request to pull it in.
DOM Tree and CSS Object Model (CSSOM) form the Render Tree. That’s what
gets indexed.

DOM Tree (pictured)
CSSOM (not pictured) would add info
like font size, weight, color, etc. to
each element.
Render Tree

Collapser
• Error handling
• Retries
• Soft 404s

Processing – Now We’ll Talk About It

Processing - Duplicates
Duplicate detection - content hashes or checksum
They’ll remove boilerplate content (nav, footer) for the checksum.

Near Duplicates

Processing – Duplicate Elimination
Canonicalization

~20 Canonicalization Signals
• Duplicates
• Redirects (high weight)
• Canonical link elements - multiple will be ignored
• Sitemap URLs
• Links (Internal/External, PageRank)
• Alternates – mobile, AMP, print, Hreflang
• HTTPS pages over HTTP
• Shorter URLs over longer URLs
• Where content was first published / seen
• Site level signals like a history of scraped content
• Pages over PDFs
Machine learning system

301 = Permanent, 302 = Temporary
Holds true for other perm and temp redirects

Warning! Speculation

Processing – Link Parser
Good:
<a> tag with an href attribute.
<a href=”/page”>simple is good</a>
<a href=”/page” onclick=”goTo(‘page’)”>still okay</a>

Bad (but may be parsed):
<a routerLink="products/category">no href</a>
<a onclick=”goTo(‘page’)”>no href</a>
<a href=”javascript:goTo(‘page’)”>kind of nested</a>
<a href=”javascript:void(0)”>missing link</a>
<span onclick=”goTo(‘page’)”>not the right HTML element or href</span>
<span href=“page">not the right HTML element</span>
<option value="page">not the right HTML element</option>
<a href=”#”>no link</a>
Button, ng-click, there are many more ways this can be done incorrectly.

• Link location, where it goes
• Anchor text
• Surrounding text
• …

Link Tagging
• Penguin
• Location on page (footer, main content)
• Disavow
• …

Processing – Content Parser
• Content – tokenized, vectorized. Words become numbers.
• Content language
• Content location
• Extract meta tags
• Extract Schema
• HTML Lexer – normalize the HTML
• Topic analysis. Content on other topics may be weighted less in ranking.
• Semantic analysis. Linguistic, knowledge graph, address extraction
• …

Content Tagging
• YMYL
• Adult / safe search
• Mobile-friendly
• …

Signal Collectors
• PageRank
• Spam
• Page Experience
• Freshness
• …

A Lot More In Processing Like
Drop anything after # in URLs.
(some exceptions to this)
Most Restrictive Directives
index + noindex + index = noindex
They’ll drop low quality content

Other Files May Be Processed Differently
Adobe Portable Document Format (.pdf)
•Adobe PostScript (.ps)
•Google Earth (.kml, .kmz)
•GPS eXchange Format (.gpx)
•Hancom Hanword (.hwp)
•HTML (.htm, .html, other file extensions)
•Lotus
•Microsoft Excel (.xls, .xlsx)
•Microsoft PowerPoint (.ppt, .pptx)
•Microsoft Word (.doc, .docx)
•OpenOffice presentation (.odp)
•OpenOffice spreadsheet (.ods)
•OpenOffice text (.odt)
•Rich Text Format (.rtf)
•Scalable Vector Graphics (.svg)
•TeX/LaTeX (.tex)
•Text (.txt, .text, other file extensions), including
source code in common programming languages:
• Basic source code (.bas)
• C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
• C# source code (.cs)
• Java source code (.java)
• Perl source code (.pl)
• Python source code (.py)
•Wireless Markup Language (.wml, .wap)
•XML (.xml)

Image Processing
• Text around the image
• Content of the image. They tag what is in the image. Not super reliable.
• Alt attribute
• Image name (minimal weight)
• Webpage title and description
Photo from a Gary Illyes Presentation
at Pubcon.

Robots.txt for Images
Blocking Googlebot Image from crawling means that your images will not
be indexed.

Video Processing
• OCR to get text
• Objects identified from visuals
• Speech converted to text
• Structured data
• Text and other signals from the page, URL, title, description

PDFs
• PDFs are converted and indexed as HTML
• OCR to get text
• Images get indexed
• Links get picked up
• Title
• File name
• …

Google Index
Named Caffeine

Data Infrastructure
Many data centers around the world.
Each has a copy of the index.
Millions of servers and hard drives.
Index is an inverted index.
Maps things like words to documents.
Index shards are split into words and phrases.
Other shards for metadata.

Indexing Tiers – Based On Doc Popularity
• Ram (fastest)
• SSD (fast)
• Hard drives (slowest)

Mobile Version Is Indexed (Mostly)
Some sites may remain on desktop-only indexing.
They don’t work on mobile.

Life Of A Query

Fun Fact
15% of queries have never been seen before

Start Typing - Autocomplete
Powered by real search data
and patterns across the web +
• The language of the query
• The location a query is coming from
• Trending interest in a query
• Your past searches
Probably reduces misspellings

Query parsing and understanding
BERT (DeepRank) – combinations of words express different
meanings and intents. They won’t drop important words from
the queries.
Neural matching – words to searches.
“For example, neural matching helps Google understand that a
search for “why does my TV look strange” is related to the
concept of “the soap opera effect.” We can then return pages
about the soap opera effect, even if the exact words aren’t used.”

Misspelling
1/10 searches are misspelled

Google Training Misspelling Example
Over 600 ways people misspelled Britney Spears.
http://archive.google.com/jobs/britney.html

Spelling Old Vs New
Old way:
How often terms were searched
+probability of typos from neighboring keys
New way:
Deep neural net with 680M parameters

Query Expansion
When the query is sent, it’s going to also pull pages with terms that include:
• Synonyms
• Antonyms
• Acronyms
• Plural/singular
• Stemming – root words
• Diacritical expansion - accent characters other versions
These will mostly get lower weights in scoring than the main term used.

Concepts & Entities
People, places, things
“RankBrain helps Google better relate pages to concepts – This
means Google can better return relevant pages even if they
don’t contain the exact words used in a search, by
understanding the page is related to other words and
concepts.”

Speculation
All the query expansion things may not
be necessary anymore. They may just
pull close terms in vector space.

Stop Words
The, is, and, of, a, are, an, if, etc.
Removed for some queries.
Used for other queries, like when it matches a concept.

Segmenter
Splits up strings (languages without spaces).
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']

Retrieval – Posting List
Remember that inverted index?
Map of terms to pages that contain those terms. Get all those.

Sum Of The Total Pages From All Shards

Popular Queries Are Cached

Make A Smaller List - Ranking
Google is going to cut all those results down to the top 1000 by ranking them.

Ranking / Scoring – Query Dependent
Feature of a page & query
• Keyword hits
• All those other versions from the query expansion like synonyms
• Proximity
• Content relevance, topicality
• …

Ranking / Scoring – Query Independent
Feature of a page
• PageRank, site queries, mentions,
& other E-E-A-T signals
• Language
• Mobile-friendliness
• Page experience
• …
Numbers multiplied by other numbers in the scoring

They’re Like Nah, We Can Do Better

Reranking / Post-Retrieval Adjustments
Has a smaller number of results - 1000
With the smaller number, they can run more intelligent but resource intensive
systems to re-order the results.

RankBrain & BERT - Again
“Based on its complex language understanding, BERT can very quickly rank
documents for relevance.”
Depending on the search, Google’s algorithm can use either RankBrain, BERT,
or both.

Host Clustering
Limits the results you see from the same domain.
Add &filter=0 to your search URL to see unfiltered results.

Hreflang
Tries to swap to the most relevant country/language version of a page.

DMCA, Privacy Removals, URL Removal Tool

Spelling Corrections

How Search Works

More Related Content

What's hot

Similar to How Search Works

More from Ahrefs

Recently uploaded

How Search Works