@patrickstox @ahrefs #pubcon
How Search Works
Presented by:
Patrick Stox
@patrickstox @ahrefs #pubcon
Product Advisor, Technical SEO, &
Brand Ambassador at
• I write for Ahrefs blog but have written for many industry
publications in the past
• I speak at some conferences like SMX, Pubcon, UnGagged, DMO
Advanced, TechSEO Boost, BrightonSEO
• Organizer for the Raleigh SEO Meetup (most successful in US) and
the Beer & SEO Meetup
• We also run a conference, the Raleigh SEO Conference
• Founder Technical SEO Slack Group
• Moderator /r/TechSEO on Reddit
• Helped define the role of Search Marketing Strategist for the US
Department of Labor
• Lead author for the SEO Chapter of the 2021 Web Almanac, reviewer
for the 2022 Chapter
• Technical Review Editor for The Art of SEO 4th Edition
Who is Patrick Stox?
@patrickstox @ahrefs #pubcon
Disclaimer
This is my understanding of systems and is based on a lot of public statements
from Google and my own knowledge.
Warning: It’s not going to be 100% complete or accurate.
@patrickstox @ahrefs #pubcon
How Many Domains Exist?
Q3 2022 according to Verisign: 349.9 million registered
January 2023 according to Netcraft: 270.9 million unique domains responded
Ahrefs 213.1 million (after removing spam domains)
@patrickstox @ahrefs #pubcon
How Many Pages?
Google in 2016: 130T known
@patrickstox @ahrefs #pubcon
How Big Is The Index?
Google: hundreds of billions of pages indexed
100 PB in size
Ahrefs: ~380B pages
@patrickstox @ahrefs #pubcon
A Fraction Of The Web Is Useful Content
Rough math:
(400B / 130T) * 100 = 0.3%
@patrickstox @ahrefs #pubcon
https://twitter.com/
lilyraynyc/status/150
9176261884747781
@patrickstox @ahrefs #pubcon
Spam
Google 2021: “every day, we discover 40 billion spammy pages”
That’s 14.6T spam pages a year.
@patrickstox @ahrefs #pubcon
Googlebot
Googlebot is a lot of systems (1000+) and there are multiple Googlebots.
• Googlebot Image
• Googlebot News
• Googlebot Video
• Googlebot Desktop
• Googlebot Mobile
• +Ads and more
https://developers.google.com/search/docs/crawling-indexing/overview-
google-crawlers
@patrickstox @ahrefs #pubcon
Googlebot Is A Protocol Buffer
It stores structured data.
Similar to JSON, but smaller and faster.
@patrickstox @ahrefs #pubcon
Googlebot Rendering Pipeline (Simplified)
@patrickstox @ahrefs #pubcon
URL Sources
• Links on pages, or anything that even looks like a link
• Sitemaps
• Request indexing in GSC
• Indexing API (limited use cases)
• RSS Feeds
• WebSub (formerly PubSubHubbub)
@patrickstox @ahrefs #pubcon
Crawler Queue / Scheduler
Determines what URLs to crawl and when.
2 main purposes:
• Discovery
• Refresh
@patrickstox @ahrefs #pubcon
What SEOs Call Crawl Budget, Google Calls
Crawl demand
How much Google wants to crawl your site.
Crawl rate limit
How much crawling your website can support.
@patrickstox @ahrefs #pubcon
What Counts Against Your Crawl Budget?
All URLs and requests including:
• Pages/files
• Alternate URLs like AMP or m-dot pages, hreflang
• CSS
• JavaScript, including XHR requests
• Embedded content
***All Googlebots share the same crawl budget, including the ones for Ads,
images, etc.
@patrickstox @ahrefs #pubcon
Crawl Demand Factors
• PageRank
• How often pages change (freshness/staleness)
• When it was last crawled
• Any major changes
@patrickstox @ahrefs #pubcon
Crawl Rate Factors
• Stability / crawl health
• Slow responses
• Errors. 5xx (server errors) or 429 (too many requests) HTTP status codes.
They don’t want to crash the sites and the crawlers will generally back down if
they start seeing issues.
@patrickstox @ahrefs #pubcon
Crawl Rate Settings GSC
@patrickstox @ahrefs #pubcon
Crawling
The little spider is named Crawley.
@patrickstox @ahrefs #pubcon
Crawling
Mostly from Mountain View, CA, USA.
Every request needs to respect robots.txt.
15MB max HTML size.
@patrickstox @ahrefs #pubcon
Google Doesn’t Navigate Like Users
Sends requests for the files individually, doesn’t navigate between pages like a
user.
@patrickstox @ahrefs #pubcon
Caching Files
They Crawl more than HTML:
• Pages and other file types
• JavaScript
• CSS
@patrickstox @ahrefs #pubcon
Caching Files
Files are stored for use in rendering.
Google will ignore your cache timings and fetch a new copy when they want to.
JS
HTML HTML
HTML JS CSS CSS CSS
Cache
Cache
@patrickstox @ahrefs #pubcon
Processing – We’ll Cover This Shortly
@patrickstox @ahrefs #pubcon
Web Rendering Service (WRS)
Needed to process JavaScript
Evergreen (up-to-date) Googlebot
Headless (no Graphical User Interface)
@patrickstox @ahrefs #pubcon
Web Rendering Service (WRS)
• Stateless (storage and cookies cleared between loads)
• Denies Permissions
• Flattens light DOM and shadow DOM
• Date / Time functions adjusted
• Service workers rejected
• Animations may differ
• Random may not be random
@patrickstox @ahrefs #pubcon
Myth: 5 Second Limit
I think this started with a test from Max Prin on the time when the testing tools
took a screenshot. They need to have reasonable time limits for testing tools.
https://maxprin.com/tests/js-timer/
@patrickstox @ahrefs #pubcon
No 5 Second Limit
They’ll try to wait for pages to finish, something like networkidle0 (no more
activity).
Eventually cuts off in case something gets stuck or someone is trying to mine
bitcoin.
@patrickstox @ahrefs #pubcon
It Doesn’t Even Make Sense
They’re basically loading a page with everything cached already.
WRS
JS
HTML HTML
HTML
JS CSS CSS CSS
Cache
Cache
@patrickstox @ahrefs #pubcon
This System Causes Other Issues
Impossible states – previous file versions used when rendering.
File versioning /fingerprinting should help.
XHR requests are done in real time.
@patrickstox @ahrefs #pubcon
Myth: Weeks To Render
All pages go through the renderer.
The average wait time is 5 seconds according to Google’s Martin Splitt.
The 90th percentile is only minutes, not weeks.
Probably comes from pages not being prioritized
for crawling.
@patrickstox @ahrefs #pubcon
Rendering At Web Scale
The 8th wonder of the world.
@patrickstox @ahrefs #pubcon
They Use Some Hacks
“In Google search we don’t really care about the pixels because we don’t
really want to show it to someone. We want to process the information
and the semantic information so we need something in the intermediate
state. We don’t have to actually paint the pixels.” – Martin Splitt
@patrickstox @ahrefs #pubcon
What That Looks Like
Gray = downloads
Blue = HTML
Yellow = JavaScript
Purple = Layout
Green = Painting
@patrickstox @ahrefs #pubcon
They Won’t Render Noindexed Pages
<meta name="robots" content="noindex">
<meta name="robots" content="none">
None = noindex, nofollow
@patrickstox @ahrefs #pubcon
They’re Not Taking Actions
They don’t scroll.
They generally don’t click.
@patrickstox @ahrefs #pubcon
Mobile Desktop
@patrickstox @ahrefs #pubcon
They Don’t Click
Load content into the Document Object Model (DOM) by default. They won’t
see the content if it requires a click that makes an XHR request to pull it in.
DOM Tree and CSS Object Model (CSSOM) form the Render Tree. That’s what
gets indexed.
@patrickstox @ahrefs #pubcon
DOM Tree (pictured)
CSSOM (not pictured) would add info
like font size, weight, color, etc. to
each element.
Render Tree
@patrickstox @ahrefs #pubcon
Collapser
• Error handling
• Retries
• Soft 404s
@patrickstox @ahrefs #pubcon
Processing – Now We’ll Talk About It
@patrickstox @ahrefs #pubcon
Processing - Duplicates
Duplicate detection - content hashes or checksum
They’ll remove boilerplate content (nav, footer) for the checksum.
@patrickstox @ahrefs #pubcon
Near Duplicates
@patrickstox @ahrefs #pubcon
Processing – Duplicate Elimination
Canonicalization
@patrickstox @ahrefs #pubcon
~20 Canonicalization Signals
• Duplicates
• Redirects (high weight)
• Canonical link elements - multiple will be ignored
• Sitemap URLs
• Links (Internal/External, PageRank)
• Alternates – mobile, AMP, print, Hreflang
• HTTPS pages over HTTP
• Shorter URLs over longer URLs
• Where content was first published / seen
• Site level signals like a history of scraped content
• Pages over PDFs
Machine learning system
@patrickstox @ahrefs #pubcon
301 = Permanent, 302 = Temporary
Holds true for other perm and temp redirects
@patrickstox @ahrefs #pubcon
Warning! Speculation
@patrickstox @ahrefs #pubcon
Processing – Link Parser
Good:
<a> tag with an href attribute.
<a href=”/page”>simple is good</a>
<a href=”/page” onclick=”goTo(‘page’)”>still okay</a>
@patrickstox @ahrefs #pubcon
Processing – Link Parser
Bad (but may be parsed):
<a routerLink="products/category">no href</a>
<a onclick=”goTo(‘page’)”>no href</a>
<a href=”javascript:goTo(‘page’)”>kind of nested</a>
<a href=”javascript:void(0)”>missing link</a>
<span onclick=”goTo(‘page’)”>not the right HTML element or href</span>
<span href=“page">not the right HTML element</span>
<option value="page">not the right HTML element</option>
<a href=”#”>no link</a>
Button, ng-click, there are many more ways this can be done incorrectly.
@patrickstox @ahrefs #pubcon
Processing – Link Parser
• Link location, where it goes
• Anchor text
• Surrounding text
• …
@patrickstox @ahrefs #pubcon
Link Tagging
• Penguin
• Location on page (footer, main content)
• Disavow
• …
@patrickstox @ahrefs #pubcon
Processing – Content Parser
• Content – tokenized, vectorized. Words become numbers.
• Content language
• Content location
• Extract meta tags
• Extract Schema
• HTML Lexer – normalize the HTML
• Topic analysis. Content on other topics may be weighted less in ranking.
• Semantic analysis. Linguistic, knowledge graph, address extraction
• …
@patrickstox @ahrefs #pubcon
Content Tagging
• YMYL
• Adult / safe search
• Mobile-friendly
• …
@patrickstox @ahrefs #pubcon
Signal Collectors
• PageRank
• Spam
• Page Experience
• Freshness
• …
@patrickstox @ahrefs #pubcon
A Lot More In Processing Like
Drop anything after # in URLs.
(some exceptions to this)
Most Restrictive Directives
index + noindex + index = noindex
They’ll drop low quality content
@patrickstox @ahrefs #pubcon
Other Files May Be Processed Differently
Adobe Portable Document Format (.pdf)
•Adobe PostScript (.ps)
•Google Earth (.kml, .kmz)
•GPS eXchange Format (.gpx)
•Hancom Hanword (.hwp)
•HTML (.htm, .html, other file extensions)
•Lotus
•Microsoft Excel (.xls, .xlsx)
•Microsoft PowerPoint (.ppt, .pptx)
•Microsoft Word (.doc, .docx)
•OpenOffice presentation (.odp)
•OpenOffice spreadsheet (.ods)
•OpenOffice text (.odt)
•Rich Text Format (.rtf)
•Scalable Vector Graphics (.svg)
•TeX/LaTeX (.tex)
•Text (.txt, .text, other file extensions), including
source code in common programming languages:
• Basic source code (.bas)
• C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp)
• C# source code (.cs)
• Java source code (.java)
• Perl source code (.pl)
• Python source code (.py)
•Wireless Markup Language (.wml, .wap)
•XML (.xml)
@patrickstox @ahrefs #pubcon
Image Processing
• Text around the image
• Content of the image. They tag what is in the image. Not super reliable.
• Alt attribute
• Image name (minimal weight)
• Webpage title and description
Photo from a Gary Illyes Presentation
at Pubcon.
@patrickstox @ahrefs #pubcon
Robots.txt for Images
Blocking Googlebot Image from crawling means that your images will not
be indexed.
@patrickstox @ahrefs #pubcon
Video Processing
• OCR to get text
• Objects identified from visuals
• Speech converted to text
• Structured data
• Text and other signals from the page, URL, title, description
@patrickstox @ahrefs #pubcon
PDFs
• PDFs are converted and indexed as HTML
• OCR to get text
• Images get indexed
• Links get picked up
• Title
• File name
• …
@patrickstox @ahrefs #pubcon
Google Index
Named Caffeine
@patrickstox @ahrefs #pubcon
Data Infrastructure
Many data centers around the world.
Each has a copy of the index.
Millions of servers and hard drives.
Index is an inverted index.
Maps things like words to documents.
Index shards are split into words and phrases.
Other shards for metadata.
@patrickstox @ahrefs #pubcon
Indexing Tiers – Based On Doc Popularity
• Ram (fastest)
• SSD (fast)
• Hard drives (slowest)
@patrickstox @ahrefs #pubcon
Mobile Version Is Indexed (Mostly)
Some sites may remain on desktop-only indexing.
They don’t work on mobile.
@patrickstox @ahrefs #pubcon
Life Of A Query
@patrickstox @ahrefs #pubcon
Fun Fact
15% of queries have never been seen before
@patrickstox @ahrefs #pubcon
Start Typing - Autocomplete
Powered by real search data
and patterns across the web +
• The language of the query
• The location a query is coming from
• Trending interest in a query
• Your past searches
Probably reduces misspellings
@patrickstox @ahrefs #pubcon
Query parsing and understanding
BERT (DeepRank) – combinations of words express different
meanings and intents. They won’t drop important words from
the queries.
Neural matching – words to searches.
“For example, neural matching helps Google understand that a
search for “why does my TV look strange” is related to the
concept of “the soap opera effect.” We can then return pages
about the soap opera effect, even if the exact words aren’t used.”
@patrickstox @ahrefs #pubcon
Misspelling
1/10 searches are misspelled
@patrickstox @ahrefs #pubcon
Google Training Misspelling Example
Over 600 ways people misspelled Britney Spears.
http://archive.google.com/jobs/britney.html
@patrickstox @ahrefs #pubcon
Spelling Old Vs New
Old way:
How often terms were searched
+probability of typos from neighboring keys
New way:
Deep neural net with 680M parameters
@patrickstox @ahrefs #pubcon
Query Expansion
When the query is sent, it’s going to also pull pages with terms that include:
• Synonyms
• Antonyms
• Acronyms
• Plural/singular
• Stemming – root words
• Diacritical expansion - accent characters other versions
These will mostly get lower weights in scoring than the main term used.
@patrickstox @ahrefs #pubcon
Concepts & Entities
People, places, things
“RankBrain helps Google better relate pages to concepts – This
means Google can better return relevant pages even if they
don’t contain the exact words used in a search, by
understanding the page is related to other words and
concepts.”
@patrickstox @ahrefs #pubcon
Speculation
All the query expansion things may not
be necessary anymore. They may just
pull close terms in vector space.
@patrickstox @ahrefs #pubcon
Stop Words
The, is, and, of, a, are, an, if, etc.
Removed for some queries.
Used for other queries, like when it matches a concept.
@patrickstox @ahrefs #pubcon
Segmenter
Splits up strings (languages without spaces).
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
@patrickstox @ahrefs #pubcon
Retrieval – Posting List
Remember that inverted index?
Map of terms to pages that contain those terms. Get all those.
@patrickstox @ahrefs #pubcon
Sum Of The Total Pages From All Shards
@patrickstox @ahrefs #pubcon
Popular Queries Are Cached
@patrickstox @ahrefs #pubcon
Make A Smaller List - Ranking
Google is going to cut all those results down to the top 1000 by ranking them.
@patrickstox @ahrefs #pubcon
Ranking / Scoring – Query Dependent
Feature of a page & query
• Keyword hits
• All those other versions from the query expansion like synonyms
• Proximity
• Content relevance, topicality
• …
@patrickstox @ahrefs #pubcon
Ranking / Scoring – Query Independent
Feature of a page
• PageRank, site queries, mentions,
& other E-E-A-T signals
• Language
• Mobile-friendliness
• Page experience
• …
Numbers multiplied by other numbers in the scoring
@patrickstox @ahrefs #pubcon
They’re Like Nah, We Can Do Better
@patrickstox @ahrefs #pubcon
Reranking / Post-Retrieval Adjustments
Has a smaller number of results - 1000
With the smaller number, they can run more intelligent but resource intensive
systems to re-order the results.
@patrickstox @ahrefs #pubcon
RankBrain & BERT - Again
“Based on its complex language understanding, BERT can very quickly rank
documents for relevance.”
Depending on the search, Google’s algorithm can use either RankBrain, BERT,
or both.
@patrickstox @ahrefs #pubcon
Host Clustering
Limits the results you see from the same domain.
Add &filter=0 to your search URL to see unfiltered results.
@patrickstox @ahrefs #pubcon
Hreflang
Tries to swap to the most relevant country/language version of a page.
@patrickstox @ahrefs #pubcon
DMCA, Privacy Removals, URL Removal Tool
@patrickstox @ahrefs #pubcon
Spelling Corrections
@patrickstox @ahrefs #pubcon
Trending Topics Are Promoted
@patrickstox @ahrefs #pubcon
Spam
Spam demotions
Manual actions
@patrickstox @ahrefs #pubcon
Query Other Systems - Universal results
News, Maps, Images, Videos, etc.
Results are bidding for their position
@patrickstox @ahrefs #pubcon
@patrickstox @ahrefs #pubcon

How Search Works

  • 1.
    @patrickstox @ahrefs #pubcon HowSearch Works Presented by: Patrick Stox
  • 2.
    @patrickstox @ahrefs #pubcon ProductAdvisor, Technical SEO, & Brand Ambassador at • I write for Ahrefs blog but have written for many industry publications in the past • I speak at some conferences like SMX, Pubcon, UnGagged, DMO Advanced, TechSEO Boost, BrightonSEO • Organizer for the Raleigh SEO Meetup (most successful in US) and the Beer & SEO Meetup • We also run a conference, the Raleigh SEO Conference • Founder Technical SEO Slack Group • Moderator /r/TechSEO on Reddit • Helped define the role of Search Marketing Strategist for the US Department of Labor • Lead author for the SEO Chapter of the 2021 Web Almanac, reviewer for the 2022 Chapter • Technical Review Editor for The Art of SEO 4th Edition Who is Patrick Stox?
  • 3.
    @patrickstox @ahrefs #pubcon Disclaimer Thisis my understanding of systems and is based on a lot of public statements from Google and my own knowledge. Warning: It’s not going to be 100% complete or accurate.
  • 4.
    @patrickstox @ahrefs #pubcon HowMany Domains Exist? Q3 2022 according to Verisign: 349.9 million registered January 2023 according to Netcraft: 270.9 million unique domains responded Ahrefs 213.1 million (after removing spam domains)
  • 5.
    @patrickstox @ahrefs #pubcon HowMany Pages? Google in 2016: 130T known
  • 6.
    @patrickstox @ahrefs #pubcon HowBig Is The Index? Google: hundreds of billions of pages indexed 100 PB in size Ahrefs: ~380B pages
  • 7.
    @patrickstox @ahrefs #pubcon AFraction Of The Web Is Useful Content Rough math: (400B / 130T) * 100 = 0.3%
  • 8.
  • 9.
    @patrickstox @ahrefs #pubcon Spam Google2021: “every day, we discover 40 billion spammy pages” That’s 14.6T spam pages a year.
  • 10.
    @patrickstox @ahrefs #pubcon Googlebot Googlebotis a lot of systems (1000+) and there are multiple Googlebots. • Googlebot Image • Googlebot News • Googlebot Video • Googlebot Desktop • Googlebot Mobile • +Ads and more https://developers.google.com/search/docs/crawling-indexing/overview- google-crawlers
  • 11.
    @patrickstox @ahrefs #pubcon GooglebotIs A Protocol Buffer It stores structured data. Similar to JSON, but smaller and faster.
  • 12.
    @patrickstox @ahrefs #pubcon GooglebotRendering Pipeline (Simplified)
  • 13.
    @patrickstox @ahrefs #pubcon URLSources • Links on pages, or anything that even looks like a link • Sitemaps • Request indexing in GSC • Indexing API (limited use cases) • RSS Feeds • WebSub (formerly PubSubHubbub)
  • 14.
    @patrickstox @ahrefs #pubcon CrawlerQueue / Scheduler Determines what URLs to crawl and when. 2 main purposes: • Discovery • Refresh
  • 15.
    @patrickstox @ahrefs #pubcon WhatSEOs Call Crawl Budget, Google Calls Crawl demand How much Google wants to crawl your site. Crawl rate limit How much crawling your website can support.
  • 16.
    @patrickstox @ahrefs #pubcon WhatCounts Against Your Crawl Budget? All URLs and requests including: • Pages/files • Alternate URLs like AMP or m-dot pages, hreflang • CSS • JavaScript, including XHR requests • Embedded content ***All Googlebots share the same crawl budget, including the ones for Ads, images, etc.
  • 17.
    @patrickstox @ahrefs #pubcon CrawlDemand Factors • PageRank • How often pages change (freshness/staleness) • When it was last crawled • Any major changes
  • 18.
    @patrickstox @ahrefs #pubcon CrawlRate Factors • Stability / crawl health • Slow responses • Errors. 5xx (server errors) or 429 (too many requests) HTTP status codes. They don’t want to crash the sites and the crawlers will generally back down if they start seeing issues.
  • 19.
  • 20.
    @patrickstox @ahrefs #pubcon Crawling Thelittle spider is named Crawley.
  • 21.
    @patrickstox @ahrefs #pubcon Crawling Mostlyfrom Mountain View, CA, USA. Every request needs to respect robots.txt. 15MB max HTML size.
  • 22.
    @patrickstox @ahrefs #pubcon GoogleDoesn’t Navigate Like Users Sends requests for the files individually, doesn’t navigate between pages like a user.
  • 23.
    @patrickstox @ahrefs #pubcon CachingFiles They Crawl more than HTML: • Pages and other file types • JavaScript • CSS
  • 24.
    @patrickstox @ahrefs #pubcon CachingFiles Files are stored for use in rendering. Google will ignore your cache timings and fetch a new copy when they want to. JS HTML HTML HTML JS CSS CSS CSS Cache Cache
  • 25.
    @patrickstox @ahrefs #pubcon Processing– We’ll Cover This Shortly
  • 26.
    @patrickstox @ahrefs #pubcon WebRendering Service (WRS) Needed to process JavaScript Evergreen (up-to-date) Googlebot Headless (no Graphical User Interface)
  • 27.
    @patrickstox @ahrefs #pubcon WebRendering Service (WRS) • Stateless (storage and cookies cleared between loads) • Denies Permissions • Flattens light DOM and shadow DOM • Date / Time functions adjusted • Service workers rejected • Animations may differ • Random may not be random
  • 28.
    @patrickstox @ahrefs #pubcon Myth:5 Second Limit I think this started with a test from Max Prin on the time when the testing tools took a screenshot. They need to have reasonable time limits for testing tools. https://maxprin.com/tests/js-timer/
  • 29.
    @patrickstox @ahrefs #pubcon No5 Second Limit They’ll try to wait for pages to finish, something like networkidle0 (no more activity). Eventually cuts off in case something gets stuck or someone is trying to mine bitcoin.
  • 30.
    @patrickstox @ahrefs #pubcon ItDoesn’t Even Make Sense They’re basically loading a page with everything cached already. WRS JS HTML HTML HTML JS CSS CSS CSS Cache Cache
  • 31.
    @patrickstox @ahrefs #pubcon ThisSystem Causes Other Issues Impossible states – previous file versions used when rendering. File versioning /fingerprinting should help. XHR requests are done in real time.
  • 32.
    @patrickstox @ahrefs #pubcon Myth:Weeks To Render All pages go through the renderer. The average wait time is 5 seconds according to Google’s Martin Splitt. The 90th percentile is only minutes, not weeks. Probably comes from pages not being prioritized for crawling.
  • 33.
    @patrickstox @ahrefs #pubcon RenderingAt Web Scale The 8th wonder of the world.
  • 34.
    @patrickstox @ahrefs #pubcon TheyUse Some Hacks “In Google search we don’t really care about the pixels because we don’t really want to show it to someone. We want to process the information and the semantic information so we need something in the intermediate state. We don’t have to actually paint the pixels.” – Martin Splitt
  • 35.
    @patrickstox @ahrefs #pubcon WhatThat Looks Like Gray = downloads Blue = HTML Yellow = JavaScript Purple = Layout Green = Painting
  • 36.
    @patrickstox @ahrefs #pubcon TheyWon’t Render Noindexed Pages <meta name="robots" content="noindex"> <meta name="robots" content="none"> None = noindex, nofollow
  • 37.
    @patrickstox @ahrefs #pubcon They’reNot Taking Actions They don’t scroll. They generally don’t click.
  • 38.
  • 39.
    @patrickstox @ahrefs #pubcon TheyDon’t Click Load content into the Document Object Model (DOM) by default. They won’t see the content if it requires a click that makes an XHR request to pull it in. DOM Tree and CSS Object Model (CSSOM) form the Render Tree. That’s what gets indexed.
  • 40.
    @patrickstox @ahrefs #pubcon DOMTree (pictured) CSSOM (not pictured) would add info like font size, weight, color, etc. to each element. Render Tree
  • 41.
    @patrickstox @ahrefs #pubcon Collapser •Error handling • Retries • Soft 404s
  • 42.
    @patrickstox @ahrefs #pubcon Processing– Now We’ll Talk About It
  • 43.
    @patrickstox @ahrefs #pubcon Processing- Duplicates Duplicate detection - content hashes or checksum They’ll remove boilerplate content (nav, footer) for the checksum.
  • 44.
  • 45.
    @patrickstox @ahrefs #pubcon Processing– Duplicate Elimination Canonicalization
  • 46.
    @patrickstox @ahrefs #pubcon ~20Canonicalization Signals • Duplicates • Redirects (high weight) • Canonical link elements - multiple will be ignored • Sitemap URLs • Links (Internal/External, PageRank) • Alternates – mobile, AMP, print, Hreflang • HTTPS pages over HTTP • Shorter URLs over longer URLs • Where content was first published / seen • Site level signals like a history of scraped content • Pages over PDFs Machine learning system
  • 47.
    @patrickstox @ahrefs #pubcon 301= Permanent, 302 = Temporary Holds true for other perm and temp redirects
  • 48.
  • 49.
    @patrickstox @ahrefs #pubcon Processing– Link Parser Good: <a> tag with an href attribute. <a href=”/page”>simple is good</a> <a href=”/page” onclick=”goTo(‘page’)”>still okay</a>
  • 50.
    @patrickstox @ahrefs #pubcon Processing– Link Parser Bad (but may be parsed): <a routerLink="products/category">no href</a> <a onclick=”goTo(‘page’)”>no href</a> <a href=”javascript:goTo(‘page’)”>kind of nested</a> <a href=”javascript:void(0)”>missing link</a> <span onclick=”goTo(‘page’)”>not the right HTML element or href</span> <span href=“page">not the right HTML element</span> <option value="page">not the right HTML element</option> <a href=”#”>no link</a> Button, ng-click, there are many more ways this can be done incorrectly.
  • 51.
    @patrickstox @ahrefs #pubcon Processing– Link Parser • Link location, where it goes • Anchor text • Surrounding text • …
  • 52.
    @patrickstox @ahrefs #pubcon LinkTagging • Penguin • Location on page (footer, main content) • Disavow • …
  • 53.
    @patrickstox @ahrefs #pubcon Processing– Content Parser • Content – tokenized, vectorized. Words become numbers. • Content language • Content location • Extract meta tags • Extract Schema • HTML Lexer – normalize the HTML • Topic analysis. Content on other topics may be weighted less in ranking. • Semantic analysis. Linguistic, knowledge graph, address extraction • …
  • 54.
    @patrickstox @ahrefs #pubcon ContentTagging • YMYL • Adult / safe search • Mobile-friendly • …
  • 55.
    @patrickstox @ahrefs #pubcon SignalCollectors • PageRank • Spam • Page Experience • Freshness • …
  • 56.
    @patrickstox @ahrefs #pubcon ALot More In Processing Like Drop anything after # in URLs. (some exceptions to this) Most Restrictive Directives index + noindex + index = noindex They’ll drop low quality content
  • 57.
    @patrickstox @ahrefs #pubcon OtherFiles May Be Processed Differently Adobe Portable Document Format (.pdf) •Adobe PostScript (.ps) •Google Earth (.kml, .kmz) •GPS eXchange Format (.gpx) •Hancom Hanword (.hwp) •HTML (.htm, .html, other file extensions) •Lotus •Microsoft Excel (.xls, .xlsx) •Microsoft PowerPoint (.ppt, .pptx) •Microsoft Word (.doc, .docx) •OpenOffice presentation (.odp) •OpenOffice spreadsheet (.ods) •OpenOffice text (.odt) •Rich Text Format (.rtf) •Scalable Vector Graphics (.svg) •TeX/LaTeX (.tex) •Text (.txt, .text, other file extensions), including source code in common programming languages: • Basic source code (.bas) • C/C++ source code (.c, .cc, .cpp, .cxx, .h, .hpp) • C# source code (.cs) • Java source code (.java) • Perl source code (.pl) • Python source code (.py) •Wireless Markup Language (.wml, .wap) •XML (.xml)
  • 58.
    @patrickstox @ahrefs #pubcon ImageProcessing • Text around the image • Content of the image. They tag what is in the image. Not super reliable. • Alt attribute • Image name (minimal weight) • Webpage title and description Photo from a Gary Illyes Presentation at Pubcon.
  • 59.
    @patrickstox @ahrefs #pubcon Robots.txtfor Images Blocking Googlebot Image from crawling means that your images will not be indexed.
  • 60.
    @patrickstox @ahrefs #pubcon VideoProcessing • OCR to get text • Objects identified from visuals • Speech converted to text • Structured data • Text and other signals from the page, URL, title, description
  • 61.
    @patrickstox @ahrefs #pubcon PDFs •PDFs are converted and indexed as HTML • OCR to get text • Images get indexed • Links get picked up • Title • File name • …
  • 62.
  • 63.
    @patrickstox @ahrefs #pubcon DataInfrastructure Many data centers around the world. Each has a copy of the index. Millions of servers and hard drives. Index is an inverted index. Maps things like words to documents. Index shards are split into words and phrases. Other shards for metadata.
  • 64.
    @patrickstox @ahrefs #pubcon IndexingTiers – Based On Doc Popularity • Ram (fastest) • SSD (fast) • Hard drives (slowest)
  • 65.
    @patrickstox @ahrefs #pubcon MobileVersion Is Indexed (Mostly) Some sites may remain on desktop-only indexing. They don’t work on mobile.
  • 66.
  • 67.
    @patrickstox @ahrefs #pubcon FunFact 15% of queries have never been seen before
  • 68.
    @patrickstox @ahrefs #pubcon StartTyping - Autocomplete Powered by real search data and patterns across the web + • The language of the query • The location a query is coming from • Trending interest in a query • Your past searches Probably reduces misspellings
  • 69.
    @patrickstox @ahrefs #pubcon Queryparsing and understanding BERT (DeepRank) – combinations of words express different meanings and intents. They won’t drop important words from the queries. Neural matching – words to searches. “For example, neural matching helps Google understand that a search for “why does my TV look strange” is related to the concept of “the soap opera effect.” We can then return pages about the soap opera effect, even if the exact words aren’t used.”
  • 70.
  • 71.
    @patrickstox @ahrefs #pubcon GoogleTraining Misspelling Example Over 600 ways people misspelled Britney Spears. http://archive.google.com/jobs/britney.html
  • 72.
    @patrickstox @ahrefs #pubcon SpellingOld Vs New Old way: How often terms were searched +probability of typos from neighboring keys New way: Deep neural net with 680M parameters
  • 73.
    @patrickstox @ahrefs #pubcon QueryExpansion When the query is sent, it’s going to also pull pages with terms that include: • Synonyms • Antonyms • Acronyms • Plural/singular • Stemming – root words • Diacritical expansion - accent characters other versions These will mostly get lower weights in scoring than the main term used.
  • 74.
    @patrickstox @ahrefs #pubcon Concepts& Entities People, places, things “RankBrain helps Google better relate pages to concepts – This means Google can better return relevant pages even if they don’t contain the exact words used in a search, by understanding the page is related to other words and concepts.”
  • 75.
    @patrickstox @ahrefs #pubcon Speculation Allthe query expansion things may not be necessary anymore. They may just pull close terms in vector space.
  • 76.
    @patrickstox @ahrefs #pubcon StopWords The, is, and, of, a, are, an, if, etc. Removed for some queries. Used for other queries, like when it matches a concept.
  • 77.
    @patrickstox @ahrefs #pubcon Segmenter Splitsup strings (languages without spaces). '上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
  • 78.
    @patrickstox @ahrefs #pubcon Retrieval– Posting List Remember that inverted index? Map of terms to pages that contain those terms. Get all those.
  • 79.
    @patrickstox @ahrefs #pubcon SumOf The Total Pages From All Shards
  • 80.
  • 81.
    @patrickstox @ahrefs #pubcon MakeA Smaller List - Ranking Google is going to cut all those results down to the top 1000 by ranking them.
  • 82.
    @patrickstox @ahrefs #pubcon Ranking/ Scoring – Query Dependent Feature of a page & query • Keyword hits • All those other versions from the query expansion like synonyms • Proximity • Content relevance, topicality • …
  • 83.
    @patrickstox @ahrefs #pubcon Ranking/ Scoring – Query Independent Feature of a page • PageRank, site queries, mentions, & other E-E-A-T signals • Language • Mobile-friendliness • Page experience • … Numbers multiplied by other numbers in the scoring
  • 84.
    @patrickstox @ahrefs #pubcon They’reLike Nah, We Can Do Better
  • 85.
    @patrickstox @ahrefs #pubcon Reranking/ Post-Retrieval Adjustments Has a smaller number of results - 1000 With the smaller number, they can run more intelligent but resource intensive systems to re-order the results.
  • 86.
    @patrickstox @ahrefs #pubcon RankBrain& BERT - Again “Based on its complex language understanding, BERT can very quickly rank documents for relevance.” Depending on the search, Google’s algorithm can use either RankBrain, BERT, or both.
  • 87.
    @patrickstox @ahrefs #pubcon HostClustering Limits the results you see from the same domain. Add &filter=0 to your search URL to see unfiltered results.
  • 88.
    @patrickstox @ahrefs #pubcon Hreflang Triesto swap to the most relevant country/language version of a page.
  • 89.
    @patrickstox @ahrefs #pubcon DMCA,Privacy Removals, URL Removal Tool
  • 90.
  • 91.
  • 92.
  • 93.
    @patrickstox @ahrefs #pubcon QueryOther Systems - Universal results News, Maps, Images, Videos, etc. Results are bidding for their position
  • 94.
  • 95.