Genome in a Bottle: So you’ve
sequenced a genome – how well did
you do?
February 2015
Justin Zook, Marc Salit, and the Genome
in a Bottle Consortium
Whole genome sequencing technologies
disagree about 100,000’s of variants
3,198,316
(80.05%)
125,574
(3.14%)
Platform
#1
Platform
#2
Platform #3
230,311
(5.76%)
121,440
(3.04%)
208,038
(5.21%)
71,944
(1.80%)
39,604
(0.99%)
# SNPs
(% of SNPs detected
by any platform)
Bioinformatics programs also disagree
O’Rawe et al. Genome Medicine 2013, 5:28
NIST-hosted
Genome in a Bottle Consortium
• Infrastructure for performance
assessment of NGS
– support science-based regulatory
oversight
• No widely accepted set of metrics
to characterize the fidelity of
variant calls from NGS…
• Genome in a Bottle Consortium is
developing standards to address
this…
– well-characterized human genomes
as Reference Materials (RMs)
• characterized and disseminated by NIST
– tools and methods to use these RMs
• Global Alliance for Genomics and
Health Benchmarking Team
http://genomeinabottle.org
Genome in a Bottle
Consortium Development
• NIST met with sequencing
technology developers to assess
standards needs
– Stanford, June 2011
• Open, exploratory workshop
– ASHG, Montreal, Canada
– October 2011
• Small, invitational workshop at
NIST to develop consortium for
human genome reference
materials
– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers,
clinical labs, CAP, PGP, Partners,
ABRF, others
– developed draft work plan
– April 2012
• Open, public meetings of GIAB
– August 2012 at NIST
– March 2013 at Xgen
– August 2013 at NIST
– January 2014 at Stanford
– August 2014 at NIST
– January 2015 at Stanford
• Website
– www.genomeinabottle.org
Others working in this space…
Well-characterized genomes
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Human Longevity, Inc.
• Hyditaform mole haploid
cell line
• Genome Reference
Consortium
Performance Metrics
• Global Alliance for
Genomics and Health
Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
NIST Plays a Role in the First FDA Authorization for
Next-Generation Sequencer
November 20, 2013
Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference
materials will be
developed to
characterize
performance of a part
of process
– materials will be
certified for their
variants against a
reference sequence,
with confidence
estimates
genericmeasurementprocess
Analytical
steps
Pre-Analytical
steps
Clinical
Interpretation
• NIST worked with GIAB
to select genomes
• Current genomes
– NA12878 HapMap
sample as Pilot sample
• part of 17-member
pedigree
– 2 trios from PGP
• Ashkenazim
• Asian
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
11 children
NIST Human Genome RMs in the
pipeline
• All 10 ug samples of DNA
isolated from multistage large
growth cell cultures
– all are intended to act as stable,
homogeneous references
suitable for use in regulated
applications
– all genomes also available from
Coriell repository
• Pilot Genome
– ~8400 tubes
• Ashkenazim Jewish Trio
– ~10000 son; ~2500 each parent
• Asian Trio
– ~10000 son; parents not yet
planned as NIST RM
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
11
Pilot Genome: Integrate 12 14
Datasets from 5 platforms
12
Dataset#1Dataset#2Dataset#3
Annotation #1
Histogram
(e.g., coverage)
Dataset#1Dataset#2Dataset#3
Annotation #2
Histogram
(e.g., strand bias)
Site A
Site B
Potential
Bias
Site C
Dataset Site A Site B Site C
Dataset #1 0/0 0/0 1/1
Dataset #2 0/1 0/1 1/1
Dataset #3 0/0 0/1 1/1
Integration 0/0 0/1 Uncer-
tain
Candidate
variants
Concordant
variants
Find
characteristics
of bias
Arbitrate using
evidence of
bias
Confidence
Level
Integration Methods to Establish
Benchmark Variant Calls
Integration Methods to Establish
Benchmark Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics
methods agree or we
understand the biases
causing disagreement
• At least some methods have
no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be
difficult for current
technologies
• State reasons for lower
confidence
• If a site is near a low
confidence site, make it low
confidence
Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
16
Challenge in variant comparison: Complex
variants have multiple correct representations
BWA
ssaha2
CGTools
Novo-
align
Ref:
T
insertion
TCTCT
insertion
17
FP SNPs FP MNPs FP indels
Traditional
comparison
0.38%
(610)
100%
(915)
6.5%
(733)
Comparison
with
realignment
0.15%
(249)
4.2%
(38)
2.6%
(298)
Global Alliance for Genomics and Health
Benchmarking Task Team
• Formed June 2014 to develop
methods and tools for comparing
variant calls to a benchmark
• Developed standardized definitions
for performance metrics like TP, FP,
and FN.
• Initial focus on germline SNPs/indels
• Developing benchmarking tools
• Comparison engine
• Pluggable web interface with
modules for:
• Reporting/calculation of metrics
• Visualization/user interface
• Working with Genome in a Bottle
Consortium to host data and calls
from their well-characterized
genomes www.bioplanet.com/gcat
Example User Interface
Stratifying Performance
• Measure performance for
different types of variants in
different sequence contexts
– Types of variants
• SNPs
• indels of different sizes
• complex variants
• structural variants
– Sequence contexts
• Homopolymers,
• STRs
• Duplications
– Functional context
• Exome vs genome, etc
– Data characteristics
• Coverage
• Mapping quality
• Challenge of smaller gene
panels vs genome
sequencing
– one RM may not have a
sufficient number of
examples of different classes
of variants or sequence
contexts
– likely need more samples
with specific types of variants
NCBI/CDC GeT-RM Browser
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of questionable calls
Initial uses of high-confidence NIST-
GIAB genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
Using Genome in a Bottle calls to
benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of
NA12878 technical replicates
against GIAB for each new
pipeline version.”
Benchmarking somatic variant calling
at Qiagen
Implications of Technical Accuracy in
Medical Genome Sequencing
• Collaboration with Euan
Ashley group at Stanford
• What is accuracy for
functional variants?
• How much of the exome
falls in high confidence
regions?
• “Black list” in databases
• Sensitivity
– WExS (95%) < WGS (98%)
• especially splicing
– genome < nonsyn < syn
– Most exome FNs caused by
low coverage
– Most WGS FNs cause by
filtering
• Only 81 % of ClinVar
pathogenic or likely
pathogenic SNPs fall in
high-confidence regions
– Lots of work to do!
Overview of NIST RM Development
Genome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015
HG-
001/NA1287
8
(“Pilot”
Genome)
Release NIST
RM8398;
Preliminary
large
deletions
Refined
Structural
Variants
HG-002 to
HG-004
(Ashkenazim
trio)
Illumina,
Complete
Genomics,
Ion,
BioNano,
homogeneity
/stability
Preliminary
SNPs/indels;
120x-150x
PacBio data;
“moleculo”;
mate-pair;
CG-LFR
Refined
SNPs/indels
;
Preliminary
SVs
Refined
Structural
Variants
NIST RMs
8391/839
2 release
HG-005 (son
in Asian trio)
Illumina,
Complete
Genomics,
Ion,
BioNano,
homogeneity
/stability
“moleculo”;
mate-pair;
CG-LFR
Preliminary
SNPs/indels
Refined
SNPs/indels;
Refined
Structural
Variants
NIST
RM8393
release
Ashkenazim Jewish PGP RM Trio
Dataset Characteristics Coverage Availability Good for…
Illumina Paired-
end
150x150bp ~300x/individu
al
Fastq on ftp SNPs/indels/so
me SVs
Illumina Long
Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina
“moleculo”
Custom library ~30x by long
fragments
Feb-Mar 2015 SVs/phasing/as
sembly
Complete
Genomics
100x/individual On ftp SNPs/indels/so
me SVs
Complete
Genomics
LFR ?? SNPs/indels/ph
asing
Ion Proton Exome 1000x/individu
al
On SRA SNPs/indels in
exome
BioNano
Genomics
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on
AJ trio
Finished ~Mar
2015
SVs/phasing/as
sembly/STRs
Asian PGP trio
• Similar sequencing to
Ashkenazim trio except
for PacBio
• Only son will be NIST
RM
Future Directions
Germline mutations
• Difficult regions/variants
– Long-read technologies
– Forming an analysis group
• Tools for assessing
performance
– How to stratify performance
and understand biases?
Somatic mutations
• Pilot interlaboratory study
to assess comparability of
spike-ins
• Commercial members
developing FFPE cell lines
• Participants interested in
mixing different RMs
How to get involved
• Use our integrated
SNP/indel genotypes for
NA12878 and give us
feedback
– Cells and DNA currently
available from Coriell
– NIST RM available April
2015
• Join our new Analysis
group
– Use Long-read
technologies
– Structural Variant calls
– De novo assembly
– Help create the best-ever
characterized trio
• Attend our biannual
workshops (January in CA,
August in MD)
• Develop tools/metrics
with Global Alliance for
Genomics and Health
Benchmarking Team
Acknowledgments
• FDA – Elizabeth Mansfield,
HPC staff
• HSPH
• GCAT - David Mittelman,
Jason Wang
• Francisco De La Vega
• Illumina - Mike Eberle
• Personalis - Deanna Church
• NCBI – Chunlin Xiao
• Celera - Andrew Grupe
• Genome in a Bottle
– www.genomeinabottle.org
– New members welcome!
– Sign up for email newsletters
– jzook@nist.gov

150224 giab 30 min generic slides

  • 1.
    Genome in aBottle: So you’ve sequenced a genome – how well did you do? February 2015 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
  • 2.
    Whole genome sequencingtechnologies disagree about 100,000’s of variants 3,198,316 (80.05%) 125,574 (3.14%) Platform #1 Platform #2 Platform #3 230,311 (5.76%) 121,440 (3.04%) 208,038 (5.21%) 71,944 (1.80%) 39,604 (0.99%) # SNPs (% of SNPs detected by any platform)
  • 3.
    Bioinformatics programs alsodisagree O’Rawe et al. Genome Medicine 2013, 5:28
  • 4.
    NIST-hosted Genome in aBottle Consortium • Infrastructure for performance assessment of NGS – support science-based regulatory oversight • No widely accepted set of metrics to characterize the fidelity of variant calls from NGS… • Genome in a Bottle Consortium is developing standards to address this… – well-characterized human genomes as Reference Materials (RMs) • characterized and disseminated by NIST – tools and methods to use these RMs • Global Alliance for Genomics and Health Benchmarking Team http://genomeinabottle.org
  • 5.
    Genome in aBottle Consortium Development • NIST met with sequencing technology developers to assess standards needs – Stanford, June 2011 • Open, exploratory workshop – ASHG, Montreal, Canada – October 2011 • Small, invitational workshop at NIST to develop consortium for human genome reference materials – FDA, NCBI, NHGRI, NCI, CDC, Wash U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others – developed draft work plan – April 2012 • Open, public meetings of GIAB – August 2012 at NIST – March 2013 at Xgen – August 2013 at NIST – January 2014 at Stanford – August 2014 at NIST – January 2015 at Stanford • Website – www.genomeinabottle.org
  • 6.
    Others working inthis space… Well-characterized genomes • Illumina Platinum Genomes • CDC GeT-RM • Korean Genome Project • Human Longevity, Inc. • Hyditaform mole haploid cell line • Genome Reference Consortium Performance Metrics • Global Alliance for Genomics and Health Benchmarking Team • NCBI/CDC GeT-RM Browser • GCAT website
  • 7.
    NIST Plays aRole in the First FDA Authorization for Next-Generation Sequencer November 20, 2013
  • 8.
    Measurement Process Sample gDNA isolation LibraryPrep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • gDNA reference materials will be developed to characterize performance of a part of process – materials will be certified for their variants against a reference sequence, with confidence estimates genericmeasurementprocess Analytical steps Pre-Analytical steps Clinical Interpretation
  • 9.
    • NIST workedwith GIAB to select genomes • Current genomes – NA12878 HapMap sample as Pilot sample • part of 17-member pedigree – 2 trios from PGP • Ashkenazim • Asian 12889 12890 12891 12892 12877 12878 12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893 CEPH Utah Pedigree 1463 Putting “Genomes” in Bottles 11 children
  • 10.
    NIST Human GenomeRMs in the pipeline • All 10 ug samples of DNA isolated from multistage large growth cell cultures – all are intended to act as stable, homogeneous references suitable for use in regulated applications – all genomes also available from Coriell repository • Pilot Genome – ~8400 tubes • Ashkenazim Jewish Trio – ~10000 son; ~2500 each parent • Asian Trio – ~10000 son; parents not yet planned as NIST RM
  • 11.
    Goals for Datato Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 11
  • 12.
    Pilot Genome: Integrate12 14 Datasets from 5 platforms 12
  • 13.
    Dataset#1Dataset#2Dataset#3 Annotation #1 Histogram (e.g., coverage) Dataset#1Dataset#2Dataset#3 Annotation#2 Histogram (e.g., strand bias) Site A Site B Potential Bias Site C Dataset Site A Site B Site C Dataset #1 0/0 0/0 1/1 Dataset #2 0/1 0/1 1/1 Dataset #3 0/0 0/1 1/1 Integration 0/0 0/1 Uncer- tain Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Integration Methods to Establish Benchmark Variant Calls
  • 14.
    Integration Methods toEstablish Benchmark Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  • 15.
    Assigning confidence togenotypes High-confidence sites • Sequencing/bioinformatics methods agree or we understand the biases causing disagreement • At least some methods have no evidence of bias • Inherited as expected Less confident sites • In a region known to be difficult for current technologies • State reasons for lower confidence • If a site is near a low confidence site, make it low confidence
  • 16.
    Challenges with assessing performance •All variant types are not equal • All regions of the genome are not equal • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 16
  • 17.
    Challenge in variantcomparison: Complex variants have multiple correct representations BWA ssaha2 CGTools Novo- align Ref: T insertion TCTCT insertion 17 FP SNPs FP MNPs FP indels Traditional comparison 0.38% (610) 100% (915) 6.5% (733) Comparison with realignment 0.15% (249) 4.2% (38) 2.6% (298)
  • 18.
    Global Alliance forGenomics and Health Benchmarking Task Team • Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark • Developed standardized definitions for performance metrics like TP, FP, and FN. • Initial focus on germline SNPs/indels • Developing benchmarking tools • Comparison engine • Pluggable web interface with modules for: • Reporting/calculation of metrics • Visualization/user interface • Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes www.bioplanet.com/gcat Example User Interface
  • 19.
    Stratifying Performance • Measureperformance for different types of variants in different sequence contexts – Types of variants • SNPs • indels of different sizes • complex variants • structural variants – Sequence contexts • Homopolymers, • STRs • Duplications – Functional context • Exome vs genome, etc – Data characteristics • Coverage • Mapping quality • Challenge of smaller gene panels vs genome sequencing – one RM may not have a sufficient number of examples of different classes of variants or sequence contexts – likely need more samples with specific types of variants
  • 20.
    NCBI/CDC GeT-RM Browser •http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/ • Allows visualization of questionable calls
  • 21.
    Initial uses ofhigh-confidence NIST- GIAB genotypes for NA12878 • NIST have released several versions of high- confidence genotypes for its pilot RM • These data are presently being used for benchmarking – prior to release of RMs – SNPs & indels • ~77% of the genome
  • 22.
    Using Genome ina Bottle calls to benchmark clinical exome sequencing at Mount Sinai School of Medicine “We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”
  • 23.
  • 24.
    Implications of TechnicalAccuracy in Medical Genome Sequencing • Collaboration with Euan Ashley group at Stanford • What is accuracy for functional variants? • How much of the exome falls in high confidence regions? • “Black list” in databases • Sensitivity – WExS (95%) < WGS (98%) • especially splicing – genome < nonsyn < syn – Most exome FNs caused by low coverage – Most WGS FNs cause by filtering • Only 81 % of ClinVar pathogenic or likely pathogenic SNPs fall in high-confidence regions – Lots of work to do!
  • 25.
    Overview of NISTRM Development Genome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015 HG- 001/NA1287 8 (“Pilot” Genome) Release NIST RM8398; Preliminary large deletions Refined Structural Variants HG-002 to HG-004 (Ashkenazim trio) Illumina, Complete Genomics, Ion, BioNano, homogeneity /stability Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”; mate-pair; CG-LFR Refined SNPs/indels ; Preliminary SVs Refined Structural Variants NIST RMs 8391/839 2 release HG-005 (son in Asian trio) Illumina, Complete Genomics, Ion, BioNano, homogeneity /stability “moleculo”; mate-pair; CG-LFR Preliminary SNPs/indels Refined SNPs/indels; Refined Structural Variants NIST RM8393 release
  • 26.
    Ashkenazim Jewish PGPRM Trio Dataset Characteristics Coverage Availability Good for… Illumina Paired- end 150x150bp ~300x/individu al Fastq on ftp SNPs/indels/so me SVs Illumina Long Mate pair ~6000 bp insert ~40x/individual Feb-Mar 2015 SVs Illumina “moleculo” Custom library ~30x by long fragments Feb-Mar 2015 SVs/phasing/as sembly Complete Genomics 100x/individual On ftp SNPs/indels/so me SVs Complete Genomics LFR ?? SNPs/indels/ph asing Ion Proton Exome 1000x/individu al On SRA SNPs/indels in exome BioNano Genomics Feb 2015 SVs/assembly PacBio ~10kb reads ~120-150x on AJ trio Finished ~Mar 2015 SVs/phasing/as sembly/STRs
  • 27.
    Asian PGP trio •Similar sequencing to Ashkenazim trio except for PacBio • Only son will be NIST RM
  • 28.
    Future Directions Germline mutations •Difficult regions/variants – Long-read technologies – Forming an analysis group • Tools for assessing performance – How to stratify performance and understand biases? Somatic mutations • Pilot interlaboratory study to assess comparability of spike-ins • Commercial members developing FFPE cell lines • Participants interested in mixing different RMs
  • 29.
    How to getinvolved • Use our integrated SNP/indel genotypes for NA12878 and give us feedback – Cells and DNA currently available from Coriell – NIST RM available April 2015 • Join our new Analysis group – Use Long-read technologies – Structural Variant calls – De novo assembly – Help create the best-ever characterized trio • Attend our biannual workshops (January in CA, August in MD) • Develop tools/metrics with Global Alliance for Genomics and Health Benchmarking Team
  • 30.
    Acknowledgments • FDA –Elizabeth Mansfield, HPC staff • HSPH • GCAT - David Mittelman, Jason Wang • Francisco De La Vega • Illumina - Mike Eberle • Personalis - Deanna Church • NCBI – Chunlin Xiao • Celera - Andrew Grupe • Genome in a Bottle – www.genomeinabottle.org – New members welcome! – Sign up for email newsletters – jzook@nist.gov