Genome in a Bottle: Tools for
Using NIST Reference Materials
Next Generation Diagnostics Summit Short Course
August 2014
Justin Zook, Marc Salit, and the Genome in a Bottle
Consortium
Learning Objectives
• How can Genome in a Bottle Reference
Materials help with validating NGS assays?
• Comparing your variant calls to high-
confidence calls
• Tools available for understanding potential
false positives and false negatives
• Examples of how labs are using our high-
confidence calls
NIST-hosted
Genome in a Bottle Consortium
• Infrastructure for performance
assessment of NGS
– support science-based regulatory
oversight
• No widely accepted set of metrics
to characterize the fidelity of
variant calls from NGS…
• Genome in a Bottle Consortium is
developing standards to address
this…
– human genomes as Reference
Materials (RMs)
• characterize and disseminate by NIST
– tools and methods to use these RMs
• common sequencing instruments
• bioinformatics workflows.
http://genomeinabottle.org
Whole genome sequencing technologies
disagree about 100,000’s of variants
3,198,316
(80.05%)
125,574
(3.14%)
Platform
#1
Platform
#2
Platform #3
230,311
(5.76%)
121,440
(3.04%)
208,038
(5.21%)
71,944
(1.80%)
39,604
(0.99%)
# SNPs
(% of SNPs detected
by any platform)
Bioinformatics programs also disagree
O’Rawe et al. Genome Medicine 2013, 5:28
Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference
materials will be
developed to
characterize
performance of a part
of process
– materials will be
certified for their
variants against a
reference sequence,
with confidence
estimates
genericmeasurementprocess
NIST Human Genome RMs in the
pipeline
• All 10 ug samples of DNA
isolated from multistage large
growth cell cultures
– all are intended to act as stable,
homogeneous references
suitable for use in regulated
applications
– all genomes also available from
Coriell repository
• Pilot Genome
– ~8400 tubes
• Ashkenazim Jewish Trio
– ~10000 son; ~2500 each parent
• Asian Trio
– ~10000 son; parents not yet
planned as NIST RM
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
8
Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics
methods agree or we
understand the biases
causing disagreement
• At least some methods have
no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be
difficult for current
technologies
• State reasons for lower
confidence
• If a site is near a low
confidence site, make it low
confidence
Reasons we exclude regions from high-
confidence set
Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
– Homopolymers, STRs,
duplications
– Can be similar or different
in different genomes
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
12
Preliminary uses of high-confidence
NIST-GIAB genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
NIST Plays a Role in the First FDA Authorization for
Next-Generation Sequencer
November 20, 2013
Integrating NIST Call Sets into a Validation Workflow
Validation Report
False Positive Ratio FPR=FP/(FP+TN)
False Discovery Rate FDR=FP/(FP + TP)
Sensitivity Sens. = TP/(TP+FN)
Specificity Spec. = TN/(FP +TN)
Balanced Accuracy (Sens. + Spec.)/2
GCAT – Interactive Performance
Metrics
• NIST is working with GCAT
to use our highly
confident variant calls
• Assess performance of
many combinations of
mappers and variant
callers
• Currently assesses only
exome sequencing
• www.bioplanet.com/gcat
16
GCAT Tests
GCAT Variant Calling Tests
Pre-run Tests
Upload your own variant calls
GCAT – Upload your own exome calls
Freebayes SNP calls changed very little in 2013
http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem-
freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/snp/group-quality
Freebayes indel calls improved in 2013
http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem-
freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/indel/group-quality
Background
• Clinical laboratory – Division of Genomic Diagnostics Certified by regulatory
agencies (CAP).
• CWES test requires stringent validation per CAP criteria to establish
performance metrics of the test.
Utilizing NIST data in validation of CWES Test
• Sequence and call variants of NA12878 at CHOP
• CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file
• Compare CHOP dataset to NIST data set for concordance
NIST Data Set Details:
*High quality reference data set on NA12878 (Dec. 2013)
*NIST’s highly confident Region of Interests (ROI)
*Variants called in 219,222 regions on hg19 assembly
*: National Institute of Standards and Technology
Analytical Validation of Clinical
Whole-Exome Sequencing (CWES) Test
SENSITIVITY /SPECIFICITY
RefGene +/- 15bp (SSV5+)
CHOP NIST
TP
SNVs: 18480
INDELs: 396
FP
SNVs: 26
INDELs: 3
FN
SNVs: 63
INDELs: 30
FP: False Positive
TP: True Positive
FN: False Negative
TN: True Negative
SNVs INDELs
Sensitivity (TP/TP+FN) 99.66% 92.96%
Specificity (TN/TN+FP) ~100% ~100%
FDR (FP/FP+TN) 0.02% 0.08%
Accuracy (TP+TN/TP+TN+FP+FN) ~100% ~100%
TN = NIST highly confident
regions – CHOP ROIs
Further analysis on presumptive 93 FNs and 29 FPs
63 SNVs 30 INDELs
93 FNs
29 FPs
26 SNVs 3 INDELs
Using the GeT-RM Browser
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of questionable calls
GeT-RM Load alignments for visualization
Chr6:151669820 Chr6:151669828
Difficult site in homopolymer in intron of gene AKAP12
Chr1:1666303
SNP in Gene SLC35E2, which is also in a pseudogene and a segmental duplication
Segmental
Duplication
Pseudo-
gene
Structural
Variant
Feedback from MoCha lab in NCI
• We built a targeted amplicons NGS assay for
detecting mutations in clinical tumor specimens
• To assess the assay’s specificity, we compared 84
runs of CEPH NA12878 data from our assay with
NIST’s consensus variant list (VCF v2.15)
• We observed a high overall concordance with a
few FP variants in homopolymeric regions unique
in our platform
• We concluded that NIST GIAB is a useful
reference standard to evaluate assay specificity
Using Genome in a Bottle calls to
benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of
NA12878 technical replicates
against GIAB for each new
pipeline version.”
Benchmarking somatic variant calling
at Qiagen
HSPH – Brad Chapman
Comparing variant callers
http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-
methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/
NextSeq: New Chemistry – Does it work?
Whole Genome Metrics NextSeq500 HiSeq2500
% Genome Covered (>= 10X in Q20 bases) 96% 96%
Mean Coverage in Q20 Bases 28.3X 31.8X
SNPs Called (% dbSNP 129) 3,643,998 (89%) 3,664,014 (88%)
InDels Called (% dbSNP 129) 646,907 (65.7%) 686,547 (64.5%)
Genome in a Bottle SNP Sensitivity & Precision 99.07% | 99.04% 99.25% | 99.90%
Genome in a Bottle Indel Sensitivity & Precision 86.90% | 98.85% 93.29% | 97.54%
NextSeq 500: Genomic Coverage in High Quality Bases
Coverage in Bases with MQ>=20 and Q>=20
ProportionofGenomeatCoverage
0.000.010.020.030.040.05
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Mean: 28.33X
Fraction at 2/3 Mean: 0.9
HiSeq 2000: Genomic Coverage in High Quality Bases
Coverage in Bases with MQ>=20 and Q>=20
ProportionofGenomeatCoverage
0.000.010.020.030.040.05
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Mean: 31.86X
Fraction at 2/3 Mean: 0.91
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●●●●
●
●●●● ●
●0.0
0.5
1.0
1.5
2.0
0.00 0.25 0.50 0.75 1.00
GC Content
NormalizedCoverage
Platform
●
●
HiSeq 2000
NextSeq 500
Ion Benchmarking I
Ion Benchmarking II
Command-line tools for variant
benchmarking
• USeq VCFComparator
– http://sourceforge.net/projects/useq/
• RTG vcfeval
– ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
• bcbio.variation
– http://bcbio.wordpress.com/2013/05/06/framework-
for-evaluating-variant-detection-methods-
comparison-of-aligners-and-callers/
• SMaSH
– http://smash.cs.berkeley.edu/
How Can I Get Involved?
• Use our integrated SNP/indel
genotypes for NA12878 and give
us feedback
– Cells and DNA currently available
from Coriell
– NIST RM available late 2014
• Sequencing/analyzing the new
Genome in a Bottle samples
• Help with Structural Variant calls
• Help with analyzing data from
long-read technologies
• Attend our biannual workshops
(January in CA, August in MD)
• Help develop methods to
measure performance using our
well-characterized genomes
http://genomeinabottle.org
Email:
Justin Zook - jzook@nist.gov
Marc Salit – salit@nist.gov
Slides on slideshare at:
http://www.slideshare.net/Gen
omeInABottle

Tools for Using NIST Reference Materials

  • 1.
    Genome in aBottle: Tools for Using NIST Reference Materials Next Generation Diagnostics Summit Short Course August 2014 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
  • 2.
    Learning Objectives • Howcan Genome in a Bottle Reference Materials help with validating NGS assays? • Comparing your variant calls to high- confidence calls • Tools available for understanding potential false positives and false negatives • Examples of how labs are using our high- confidence calls
  • 3.
    NIST-hosted Genome in aBottle Consortium • Infrastructure for performance assessment of NGS – support science-based regulatory oversight • No widely accepted set of metrics to characterize the fidelity of variant calls from NGS… • Genome in a Bottle Consortium is developing standards to address this… – human genomes as Reference Materials (RMs) • characterize and disseminate by NIST – tools and methods to use these RMs • common sequencing instruments • bioinformatics workflows. http://genomeinabottle.org
  • 4.
    Whole genome sequencingtechnologies disagree about 100,000’s of variants 3,198,316 (80.05%) 125,574 (3.14%) Platform #1 Platform #2 Platform #3 230,311 (5.76%) 121,440 (3.04%) 208,038 (5.21%) 71,944 (1.80%) 39,604 (0.99%) # SNPs (% of SNPs detected by any platform)
  • 5.
    Bioinformatics programs alsodisagree O’Rawe et al. Genome Medicine 2013, 5:28
  • 6.
    Measurement Process Sample gDNA isolation LibraryPrep Sequencing Alignment/Mapping Variant Calling Confidence Estimates Downstream Analysis • gDNA reference materials will be developed to characterize performance of a part of process – materials will be certified for their variants against a reference sequence, with confidence estimates genericmeasurementprocess
  • 7.
    NIST Human GenomeRMs in the pipeline • All 10 ug samples of DNA isolated from multistage large growth cell cultures – all are intended to act as stable, homogeneous references suitable for use in regulated applications – all genomes also available from Coriell repository • Pilot Genome – ~8400 tubes • Ashkenazim Jewish Trio – ~10000 son; ~2500 each parent • Asian Trio – ~10000 son; parents not yet planned as NIST RM
  • 8.
    Goals for Datato Accompany RM • ~0 false positive AND false negative calls in confident regions • Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection) • Avoid bias towards any particular platform – take advantage of strengths of each platform • Avoid bias towards any particular bioinformatics algorithms 8
  • 9.
    Integration Methods toEstablish Reference Variant Calls Candidate variants Concordant variants Find characteristics of bias Arbitrate using evidence of bias Confidence Level Zook et al., Nature Biotechnology, 2014.
  • 10.
    Assigning confidence togenotypes High-confidence sites • Sequencing/bioinformatics methods agree or we understand the biases causing disagreement • At least some methods have no evidence of bias • Inherited as expected Less confident sites • In a region known to be difficult for current technologies • State reasons for lower confidence • If a site is near a low confidence site, make it low confidence
  • 11.
    Reasons we excluderegions from high- confidence set
  • 12.
    Challenges with assessing performance •All variant types are not equal • All regions of the genome are not equal – Homopolymers, STRs, duplications – Can be similar or different in different genomes • Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance • Genotypes fall in 3+ categories (not positive/negative) – standard diagnostic accuracy measures not well posed 12
  • 13.
    Preliminary uses ofhigh-confidence NIST-GIAB genotypes for NA12878 • NIST have released several versions of high- confidence genotypes for its pilot RM • These data are presently being used for benchmarking – prior to release of RMs – SNPs & indels • ~77% of the genome
  • 14.
    NIST Plays aRole in the First FDA Authorization for Next-Generation Sequencer November 20, 2013
  • 15.
    Integrating NIST CallSets into a Validation Workflow Validation Report False Positive Ratio FPR=FP/(FP+TN) False Discovery Rate FDR=FP/(FP + TP) Sensitivity Sens. = TP/(TP+FN) Specificity Spec. = TN/(FP +TN) Balanced Accuracy (Sens. + Spec.)/2
  • 16.
    GCAT – InteractivePerformance Metrics • NIST is working with GCAT to use our highly confident variant calls • Assess performance of many combinations of mappers and variant callers • Currently assesses only exome sequencing • www.bioplanet.com/gcat 16
  • 17.
  • 18.
    GCAT Variant CallingTests Pre-run Tests Upload your own variant calls
  • 19.
    GCAT – Uploadyour own exome calls
  • 20.
    Freebayes SNP callschanged very little in 2013 http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem- freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/snp/group-quality
  • 21.
    Freebayes indel callsimproved in 2013 http://www.bioplanet.com/gcat/reports/1933-westleouzm/variant-calls/illumina-100bp-pe-exome-150x/bwamem- freebayes-0-9-10-131226/compare-1934-akckizzzfr-1931-laqgzjytqw-1935-xwckffckoa/indel/group-quality
  • 22.
    Background • Clinical laboratory– Division of Genomic Diagnostics Certified by regulatory agencies (CAP). • CWES test requires stringent validation per CAP criteria to establish performance metrics of the test. Utilizing NIST data in validation of CWES Test • Sequence and call variants of NA12878 at CHOP • CHOP ROI: Agilent SureSelect V5+ (SSV5+) baits file • Compare CHOP dataset to NIST data set for concordance NIST Data Set Details: *High quality reference data set on NA12878 (Dec. 2013) *NIST’s highly confident Region of Interests (ROI) *Variants called in 219,222 regions on hg19 assembly *: National Institute of Standards and Technology Analytical Validation of Clinical Whole-Exome Sequencing (CWES) Test
  • 23.
    SENSITIVITY /SPECIFICITY RefGene +/-15bp (SSV5+) CHOP NIST TP SNVs: 18480 INDELs: 396 FP SNVs: 26 INDELs: 3 FN SNVs: 63 INDELs: 30 FP: False Positive TP: True Positive FN: False Negative TN: True Negative SNVs INDELs Sensitivity (TP/TP+FN) 99.66% 92.96% Specificity (TN/TN+FP) ~100% ~100% FDR (FP/FP+TN) 0.02% 0.08% Accuracy (TP+TN/TP+TN+FP+FN) ~100% ~100% TN = NIST highly confident regions – CHOP ROIs
  • 24.
    Further analysis onpresumptive 93 FNs and 29 FPs 63 SNVs 30 INDELs 93 FNs 29 FPs 26 SNVs 3 INDELs
  • 25.
    Using the GeT-RMBrowser • http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/ • Allows visualization of questionable calls
  • 26.
    GeT-RM Load alignmentsfor visualization
  • 27.
    Chr6:151669820 Chr6:151669828 Difficult sitein homopolymer in intron of gene AKAP12
  • 28.
    Chr1:1666303 SNP in GeneSLC35E2, which is also in a pseudogene and a segmental duplication
  • 29.
  • 30.
    Feedback from MoChalab in NCI • We built a targeted amplicons NGS assay for detecting mutations in clinical tumor specimens • To assess the assay’s specificity, we compared 84 runs of CEPH NA12878 data from our assay with NIST’s consensus variant list (VCF v2.15) • We observed a high overall concordance with a few FP variants in homopolymeric regions unique in our platform • We concluded that NIST GIAB is a useful reference standard to evaluate assay specificity
  • 31.
    Using Genome ina Bottle calls to benchmark clinical exome sequencing at Mount Sinai School of Medicine “We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”
  • 32.
  • 33.
    HSPH – BradChapman Comparing variant callers http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection- methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/
  • 34.
    NextSeq: New Chemistry– Does it work? Whole Genome Metrics NextSeq500 HiSeq2500 % Genome Covered (>= 10X in Q20 bases) 96% 96% Mean Coverage in Q20 Bases 28.3X 31.8X SNPs Called (% dbSNP 129) 3,643,998 (89%) 3,664,014 (88%) InDels Called (% dbSNP 129) 646,907 (65.7%) 686,547 (64.5%) Genome in a Bottle SNP Sensitivity & Precision 99.07% | 99.04% 99.25% | 99.90% Genome in a Bottle Indel Sensitivity & Precision 86.90% | 98.85% 93.29% | 97.54% NextSeq 500: Genomic Coverage in High Quality Bases Coverage in Bases with MQ>=20 and Q>=20 ProportionofGenomeatCoverage 0.000.010.020.030.040.05 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Mean: 28.33X Fraction at 2/3 Mean: 0.9 HiSeq 2000: Genomic Coverage in High Quality Bases Coverage in Bases with MQ>=20 and Q>=20 ProportionofGenomeatCoverage 0.000.010.020.030.040.05 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Mean: 31.86X Fraction at 2/3 Mean: 0.91 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●●● ● ●0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00 GC Content NormalizedCoverage Platform ● ● HiSeq 2000 NextSeq 500
  • 35.
  • 36.
  • 37.
    Command-line tools forvariant benchmarking • USeq VCFComparator – http://sourceforge.net/projects/useq/ • RTG vcfeval – ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/ • bcbio.variation – http://bcbio.wordpress.com/2013/05/06/framework- for-evaluating-variant-detection-methods- comparison-of-aligners-and-callers/ • SMaSH – http://smash.cs.berkeley.edu/
  • 38.
    How Can IGet Involved? • Use our integrated SNP/indel genotypes for NA12878 and give us feedback – Cells and DNA currently available from Coriell – NIST RM available late 2014 • Sequencing/analyzing the new Genome in a Bottle samples • Help with Structural Variant calls • Help with analyzing data from long-read technologies • Attend our biannual workshops (January in CA, August in MD) • Help develop methods to measure performance using our well-characterized genomes http://genomeinabottle.org Email: Justin Zook - jzook@nist.gov Marc Salit – salit@nist.gov Slides on slideshare at: http://www.slideshare.net/Gen omeInABottle

Editor's Notes

  • #24 One FN snv is confirmed to be a reference One FP indel is confirmed to be REAL indel Three FP SNVs are confirmed to be REAL SNVs