Addressing enzyme bias
Source of bias in NGS workflows :
The enzyme bias problem :
Once the library is constructed, it must be amplified in order to enrich for molecules that contain the appropriate adaptor configuration and obtain sufficient material for sequencing.
Ideally the composition/diversity of the original library and amplified library will be the same. In actuality, amplification can severely distort the representation of original library DNA – this phenomenon is the result of enzyme bias :
- AT- and GC-rich sequences and other ‘problem sequences’ have been reported to be under-represented in polymerase driven NGS platforms.
- Many of these difficult regions represent high value portions of the human genome (i.e. 5’ UTRs, CpG islands, & first exons)
- Certain organisms associated with human disease are notoriously difficult to sequence, e.g. P. falciparum (19% GC), M. tuberculosis (65% GC)
- To compensate for regions of under-representation, total sequencing coverage must be increased. This increases the cost of sequencing.
- Despite increased coverage, a certain percentage of a genome is never represented. Missing sequence information complicates data analysis and interpretation.
- Illumina has recently made attempts to address these bias problems by improving the cluster amplification and sequencing-by-synthesis (SBS) polymerase systems.
- Library amplification is still a major source of bias not adequately addressed by the instrument vendors.
Distributions of GC-content in 5′ UTRs, CDS, and 3′ UTRs of human genes
Zhang L et al. PNAS 2004;101:16855-16860
Why amplification bias is a problem :
Some regions of the genome are overrepresented and other regions are underrepresented. In order to achieve sufficient coverage depth to make sense of the sequence data, the entire genome must be oversampled to ensure the sequence information for regions that are underrepresented appear at a baseline level of coverage. This increases the cost of sequencing : you have to sequence at higher coverage than you would if everything was amplified more equally.
Even with more sequencing, some regions of the genome are never represented. This results in missing sequence information.
The improved processivity and higher performance features of the engineered KAPAHiFi DNA Polymerase solves the bias problem in NGS library amplification !
Effect of GC-content on coverage depth for libraries amplified using proof-reading (Type B) polymerases
Indexed Illumina TruSeq libraries prepared from identical sheared S. aureus (33% GC; left panels) and M. tuberculosis (65% GC; right panels) gDNA were amplified using the indicated PCR reagents, and compared to an equivalent unamplified library by paired-end sequencing (2 x 75bp). After filtering and aligning read pairs to reference sequences, 250 000 read pairs were randomly sampled for each genome, and scatter plots of mean sequence coverage depth vs. GC content were generated by analyzing 250bp windows.
Low GC-content in P. falciparum libraries results in variable bias depending on the polymerase used for amplification
Observed frequencies of GC-content for reads are plotted for each condition tested (black = unamplified; green = KAPAHiFi HotStart ReadyMix; Blue = Phusion HF Master Mix). The expected frequency distribution of reads is indicated by the grey shaded area. The unamplified library tracked the expected frequency distribution. Amplification with KAPAHiFi showed minimal bias, while amplification with Phusion resulted in a dramatic bias against reads with low GC-content.
Coverage depth and GC-content across a ~7kb region of the B. pertussis genome
Within this region of the genome there are 4 distinct locations of high GC sequence (>75%) that lead to sequence coverage bias (grey bars). In these regions the library amplified using Phusion (blue) exhibits lower depth of coverage compared to the unamplified control. In contrast, the library amplified with KAPAHiFi (green) exhibits more even coverage depth, similar to the control library (black).
Coverage depth and GC-content across a ~7kb region of the P. falciparum genome
Within this region of the genome there are 3 locations where high-AT sequences (>80%) lead to coverage bias (grey bars). In all three regions coverage depth drops significantly after amplification with Phusion (blue), while the library amplified using KAPAHiFi (green) shows more uniform coverage depth which tracks that of the unamplified control library (black).
Coverage depth and GC-content across a ~7kb region of the S. pullorum genome
This region contains a ~1kb AT-rich stretch (grey bar) in which coverage depth drops dramatically in the library amplified using Phusion (blue). The coverage depth of the library amplified using KAPAHiFi (green) tracks that of the unamplified control (black).
The reduction of enzyme bias in NGS library amplification :
1. KAPAHiFi reduces the cost of sequencing (improved uniformity means less coverage is required to obtain useful sequence across genome)
2. KAPAHiFi provides important sequence information not represented in biased systems (more complete information helps answer more biological questions)