Is Whole Genome Sequencing Really Worth it?
Article at a Glance
- Consumer genomic companies like Nebula (and their offshoot, DNA Complete) like to tout their “state of the art” whole genome sequencing (WGS) as a differentiator in the DNA testing space.
- WGS is still very expensive and is overkill for most consumers, unless discovering a rare or novel genetic variant is the goal.
- A technology known as imputation can fill in many of the gaps left by genotyping services. Imputation can significantly increase the number of genetic variants available from genotyping data by filling in missing genotypes based on reference panels.
- For most consumers, paying for WGS isn’t worth the extra money.
Contents
Since the early 2000s and until only a few years ago, if we wanted to know our genetic makeup, we only had one realistic choice as consumers: genotyping. Genotyping detects genetic variants (SNPs) at approximately 100,000 to 500,000 positions in the genome (about 0.1% total genome size) using a technology known as DNA chips or microarrays. However, another, much more complete way of exploring our genome, whole genome sequencing (WGS), is becoming increasingly affordable, though it is still considerably more expensive than genotyping.
But is WGS worth its price in consumer genomics?
The short answer is – it depends on your goals, expertise and budget. Genotyping is a cost-effective method that captures known variants and can be enhanced by imputation, while WGS offers a much more comprehensive view of the genome, including rare or novel variants and structural changes. For a longer explanation, read on! We’ll start with a brief intro to each method, then discuss how imputation helps boost genotyping and finally arrive at a more detailed conclusion.
How genotyping works
Recall that our DNA is a double helix, and the ‘double’ comes from the ability of each ‘letter’ in the genetic code (A, C, G, T) to pair with another specific letter on a complementary strand of DNA: ‘A’ pairs with ‘T’, and ‘G’ pairs with ‘C’.
DNA chips are tightly lined with clusters of short synthetic DNA fragments, each containing one version of a SNP of interest, with a known map of where each cluster sits on the chip. To genotype a sample of DNA, the DNA is cut into small pieces, labeled with fluorescent dyes, and added to the chip. Pieces that match the synthetic DNA fragments on the chip will stick to those fragments, lighting them up. Scanning the chip with a fluorescence reader and analysing the signals computationally provides information about how much DNA was stuck to each synthetic fragment.
Now imagine there’s a SNP somewhere in the genome that we want to detect, embedded in a sequence that is invariant in most people: for example, ….ATCC[A/G]TTC…, and we want to understand whether a person’s two copies of DNA are homozygous for the A (AA) or G (GG) allele, or heterozygous (AG) at this SNP. To test this, two clusters of synthetic DNA fragments will be placed on the chip: one containing ….ATCCATTC… and the other containing ….ATCCGTTC…. By comparing the intensity of fluorescence emitted from the DNA sample bound to the respective parts of the chip, it is possible to find out whether the sample contains only one of the alleles at this SNP (in which case it is homozygous for either A or G) or both (in which case it is heterozygous at this SNP).
Even if the above explanation sounds confusing, the key take-home message is quite simple: microarrays work well to measure known variation at a small subset of variants. If a certain part of DNA isn’t on the chip, or if its variant in the sample is different from the possibilities accounted for on the chip (e.g. if it is ….ATCCAGTC… instead of ….ATCC[A/G]TTC… that were placed on the chip), it won’t be detected correctly using this technology. Another tough case for genotyping is detecting large structural changes in the DNA–e.g., when whole chunks are repositioned or deleted.
In conclusion, genotyping is fast and affordable, but it can’t look everywhere. Below we will discuss how some of these limitations can be mitigated using a computational technology called imputation.
Get Started With Personalized Nutrition
Gene Food uses a proprietary algorithm to divide people into one of twenty diet types based on genetics. We score for cholesterol and sterol hyperabsorption, MTHFR status, histamine clearance, carbohydrate tolerance, and more. Where do you fit?
The power of imputation
What do ancient catastrophes such as the plague, famines and natural disasters have to do with our topic? It turns out that quite a lot, because their consequences have made it possible to drastically increase the amount of information that can be obtained from a DNA chip, through a process called imputation.
But first, let’s ask why knowing the genotype of a single position on each chromosome isn’t enough to know its full sequence right away. Sounds like a strange question, right?
- But maybe not so strange once you consider how chromosomes are passed down. When cells divide within an organism (e.g., when our bodies replace gut or skin cells), they duplicate their chromosomes in full and pass copies to daughter cells.
- If DNA copying in making entire organisms worked the same way, with each parent passing down one of their two copies of each chromosome in its entirety to their child, we’d only need to check a few key positions to figure out which parental chromosome was inherited.
- From there, we could infer most of the sequence.
- However, that’s not quite what happens. And it’s easy to see why not: if reproduction worked this way, we’d have a very limited genetic diversity, and this would make us more vulnerable to diseases and environmental changes.
So when making eggs and sperm, the two copies of the chromosomes in each parent don’t get split between the daughter cells intactly, but rather swap fragments with each other in a process known as crossing-over. It is largely because of this process that siblings from the same parents look quite different to each other, unless they are monozygotic twins (which are produced from the same combination of sperm and egg).
Crossing-over is one reason we can’t know the full sequence of a chromosome from just a single genotyped SNP. But crossing-over happens at a relatively small number of spots when sperm or egg cells are produced, and, what’s more, it tends to occur in certain parts of the chromosomes more often than others. As a result, most genotyped SNPs give us information about more than just a single position, but also about the alleles of SNPs around them––a phenomenon rather pompously known as ‘linkage disequilibrium.’ And the smaller a species’ population, the more information the allelic makeup of a single SNP conveys about nearby SNPs.
In species with enormous population sizes that have likely remained large since time immemorial –like fruit flies–genetic variants aren’t strongly linked, meaning that SNP alleles only correlate with each other within stretches of about 100 DNA letters (‘bases’) or so. In contrast, humans aren’t nearly as numerous today and have experienced drastic population reductions in the past (known as ‘population bottlenecks’): this is where the epidemies, natural disasters that we mentioned at the beginning of this section become relevant. As a result, sets of SNPs with correlated alleles are much larger in humans. On average, for sufficiently common SNPs, knowing the allele of one variant allows you to ‘guess’ the alleles of a multitude of other common variants located up to ~50,000 bases away with reasonable confidence.
This guesswork is done by imputation, which is a statistical technique that takes as input the genotypes of SNPs in the sample detected by a DNA chip, as well as the known genotypes of every SNP that we want to impute in large cohorts of fully-sequenced individuals.
As we said, humans are a good species for imputation, but the exact cohort of fully sequenced individuals used as a reference for imputation also matters. The larger the cohort, the more variants in it are detected at high quality, and the closer the population of its donors to that of the donor of the DNA sample, the better it will work. For example, using the full genome sequences of a cohort of about 1000 individuals from several different populations produced by the “1000 Genomes” project, it is possible to impute the alleles of ~10 millions common variants from a standard genotype array with reasonable consequence. However, imputation will still struggle with inferring rare or large structural variants.
In conclusion, we humans are ‘lucky’ to have genomes amenable for imputation, and with data from large cohorts used as reference and the powerful statistical methods currently available to perform imputation, standard DNA chips can provide trustworthy information about millions of common SNPs.
Let’s now look at whole genome sequencing (WGS), which can provide an even more comprehensive view of the genome.
Whole genome sequencing (WGS)
At the core of modern DNA sequencing is the same principle used in the first sequencing reaction nearly 50 years ago, based on how DNA is naturally copied inside the cells before they divide. In this process, the double helix of the DNA in each chromosome is unwound, and a complementary strand is synthesized for each original strand, resulting in two identical double helices.
This natural process of copying the genome in different organisms uses various versions of a complex enzyme called DNA polymerase. Supplied with individual ‘letters’ as substrates, this enzyme moves along the DNA template, and for each letter it encounters, it adds ‘the opposite’ letter to the newly synthesised DNA strand. For example, if it reads an ‘A’ in the original DNA strand, it will add a ‘T’ to the new one, and so on.
For DNA sequencing, scientists ‘hijacked’ this process by getting it to work outside the cells (in vitro). They then used fluorescent dyes and some other chemical tricks to know exactly what letter the enzyme is adding, one after another. This is how the original sequencing technology worked, and it is also how the so-called next-generation sequencing (NGS) that underpins most WGS services works today. The only difference is that in the latter case, this technology is modified such that it’s possible to monitor not one but millions of sequencing reactions in parallel using slightly different chemical tricks and powerful optics.
Even though DNA sequencing is now a very mature technology, ‘reading’ human DNA chromosome-by-chromosome, from start to finish, is still technologically impossible. Current sequencing methods produce short fragments of DNA (100-200 bases with NGS, a bit longer with some other, slightly lower-throughput tech). These fragments–known as sequencing reads–must then be combined to reconstruct the full sequence, akin to solving a jigsaw puzzle. Doing this without a ‘reference’ of what the full genome sequence looks like is incredibly hard, typically requires combining information from multiple sequencing technologies and is done by large-scale consortia of scientists. Once the reference genome for an organism is known, however–such as for us humans, where it has been available for over 20 years and has been continuously improving since–aligning short reads from an NGS dataset to the reference is straightforward and is now fully automated by specialised software pipelines.
Does it mean that we can now obtain precise information about the entirety of our 3.2-gigabase genomes from WGS? Not quite. About 5% of the genome, containing highly repetitive and difficult-to-copy regions, cannot be sequenced at all. In addition, another 15-20% can be sequenced but are too repetitive to be confidently assigned to a specific location on a chromosome. In fairness, most of these areas are equally difficult to detect by genotyping, and understanding their function remains a hot topic in biomedical research today.
In conclusion, WGS is powerful, but still isn’t perfect. And while its costs have plummeted drastically since the early days, it isn’t yet cheap–well into triple-digit dollars–compared with the double-digit costs of genotyping.
So – is WGS worth the price?
In conclusion, while genotyping is fast and cheap, WGS has clear advantages when you are looking for variants that genotyping and imputation can’t detect easily. These include:
- Rare variants (with allele frequencies below 5%) that are not present on the chips, because they are near-impossible to impute from the nearby SNPs. Note that the most important rare variants–e.g., those that have clear links with diseases or metabolism–are often added explicitly.
- Novel variants that may never have been encountered before, for example, formed in cancer cells or suspected to be behind some unusual pathologies.
- Large or complex structural variants–for example, those associated with some types of muscular dystrophies or found in cancers.
But the question is–do we need these variants in a consumer genomics setting? Most of them are either very rare and/or not studied sufficiently well to interpret them easily without expert skills and knowledge.
Therefore, for most people, the added information provided by whole-genome sequencing isn’t that useful, at least for now. And getting your genome sequence “in advance” may not be worth it for most of us either, as DNA sequencing is becoming cheaper and more complete and precise day by day.
Get Started With Personalized Nutrition
Gene Food uses a proprietary algorithm to divide people into one of twenty diet types based on genetics. We score for cholesterol and sterol hyperabsorption, MTHFR status, histamine clearance, carbohydrate tolerance, and more. Where do you fit?