In June 2000, Bill Clinton, the then President of the United States, was standing smiling together with the leaders of the Human Genome Project. “In genetic terms, all human beings, regardless of race, are more than 99.9% the same,” he stated. That was the message when the The first draft of the human genome sequence is revealed in the White House.
The single strand of As, Ts, Cs, and Gs eventually became the first reference human genome. Since its publication in 2003, the reference has revolutionized genome sequencing and helped scientists find thousands of disease-causing mutations. However, at its core is a somewhat ironic problem: the code intended to represent the human species is based mainly on one man from Buffalo, New York.
Although humans are very similar, “a person is not representative of the world,” says Pui-Yan Kwok, a specialist in genome analysis at the University of California, San Francisco and Academia Sinica in Taiwan. As a result, most genome sequencing is fundamentally biased.
This bias limits the type of genetic variation that can be detected, leaving some patients undiagnosed and potentially without proper treatment. What’s more, people who share less ancestry with the Buffalo man are likely to benefit less from the next era of precision medicine, which promises to tailor medical care to individuals.
To combat this, researchers have begun to assemble reference genomes for specific countries, including South Korea, Japan, Sweden, Denmark and the United Arab Emirates. They hope this will better serve their populations, but critics worry it could turn immigrants into second-class citizens in their healthcare systems. Now a big new project offers a different solution to representing global diversity: a human pangenome.
PRecision medicine, also known as personalized medicine, has been a buzzword within the medical community for years and it certainly sounds good. “Getting the right drug to the right patient at the right time is the motto,” says Neil Hanchard, US Physician-Scientist. National Institute for Human Genome Research.
But standard genome sequencing misses a lot of variation that could be related to disease. In most cases, it works by cutting DNA into small fragments known as “short reads”, before sequencing and organizing them into a genome using the reference as a guide.
Single nucleotide variants (SNVs), a change from a C to a T in a gene’s code, for example, are mostly easy to detect in this way, but larger chunks of variation known as structural variants (SV) are more complicated. New sections, sometimes hundreds or thousands of base pairs long, can go undetected, as can sections that are missing, inverted, or moved elsewhere. In those cases, the short reads can’t be easily assigned to the reference and “a lot,” says Kwok, get thrown away.
This means that standard genome sequencing is biased towards SVs already in the reference. If your SVs differ, you end up with a sequence that doesn’t fully capture your personal variation. Since it is these small differences between people that we hope will tell us, for example, why one person may respond well to a drug but another may not, that’s bad news.
Kwok’s work hints at how much SV goes under the radar. in 2019, his team analyzed samples from 154 people around the world and found that 60 million base pairs of SV genome content was missing from the reference, with much more still out there. A Follow, continue of 338 people who searched for just extra inserted DNA found nearly 130,000 new sequences.
But SVs also seem to show different frequency patterns in different populations. By extension, Kwok says, if a person “belongs to a population quite different from the person from whom the genome reference is derived, there will be more misalignment” when their short reads are mapped to the reference. Consequently, he says: “We can miss risk variants in those regions not represented in the reference.”
This lack of representation is a general problem in genomics. Even the most studied SNVs show large data gaps. Recently, for example, Hanchard and his colleagues sampled 426 individuals from 50 ethnolinguistic groups clusters across Africa and found more than 3 million new SNVs, mostly from populations that had never been sampled before. “We haven’t even touched [SVs]says Hanchard, “but our preliminary data suggests it’s going to be more of the same.”
Such data disparities directly affect medical outcomes. For example, if a person with a rare variant has a rare disease, the variant is most likely to blame. But often we don’t know if the variants are truly rare or just common in understudied populations. In those cases, doctors cannot make a diagnosis. “For people of non-European ancestry, that happens a lot more,” Hanchard says.
As we move into an era of precision medicine, that will only become more important. Kári Stefánsson, whose Reykjavik-based biotech company DeCode Genetics specializes in connecting the dots between genetic variants and disease, says what keeps him up at night is that our understanding of diversity within populations of ancestry Europea is now so good that we can start using it for precision medicine. But for other populations, “we don’t have the same kind of data,” he says. “[This] It’s going to increase disparities in health care beyond what they are today.”
WWhile there are no genetic foundations that meaningfully group people into different races, some believe it makes sense to create references to capture variation within specific populations, such as ethnic groups and nation states. A country that now has its own reference is Denmark.
“What we see is that there is a lot of variation that [has only been detected in] the Danish population,” says computational biologist Simon Rasmussen of the University of Copenhagen, who led the work. That’s a strong argument for a local referral, and the appeal is obvious: a Danish-based referral is in a unique position to empower the Danish healthcare system.
But some criticize national genomes for focusing too much on differences between populations, rather than individuals. Medical anthropologist Emma Kowal of Deakin University in Victoria, Australia, worries that national genomes could “keep the idea of race alive.” And framing genomes in terms of nationality inevitably leads to exclusion, says Jenny Reardon, a life sciences sociologist at the University of California, Santa Cruz. “We are deciding, in effect, who is Danish and who is not.”
Rasmussen admits that the reference would be less useful for the 15% of the Danish population who are migrants or their descendants. Even samples of people with mixed ancestry were removed during screening for the reference. But due to consent issues, the referral never made it to the clinic, so Rasmussen and his team want to create another. For that, he says: “We want to take a different form [selection] Approach.” Exactly how has yet to be determined.
However, there is an alternative to national genomes. Instead of targeting different populations, the Human Pangenome Reference Consortium wants to move away; overlapping many genomes to create a reference that has built-in variation: a pangenome. The consortium recently published the first draft of such a reference in a preprint.
Comprised of 47 exquisitely detailed genomes, the draft represents the first part of the 350 genomes he plans to sequence to include the most common variation worldwide. “This is not a standard that has been done before,” says Karen Miga of the University of California, Santa Cruz, who is part of the consortium.
But the project isn’t just about sequencing more diverse data. “We need to come up with a better data structure to encode that information,” says Miga’s colleague Ting Wang, of the Washington University School of Medicine in St Louis, Missouri.
That data structure is called a genome graph. In contrast to the current reference, which is just a long string of letters, the genome graph shows variation between genomes as deviations in an otherwise shared path. That will allow researchers and clinicians to assign short reads to the version of the pathway that best suits their sample.
The natural question is: how do you choose who is going to represent the world? The first genomes qualified for their high technical quality, but the consortium will need to choose new samples in the future. Since Africa is the cradle of humanity, Miga says, “The vast majority of the genomes we are including are of African descent.”
However, from Reardon’s perspective, 350 people could do a better job of representing the world than one person, but “[the consortium] I’ve made some decisions about the groups,” she says. “Who did they sample? Who did they not sample? As long as the reference contains only a subset, someone arguably won’t make the cut.
Crumb does not deny it. “[We are] really trying to capture common variation on a global level, so things are seen quite often,” she says. Documenting common variation in this case leaves out rare variation. “If you’re looking for something extremely rare,” she says, “that’s not our charge right now.”
IIn an ideal world, individuals would have their genomes sequenced without the use of a reference. This has has been held back for a long time as the definitive and problem-free solution, but almost no one believes that this is at stake. “It’s not a trivial undertaking and I don’t see it being trivial 10 years from now,” says Hanchard.
And instead of using a broad global pangenome, countries could be led by a reference more in line with their population, as well as maintained and controlled by themselves. “We don’t really expect anyone but the Danes to make a Danish reference genome,” says Rasmussen, who expects the next iteration to be led by Denmark’s state-controlled National Genome Center, potentially as part of the EU. europe genome Project.
Hanchard also sees the benefit of local or regional referrals. “[The pangenome] it will not have all the variation represented”, he says. he is part of H3Africa Consortium, whose goal is to bring the benefits of genomics to Africa and is considering an Africa-specific genome chart. At the same time, he hopes that all of these references will probably eventually come together.
When asked about his hopes for the future of genomics, he talks about knowing and understanding variation relative to himself or anyone else of Jamaican descent. “I would love to get to a point where everyone feels represented and this is for them as much as any particular group,” he says. “We are of one humanity, that is the important part.”