The Real Issue: Diversity in Genetics Research

A recent article in Quartz with the headline “23andMe has a problem when it comes to ancestry reports for people of color” provides an opportunity for us to discuss the real issue largely missed by the author — the need for diversity 23andMe_Logo_blogin genetics research, and biomedical research in general. It’s also an opportunity to address mischaracterizations of our science within the article, and motivations ascribed to our research that are just plain wrong.  We’ve raised these issues with Quartz and feel strongly enough to share our thoughts more publicly to hopefully steer the discussion in an informative and worthwhile direction, since it is a conversation that should be had by the broader scientific community.

Genetics research has a diversity problem. This is widely recognized, but when you see the data it’s staggering. For example, a 2011 article cited data showing that only 4 percent (!) of all genome-wide association studies had been conducted using samples of non-European descent. There are very real repercussions when research is limited to just one group. The seriousness of this situation was demonstrated in a recent study showing that due to biased research, African Americans are more likely than whites to mistakenly be told that they carry a mutation putting them at risk for the heart condition known as hypertrophic cardiomyopathy, when in fact they do not. This type of mistake has serious emotional and medical implications.

The lack of diversity in genetics research is reflective of the general lack of diversity in all aspects of health research. There are lots of reasons — historical, cultural, political. For good overviews we suggest reading this piece from Newsweek and this article from Oh, et al. These authors eloquently and forcefully explain the situation.

What does this have to do with 23andMe?

Though we do conduct our own research and are increasingly able to provide reports based on our own published studies, for the most part 23andMe is dependent on peer-reviewed scientific research published by others to fuel the reports in our product. Lack of diversity in what is studied by the research community at large therefore impacts the results we are able to provide to customers.
Ancestry Comp Home
When it comes to our Ancestry Composition feature (which tells people the proportion of their DNA that comes from each of 31 populations), the results we provide to customers are derived by comparing their DNA to what are known as “reference datasets”. We use publicly available data from academic initiatives that have specifically sought to characterize large swaths of the human genetic diversity of our planet: the Human Genome Diversity Project (HGDP), HapMap3 and the 1000 Genomes Project. We augment this public data with information provided by a small percentage of our customers whose data meet our rigorous standards for being included in a reference dataset.

The author of the Quartz article focused in on the numbers of people in our most granular population references. She specifically questioned how we can claim to provide accurate results about her ancestry (she identifies as Korean, but her 23andMe results indicated Japanese and Chinese ancestry, as well as Korean) when there are data for “only” 76 Koreans in our reference data set. It’s an understandable worry — so much of what we hear about these days in science is the power of “big data”.  Bigger must be better, right? Not necessarily. For Ancestry Composition, high accuracy can often be obtained with relatively modest sample sizes. Size matters more when it comes to granularity, not accuracy.

We’ve done a comprehensive evaluation of our algorithm and reference datasets, and are confident in the accuracy of the results we report for customers with Korean or any other ancestry. If you want to get into the nitty gritty you can check out this white paper (in which we describe the multiple steps that go into producing accurate results and the metrics demonstrating how well our algorithm works), this blog post or this scientific poster. The gist is that we only go as granular as we can for a given confidence threshold. Customers can adjust their results from a 90 percent confidence level (“Conservative”) down to 50 percent (“Speculative”). If we say that a section of DNA is most likely Japanese or Chinese or Korean at one of our higher confidence thresholds, we stand behind it.  If we weren’t sure that we could make the distinction, we’d go up a level and say that that DNA reflected “East Asian” ancestry. The information would not be incorrect, simply less specific.

While the impact of adding more Korean individuals (or individuals of any of the groups we currently have represented in our reference datasets) is likely to be small in terms of improving the accuracy of our Ancestry Composition results, it could help us add more granularity to our results — i.e., groups of people from more specifically defined geographical areas. That way, for example, we might be able to distinguish between Cameroonian and Ghanaian ancestry, instead of lumping ancestry from these different locations under the label “West African.”

The author of the Quartz article correctly points out that we do have a lot of data and granularity for Europe (providing, for example, separate French & German and British & Irish groupings), but not for the rest of the world. She implies that this is driven by a lack of effort or concern on our part. The truth is that we have a great deal of customer data to enhance the public data available for Europeans because the vast majority of our customers have European ancestry, and several thousand of them are eligible to contribute to our reference dataset. This is not because of a “commercially driven ethnic bias,” and we have for several years been making in-roads to improve the diversity in our research and reference data.

This is where we come back to where we began — this is not just a 23andMe issue but an issue with diversity in genetics research in general.

There’s only so much we can do as a single company, but we are trying. We funded the genotyping of the HGDP samples, data which are publicly available. Our Roots into the Future project enrolled 10,000 African Americans in our research program, and our African Ancestry Project helped contribute to our West African reference dataset. We recently collaborated to obtain data from for over 400 representatives of populations in the Democratic Republic of Congo. One of our researchers was awarded NIH funding to develop new methods for uncovering genetic variants associated with disease, especially in those with non-European ancestry. There are additional programs in the works, and we look forward to sharing more as we can.

We’re also always looking for more publicly available data that we can use to improve our product and inform our research. But it’s not as simple as just pulling studies for a given group or purchasing reference data.  There are important scientific and ethical considerations that must be attended to when collecting population data and 23andMe is not willing to compromise our standards in the interest of appearing more diverse.

We can always do better. We want people — researchers, customers, journalists — to keep holding us accountable and pushing us to lead the way in consumer genetics and genetics research. But accusing us of racial bias and shoddy science is unfair. It reduces an important conversation to clickbait.

Please share your comments here but don’t limit it to our blog — discuss your ideas, raise awareness and spread the word that the lack of research diversity is a real problem. We want people to have informed opinions and conversations on this topic since the impact of genetics research will only continue to grow.






  • ABottleOfStTropez

    “The truth is that we have a great deal of customer data to enhance the public data available for Europeans because the vast majority of our customers have European ancestry, and several thousand of them are eligible to contribute to our reference dataset. ”

    So if you truly are augmenting your data, how come my ancestry results haven’t been updated in 2 years?

    • 77anteater

      Should be about three years, really. As far as I know, around December of 2014, 23andme just had a minor algorithm change where they just phased parents and children together who were linked in split view. The last time they actually had a major Ancestry Composition update where the added new reference populations and samples was in November 2013, coming up on three years ago.. Ancestry.com’s last update was the month or two earlier. I don’t think the companies have been putting much effort into adding new data, even when there ARE new references on 1000 Genomes that at least one of the companies should be able to use.
      I can understand not doing it every six months, but about three years?

  • TRS

    While the impact of adding more Korean
    individuals (or individuals of any of the groups we currently have
    represented in our reference datasets) is likely to be small in terms of
    improving the accuracy of our Ancestry Composition results, it could
    help us add more granularity to our results — i.e., groups of people
    from more specifically defined geographical areas. That way, for
    example, we might be able to distinguish between Cameroonian and
    Ghanaian ancestry, instead of lumping ancestry from these different
    locations under the label “West African.”

    If you lack granularity, fine, but what’s the point in lumping a population into a category where it obviously doesn’t belong? I’m talking about the reason for including certain Middle Eastern populations in the North African category.

    There are 249 samples in “North African”, but these are the non-North African populations found within that category:

    Palestinian HGDP 51
    Bedouin (from Negev) HGDP 48
    Palestine 23andMe 28
    Saudi Arabia 23andMe 8
    Jordan 23andMe 5
    Yemen 23andMe 5
    Kuwait 23andMe 3
    United Arab Emirates 23andMe 2
    Bahrain 23andMe 1

    In 249 samples, 151 are NOT North African! So how can you tell if your reported North African ancestry comes from Morocco, Algeria, etc…. or from non-North African Arabia? You cannot. This problem has been present since the ancestry composition update in 2014 and the company still refuses to acknowledge it.

  • Leon Bird

    Detailed human DNA population studies are in there infancy, as time goes on the data in balance will be corrected, Technology and innovation follow interest and opportunity which are in Europe and America.

  • 23blog

    Udo, the Girl,
    We actually have a very large number of people with African ancestry in our database. We do need more reference samples from people within Africa to improve how specific we can get with the Ancestry Composition estimates we give customers.

    As for “wavy or straight” hair trait estimate, the result is best applied to customers of European ancestry. I think that is stated in the report itself and there is an additional note regarding ethnicity and the fact that Asians typically have straight hair, while people with African ancestry typically have very curly hair.

    • 77anteater

      Have you had any plans on offering a free project to people with East Asian and specifically Southeast Asian descent (Thai, Filipino, Vietnamese, Cambodian) descent? How about Middle Easterners, since there are (or atleast should be) differences between Persian and Arab?
      I’ll be glad to help with it.

      On Africa, you mentioned the Democratic Republic of Congo in the essay, but I’m curious as to what work has been done for other parts like the regions with Senegambia (which need more than just the 25 or so Mandenka samples of HGDP) and Sierra Leone, since those areas maybe more specific to African Americans? Speaking of Sierra Leone, from what I remember seeing of how Ancestry Composition was before the change to the “New Experience,” you do have samples listed as “Sierra Leone,” but don’t specify as to what ethnic groups they are? Are they Mende, Temne, etc.?
      Same as with the (I can’t remember the number, but it couldn’t have been any higher than 10) Ghana sample(s). I think it actually does matter what the ethnic groups were when you’re talking about Africa, and not just the country.
      Will you ever release a paper saying what reference populations were obtained through the African Ancestry Project a few years ago?

      You said, “We use publicly available data from academic initiatives that have specifically sought to characterize large swaths of the human genetic diversity of our planet: the Human Genome Diversity Project (HGDP), HapMap3 and the 1000 Genomes Project.”

      We know that there has to be more than just these three projects out there. Plus it seems that HapMap3 and HGDP are basically done, while 1000 Genomes has added more samples to them such as “Esan of Nigeria” and “Mende of Sierra Leone.” http://www.internationalgenome.org/sites/1000genomes.org/files/documents/nhgri_flyer_2013.pdf

      Is it possible to obtain reference populations from the African Genome Variation Project too? http://www.nature.com/nature/journal/v517/n7534/fig_tab/nature13997_F2.html
      There are not only Mandenka and Yoruba listed here but also more Senegambian populations such as Wolof and Jola. Plus there are Southern and Eastern African groups such as Zulu and Baganda as well. Is it possible to use this project too?

      There’s also the ” Simons Genome Diversity Project.”
      http://www.unz.com/gnxp/the-simons-genome-diversity-project/
      https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/
      It doesn’t appear to have a lot of samples for each population, but I’m sure that every bit should help, right. Could you use that project as well?
      I mean, it seems like there HAS been some more data over the years, but just the matter of you and the other companies using them. Am I right?

      I will give you at 23andme credit for at least TALKING about this. The other companies have kept silent on this issue.

      • 23blog

        77anteater,
        Thanks for the note and comments. You bring up a lot of points, and I think you are echoing some of the same frustrations that we’ve heard from others who want more detail than we currently have. What I can tell you is that we are continuing to improve Ancestry Composition, as we have done in the past. We do look opportunistically at the possibility of obtaining reference populations from other sources, but that raises other issues like how the samples were obtained and whether they meet our standards.
        I think you also make a good point about how we identify reference populations. Sometimes the national designation falls short of identifying ethnic groups within those countries. All I can say is we have a lot of smart people working on this. Stay tuned.

        • 77anteater

          Like I said, there just has to be more public projects available than just the old 1000 Genomes, and HGDP and HapMap3, neither of which apparently have added any new populations since they came out.

          Could you try to look into this as well?
          https://www.wegene.com/question/600

          The company Wegene apparently was able to obtain Indigenous American samples from Central and South America.

          You can say that peoples of North and South America have no genetic diversity and all there needs to be are the same old five HGDP groups: Colombian, Karitiana, Maya, Pima, and Surui… but I believe that many people of Hispanic origin in particular would be interested in finally seeing some detail beyond just “Native American.” Many of them would like to see at least what reference population they have affinity to, such as Maya or Mixtec or Zapotec or Mixe.

          Is it possible to look into where Wegen obtained these samples, and see if 23andme could use them for their Latin American customers?

        • 23blog

          I don’t want to comment about how or where wegene obtains its samples, but I do want to make sure that you understand it’s not just a matter of going to a country and taking people’s DNA. There are legal, ethical, cultural, and scientific guidelines that should be followed whenever we gather reference population data.

          My point in saying this is to address this idea that its simple to just go out and get samples. It’s not, at least in a way that would meet the criteria that we’ve set out.

          That doesn’t mean we couldn’t do more. We can and we are. I just want to address this idea that there are these readily available reference population data sets that 23andMe could use, or that 23andMe could simply send a researcher into a country and take DNA samples. It doesn’t work that way for us, nor should it.

          Thanks again for your comments.

        • 77anteater

          Alright. One last thing, do you think that an Ancestry Composition update could be done by this time next year, which would be almost four years since the previous one in November 2013?

          I mean I totally understand that ethnicity updates can’t be done every six months, but when you’re coming up on three years, and none of the companies (I understand that you can only speak for 23andme) have added any new reference populations or samples in all that time, it is kind of frustrating. But like I said, you do deserve credit for at least discussing it, while the other companies have kept silent.

  • You should share more of gene science with your customers. For example: “”Almost all living people outside of Africa trace back to a single migration more than 50,000 years ago”. But you will get immediate aggressive, hostility from anti-science weirdos.

  • 23blog

    Mango Martini,
    Thanks for your comments. We are continuing to improve what we can tell people with African ancestry, and part of that effort will be continuing to improve the amount of reference data we have to inform people about where in Africa their ancestors came.

  • Sai C.

    Yeah, this is a great example of deflection. I’m not buying that y’all learned your lesson. Why? Because:

    “I pointed out to Mountain that if 23andMe was so concerned about diversity, they could purchase or acquire data sets from outside the US. Like from Asia, for example. Was 23andMe at all interested in non-US DNA, I asked?
    Mountain replied yes, but the example she gave made me wonder whether she was listening to me at all. “One of the studies [we’re using] is called ‘Peoples of the British Isles,’ which is data for 2-3,000 people all over the British Isles.””

    I’m sorry but this just sounds like *blinding* Eurocentrism. Is it really that hard with all of the money you make, to send out sets to poorly represented populations? Yes, it’s a diversity problem affecting the entire field, but it’s clear that the scientists leading your operations can’t seem to even acknowledge that it’s a problem. This sounds like PR-speak rather than earnest improvement. It’s an insult that you can tell someone that they’re German and Polish, but you can’t tell me that I’m Gujarati and Telugu (two completely different ethnicities), I’m just “South Asian”.

    • 23blog

      Hi Sai,
      I would hope that the overall message you’d get from this is that we are working toward adding more diversity to our reference populations, that this issue of lack of diversity in genetic research is a not unique to 23andMe and is related to many many issues. If the only takeaway from Dr. Mountain’s comments were her “blinding Eurocentrism” we really missed the mark, and you miss what has been the core of her lifework in science, which has been focused on Africa.

      The bottom line is that we are working to improve this situation, actively working on projects that will make a difference.

  • 23blog

    Hi Don,
    The answer to your question is that it depends. Right now we only report “broadly south Asian” ancestry. This includes Afghan, Balochi, Bangladeshi, Brahui, Burusho, Hazara, Indian, Kalash, Makrani, Nepalese, Pakistani, Pathan, Sindhi, Sri Lankan, and Uyghur. But if you want to distinguish between those different ancestries we are not yet able to do that.

Return to top