May 18, 2017 6:00 AM

A Fish in a Pond or a Needle in a Haystack? DNA Tool Raises Promise, Privacy Concerns

For the first time, researchers connected two different types of DNA snippets to identify individuals. This could help researchers across many fields — but isn’t without risk.

TV crime dramas often feature law enforcement characters saying, “We got a hit in CODIS” — a potential match of DNA found at a crime scene to a central database of DNA samples gathered from people suspected of or convicted of past crimes.

MORE FROM THE LAB: Subscribe to our weekly newsletter

A few tiny snippets of genetic material might be enough to send a perpetrator to jail by the episode’s end — or free a wrongfully convicted person.

Meanwhile, medical dramas feature DNA-based diagnosis and care, made possible by rapid DNA testing and research in collections of samples from thousands of people.

And nature shows might feature wildlife specialists using DNA from fur or droppings to query a database of animal DNA and guide their search for a member of an endangered species.

All three use different databases to achieve the same goal: finding a genetic needle in a haystack or a single fish in a pond.

These databases use different types of genetic fragments to identify people or animals, which keeps those databases from “talking” to one another. The haystacks and ponds stand apart.

Now, a team of researchers has published findings that could help break down that barrier. In the Proceedings of the National Academy of Sciences, the team reports a way to identify the same individuals across two genetic databases.

Using the type of fragments law enforcement employs, called STRs, or short tandem repeats, and the kind wielded more in the medical research realm, called SNPs, or single nucleotide polymorphisms, researchers matched the correct person more than 90 percent of the time in a group of 872 individuals.

The team used a technique that relies on linkage disequilibrium, which allows many bits of otherwise unrelated DNA variants to be passed down together so that knowing one can help you guess the others. This way, a record in one database can be used to search for its match in another database.

Their achievement, if proven to work in larger groups, could help researchers combine massive genetic databases — and avoid counting the same person twice in their analyses. That could help medical research move faster and further on many diseases.

The approach might also help law enforcement zero in on suspects who aren’t in the national CODIS database but whose DNA is on file somewhere else that they can access with a warrant.

"(DNA) record matching across different databases is a medical care issue as well as a privacy issue."
Jun Z. Li, Ph.D.

Privacy concerns

Because of this potential, researchers note that the technique could open up privacy risks by linking records in the criminal justice system with those in medical or ancestry research, unless protections are put in place.

SEE ALSO: Decoding Cancer: A Personalized Approach Targets Genetics

For instance, it could inadvertently give law enforcement information about the health-related genetic traits someone carries.

Or it could allow the DNA a person voluntarily contributed to a research project, or sent to a company that offers ancestry or health-related genetic testing, to be used for purposes they didn’t permit.

Jun Z. Li, Ph.D., the University of Michigan genetics and bioinformatics researcher who participated in the effort along with Stanford University and University of Manitoba colleagues, notes that these potential uses and misuses are still just concepts.

After all, the study used DNA from only a limited set of people, with a decent but not fully representative level of diversity of ethnic backgrounds.

The DNA samples were initially collected to study human diversity among different populations and were anonymous. So the “matching” was between records in two databases and didn’t find someone by name.

Still, the record matching the researchers achieved, and the data aggregation it could unlock, makes Li and his Stanford colleague Noah Rosenberg, Ph.D., interested in studying the approach further, in larger sets of DNA. Rosenberg and Li once worked together at U-M and have continued collaborating now that Rosenberg is at Stanford.

“We had a pond of less than 1,000 fish, and we put ‘bait’ in to see which fish would bite,” Li says. “But in practice, if you want to search population-level data, you’d be doing the equivalent of looking for a needle in a haystack and determining if you have a match among millions of records.”

Using the same technique in larger databases could lead to a number of moderately strong matches, rather than a surefire pinpointing of an individual.

But that could be enough to dramatically narrow a search for a suspect. As the CODIS database extends its sampling from 13 to 20 STRs, the potential precision of matches will increase.

Preventing unauthorized uses of DNA data and cross-matching of databases, Li says, could require research oversight boards, clinical providers and commercial DNA-testing companies to rethink their consent forms and security protections.

“In the end, record matching is a classification task, like matching a patient to a diagnosis,” says Li. “It has to do with the granularity of the classification. But record matching across different databases is a medical care issue as well as a privacy issue. And the chance of being identified beyond your consented use of your DNA is a real possibility. We need to figure out where to draw the line.”

Further applications

As the movement toward precision medicine gains steam, combining databases of DNA and the clinical data from the same patients could allow health researchers to identify genetic factors that make a person more likely to respond to certain treatments or have certain side effects.

SEE ALSO: Why Health Care Infrastructure Needs to Catch Up with Precision Medicine               

Li is also studying variation in cancer cells by looking at the massive amounts of data generated when researchers track which genes are being expressed, or which mutations have occurred, at different stages in a tumor’s lifespan.

If they can connect the gene expression or mutation patterns with the clinical stage of the cancer, they might be able to understand better which vulnerabilities a tumor has, and therefore how to detect or treat it better.

Now that connecting haystacks and ponds has been shown to work, that potential grows ever stronger.