A new study shows a method to extract much more clinically relevant information from relatively raw or apparently less important data from large biobanks and genetic databases. A researcher says that the method was first developed to increase the quantity of human genetic information that is useful for research, since this has been relatively limited.
Imagine being a researcher who wants to investigate the genetics behind major depressive disorder (MDD).
This requires large quantities of data that could involve examining the genetic profiles of 100,000 people and linking the data with their health history and other observable phenotypic traits.
These studies are very expensive, and the large biobanks may not contain the genetic and phenotypic data needed.
A new study now shows a method of attributing the probability that specific people have some genetic and clinically relevant phenotypic traits without observing or measuring their actual traits.
These data for likely genetic and phenotypic traits can then be used to conduct studies and thereby obtain insight into the genetics behind MDD and how it differs between people.
“The new method is very useful because we can obtain considerably more insight into the genetics behind a disease or disorder without comprehensively characterising the genetics of the people involved in a specific study. In addition, we can save many resources on these typically very expensive studies, which in the long term can enable us to predict whether a given person has an increased risk of developing MDD or is likely to react positively or negatively to a specific treatment,” explains a researcher behind the study, Thomas Werge, Clinical Professor, Institute of Biological Psychiatry, Mental Health Centre Sct Hans, Copenhagen University Hospital Mental Health Services and Department of Clinical Medicine, University of Copenhagen.
The research has been published in Nature Genetics.
No need to know about the traits to be studied
The researchers validated a method to learn more about people’s genetic or phenotypic traits based on other phenotypic traits or other types of data.
For example, researchers may want to study dementia but only have people 30–50 years old to study.
Very few or none of these people have probably developed dementia, so linking the genetics of these people with whether they have developed dementia to determine risk does not make much sense.
Instead of waiting about 20–40 years to see whether the people being studied develop dementia, researchers could use information about their parents’ history of dementia.
“This means that instead of assessing dementia among the people for whom we have genetic data, we estimate a probability of dementia based on knowing whether their parents had dementia. This uses probabilities to fill a gap of knowledge about these people’s future illness trajectory,” says Thomas Werge.
If one parent has or had dementia, a person can be assigned a certain probability of developing dementia, and the probability could be, for example, 50% higher if both parents have or had dementia.
“We do not need to know everything about a person to be able to analyse. If we know about some features, we can calculate a probability for other features of interest, and that provides sufficient data power to be able to draw results from our studies,” adds Thomas Werge.
Inferring relationships from other data sets
The researchers showed that relationships can be identified between genetic differences and traits, and this insight can be used to quite accurately calculate other people’s probability of clinically important traits and genetic variants, so that these people can also be involved in and strengthen studies of disease.
An example could be only knowing a person’s birthweight, education, sex and age but needing the person’s height for analysis.
Again, databases with millions of people can identify correlations between birthweight, education, age, sex and height, and the probable height for each person can be calculated for the people in a specific study cohort.
This probable height can then be included in and significantly strengthen the study to provide useful conclusions.
“So even if we do not know the person’s height, but only a probable height, this can be included and contribute to genetic studies. The interesting thing we show is that researchers can use parental data or other information about the people of interest. Other large data sets provide correlations between phenotypic traits and genetics that enable the probability in a data set to be calculated,” explains Thomas Werge.
Even without data, researchers can learn more about MDD
The researchers showed that their method works in genetic studies of MDD.
MDD can be difficult to study because the diagnosis is not binary (definitely yes or no) in the same way as type 1 diabetes or measured height or weight.
“We can understand the genetics and causes of MDD much better if we have more data on large groups of people with MDD. But as I said, this type of study is very expensive,” says Thomas Werge.
Instead of starting the studies from scratch, the researchers show that data can be obtained from major biobanks such as the UK Biobank and Biobanks in Denmark.
Biobanks often contain genetic data on the participants and general information about their previous illnesses, education and the like, but detailed information is typically lacking about the many special and clinically decisive traits that vary between people with MDD.
However, the researchers do not need to know about these, because they can calculate probabilities for them and thereby analyse based on the very large data sets available.
“The basic data and resources are there, and estimating the missing data needed does not cost a fortune and require a whole career. You can calculate something useful and thereby identify genetics that affects a clinically important aspect of MDD. This may be important for clinical practice and for treating people with MDD,” concludes Thomas Werge.