100,000 Studies: A Milestone for Human Genome Epidemiology (HuGE) and the HuGE Navigator
Posted on byThe HuGE published literature database now contains more than 100,000 citations, a milestone reached at the end of 2014. The Office of Public Health Genomics has compiled this database since 2001 via weekly systematic sweeps of PubMed performed by a single curator. For the first five years, a complex PubMed query was used to identify studies of genotype prevalence, gene-disease association, gene-environment interaction, and the performance characteristics of genetic tests. In 2006, a data mining approach using support vector machines replaced the PubMed query, reducing the time needed for hand curation and improving both sensitivity and specificity. The database and a suite of online tools to explore it were re-launched as the HuGE Navigator.
Since the first draft of the human genome sequence was announced in 2001, PubMed has added more than one million articles on human genetics and genomics. Human genome epidemiology has grown, too, but studies of genetic variation and disease in populations—i.e., groups of people not defined by family relationships—still accounts for only a small fraction of the total (Figure 1).
A boom in gene discovery followed the introduction of genome-wide association studies (GWAS) (hotlink) in 2005; following up on these discoveries to unravel genetic contributions to disease, however, remains extremely challenging. There are no “high-throughput” shortcuts to understanding. Now that it seems clear that common genetic variants have only small effects on disease risk, the field has shifted toward studies of rare variants with large effects. This may look like a return to the pre-Human Genome Project roots of genetic epidemiology; discoveries in this phase, however, are just the next steps toward building the knowledge base for population-level interpretation.
Meta-analysis has become popular as a first step in knowledge synthesis. Concern over the proliferation of poorly conducted meta-analyses, however, led the editors of PLOS ONE to establish explicit quality criteria for submitted manuscripts and the American Journal of Epidemiology has endorsed this approach. Although rigorous meta-analysis can be useful for assessing and refining gene discoveries, it does not suggest next steps. Other methods are needed to integrate genetic data into ways of thinking that can help us understand, prevent and treat disease. Human genome epidemiology must evolve to help meet this challenge.
On Jan 5, 2015, the HuGE Navigator completed transition to a completely automated curation process based on machine learning and data extraction. This method has achieved 90% sensitivity and specificity when tested against the previous, semi-automated process. The HuGE published literature database will continue to be updated weekly with automatic indexing of gene symbols, study type (meta-analysis, GWAS), and category (pharmacogenomics, genetic testing).
Human genome epidemiology is a global enterprise. The first 100,000 articles in the database included authors from 151 countries (Fig 2). The HuGE Navigator will remain online as a freely accessible resource for all who are interested in human genetic variation and population health.