Focus

January 13, 2006

Biological Chemistry
Transcription Apparatus Seen to Uncoil—and Recoil—DNA

Pathology
Molecule that Inflames Cancer May Also Dampen Spread of Disease

Imaging
Technique Demonstrates Whole-body Fluorescent Scanning

Bioinformatics
From Narratives to Networks: Annotation Mining Reveals Links Between Genes, Biological Context

Resources
Powerful Imager Strengthens Longwood MRI Facility

Stem Cells Discovered in Fruit Fly Gut, Tied to Notch Pathway

Function of “Unicorn” Whale’s Eight-foot Tooth Discovered

Not-for-profit Medicare Health Plans Outshine For-profits in Selected Measures

Recruitment Begins for Hurricane Advisory Group

Gimbrone Wins Faisal Prize

Appointments to Full Professor

Plasmid Information Database Launched

Fellowship in Medical Ethics Accepting Applications

Honors and Advances

Giddon Feted at School of Dental Medicine

Birth of Change in Medicare Benefits: The Story of tPA

Front Page

BIOINFORMATICS


From Narratives to Networks: Annotation Mining Reveals Links Between Genes, Biological Context

Photo by Jeff Cleary

Isaac Kohane


With the explosion of high-capacity screening technologies, biologists are drowning in data. Microarray chips allow the simultaneous measurement of the expression of thousands of genes in a single experiment, and thousands of such experiments have been completed in the past decade. Today, the Gene Expression Omnibus (GEO), a public access database at the National Institutes of Health, contains more than 60,000 gene-profiling experiments, and that number is expected to double in the coming year.

Besides experimental results, GEO contains a wealth of annotation data. By explaining what researchers actually did—what species, condition, disease, and tissue they chose to study—these notes link genomic data to their biological context. But their unstructured text format has rendered annotations unusable by automated data-mining and clustering algorithms commonly applied to genomic data.

To rectify that situation, Isaac Kohane and Atul Butte, researchers at Children’s Hospital Boston, developed a method to transform the GEO annotations into computer-friendy codes. In a paper published in the January Nature Biotechnology, the two showed they could use the coded data to automatically link biological concepts like diabetes, aging, or stress, to gene expression patterns across the entire span of diseases in the GEO data collection.

“The genome captures an important part, but only a part, of what constitutes disease—and health,” said Kohane, the Lawrence J. Henderson associate professor of pediatrics and health sciences and technology and director of the Countway Library of Medicine. The environment works on the genome to create the organism’s big picture, which researchers call the phenome. “A lot of work has been done on the genome part of the equation, but the environmental information in annotations has been sparse and hard to process. We’re showing now that just as the genome has been commoditized, it is now possible to start commoditizing the phenome.”

“A tool like this can help package and deliver a decade’s worth of genomic experiments.”

To bring the two realms together, the researchers took advantage of the Unified Medical Language System, a million-term concept-coding thesaurus developed by the National Library of Medicine in anticipation of electronic medical records. The UMLS combines nearly 100 coding systems, including physician billing codes, pathology codes, genomic vocabulary, and other coded biomedical terms. Building on a text-reading function included in the UMLS, Kohane and Butte were able to automatically scan the annotations and map each GEO experiment to at least one relevant biomedical concept.

Then the number crunching could begin. By automatically scanning all of the gene expression data in GEO for all of the biomedical concepts, the researchers built a network of relations that let them see new connections between genes and conditions like aging or injury. Ultimately, the process could build new disease classification trees by linking previously unrelated biological processes through gene expression patterns. For example, the analysis revealed a common thread of elevated stress kinase gene expression across the clinically distinct processes of traumatic injury, stress, and ischemic heart attack.

Not everyone agrees that mining annotations will lead to “Eureka” moments for biologists experienced in analyzing expression data. David Lipman, director of the NIH’s National Center for Biotechnology Information, home of GEO and Genbank, said, “For these scientists, the rate-limiting step in discovery is not trying to decipher annotations, it’s slogging through the data and thinking about what it means.” But Lipman allows that more standardized annotation does help a wider community of researchers to more easily access genomic data.

The new tool will be especially helpful to physicians who want to access genomic data for their favorite diseases, said Butte, lead author on the paper and a former instructor in pediatrics at Children’s. “We talk about translational research nowadays, but it’s still a challenge for physician-scientists to get involved in genomics. Because we used the UMLS, now we can take a familiar billing code for a favorite disease and use that to find the genomic experiments, and then say which genes are statistically associated with that billing code. A tool like this can help package and deliver a decade’s worth of genomic experiments to a physician-scientist in a way that’s been unattainable previously.” Butte, now a faculty member at Stanford, says the basic programs and mappings are available online at http://genotext.stanford.edu/, with continued development planned over the next few years.


top