What’s the best way to bypass expensive conference fees as an undergrad? Volunteering of course!
This year I’m volunteering at the ISMB 2008 conference in Toronto, which, for a minor time commitment, entitles me to hear fascinating presentations on all aspects of computational biology and bioinformatics. One of these was a tutorial on ncRNA (noncoding RNA) gene finding, given by Jan Gorodkin and Ivo Hofacker.
First, a quick intro. Most of what we know about the genome deals with coding sequence:
- DNA that’s transcribed into messenger RNA (mRNA)
- mRNA that’s translated into proteins
But only 1.3-1.6% of the (human) genome codes for protein! The rest, despite the popular misconception, is not “junk DNA“. DNA and the transcribed non protein coding RNA have many functions, a few of which we know about [PDF]. The recent ENCODE pilot project estimated that as much as 97% of the human genome is transcribed into mRNA at some point. But what does it do?
Another type of ncRNA are MicroRNA, or miRNA. These are short sequences of RNA that are complementary to (protein coding) mRNA. They act as downregulators (suppressors) of genes, by attaching to the mRNA and getting in the way of the translational machinery.
Discovering ncRNA genes can be tricky! This is mostly due to the fact that ncRNA function is in many cases tied inextricably to the structure of the transcribed RNA. RNA, being single stranded, can double up on itself and form loops and helixes (as pictured below). These crazy loops are called the secondary structure. The secondary structure of RNA is what results from the first pass of folding, and serves as a simplified (but useful) model for the 3D structure of the RNA.
Because the secondary structure results from the pairing of bases, any of the so-called canonical base pairs (C-G, A-U, U-G, and the reverse of all three) can occur. Mutations can occur that change the sequence, but keep the bases paired in the same way, leading to structures that are the same, but with sequences that are very different.
However, a single nucleotide that no longer base pairs the same way can produce a completely different secondary structure. In the world of bioinformatics this can make it difficult for computer algorithms relying only on nucleotide data to align sequences that are too dissimilar. There are algorithms that can align sequences based on conserved structure, but they are computationally expensive both in terms of memory and CPU time.
That said, sometimes alignment based only on sequence are good enough. Sequence alignment tools are fairly common, and alignment data across many species is available for downloading. Algorithms can use these alignments to discover the genomic locations of new ncRNA genes. Because the sequence (well, structure) of an ncRNA gene will stand firm while the sequence around it mutates, functional genes will stand out as regions with high conservation across an evolutionary tree.
The alignment of multiple sequences is used in a few different ways to discover ncRNA genes. Some of them use the known evolutionary tree in a probabilistic way (how likely is it that this nucleotide mutated from A to U? What if it’s part of a base pair?) to try and find a consensus structure. Others calculate the stability of the stuctures formed. Sequences with the most stable structures tend to be functional. There are algorithms that combine the two approaches.
The sets ncRNA genes predicted by these different matches have little overlap. This may be due to lots of false positives being predicted, or it may be because certain approaches are more likely to find ncRNA genes of certain types or with certain properties. Improvement of these methods, as well as secondary-structure based sequence alignment and prediction of RNA structure and function, remain areas of ongoing research. It’s clear that we’ve already begun to crack the genetic code.
The secondary structure of ribosomal RNA from E. coli.