Saturday, October 13, 2012: 5:40 AM
Hall 4E/F (WSCC)
Bacterial genomes consist of regions that are transcribed into RNA (comprising the transcriptome) and regions that are not. Fully characterizing the transcriptome would give researchers a powerful way to pinpoint transcripts that relate to bacterial phenotypes including those present in pathogenic processes (e.g. M. tuberculosis). RNA-seq, the high-throughput sequencing of cDNA libraries, allows us to build a map of bacterial transcriptomes by overlapping millions of individual reads. Currently there are no methods for normalizing variation between high and low expressed genes and suppressing noise from short read lengths, read error, coverage error and the dense nature of bacterial genomes. This renders the parsing out of all but the mostly highly expressed transcripts difficult. Using E. coli RNA-seq data, we have developed an algorithm that scans RNA-seq expression data identifying transcript locations. Our most recent algorithm has achieved low to moderate sensitivity in identifing E. coli start-stops and has achieved moderate precision by identifying a super majority of start-stops being accurate to within 200 base pairs. To increase precision and sensitivity, we aimed to create several more normalized variables using expression levels of the E. coli genome. We hope to significantly increase the precision of start-stop site identification therefore generating a low percent of false positives. This might serve as a novel algorithm for deconstructing transcriptome data in a range of bacterial pathogens and other microbes.