Code Refactoring and Parallelization of a Novel Algorithm for Gene Expression Biclustering

Saturday, October 29, 2011
Hall 1-2 (San Jose Convention Center)
Daniel R. Chee , Computer Science, University of New Mexico, Albuquerque, NM
Susan R. Atlas, PhD , Department of Physics and Astronomy, University of New Mexico, Albuquerque, NM
The past several years have seen a vast increase in the amount of genomic data being generated, specifically sequence and gene expression data. It is now possible to generate detailed datasets that contain information that could potentially shed light on the etiology and treatment of human cancer. This requires the development of advanced statistical, mathematical, and data mining algorithms. Here we consider biclustering, a technique that is used to identify coherent patterns within a matrix, in this case a matrix of gene expression values for a set of patients. In this project we are using a robust biclustering algorithm previously developed at UNM (Wang et al., 2007.) We describe the basic ideas underlying the algorithm, emphasizing its novel features. The algorithm was originally implemented in MATLAB as independent pieces of code that required manual coordination to execute. We merged these pieces into a single unified code capable of running on the UNM supercomputer nano. In addition, we made significant improvements to the performance of the algorithm by restructuring components to run in parallel across multiple nodes. The improved code achieves close to linear speedup with respect to the number of processors when compared to the original implementation. The code has been validated on two cohorts of pre-B ALL pediatric cancer patients, and is in the process of being applied to a new cohort of 97 infant leukemias. The biclusters identified within this cohort will be analyzed in terms of their biological significance using pathway analysis software.