Master internship project : Statistical learning for the prediction of chromosomic interactions

Abstract :

The identity of a cell is determined by the expression pat-terns of its genes. Inside the nucleus, the DNA molecule is dynamically folded so regulatory regions can be in contact. Those physical interactions enable transcription factors to bind to DNA, thus controlling the initiation of genes transcription.

The mechanisms of such contacts are still far from being fully understood, as well as the biological parameters driving them. In literature, very few projects focus on DNA sequence as a direct predictor variable for those contacts. The objective of this internship is to determine if sequence alone is enough to predict the distal interactions between regulatory regions, in a statistical learning framework. We propose here methods to build appropriate datasets based on biological experiments, and to extract features using a sequence segmentation approach.Different types of models for supervised classification are trained and compared on those datasets.

With our current methods, the results show that the sequence variables of regulatory regions do not seem to carry enough information to accurately predict DNA folding. However, our feature extraction method revealed that the neighborhood of regulatory regions is more informative to predict contacts than the regions themselves, and leads to acceptable classification.

Keywords:

Chromosomic interactions, machine learning, supervised clas-sification, genomics, feature extraction

Océane Cassan
Océane Cassan
Postdoc researcher in statistical learning applied to gene regulation

Postdoc researcher in statistical learning and computational biology

Related