P5: Bioinformatics | inscida2016

PROJECT 5

Collapsing viral next generation sequencing data

Client: Center for AIDS Research

Led by: Dr Ben Murrell (University of California, San Diego)

Model of an HIV particle. We will be working with sequence data from the env protein, which can be seen on the surface in orange.

Background information

Rapidly evolving viruses, such as HIV, can exist as very diverse populations, even within a single host, with each virus particle having a potentially different genome. The details of these populations can have medical relevance, contributing to our understanding of phenomena such as viral transmission dynamics between hosts, evolution to escape anti-viral therapy, evolution in response to the host immune system, and more.

One state-of-the-art window into these populations is next generation sequencing, which, in a very high-throughput manner, converts physical DNA into digital read-outs of the input gene sequences, listing the individual nucleotides (A, C, G or T) that comprise the DNA strands. For our purposes, we will be considering data from targeted viral gene sequencing, which proceeds by first isolating the viral particles, and then amplifying the genomic region of interest, which makes multiple DNA copies of each initial viral template sequence. This amplified DNA is then fed into the sequencer, generating the digital sequences.

This is where the trouble begins. Because of the DNA amplification process, there are many more sequences than there were initial viral particles. Furthermore, there are too many sequence reads for the kinds of analyses that we typically want to conduct. Finally, each sequence read is noisy, containing sequencing errors, which can be simply incorrect nucleotides, or, more problematically, inserted or deleted nucleotides.

See more background at: http://en.wikipedia.org/wiki/Multiple_sequence_alignment and http://en.wikipedia.org/wiki/Phylogenetics

Project aims and objectives

The problem we will attempt to solve is how to take a large collection of noisy, redundant sequence observations, and collapse these down to a smaller number of sequences, both reducing the data to a manageable size, and averaging away the observation noise. But when the levels of noise vary from one sequence to another, how do we decide which sequences to collapse together, and how do we do so efficiently when we have tens of thousands of sequences to consider?

About the dataset (and skills required)

This is as much a computational challenge as a statistical one, and the interested student should have strong programming abilities. Given the time limitation, we will attempt to exploit existing bioinformatics tools as much as possible to solve some of the intermediate problems, but some skillful programming will certainly be required. When applying, please list experience with programming languages, and be honest about your fluency in each. Also list any general computing knowledge, such as familiarity with linux, especially in a cluster environment. None of this is absolutely required, but it would be useful to know in advance. There will also be room for a purely statistical role on the team, so do not be discouraged if you are an interested statistician with limited programming abilities.

Applicants should consider getting familiar with http://julialang.org, as well as reading the following:

https://en.wikipedia.org/wiki/Cluster_analysis

https://en.wikipedia.org/wiki/Sequence_analysis

https://cs.brown.edu/courses/csci1810/bioprimer.pdf

Intended outcomes and real-world relevance

This project tackles a challenging problem and we do not expect to solve it entirely during the week. What we hope for is:

Quantify advantages and limitations of existing methods when applied to sequence collapsing;
Begin to develop and apply tools for data reduction in the context of the kinds of sequencing data described above;
Gain a better understanding of transmission dynamics between hosts. As mentioned, this has a number of important implications for understanding the evolution of HIV in response to anti-viral therapy, changes in the host immune system, etc;
To provide a thorough introduction to this area and stimulate interest in follow-up work, for which there is considerable scope.

InSciDa

WORKSHOP ON STATISTICS AND DATA SCIENCE IN INDUSTRY

18 - 23 January 2016