Consensus and Profile

🤔 Problem link

The Problem

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j. to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1, j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

DNA Strings
A T C C A G C T 
G G G C A A C T 
A T G G A T C T 
A A G C A A C C 
T T G G A A C T 
A T G C C A T T 
A T G G C A C T 
Profile
A   5 1 0 0 5 5 0 0
C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6
Consensus
A T G C A A C T
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
> Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
Sample Output
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

The first thing we will need to do is read in the input fasta. In this case, we will not be reading in an actual fasta file, but a set of strings in fasta format. If we were reading in an actual fasta file, we could use the FASTX.jl package to help us with that.

Since the task required here is something that was already demonstrated in the GC-content tutorial, we can borrow the function from that tutorial.

fake_file = IOBuffer("""
    >Rosalind_1
    ATCCAGCT
    >Rosalind_2
    GGGCAACT
    >Rosalind_3
    ATGGATCT
    >Rosalind_4
    AAGCAACC
    >Rosalind_5
    TTGGAACT
    >Rosalind_6
    ATGCCATT
    >Rosalind_7
    ATGGCACT
    """
)

function parse_fasta(buffer)
    records = [] # this is a Vector of type `Any`
    record_name = ""
    sequence = ""
    for line in eachline(buffer)
        if startswith(line, ">")
            !isempty(record_name) && push!(records, (record_name, sequence))
            record_name = lstrip(line, '>')
            sequence = ""
        else
            sequence *= line
        end
    end
    push!(records, (record_name, sequence))
    return records
end

records = parse_fasta(fake_file)

Once the fasta is read in, we can iterate over each sequence/record and store its nucleotide sequence in a data matrix.

From there, we can generate the profile matrix. We'll need to sum the number of times each nucleotide appears at a particular column of the data matrix.

Then, we can identify the most common nucleotide at each column of the data matrix, which represent each index of the consensus string. After doing this for all columns of the data matrix, we can generate the consensus string.

function consensus(fasta_string)
    # extract strings from fasta
    records = parse_fasta(fasta_string)

    # make a vector of sequence strings
    data_vector = last.(records)

    # convert data_vector to matrix where each column is a character position and each row is a string
    data_matrix = reduce(vcat, permutedims.(collect.(data_vector)))

    # make profile matrix: (num_strings × n) of Chars
    # profile is 4×n (each row corresponds to A,C,G,T)
    nucs = ['A', 'C', 'G', 'T']
    profile = reduce(vcat, (sum(data_matrix .== nuc, dims=1) for nuc in nucs))

    # compute the consensus string
    consensus_string = join([nucs[argmax(@view profile[:, j])] for j in 1:size(profile, 2)])
    return(consensus_string)
end

consensus(fake_file)

As mentioned in the problem description above, it is possible that there can be multiple consensus strings, as some nucleotides may appear the same number of times in each column of the data matrix.

If this is the case, the function we are using (argmax) returns the nucleotide with the most occurrences that it first encounters.

The way our function is written, we first scan for 'A', 'C', then 'G' and 'T', so the final consensus string will be biased towards more A's, then C's, G's and T's. This is simply based on which nucleotide counts it will encounter first in the profile matrix.

In the example below, there are equal number of sequences that are all A's and G's, so the consensus string could be either AAAAAAAA or GGGGGGGG. However, because our solution scans for A first, the consensus string returned will be AAAAAAAA.

fake_file2 = IOBuffer("""
    >Rosalind_1
    AAAAAAAA
    >Rosalind_2
    AAAAAAAA
    >Rosalind_3
    AAAAAAAA
    >Rosalind_4
    GGGGGGGG
    >Rosalind_5
    GGGGGGGG
    >Rosalind_6
    GGGGGGGG
    """
)

consensus(fake_file2)