🧮 Problem 5: Computing GC Content

🤔 Problem link

The Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalindxxxx", where "xxxx" denotes a four-digit code between 0000 and 9999._Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated.

Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808
60.919540

There are really 3 parts of this problem.

Parse the input, which is in a common biological format, FASTA
Calculate the GC content
Iterate through the input, keeping track of the largest GC content.

Let's start with part 2, since it's quite similar to something we solved in Problem 1: Counting DNA Nucleotides.

Calculating GC content

Just as when we were counting the frequency of each base, here, we can calculate the GC content by simply counting the number of G's or C's and dividing that number by the total length of the sequence.

As a reminder, the x-> do stuff notation is an "anonymous function", which we're using as a "predicate" for the count() function. That is, count() will return the sum of elements where the predicate returns true.

my_seq = "AACCGGTTCT"

function myGC(seq)
    gcs = count(base-> base in ('G', 'C'), seq)
    return gcs / length(seq)
end

myGC(my_seq)  # 0.4

A few things to note about this - There's no validation of the input, so this function will happily count the capital G's and C's of any string:

myGC("Goodbye, Cruel World!")  # 0.1904...

But for now, this will do.

BioSequences method

As with many of these problems, there is built-in functionality in the BioSequences.jl package. But before we get there, let's take a look at another problem with being permissive in our type signature up above:

using BioSequences

bioseq = LongDNA{2}(my_seq)
myGC(bioseq)  # 0.0 -- wrong!

What's going on here? The definition above tries to count the Chars 'G' and 'C', but when we iterate a LongDNA, we get back nucleotide types. So none of them match, providing a count of \(0\). We could modify the original function to look also for DNA_C, DNA_G, RNA_C, and RNA_G, or we can use the built-in predicate function isGC() from BioSequences, which returns true if it encounters any G or C nucleotide.

function myGC(seq::LongNuc) # this type matches LongDNA or LongRNA
    gcs = count(isGC, seq)
    return gcs / length(seq)
end

myGC(bioseq)  # 0.4

Or, even more simply, there's already a gc_content() function in BioSequences.jl:

gc_content(bioseq)  # 0.4

Parsing the file

We need to get this part done next. We started talking about dealing with files in Problem 4, but let's go into a bit more detail.

When you want speed, there are a lot of tricks to directly parse the bytes of the file one-by-one. In fact, BioJulia developers created the package Automa.jl to create fast parsers, and Jakob Nissen, one of the co-administrators of BioJulia, is currently working on modifications to Julia Base to make dealing with files and byte streams even faster.

But for now, let's just do the easy thing, and parse the file line-by-line using the eachline() function. This function returns an iterator over the lines of the file, one at a time, that we can use in a for or while loop.

To begin with, let's write a function that just goes through the whole file, putting each sequence record into a Vector.

function parse_fasta(buffer)
    records = [] # this is a Vector of type `Any`
    record_name = ""
    sequence = ""
    for line in eachline(buffer)
        if startswith(line, ">")
            !isempty(record_name) && push!(records, (record_name, sequence))
            record_name = lstrip(line, '>')
            sequence = ""
        else
            sequence *= line
        end
    end
    push!(records, (record_name, sequence))
    return records
end

fake_file = IOBuffer("""
    >Rosalind_6404
    CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
    TCCCACTAATAATTCTGAGG
    >Rosalind_5959
    CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
    ATATCCATTTGTCAGCAGACACGC
    >Rosalind_0808
    CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
    TGGGAACCTGCGGGCAGTAGGTGGAAT
    """
)

records = parse_fasta(fake_file)

So we start with an empty Vector called records, and empty Strings representing the record name and sequence. we need to assign these variables outside the "scope" of the loop, otherwise they won't persist outside of it.

Then, we go through each line of the file, checking if it starts with ">". If it does, we push the current record into the records vector, and reset the record_name and sequence variables. If it doesn't, we append the line to the sequence variable (in julia, we combine strings with the * operator).

The isempty(record_name) check is used for the first record, to avoid pushing the initial empty strings, and the final push!(records, (record_name, sequence)) is to deal with the final sequence, since the loop will terminate without encountering a final >.

Tip: Unlike a String or other data structure, reading through an IOBuffer moves the "read head" — there's a pointer to the end of the buffer after we've read through it. If you need to use it again, reset the IOBuffer using the seekstart() function:
seekstart(fake_file)
records = parse_fasta(fake_file)

At this stage, we could take our records Vector, find the maximum gc entry, and then get that record. As with many problems, there are a lot of ways to do this, but we'll try the findmax function. This function takes a function as the first argument, and returns the "index" of the maximum value after using that function on the entry.

The "index" is an integer that we can use to access a particular entry in a vector or other container. In julia, the syntax for indexing is container[index]. And notice that, since each of our records are themselves containers (Tuples), we can access the first element of each record (the identifier) using record[1], and the second element of each record (the sequence) using record[2].

Tip: Julia uses one-based indexing, meaning that the first element of a container is at index 1, not 0 as in some other languages like Python.

i = findmax(record -> myGC(record[2]), records)[2]
top_gc = records[i][1] # the first index gets the record, and the `[1]` gets the identifier

Tip: do-block syntax

Many functions in Julia take a function as the first argument, and sometimes the -> anonymous function syntax is a bit annoying to use. Instead, one can use the do block syntax. This is a way to write a more complicated anonymous function, eg one with multiple lines.

The structure of it is:
some_func(iterator) do args # args is/are the argument(s) to the function
# function body
end
In other words, our findmax function above could have been written as:
i = findmax(records) do record
myGC(record[2])
end

Don't store the whole file in memory

If you end up with a really large file, storing every record in memory, and then iterating over it a second time to calculate the GC content may not be the best approach. Instead, you can use a streaming approach, where you read the file line by line, and calculate the GC content on the fly. This way, you only need to keep one record in memory at a time, and you can find the record with the highest GC content without having to store all records in memory.

function streaming_gc(buffer)
    max_gc = 0.0
    max_id = ""

    current_id = ""
    current_seq = ""
    for line in eachline(buffer)
        if startswith(line, ">")
            if length(current_seq) > 0
                current_gc = myGC(current_seq)
                if current_gc > max_gc
                    max_gc = current_gc
                    max_id = current_id
                end
            end
            current_id = lstrip(line, '>')
            current_seq = ""
        else
            current_seq *= line
        end
    end
    if length(current_seq) > 0
        current_gc = myGC(current_seq)
        if current_gc > max_gc
            max_gc = current_gc
            max_id = current_id
        end
    end
    return max_id
end

seekstart(fake_file)
streaming_gc(fake_file)  # "Rosalind_0808"

BioSequences method

As you might imagine, parsing FASTA files is something we do all the time, so there's a package for that. In julia, that package is called FASTX.jl (there's also a "FASTQ" file format).

This package provides FASTA "readers" and "writers", and return iterators that validate records and return data structures containing the sequence identifiers and BioSequences.

using FASTX

function fastx_gc(buffer)
    max_gc = 0.0
    max_id = ""

    FASTAReader(buffer) do reader # see the "tip" above about do-blocks
        for record in reader
            gc = gc_content(sequence(LongDNA{2}, record))
            if gc > max_gc
                max_gc = gc
                max_id = identifier(record)
            end
        end
    end
    return max_id, max_gc
end

seekstart(fake_file)
fastx_gc(fake_file)  # ("Rosalind_0808", 0.6091954...)

Reading the file

So far, we've been dealing with an IOBuffer, which is what we get when we open a file. But to actually solve the rosalind problem, we need to open a file on the disk.

In julia, we do this with the open function, which takes a file path, and optionally a function to call with the file handle. If we don't provide a function, we get an IOBuffer just as we've been using, but we need to be sure to close it when we're done.

function file_gc(filename)
    max_id, max_gc = open(fastx_gc, filename)
    println(max_id)
    println(100 * max_gc)
end

file_gc("rosalind_gc.txt")