🧮 Computing GC content

The Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

There are are really 3 parts of this problem.

Parse the input, which is in a common biological format, FASTA
Calculate the GC content
Iterate through the input, keeping track of the largest GC content.

Let's start with part 2, since it's quite similar to something we solved in the very first problem.

Calculating GC content

Just as when we were counting the frequency of each base, here, we can calculate the GC content by simply counting the number of G's or C's and dividing that number by the total length of the sequence.

As a reminder, the x-> do stuff notation is an "anonymous function", which we're using as a "predicate" for the count() function. That is, count() will return the sum of elements where the predicate returns true.

julia

my_seq = "AACCGGTTCT"

function myGC(seq)
    gcs = count(base-> base in ('G', 'C'), seq)
    return gcs / length(seq)
end

myGC(my_seq)

0.5

A few things to note about this - There's no validation of the input, so this function will happily count the capital G's and C's of any string:

julia

myGC("Goodbye, Cruel World!")

0.09523809523809523

But for now, this will do.

BioSequences method

As with many of these problems, there is built-in functionality in the BioSequences.jl package. But before we get there, let's take a look at another problem with being permissive in our type signature up above:

julia

using BioSequences

bioseq = LongDNA{2}(my_seq)
myGC(bioseq)

0.0

What's going on here? The definition above tries to count the Chars 'G' and 'C', but when we iterate a LongDNA, we get back nucleotide types. So none of them match, providing a count of $0$ . We could modify the original function to look also for DNA_C, DNA_G, RNA_C, and RNA_G, or we can use the built-in predicate function isGC() from BioSequences, which returns true if it encounters any G or C nucleotide.

julia

function myGC(seq::LongNuc) # this type matches LongDNA or LongRNA
    gcs = count(isGC, bioseq)
    return gcs / length(seq)
end

myGC(bioseq)

0.5

Or, even more simply, there's already a gc_content() function in BioSequences.jl:

julia

gc_content(bioseq)

0.5

Parsing the file

We need to get this part done next. We started talking about dealing with files in the last problem, but let's go into a bit more detail.

When you want speed, there are a lot of tricks to directly parse the bytes of the file one-by-one. In fact, BioJulia developers created the package Automa.jl to create fast parsers, and Jakob Nissen, one of the co-administrators of BioJulia, is currently working on modifications to Julia Base to make dealing with files and byte streams even faster.

But for now, let's just do the easy thing, and parse the file line-by-line using the eachline() function. This function returns an iterator over the lines of the file, one at a time, that we can use in a for or while loop.

To begin with, let's write a function that just goes through the whole file, putting each sequence record into a Vector.

julia

function parse_fasta(buffer)
    records = [] # this is a Vector of type `Any`
    record_name = ""
    sequence = ""
    for line in eachline(buffer)
        if startswith(line, ">")
            !isempty(record_name) && push!(records, (record_name, sequence))
            record_name = lstrip(line, '>')
            sequence = ""
        else
            sequence *= line
        end
    end
    push!(records, (record_name, sequence))
    return records
end

fake_file = IOBuffer("""
    >Rosalind_6404
    CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
    TCCCACTAATAATTCTGAGG
    >Rosalind_5959
    CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
    ATATCCATTTGTCAGCAGACACGC
    >Rosalind_0808
    CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
    TGGGAACCTGCGGGCAGTAGGTGGAAT
    """
)

records = parse_fasta(fake_file)

3-element Vector{Any}:
 ("Rosalind_6404", "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG")
 ("Rosalind_5959", "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC")
 ("Rosalind_0808", "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT")

So we start with an empty Vector called records, and empty Strings representing the record name and sequence. we need to assign these variables outside the "scope" of the loop, otherwise they won't persist outside of it. We'll talk more about "scope" another time.

Then, we go through each line of the file, checking if it starts with ">". If it does, we push the current record into the records vector, and reset the record_name and sequence variables. If it doesn't, we append the line to the sequence variable (in julia, we combine strings with the * operator).

The isempty(record_name) check in the 7th line of the function is used for the first record, to avoid pushing the initial empty strings, and the final push!(records, (record_name, sequence)) is to deal with the final sequence, since the loop will terminate without encountering a final >.

Our 'fake file'

To show off how this works, we're using an IOBuffer, which in most cases works pretty similar to an open file buffer. One thing to keep in mind is that, unlike a String or other data structure, the reading through of an IOBuffer moves the "read head" - in other words, there's a pointer to the end of the file after we've read through it.

So if we run our function again:

julia

records = parse_fasta(fake_file)

1-element Vector{Any}:
 ("", "")

We can see that the records vector only contains a Tuple with empty strings, since the loop never runs - eachline(file) here doesn't have any entries.

If you need to use it again, you can reset the IOBuffer using the seekstart() function:

julia

seekstart(fake_file)
records = parse_fasta(fake_file)

3-element Vector{Any}:
 ("Rosalind_6404", "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG")
 ("Rosalind_5959", "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC")
 ("Rosalind_0808", "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT")

At this stage, we could take our records Vector, find the maximum gc entry, and then get that record. As with many problems, there are a lot of ways to do this, but we'll try the findmax function. This function takes a function as the first argument, and returns the "index" of the maximum value after using that function on the entry.

The "index" is an integer that we can use to access a particular entry in a vector or other container. In julia, the syntax for indexing is container[index]. And notice that, since each of our records are themselves containers (Tuples), we can access the first element of each record (the identifier) using record[1], and the second element of each record (the sequence)using record[2].

One-based indexing

Julia uses one-based indexing, meaning that the first element of a container is at index 1, not 0 as in some other languages like Python.

julia

i = findmax(record -> myGC(record[2]), records)[2]

julia

top_gc = records[i][1] # the first index gets the record, and the `[1]` gets the identifier

"Rosalind_0808"

do-block syntax

Many functions in Julia take a function as the first argument, and sometimes the -> anonymous function syntax is a bit annoying to use. Instead, one can use the do block syntax. This is a way to write a more complicated anonymous function, eg one with multiple lines.

The structure of it is:

julia

some_func(iterator) do args # args is/are the argument(s) to the function
    # function body
end

In other words, our findmax function above could have been written as:

julia

i = findmax(records) do record
    gc_content(record[2])
end

Don't store the whole file in memory

If you end up with a really large file, storing every record in memory, and then iterating over it a second time to calculate the GC content may not be the best approach. Instead, you can use a streaming approach, where you read the file line by line, and calculate the GC content on the fly. This way, you only need to keep one record in memory at a time, and you can find the record with the highest GC content without having to store all records in memory.

julia

function streaming_gc(buffer)
    max_gc = 0.0
    max_id = ""

    current_id = ""
    current_seq = ""
    for line in eachline(buffer)
        if startswith(line, ">")
            if length(current_seq) > 0
                current_gc = myGC(current_seq)
                if current_gc > max_gc
                    max_gc = current_gc
                    max_id = current_id
                end
            end
            current_id = lstrip(line, '>')
            current_seq = ""
        else
            current_seq *= line
        end
    end
    if length(current_seq) > 0
        current_gc = myGC(current_seq)
        if current_gc > max_gc
            max_gc = current_gc
            max_id = current_id
        end
    end
    return max_id
end

seekstart(fake_file)
streaming_gc(fake_file)

"Rosalind_0808"

BioSequences method

As you might imagine, parsing FASTA files is something we do all the time, so there's a package for that. In julia, that package is called FASTX.jl (there's also a "FASTQ" file format).

This package provides FASTA "readers" and "writers", and return iterators that validate recores and return data structures containing the sequeince identifiers and BioSeqeunces.

julia

using FASTX

function fastx_gc(buffer)
    max_gc = 0.0
    max_id = ""

    FASTAReader(buffer) do reader # see the "tip" above about do-blocks
        for record in reader
            gc = gc_content(sequence(LongDNA{2}, record))
            if gc > max_gc
                max_gc = gc
                max_id = identifier(record)
            end
        end
    end
    return max_id, max_gc
end

seekstart(fake_file)
fastx_gc(fake_file)

("Rosalind_0808", 0.6091954022988506)

Reading the file

So far, we've been dealing with an IOBuffer, which is what we get when we open a file. But to actually solve the rosalind problem, we need to open a file on the disk.

In julia, we do this with the open function, which takes a file path, and optionally a function to call with the file handle. If we don't provide a function, we get an IOBuffer just as we've been using, but we need to be sure to close it when we're done.

julia

function file_gc(filename)
    max_id, max_gc = open(fastx_gc, filename)
    println(max_id)
    println(100 * max_gc)
end

file_gc("problem_inputs/rosalind_gc.txt")

Rosalind_5466
51.991150442477874

🧮 Computing GC content ​

Calculating GC content ​

BioSequences method ​

Parsing the file ​

Don't store the whole file in memory ​

BioSequences method ​

Reading the file ​

🧮 Computing GC content

Calculating GC content

BioSequences method

Parsing the file

Don't store the whole file in memory

BioSequences method

Reading the file