🧮 Problem 5: Computing GC Content
🤔 Problem link
The Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalindxxxx", where "xxxx" denotes a four-digit code between 0000 and 9999._Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated.
Sample Dataset
>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC >Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAATSample Output
Rosalind_0808 60.919540
There are really 3 parts of this problem.
- Parse the input, which is in a common biological format, FASTA
- Calculate the GC content
- Iterate through the input, keeping track of the largest GC content.
Let's start with part 2, since it's quite similar to something we solved in Problem 1: Counting DNA Nucleotides.
Calculating GC content
Just as when we were counting the frequency of each base, here, we can calculate the GC content by simply counting the number of G's or C's and dividing that number by the total length of the sequence.
As a reminder, the x-> do stuff notation is an "anonymous function",
which we're using as a "predicate" for the count() function.
That is, count() will return the sum of elements where the predicate
returns true.
my_seq = "AACCGGTTCT"
function myGC(seq)
gcs = count(base-> base in ('G', 'C'), seq)
return gcs / length(seq)
end
myGC(my_seq) # 0.4
A few things to note about this - There's no validation of the input, so this function will happily count the capital G's and C's of any string:
myGC("Goodbye, Cruel World!") # 0.1904...
But for now, this will do.
BioSequences method
As with many of these problems,
there is built-in functionality in the BioSequences.jl package.
But before we get there,
let's take a look at another problem with being permissive
in our type signature up above:
using BioSequences
bioseq = LongDNA{2}(my_seq)
myGC(bioseq) # 0.0 -- wrong!
What's going on here?
The definition above tries to count the Chars 'G' and 'C',
but when we iterate a LongDNA, we get back nucleotide types.
So none of them match, providing a count of \(0\).
We could modify the original function to look also for DNA_C,
DNA_G, RNA_C, and RNA_G,
or we can use the built-in predicate function isGC() from BioSequences,
which returns true if it encounters any G or C nucleotide.
function myGC(seq::LongNuc) # this type matches LongDNA or LongRNA
gcs = count(isGC, seq)
return gcs / length(seq)
end
myGC(bioseq) # 0.4
Or, even more simply, there's already a gc_content() function in BioSequences.jl:
gc_content(bioseq) # 0.4
Parsing the file
We need to get this part done next. We started talking about dealing with files in Problem 4, but let's go into a bit more detail.
When you want speed, there are a lot of tricks
to directly parse the bytes of the file one-by-one.
In fact, BioJulia developers created the package Automa.jl
to create fast parsers, and Jakob Nissen,
one of the co-administrators of BioJulia,
is currently working on modifications to Julia Base
to make dealing with files and byte streams even faster.
But for now, let's just do the easy thing,
and parse the file line-by-line using the eachline() function.
This function returns an iterator over the lines of the file,
one at a time,
that we can use in a for or while loop.
To begin with, let's write a function
that just goes through the whole file,
putting each sequence record into a Vector.
function parse_fasta(buffer)
records = [] # this is a Vector of type `Any`
record_name = ""
sequence = ""
for line in eachline(buffer)
if startswith(line, ">")
!isempty(record_name) && push!(records, (record_name, sequence))
record_name = lstrip(line, '>')
sequence = ""
else
sequence *= line
end
end
push!(records, (record_name, sequence))
return records
end
fake_file = IOBuffer("""
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
"""
)
records = parse_fasta(fake_file)
So we start with an empty Vector called records,
and empty Strings representing the record name and sequence.
we need to assign these variables outside the "scope" of the loop,
otherwise they won't persist outside of it.
Then, we go through each line of the file,
checking if it starts with ">".
If it does, we push the current record into the records vector,
and reset the record_name and sequence variables.
If it doesn't, we append the line to the sequence variable
(in julia, we combine strings with the * operator).
The isempty(record_name) check is used
for the first record, to avoid pushing the initial empty strings,
and the final push!(records, (record_name, sequence))
is to deal with the final sequence,
since the loop will terminate without encountering a final >.
Tip: Unlike a
Stringor other data structure, reading through anIOBuffermoves the "read head" — there's a pointer to the end of the buffer after we've read through it. If you need to use it again, reset theIOBufferusing theseekstart()function:seekstart(fake_file) records = parse_fasta(fake_file)
At this stage, we could take our records Vector,
find the maximum gc entry, and then get that record.
As with many problems, there are a lot of ways to do this,
but we'll try the findmax function.
This function takes a function as the first argument,
and returns the "index" of the maximum value after using that function on the entry.
The "index" is an integer that we can use to access a particular entry
in a vector or other container.
In julia, the syntax for indexing is container[index].
And notice that, since each of our records are themselves containers (Tuples),
we can access the first element of each record (the identifier) using record[1],
and the second element of each record (the sequence) using record[2].
Tip: Julia uses one-based indexing, meaning that the first element of a container is at index 1, not 0 as in some other languages like Python.
i = findmax(record -> myGC(record[2]), records)[2]
top_gc = records[i][1] # the first index gets the record, and the `[1]` gets the identifier
Tip: do-block syntax
Many functions in Julia take a function as the first argument, and sometimes the
->anonymous function syntax is a bit annoying to use. Instead, one can use thedoblock syntax. This is a way to write a more complicated anonymous function, eg one with multiple lines.The structure of it is:
some_func(iterator) do args # args is/are the argument(s) to the function # function body endIn other words, our
findmaxfunction above could have been written as:i = findmax(records) do record myGC(record[2]) end
Don't store the whole file in memory
If you end up with a really large file, storing every record in memory, and then iterating over it a second time to calculate the GC content may not be the best approach. Instead, you can use a streaming approach, where you read the file line by line, and calculate the GC content on the fly. This way, you only need to keep one record in memory at a time, and you can find the record with the highest GC content without having to store all records in memory.
function streaming_gc(buffer)
max_gc = 0.0
max_id = ""
current_id = ""
current_seq = ""
for line in eachline(buffer)
if startswith(line, ">")
if length(current_seq) > 0
current_gc = myGC(current_seq)
if current_gc > max_gc
max_gc = current_gc
max_id = current_id
end
end
current_id = lstrip(line, '>')
current_seq = ""
else
current_seq *= line
end
end
if length(current_seq) > 0
current_gc = myGC(current_seq)
if current_gc > max_gc
max_gc = current_gc
max_id = current_id
end
end
return max_id
end
seekstart(fake_file)
streaming_gc(fake_file) # "Rosalind_0808"
BioSequences method
As you might imagine,
parsing FASTA files is something we do all the time,
so there's a package for that.
In julia, that package is called FASTX.jl
(there's also a "FASTQ" file format).
This package provides FASTA "readers" and "writers",
and return iterators that validate records and return
data structures containing the sequence identifiers and BioSequences.
using FASTX
function fastx_gc(buffer)
max_gc = 0.0
max_id = ""
FASTAReader(buffer) do reader # see the "tip" above about do-blocks
for record in reader
gc = gc_content(sequence(LongDNA{2}, record))
if gc > max_gc
max_gc = gc
max_id = identifier(record)
end
end
end
return max_id, max_gc
end
seekstart(fake_file)
fastx_gc(fake_file) # ("Rosalind_0808", 0.6091954...)
Reading the file
So far, we've been dealing with an IOBuffer, which is what we get when we open a file. But to actually solve the rosalind problem, we need to open a file on the disk.
In julia, we do this with the open function,
which takes a file path, and optionally a function to call with the file handle.
If we don't provide a function, we get an IOBuffer just as we've been using,
but we need to be sure to close it when we're done.
function file_gc(filename)
max_id, max_gc = open(fastx_gc, filename)
println(max_id)
println(100 * max_gc)
end
file_gc("rosalind_gc.txt")