🧮 Computing GC content
The Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAATSample Output
Rosalind_0808
60.919540There are are really 3 parts of this problem.
Parse the input, which is in a common biological format, FASTA
Calculate the GC content
Iterate through the input, keeping track of the largest GC content.
Let's start with part 2, since it's quite similar to something we solved in the very first problem.
Calculating GC content
Just as when we were counting the frequency of each base, here, we can calculate the GC content by simply counting the number of G's or C's and dividing that number by the total length of the sequence.
As a reminder, the x-> do stuff notation is an "anonymous function", which we're using as a "predicate" for the count() function. That is, count() will return the sum of elements where the predicate returns true.
my_seq = "AACCGGTTCT"
function myGC(seq)
gcs = count(base-> base in ('G', 'C'), seq)
return gcs / length(seq)
end
myGC(my_seq)0.5A few things to note about this - There's no validation of the input, so this function will happily count the capital G's and C's of any string:
myGC("Goodbye, Cruel World!")0.09523809523809523But for now, this will do.
BioSequences method
As with many of these problems, there is built-in functionality in the BioSequences.jl package. But before we get there, let's take a look at another problem with being permissive in our type signature up above:
using BioSequences
bioseq = LongDNA{2}(my_seq)
myGC(bioseq)0.0What's going on here? The definition above tries to count the Chars 'G' and 'C', but when we iterate a LongDNA, we get back nucleotide types. So none of them match, providing a count of DNA_C, DNA_G, RNA_C, and RNA_G, or we can use the built-in predicate function isGC() from BioSequences, which returns true if it encounters any G or C nucleotide.
function myGC(seq::LongNuc) # this type matches LongDNA or LongRNA
gcs = count(isGC, bioseq)
return gcs / length(seq)
end
myGC(bioseq)0.5Or, even more simply, there's already a gc_content() function in BioSequences.jl:
gc_content(bioseq)0.5Parsing the file
We need to get this part done next. We started talking about dealing with files in the last problem, but let's go into a bit more detail.
When you want speed, there are a lot of tricks to directly parse the bytes of the file one-by-one. In fact, BioJulia developers created the package Automa.jl to create fast parsers, and Jakob Nissen, one of the co-administrators of BioJulia, is currently working on modifications to Julia Base to make dealing with files and byte streams even faster.
But for now, let's just do the easy thing, and parse the file line-by-line using the eachline() function. This function returns an iterator over the lines of the file, one at a time, that we can use in a for or while loop.
To begin with, let's write a function that just goes through the whole file, putting each sequence record into a Vector.
function parse_fasta(buffer)
records = [] # this is a Vector of type `Any`
record_name = ""
sequence = ""
for line in eachline(buffer)
if startswith(line, ">")
!isempty(record_name) && push!(records, (record_name, sequence))
record_name = lstrip(line, '>')
sequence = ""
else
sequence *= line
end
end
push!(records, (record_name, sequence))
return records
end
fake_file = IOBuffer("""
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
"""
)
records = parse_fasta(fake_file)3-element Vector{Any}:
("Rosalind_6404", "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG")
("Rosalind_5959", "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC")
("Rosalind_0808", "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT")So we start with an empty Vector called records, and empty Strings representing the record name and sequence. we need to assign these variables outside the "scope" of the loop, otherwise they won't persist outside of it. We'll talk more about "scope" another time.
Then, we go through each line of the file, checking if it starts with ">". If it does, we push the current record into the records vector, and reset the record_name and sequence variables. If it doesn't, we append the line to the sequence variable (in julia, we combine strings with the * operator).
The isempty(record_name) check in the 7th line of the function is used for the first record, to avoid pushing the initial empty strings, and the final push!(records, (record_name, sequence)) is to deal with the final sequence, since the loop will terminate without encountering a final >.
Our 'fake file'
To show off how this works, we're using an IOBuffer, which in most cases works pretty similar to an open file buffer. One thing to keep in mind is that, unlike a String or other data structure, the reading through of an IOBuffer moves the "read head" - in other words, there's a pointer to the end of the file after we've read through it.
So if we run our function again:
records = parse_fasta(fake_file)1-element Vector{Any}:
("", "")We can see that the records vector only contains a Tuple with empty strings, since the loop never runs - eachline(file) here doesn't have any entries.
If you need to use it again, you can reset the IOBuffer using the seekstart() function:
seekstart(fake_file)
records = parse_fasta(fake_file)3-element Vector{Any}:
("Rosalind_6404", "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG")
("Rosalind_5959", "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC")
("Rosalind_0808", "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT")At this stage, we could take our records Vector, find the maximum gc entry, and then get that record. As with many problems, there are a lot of ways to do this, but we'll try the findmax function. This function takes a function as the first argument, and returns the "index" of the maximum value after using that function on the entry.
The "index" is an integer that we can use to access a particular entry in a vector or other container. In julia, the syntax for indexing is container[index]. And notice that, since each of our records are themselves containers (Tuples), we can access the first element of each record (the identifier) using record[1], and the second element of each record (the sequence)using record[2].
One-based indexing
Julia uses one-based indexing, meaning that the first element of a container is at index 1, not 0 as in some other languages like Python.
i = findmax(record -> myGC(record[2]), records)[2]3top_gc = records[i][1] # the first index gets the record, and the `[1]` gets the identifier"Rosalind_0808"do-block syntax
Many functions in Julia take a function as the first argument, and sometimes the -> anonymous function syntax is a bit annoying to use. Instead, one can use the do block syntax. This is a way to write a more complicated anonymous function, eg one with multiple lines.
The structure of it is:
some_func(iterator) do args # args is/are the argument(s) to the function
# function body
endIn other words, our findmax function above could have been written as:
i = findmax(records) do record
gc_content(record[2])
endDon't store the whole file in memory
If you end up with a really large file, storing every record in memory, and then iterating over it a second time to calculate the GC content may not be the best approach. Instead, you can use a streaming approach, where you read the file line by line, and calculate the GC content on the fly. This way, you only need to keep one record in memory at a time, and you can find the record with the highest GC content without having to store all records in memory.
function streaming_gc(buffer)
max_gc = 0.0
max_id = ""
current_id = ""
current_seq = ""
for line in eachline(buffer)
if startswith(line, ">")
if length(current_seq) > 0
current_gc = myGC(current_seq)
if current_gc > max_gc
max_gc = current_gc
max_id = current_id
end
end
current_id = lstrip(line, '>')
current_seq = ""
else
current_seq *= line
end
end
if length(current_seq) > 0
current_gc = myGC(current_seq)
if current_gc > max_gc
max_gc = current_gc
max_id = current_id
end
end
return max_id
end
seekstart(fake_file)
streaming_gc(fake_file)"Rosalind_0808"BioSequences method
As you might imagine, parsing FASTA files is something we do all the time, so there's a package for that. In julia, that package is called FASTX.jl (there's also a "FASTQ" file format).
This package provides FASTA "readers" and "writers", and return iterators that validate recores and return data structures containing the sequeince identifiers and BioSeqeunces.
using FASTX
function fastx_gc(buffer)
max_gc = 0.0
max_id = ""
FASTAReader(buffer) do reader # see the "tip" above about do-blocks
for record in reader
gc = gc_content(sequence(LongDNA{2}, record))
if gc > max_gc
max_gc = gc
max_id = identifier(record)
end
end
end
return max_id, max_gc
end
seekstart(fake_file)
fastx_gc(fake_file)("Rosalind_0808", 0.6091954022988506)Reading the file
So far, we've been dealing with an IOBuffer, which is what we get when we open a file. But to actually solve the rosalind problem, we need to open a file on the disk.
In julia, we do this with the open function, which takes a file path, and optionally a function to call with the file handle. If we don't provide a function, we get an IOBuffer just as we've been using, but we need to be sure to close it when we're done.
function file_gc(filename)
max_id, max_gc = open(fastx_gc, filename)
println(max_id)
println(100 * max_gc)
end
file_gc("problem_inputs/rosalind_gc.txt")Rosalind_5466
51.991150442477874