ð§Ž Problem 1: Counting DNA nucleotides â
ðĪ Problem link
Note
For each of these problems, you are strongly encouraged to read through the problem descriptions, especially if you're somewhat new to molecular biology. We will mostly not repeat the background concepts in these notebooks, except where they are relevant to the solutions.
The Problem
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given: A DNA string s of length at most 1000 nt.
Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s .
Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCSample Output
20 12 17 21Let's see how it's done!
DNA sequences are Strings of Chars â
In julia, single characters and strings, which are made up of multiple characters, have different types. Char and String respectively.
They are also written differently - single quotes for a Char, and double quotes for a `String'.
chr = 'a'
str = "A"
typeof(chr)Chartypeof(str)StringIn many ways, a String can be thought of as a vector of Chars, and many julia functions that operate on collections like Vectors will work on Strings. We can also loop over the contents of a string, which will treat each Char separately.
for c in "banana"
@info c
end[ Info: b
[ Info: a
[ Info: n
[ Info: a
[ Info: n
[ Info: aApproach 1: counting in loops â
One relatively straightforward way to approach this problem is to set a variable to 0 for each base, then loop through the sequence, adding 1 to the appropriate variable at each character.
I'll also stick this into a function, so we can easily reuse the code.
function countbases(seq) # here `seq` is an "argument" for the function
a = 0
c = 0
g = 0
t = 0
for base in seq
if base == 'A'
a += 1 # this is equivalent to `a = a + 1`
elseif base == 'C'
c += 1
elseif base == 'G'
g += 1
elseif base == 'T'
t += 1
else
# it is often a good idea to try to handle possible mistakes explicitly
error("Base $base is not supported")
end
end
return (a, c, g, t)
end
countbases("AAA")(3, 0, 0, 0)Now let's see if it works on the example dataset. Remember, we should be getting the answer 20 12 17 21
answer = "20 12 17 21"
input_dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
countbases(input_dna)(20, 12, 17, 21)Well, the formatting is just a bit different - The julia type is a Tuple, which is surrounded by parentheses. To fix this, we can use the join function.
@assert join(countbases(input_dna), " ") == answerApproach 2: using count() â
Another approach is to use the built-in count() function, which takes a "predicate" function as the first argument and an iterable collection as the second argument. The predicate function must take each element of the collection, and return either true or false. The count() function then returns the number of elements that returned true.
For example, if I define the lessthan5() function to return true if a value is less than 5, I can then use it as a predicate to count the number of values in a Vector of numbers that are less than 5.
function lessthan5(num)
return num < 5
end
count(lessthan5, [1, 5, 6, -3, 3])3Often, we don't want to have to define a simple function like lessthan5() for every predicate we want to test, especially if they will only be used once. Instead, we can use an "anonymous" function (also sometimes called "lambdas") as the first argument.
In julia, anonymous functions have the syntax arg -> func. body. In other words, the same expression above could be written as:
count(num -> num < 5, [1, 5, 6, -3, 3])3Here, num -> num < 5 is identical to the definition for lessthan5(num).
So, now we can write a different formulation of countbases() using count():
function countbases2(seq)
a = count(base-> base == 'A', seq)
c = count(base-> base == 'C', seq)
g = count(base-> base == 'G', seq)
t = count(base-> base == 'T', seq)
return (a,c,g,t)
endcountbases2 (generic function with 1 method)@assert countbases2(input_dna) == countbases(input_dna)Tip
Even though this approach is quite a bit more suscinct, it might end up being a bit slower than countbases, since it has to loop over the sequence 4 times instead of just once.
Sometimes, you need to make trade-offs between clarity and efficiency. One of the great things about julia is that a lot of ways of approaching the same problem are often possible, and often fast (or they can be made fast).
Approach 3: using BioSequences.jl â
The BioSequences.jl package is designed to efficiently work with biological sequences like DNA sequences. BioSequences.jl efficiently encodes biological sequences using special types that are not Char or Strings.
using BioSequences
seq = LongDNA{2}(input_dna)
sizeof(input_dna)70sizeof(seq)16Counting individual nucleotides isn't the most common operation, but BioSequences.jl has some advanced searching functionality built-in. It's a bit overkill for this task, but for completeness:
function countbases3(seq)
a = count(==(DNA_A), seq)
c = count(==(DNA_C), seq)
g = count(==(DNA_G), seq)
t = count(==(DNA_T), seq)
return (a,c,g,t)
end
@assert countbases3(seq) == countbases2(input_dna)Benchmarking â
Julia programmers like speed, so let's benchmark our approaches!
using BenchmarkTools
testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)
@benchmark countbases($testseq_str)BenchmarkTools.Trial: 7026 samples with 1 evaluation per sample.
Range (min âĶ max): 694.128 Ξs âĶ 852.825 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 707.593 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 706.592 Ξs Âą 5.365 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
ââââââ
â
ââââââââââââ
â
ââââââ
ââââââââ
âââââââââââ
âââââââââââââââââââââââ â
694 Ξs Histogram: frequency by time 721 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.@benchmark(countbases2($testseq_str))BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min âĶ max): 331.520 Ξs âĶ 613.497 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 339.004 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 341.383 Ξs Âą 12.286 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
âââââââââ
âââ
ââââââââ
â
ââââââ â
âââ
âââââââââââââââââââââââââââââââââââââââ
âââââ
âââ
ââ
âââ
â
âââââ â
332 Ξs Histogram: log(frequency) by time 359 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.@benchmark countbases3($testseq)BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min âĶ max): 9.207 Ξs âĶ 24.586 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 9.257 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 9.348 Ξs Âą 764.504 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
9.21 Ξs Histogram: frequency by time 11.4 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.Interestingly, on my system, countbases2() is actually faster than countbases(), at least for this longer sequence. This may be because SIMD lets the calls to count() work in parallel.
But, as you can see, countbases3() is even faster. Let me make one more function that mimics the behavior of the original countbases() but uses BioSequences.jl instead.
function countbases4(seq)
a = 0
c = 0
g = 0
t = 0
for base in seq
if base == DNA_A
a += 1 # this is equivalent to `a = a + 1`
elseif base == DNA_C
c += 1
elseif base == DNA_G
g += 1
elseif base == DNA_T
t += 1
else
# it is often a good idea to try to handle possible mistakes explicitly
error("Base $base is not supported")
end
end
return (a, c, g, t)
end
@benchmark countbases4($testseq)BenchmarkTools.Trial: 6443 samples with 1 evaluation per sample.
Range (min âĶ max): 768.206 Ξs âĶ 851.041 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 775.430 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 774.752 Ξs Âą 4.578 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
ââââ ââ âââââ
ââââââ â
ââââââââââ
âââââ
âââââââââââââââââââââââ
âââ
ââââ
âââ
â
âââââ
â
â
â
â
ââ
â
â
768 Ξs Histogram: log(frequency) by time 789 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.