🧬 Problem 1: Counting DNA Nucleotides
🤔 Problem link
Note: For each of these problems, you are strongly encouraged to read through the problem descriptions, especially if you're somewhat new to molecular biology. We will mostly not repeat the background concepts in these notebooks, except where they are relevant to the solutions.
The Problem
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.
An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."
Given: A DNA string
sof length at most 1000 nt.Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in
s.Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCSample Output
20 12 17 21
Let's see how it's done!
DNA sequences are Strings of Chars
In julia, single characters and strings,
which are made up of multiple characters, have different types.
Char and String respectively.
They are also written differently - single quotes for a Char,
and double quotes for a String.
chr = 'a'
str = "A"
typeof(chr) # Char
typeof(str) # String
In many ways, a String can be thought of as a vector of Chars,
and many julia functions that operate on collections like Vectors
will work on Strings.
We can also loop over the contents of a string,
which will treat each Char separately.
for c in "banana"
@info c
end
Approach 1: counting in loops
One relatively straightforward way to approach this problem
is to set a variable to 0 for each base,
then loop through the sequence, adding 1 to the appropriate
variable at each character.
I'll also stick this into a function, so we can easily reuse the code.
function countbases(seq) # here `seq` is an "argument" for the function
a = 0
c = 0
g = 0
t = 0
for base in seq
if base == 'A'
a += 1 # this is equivalent to `a = a + 1`
elseif base == 'C'
c += 1
elseif base == 'G'
g += 1
elseif base == 'T'
t += 1
else
# it is often a good idea to try to handle possible mistakes explicitly
error("Base $base is not supported")
end
end
return (a, c, g, t)
end
countbases("AAA") # (3, 0, 0, 0)
Now let's see if it works on the example dataset.
Remember, we should be getting the answer 20 12 17 21
answer = "20 12 17 21"
input_dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
countbases(input_dna) # (20, 12, 17, 21)
Well, the formatting is just a bit different -
The julia type is a Tuple, which is surrounded by parentheses.
To fix this, we can use the join function.
@assert join(countbases(input_dna), " ") == answer
Approach 2: using count()
Another approach is to use the built-in count() function,
which takes a "predicate" function as the first argument
and an iterable collection as the second argument.
The predicate function must take each element of the collection,
and return either true or false.
The count() function then returns the number of elements
that returned true.
For example, if I define the lessthan5() function
to return true if a value is less than 5,
I can then use it as a predicate to count the number of values
in a Vector of numbers that are less than 5.
function lessthan5(num)
return num < 5
end
count(lessthan5, [1, 5, 6, -3, 3]) # 3
Often, we don't want to have to define a simple function like lessthan5()
for every predicate we want to test, especially if they will only be used once.
Instead, we can use an "anonymous" function (also sometimes called "lambdas")
as the first argument.
In julia, anonymous functions have the syntax arg -> func_body.
In other words, the same expression above could be written as:
count(num -> num < 5, [1, 5, 6, -3, 3]) # 3
Here, num -> num < 5 is identical to the definition for lessthan5(num).
So, now we can write a different formulation of countbases() using count():
function countbases2(seq)
a = count(base-> base == 'A', seq)
c = count(base-> base == 'C', seq)
g = count(base-> base == 'G', seq)
t = count(base-> base == 'T', seq)
return (a,c,g,t)
end
@assert countbases2(input_dna) == countbases(input_dna)
Tip: Even though this approach is quite a bit more succinct, it might end up being a bit slower than
countbases, since it has to loop over the sequence 4 times instead of just once.Sometimes, you need to make trade-offs between clarity and efficiency. One of the great things about
juliais that a lot of ways of approaching the same problem are often possible, and often fast (or they can be made fast).
Approach 3: using BioSequences.jl
The BioSequences.jl package is designed to efficiently work
with biological sequences like DNA sequences.
BioSequences.jl efficiently encodes biological sequences using
special types that are not Char or Strings.
using BioSequences
seq = LongDNA{2}(input_dna)
sizeof(input_dna) # 70 bytes (1 byte per char)
sizeof(seq) # 18 bytes (2 bits per base)
Counting individual nucleotides isn't the most common operation,
but BioSequences.jl has some advanced searching functionality
built-in. It's a bit overkill for this task, but for completeness:
function countbases3(seq)
a = count(==(DNA_A), seq)
c = count(==(DNA_C), seq)
g = count(==(DNA_G), seq)
t = count(==(DNA_T), seq)
return (a,c,g,t)
end
@assert countbases3(seq) == countbases2(input_dna)
Benchmarking
Julia programmers like speed, so let's benchmark our approaches!
using BenchmarkTools
testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)
@benchmark countbases($testseq_str)
@benchmark countbases2($testseq_str)
@benchmark countbases3($testseq)
Interestingly, countbases2() is actually faster than countbases(),
at least for longer sequences. This may be because SIMD lets the calls to count() work in parallel.
But countbases3() is even faster. Let me make one more function that mimics the behavior of the original countbases() but uses BioSequences.jl instead.
function countbases4(seq)
a = 0
c = 0
g = 0
t = 0
for base in seq
if base == DNA_A
a += 1
elseif base == DNA_C
c += 1
elseif base == DNA_G
g += 1
elseif base == DNA_T
t += 1
else
error("Base $base is not supported")
end
end
return (a, c, g, t)
end
@benchmark countbases4($testseq)