Skip to content

🧎 Problem 1: Counting DNA nucleotides ​

ðŸĪ” Problem link

Note

For each of these problems, you are strongly encouraged to read through the problem descriptions, especially if you're somewhat new to molecular biology. We will mostly not repeat the background concepts in these notebooks, except where they are relevant to the solutions.

The Problem

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s .

Sample Dataset

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

Sample Output

20 12 17 21

Let's see how it's done!

DNA sequences are Strings of Chars ​

In julia, single characters and strings, which are made up of multiple characters, have different types. Char and String respectively.

They are also written differently - single quotes for a Char, and double quotes for a `String'.

julia
chr = 'a'
str = "A"
typeof(chr)
Char
julia
typeof(str)
String

In many ways, a String can be thought of as a vector of Chars, and many julia functions that operate on collections like Vectors will work on Strings. We can also loop over the contents of a string, which will treat each Char separately.

julia
for c in "banana"
	@info c
end
[ Info: b
[ Info: a
[ Info: n
[ Info: a
[ Info: n
[ Info: a

Approach 1: counting in loops ​

One relatively straightforward way to approach this problem is to set a variable to 0 for each base, then loop through the sequence, adding 1 to the appropriate variable at each character.

I'll also stick this into a function, so we can easily reuse the code.

julia
function countbases(seq) # here `seq` is an "argument" for the function
	a = 0
	c = 0
	g = 0
	t = 0
	for base in seq
		if base == 'A'
			a += 1 # this is equivalent to `a = a + 1`
		elseif base == 'C'
			c += 1
		elseif base == 'G'
			g += 1
		elseif base == 'T'
			t += 1
		else
			# it is often a good idea to try to handle possible mistakes explicitly
			error("Base $base is not supported")
		end
	end
	return (a, c, g, t)
end

countbases("AAA")
(3, 0, 0, 0)

Now let's see if it works on the example dataset. Remember, we should be getting the answer 20 12 17 21

julia
answer = "20 12 17 21"
input_dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
countbases(input_dna)
(20, 12, 17, 21)

Well, the formatting is just a bit different - The julia type is a Tuple, which is surrounded by parentheses. To fix this, we can use the join function.

julia
@assert join(countbases(input_dna), " ") == answer

Approach 2: using count() ​

Another approach is to use the built-in count() function, which takes a "predicate" function as the first argument and an iterable collection as the second argument. The predicate function must take each element of the collection, and return either true or false. The count() function then returns the number of elements that returned true.

For example, if I define the lessthan5() function to return true if a value is less than 5, I can then use it as a predicate to count the number of values in a Vector of numbers that are less than 5.

julia
function lessthan5(num)
	return num < 5
end

count(lessthan5, [1, 5, 6, -3, 3])
3

Often, we don't want to have to define a simple function like lessthan5() for every predicate we want to test, especially if they will only be used once. Instead, we can use an "anonymous" function (also sometimes called "lambdas") as the first argument.

In julia, anonymous functions have the syntax arg -> func. body. In other words, the same expression above could be written as:

julia
count(num -> num < 5,  [1, 5, 6, -3, 3])
3

Here, num -> num < 5 is identical to the definition for lessthan5(num).

So, now we can write a different formulation of countbases() using count():

julia
function countbases2(seq)
	a = count(base-> base == 'A', seq)
	c = count(base-> base == 'C', seq)
	g = count(base-> base == 'G', seq)
	t = count(base-> base == 'T', seq)
	return (a,c,g,t)
end
countbases2 (generic function with 1 method)
julia
@assert countbases2(input_dna) == countbases(input_dna)

Tip

Even though this approach is quite a bit more suscinct, it might end up being a bit slower than countbases, since it has to loop over the sequence 4 times instead of just once.

Sometimes, you need to make trade-offs between clarity and efficiency. One of the great things about julia is that a lot of ways of approaching the same problem are often possible, and often fast (or they can be made fast).

Approach 3: using BioSequences.jl ​

The BioSequences.jl package is designed to efficiently work with biological sequences like DNA sequences. BioSequences.jl efficiently encodes biological sequences using special types that are not Char or Strings.

julia
using BioSequences

seq = LongDNA{2}(input_dna)

sizeof(input_dna)
70
julia
sizeof(seq)
16

Counting individual nucleotides isn't the most common operation, but BioSequences.jl has some advanced searching functionality built-in. It's a bit overkill for this task, but for completeness:

julia
function countbases3(seq)
	a = count(==(DNA_A), seq)
	c = count(==(DNA_C), seq)
	g = count(==(DNA_G), seq)
	t = count(==(DNA_T), seq)
	return (a,c,g,t)
end

@assert countbases3(seq) == countbases2(input_dna)

Benchmarking ​

Julia programmers like speed, so let's benchmark our approaches!

julia
using BenchmarkTools

testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)

@benchmark countbases($testseq_str)
BenchmarkTools.Trial: 7026 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  694.128 Ξs â€Ķ 852.825 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     707.593 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   706.592 Ξs Âą   5.365 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

                              ▂▄██▆▅▂                            
  ▂▁▁▂▂▂▂▃▃▃▄▅▅▆▇█▇▆▅▄▃▃▃▃▃▄▅▇████████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  694 Ξs           Histogram: frequency by time          721 Ξs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
@benchmark(countbases2($testseq_str))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  331.520 Ξs â€Ķ 613.497 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     339.004 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   341.383 Ξs Âą  12.286 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

         ▄▃▆▂▃▃▂▆▅█▁▅    ▁▂▂▁▁▃▄▅▅▆▄▃▂▁▁                        ▂
  ▃▃▅▄███████████████▇█▆▇████████████████▆▆▅▆▆▆▆▅▄▄▅▄▅▆▂▅▅▃▄▃▂▄ █
  332 Ξs        Histogram: log(frequency) by time        359 Ξs <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia
@benchmark countbases3($testseq)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  9.207 Ξs â€Ķ  24.586 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     9.257 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   9.348 Ξs Âą 764.504 ns  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

   █                                                           
  ▄█▃▄▂▂▂▂▁▁▁▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▂▁▂▂▁▁▁▁▂▂▂▂▁▂▁▁▂▁▂▁▁▂▂▂ ▂
  9.21 Ξs         Histogram: frequency by time        11.4 Ξs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Interestingly, on my system, countbases2() is actually faster than countbases(), at least for this longer sequence. This may be because SIMD lets the calls to count() work in parallel.

But, as you can see, countbases3() is even faster. Let me make one more function that mimics the behavior of the original countbases() but uses BioSequences.jl instead.

julia
function countbases4(seq)
	a = 0
	c = 0
	g = 0
	t = 0
	for base in seq
		if base == DNA_A
			a += 1 # this is equivalent to `a = a + 1`
		elseif base == DNA_C
			c += 1
		elseif base == DNA_G
			g += 1
		elseif base == DNA_T
			t += 1
		else
			# it is often a good idea to try to handle possible mistakes explicitly
			error("Base $base is not supported")
		end
	end
	return (a, c, g, t)
end

@benchmark countbases4($testseq)
BenchmarkTools.Trial: 6443 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  768.206 Ξs â€Ķ 851.041 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     775.430 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   774.752 Ξs Âą   4.578 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

  ▇▂▄▃ ▁▁             ▇█▆▆▅▄▃▃▃▂▁                               ▂
  ████▁██▇█▅▃▃▁▄▅▁▄▁▃█████████████▇▇▆▇▆▅▇▇▅▆▆▇▅▄▆▅▅▆▄▆▆▅▅▅▅▅▄▅▅ █
  768 Ξs        Histogram: log(frequency) by time        789 Ξs <

 Memory estimate: 0 bytes, allocs estimate: 0.