🧬 Problem 1: Counting DNA nucleotides

Note

For each of these problems, you are strongly encouraged to read through the problem descriptions, especially if you're somewhat new to molecular biology. We will mostly not repeat the background concepts in these notebooks, except where they are relevant to the solutions.

The Problem

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s .

Sample Dataset

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

Sample Output

20 12 17 21

Let's see how it's done!

DNA sequences are `String`s of `Char`s

In julia, single characters and strings, which are made up of multiple characters, have different types. Char and String respectively.

They are also written differently - single quotes for a Char, and double quotes for a `String'.

julia

chr = 'a'
str = "A"
typeof(chr)

Char

julia

typeof(str)

String

In many ways, a String can be thought of as a vector of Chars, and many julia functions that operate on collections like Vectors will work on Strings. We can also loop over the contents of a string, which will treat each Char separately.

julia

for c in "banana"
	@info c
end

[ Info: b
[ Info: a
[ Info: n
[ Info: a
[ Info: n
[ Info: a

Approach 1: counting in loops

One relatively straightforward way to approach this problem is to set a variable to 0 for each base, then loop through the sequence, adding 1 to the appropriate variable at each character.

I'll also stick this into a function, so we can easily reuse the code.

julia

function countbases(seq) # here `seq` is an "argument" for the function
	a = 0
	c = 0
	g = 0
	t = 0
	for base in seq
		if base == 'A'
			a += 1 # this is equivalent to `a = a + 1`
		elseif base == 'C'
			c += 1
		elseif base == 'G'
			g += 1
		elseif base == 'T'
			t += 1
		else
			# it is often a good idea to try to handle possible mistakes explicitly
			error("Base $base is not supported")
		end
	end
	return (a, c, g, t)
end

countbases("AAA")

(3, 0, 0, 0)

Now let's see if it works on the example dataset. Remember, we should be getting the answer 20 12 17 21

julia

answer = "20 12 17 21"
input_dna = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
countbases(input_dna)

(20, 12, 17, 21)

Well, the formatting is just a bit different - The julia type is a Tuple, which is surrounded by parentheses. To fix this, we can use the join function.

julia

@assert join(countbases(input_dna), " ") == answer

Approach 2: using `count()`

Another approach is to use the built-in count() function, which takes a "predicate" function as the first argument and an iterable collection as the second argument. The predicate function must take each element of the collection, and return either true or false. The count() function then returns the number of elements that returned true.

For example, if I define the lessthan5() function to return true if a value is less than 5, I can then use it as a predicate to count the number of values in a Vector of numbers that are less than 5.

julia

function lessthan5(num)
	return num < 5
end

count(lessthan5, [1, 5, 6, -3, 3])

Often, we don't want to have to define a simple function like lessthan5() for every predicate we want to test, especially if they will only be used once. Instead, we can use an "anonymous" function (also sometimes called "lambdas") as the first argument.

In julia, anonymous functions have the syntax arg -> func. body. In other words, the same expression above could be written as:

julia

count(num -> num < 5,  [1, 5, 6, -3, 3])

Here, num -> num < 5 is identical to the definition for lessthan5(num).

So, now we can write a different formulation of countbases() using count():

julia

function countbases2(seq)
	a = count(base-> base == 'A', seq)
	c = count(base-> base == 'C', seq)
	g = count(base-> base == 'G', seq)
	t = count(base-> base == 'T', seq)
	return (a,c,g,t)
end

countbases2 (generic function with 1 method)

julia

@assert countbases2(input_dna) == countbases(input_dna)

Tip

Even though this approach is quite a bit more suscinct, it might end up being a bit slower than countbases, since it has to loop over the sequence 4 times instead of just once.

Sometimes, you need to make trade-offs between clarity and efficiency. One of the great things about julia is that a lot of ways of approaching the same problem are often possible, and often fast (or they can be made fast).

Approach 3: using BioSequences.jl

The BioSequences.jl package is designed to efficiently work with biological sequences like DNA sequences. BioSequences.jl efficiently encodes biological sequences using special types that are not Char or Strings.

julia

using BioSequences

seq = LongDNA{2}(input_dna)

sizeof(input_dna)

julia

sizeof(seq)

Counting individual nucleotides isn't the most common operation, but BioSequences.jl has some advanced searching functionality built-in. It's a bit overkill for this task, but for completeness:

julia

function countbases3(seq)
	a = count(==(DNA_A), seq)
	c = count(==(DNA_C), seq)
	g = count(==(DNA_G), seq)
	t = count(==(DNA_T), seq)
	return (a,c,g,t)
end

@assert countbases3(seq) == countbases2(input_dna)

Benchmarking

Julia programmers like speed, so let's benchmark our approaches!

julia

using BenchmarkTools

testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)

@benchmark countbases($testseq_str)

BenchmarkTools.Trial: 7026 samples with 1 evaluation per sample.
 Range (min … max):  694.128 μs … 852.825 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     707.593 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   706.592 μs ±   5.365 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▂▄██▆▅▂                            
  ▂▁▁▂▂▂▂▃▃▃▄▅▅▆▇█▇▆▅▄▃▃▃▃▃▄▅▇████████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  694 μs           Histogram: frequency by time          721 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia

@benchmark(countbases2($testseq_str))

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  331.520 μs … 613.497 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     339.004 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   341.383 μs ±  12.286 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▄▃▆▂▃▃▂▆▅█▁▅    ▁▂▂▁▁▃▄▅▅▆▄▃▂▁▁                        ▂
  ▃▃▅▄███████████████▇█▆▇████████████████▆▆▅▆▆▆▆▅▄▄▅▄▅▆▂▅▅▃▄▃▂▄ █
  332 μs        Histogram: log(frequency) by time        359 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia

@benchmark countbases3($testseq)

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  9.207 μs …  24.586 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.257 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.348 μs ± 764.504 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █                                                           
  ▄█▃▄▂▂▂▂▁▁▁▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▂▁▂▂▁▁▁▁▂▂▂▂▁▂▁▁▂▁▂▁▁▂▂▂ ▂
  9.21 μs         Histogram: frequency by time        11.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Interestingly, on my system, countbases2() is actually faster than countbases(), at least for this longer sequence. This may be because SIMD lets the calls to count() work in parallel.

But, as you can see, countbases3() is even faster. Let me make one more function that mimics the behavior of the original countbases() but uses BioSequences.jl instead.

julia

function countbases4(seq)
	a = 0
	c = 0
	g = 0
	t = 0
	for base in seq
		if base == DNA_A
			a += 1 # this is equivalent to `a = a + 1`
		elseif base == DNA_C
			c += 1
		elseif base == DNA_G
			g += 1
		elseif base == DNA_T
			t += 1
		else
			# it is often a good idea to try to handle possible mistakes explicitly
			error("Base $base is not supported")
		end
	end
	return (a, c, g, t)
end

@benchmark countbases4($testseq)

BenchmarkTools.Trial: 6443 samples with 1 evaluation per sample.
 Range (min … max):  768.206 μs … 851.041 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     775.430 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   774.752 μs ±   4.578 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▂▄▃ ▁▁             ▇█▆▆▅▄▃▃▃▂▁                               ▂
  ████▁██▇█▅▃▃▁▄▅▁▄▁▃█████████████▇▇▆▇▆▅▇▇▅▆▆▇▅▄▆▅▅▆▄▆▆▅▅▅▅▅▄▅▅ █
  768 μs        Histogram: log(frequency) by time        789 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

🧬 Problem 1: Counting DNA nucleotides ​

DNA sequences are Strings of Chars ​

Approach 1: counting in loops ​

Approach 2: using count() ​

Approach 3: using BioSequences.jl ​

Benchmarking ​

🧬 Problem 1: Counting DNA nucleotides

DNA sequences are `String`s of `Char`s

Approach 1: counting in loops

Approach 2: using `count()`

Approach 3: using BioSequences.jl

Benchmarking