BioTutorials

😉 Problem 3 - Getting the complement

I know, I know, not the compliment, but if you have a better emoji idea, let me know.

The Problem

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string $s$ is the string $s c$ formed by reversing the symbols of $s$ , then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string $s$ of length at most 1000 bp.

Return: The reverse complement $s c$ of $s$ .

Sample Dataset

txt

AAAACCCGGT

Sample Output

txt

ACCGGGTTTT

This one is a bit tougher - we need to change each base coming in, and then reverse the result. Actually, that second part is easy, becuase julia has a built-in reverse() function that works for Strings.

julia

reverse("complement")

"tnemelpmoc"

Approach 1: using a `Dict`ionary

In my opinion, the easiest thing to do is to use a Dict(), a data structure that allows arbitrary keys to look up arbitrary entries.

For example:

julia

my_dictionary = Dict("thing1"=> "hello", "thing2" => "world!")


my_dictionary["thing1"]

"hello"

julia

my_dictionary["thing2"]

"world!"

So, we just make a dictionary with 4 entries, one for each base. Then, to apply this to every base in the sequence, we have a couple of options. One is to use the String() constructor and a "comprehension" - basically a for loop in a single phrase:

julia

function revc(seq)
	comp_dict = Dict(
		'A'=>'T',
		'C'=>'G',
		'G'=>'C',
		'T'=>'A'
	)
	comp = String([comp_dict[base] for base in seq])
	return reverse(comp)
end

revc (generic function with 1 method)

Here, the "comprehension" [comp_dict[base] for base in seq] is equivalent to something like

julia

comp = Char[]
for base in seq
	push!(comp, comp_dict[base])
end

So let's see if it works!

julia

input_dna = "AAAACCCGGT"
answer = "ACCGGGTTTT"

@assert revc(input_dna) == answer

Approach 2: using `replace()` again

It turns out, the replace() function we used for the transcription problem can be passed mulitple Pairs of patterns to replace!

So we can just pass the pairs directly:

julia

function revc2(seq)
	comp = replace(seq,
		'A'=>'T',
		'C'=>'G',
		'G'=>'C',
		'T'=>'A'
	)
	return reverse(comp)
end


@assert revc(input_dna) == revc2(input_dna)

Approach 3: `BioSequences.jl`

This is a pretty common need in bioinformatics, so BioSequences.jl actually has a reverse_complement() function built-in.

julia

using BioSequences

reverse_complement(LongDNA{2}(input_dna))

10nt DNA Sequence:
ACCGGGTTTT

Once more, benchmarks

julia

using BenchmarkTools


testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)


@benchmark revc($testseq_str)

BenchmarkTools.Trial: 4713 samples with 1 evaluation per sample.
 Range (min … max):  991.133 μs …   6.187 ms  ┊ GC (min … max): 0.00% … 83.03%
 Time  (median):       1.002 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.050 ms ± 147.442 μs  ┊ GC (mean ± σ):  0.94% ±  3.84%

  ▃█▅▃▂   ▂▃   ▂    ▁▃▂        ▃▁                               ▁
  ██████▆▆███▆▆██▆▅▄████▇▆▃▃▄▄▆██▇▅▅▄▅▄▁▃▁▄▃▁▃▄▁▃▄▁▄▆██▇▄▄▅▄▄▁▄ █
  991 μs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 586.50 KiB, allocs estimate: 9.

julia

@benchmark revc2($testseq_str)

BenchmarkTools.Trial: 1349 samples with 1 evaluation per sample.
 Range (min … max):  3.621 ms …   9.888 ms  ┊ GC (min … max): 0.00% … 44.36%
 Time  (median):     3.657 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.706 ms ± 279.194 μs  ┊ GC (mean ± σ):  0.21% ±  2.00%

   █                                                           
  ▆█▆▃▃▇▄▃▂▂▂▂▂▂▂▁▂▂▁▂▂▂▁▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂ ▂
  3.62 ms         Histogram: frequency by time        4.77 ms <

 Memory estimate: 215.19 KiB, allocs estimate: 5.

julia

@benchmark reverse_complement($testseq)

BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.927 μs … 310.346 μs  ┊ GC (min … max):  0.00% … 95.79%
 Time  (median):     2.373 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.878 μs ±   8.314 μs  ┊ GC (mean ± σ):  16.24% ±  9.46%

  █▆▃▁                       ▂▁                               ▁
  ████▆▅▅▁▃▄▄▁▁▁▄▄▁▃▁▁▁▁▃▁▃▃▇██▇▇▅▅▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▅▄▅▅▆ █
  1.93 μs      Histogram: log(frequency) by time      41.5 μs <

 Memory estimate: 48.97 KiB, allocs estimate: 4.

julia

@benchmark reverse_complement(testseq_4bit) setup=(testseq_4bit = convert(LongDNA{4}, testseq))

BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  1.908 μs … 378.154 μs  ┊ GC (min … max):  0.00% … 95.92%
 Time  (median):     2.422 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.029 μs ±   9.191 μs  ┊ GC (mean ± σ):  15.87% ±  8.76%

  █▇▃▁                      ▂▂▁                               ▁
  ████▆▆▅▄▅▄▁▃▃▁▁▃▃▁▃▃▄▄▁▃▁▁███▇▆▆▅▄▁▄▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▄▁▄▄▅ █
  1.91 μs      Histogram: log(frequency) by time      41.9 μs <

 Memory estimate: 48.97 KiB, allocs estimate: 4.

Conclusions

This one is a no-brainer! The reverse_complement() function is about 200x faster than the dictionary method, and about 1000x faster than replace() for both 2 bit and 4 bit DNA sequences.

⌛ Overall Conclusions

A lot of bioinformatics is essentially string manipulation. Julia has a lot of useful functionality to work with Strings directly, but those methods often leave a lot of performance on the table.

BioSequences.jl provides some nice sequence types and incredibly efficient data structures. We'll be seeing more of them in coming tutorials.

😉 Problem 3 - Getting the complement ​

Approach 1: using a Dictionary ​

Approach 2: using replace() again ​

Approach 3: BioSequences.jl ​

Once more, benchmarks ​

Conclusions ​

⌛ Overall Conclusions ​

😉 Problem 3 - Getting the complement

Approach 1: using a `Dict`ionary

Approach 2: using `replace()` again

Approach 3: `BioSequences.jl`

Once more, benchmarks

Conclusions

⌛ Overall Conclusions