Skip to content

😉 Problem 3 - Getting the complement ​

I know, I know, not the compliment, but if you have a better emoji idea, let me know.

The Problem

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

Sample Dataset

txt
AAAACCCGGT

Sample Output

txt
ACCGGGTTTT

This one is a bit tougher - we need to change each base coming in, and then reverse the result. Actually, that second part is easy, becuase julia has a built-in reverse() function that works for Strings.

julia
reverse("complement")
"tnemelpmoc"

Approach 1: using a Dictionary ​

In my opinion, the easiest thing to do is to use a Dict(), a data structure that allows arbitrary keys to look up arbitrary entries.

For example:

julia
my_dictionary = Dict("thing1"=> "hello", "thing2" => "world!")


my_dictionary["thing1"]
"hello"
julia
my_dictionary["thing2"]
"world!"

So, we just make a dictionary with 4 entries, one for each base. Then, to apply this to every base in the sequence, we have a couple of options. One is to use the String() constructor and a "comprehension" - basically a for loop in a single phrase:

julia
function revc(seq)
	comp_dict = Dict(
		'A'=>'T',
		'C'=>'G',
		'G'=>'C',
		'T'=>'A'
	)
	comp = String([comp_dict[base] for base in seq])
	return reverse(comp)
end
revc (generic function with 1 method)

Here, the "comprehension" [comp_dict[base] for base in seq] is equivalent to something like

julia
comp = Char[]
for base in seq
	push!(comp, comp_dict[base])
end

So let's see if it works!

julia
input_dna = "AAAACCCGGT"
answer = "ACCGGGTTTT"

@assert revc(input_dna) == answer

Approach 2: using replace() again ​

It turns out, the replace() function we used for the transcription problem can be passed mulitple Pairs of patterns to replace!

So we can just pass the pairs directly:

julia
function revc2(seq)
	comp = replace(seq,
		'A'=>'T',
		'C'=>'G',
		'G'=>'C',
		'T'=>'A'
	)
	return reverse(comp)
end


@assert revc(input_dna) == revc2(input_dna)

Approach 3: BioSequences.jl ​

This is a pretty common need in bioinformatics, so BioSequences.jl actually has a reverse_complement() function built-in.

julia
using BioSequences

reverse_complement(LongDNA{2}(input_dna))
10nt DNA Sequence:
ACCGGGTTTT

Once more, benchmarks ​

julia
using BenchmarkTools


testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)


@benchmark revc($testseq_str)
BenchmarkTools.Trial: 4713 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  991.133 Ξs â€Ķ   6.187 ms  ┊ GC (min â€Ķ max): 0.00% â€Ķ 83.03%
 Time  (median):       1.002 ms               ┊ GC (median):    0.00%
 Time  (mean Âą σ):     1.050 ms Âą 147.442 Ξs  ┊ GC (mean Âą σ):  0.94% Âą  3.84%

  ▃█▅▃▂   ▂▃   ▂    ▁▃▂        ▃▁                               ▁
  ██████▆▆███▆▆██▆▅▄████▇▆▃▃▄▄▆██▇▅▅▄▅▄▁▃▁▄▃▁▃▄▁▃▄▁▄▆██▇▄▄▅▄▄▁▄ █
  991 Ξs        Histogram: log(frequency) by time       1.46 ms <

 Memory estimate: 586.50 KiB, allocs estimate: 9.
julia
@benchmark revc2($testseq_str)
BenchmarkTools.Trial: 1349 samples with 1 evaluation per sample.
 Range (min â€Ķ max):  3.621 ms â€Ķ   9.888 ms  ┊ GC (min â€Ķ max): 0.00% â€Ķ 44.36%
 Time  (median):     3.657 ms               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   3.706 ms Âą 279.194 Ξs  ┊ GC (mean Âą σ):  0.21% Âą  2.00%

   █                                                           
  ▆█▆▃▃▇▄▃▂▂▂▂▂▂▂▁▂▂▁▂▂▂▁▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂ ▂
  3.62 ms         Histogram: frequency by time        4.77 ms <

 Memory estimate: 215.19 KiB, allocs estimate: 5.
julia
@benchmark reverse_complement($testseq)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min â€Ķ max):  1.927 Ξs â€Ķ 310.346 Ξs  ┊ GC (min â€Ķ max):  0.00% â€Ķ 95.79%
 Time  (median):     2.373 ξs               ┊ GC (median):     0.00%
 Time  (mean Âą σ):   3.878 Ξs Âą   8.314 Ξs  ┊ GC (mean Âą σ):  16.24% Âą  9.46%

  █▆▃▁                       ▂▁                               ▁
  ████▆▅▅▁▃▄▄▁▁▁▄▄▁▃▁▁▁▁▃▁▃▃▇██▇▇▅▅▅▃▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▅▄▅▅▆ █
  1.93 Ξs      Histogram: log(frequency) by time      41.5 Ξs <

 Memory estimate: 48.97 KiB, allocs estimate: 4.
julia
@benchmark reverse_complement(testseq_4bit) setup=(testseq_4bit = convert(LongDNA{4}, testseq))
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min â€Ķ max):  1.908 Ξs â€Ķ 378.154 Ξs  ┊ GC (min â€Ķ max):  0.00% â€Ķ 95.92%
 Time  (median):     2.422 ξs               ┊ GC (median):     0.00%
 Time  (mean Âą σ):   4.029 Ξs Âą   9.191 Ξs  ┊ GC (mean Âą σ):  15.87% Âą  8.76%

  █▇▃▁                      ▂▂▁                               ▁
  ████▆▆▅▄▅▄▁▃▃▁▁▃▃▁▃▃▄▄▁▃▁▁███▇▆▆▅▄▁▄▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▄▁▄▄▅ █
  1.91 Ξs      Histogram: log(frequency) by time      41.9 Ξs <

 Memory estimate: 48.97 KiB, allocs estimate: 4.

Conclusions ​

This one is a no-brainer! The reverse_complement() function is about 200x faster than the dictionary method, and about 1000x faster than replace() for both 2 bit and 4 bit DNA sequences.

⌛ Overall Conclusions ​

A lot of bioinformatics is essentially string manipulation. Julia has a lot of useful functionality to work with Strings directly, but those methods often leave a lot of performance on the table.

BioSequences.jl provides some nice sequence types and incredibly efficient data structures. We'll be seeing more of them in coming tutorials.