ð Problem 3 - Getting the complement â
I know, I know, not the compliment, but if you have a better emoji idea, let me know.
The Problem
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.
The reverse complement of a DNA string
Given: A DNA string
Return: The reverse complement
Sample Dataset
AAAACCCGGTSample Output
ACCGGGTTTTThis one is a bit tougher - we need to change each base coming in, and then reverse the result. Actually, that second part is easy, becuase julia has a built-in reverse() function that works for Strings.
reverse("complement")"tnemelpmoc"Approach 1: using a Dictionary â
In my opinion, the easiest thing to do is to use a Dict(), a data structure that allows arbitrary keys to look up arbitrary entries.
For example:
my_dictionary = Dict("thing1"=> "hello", "thing2" => "world!")
my_dictionary["thing1"]"hello"my_dictionary["thing2"]"world!"So, we just make a dictionary with 4 entries, one for each base. Then, to apply this to every base in the sequence, we have a couple of options. One is to use the String() constructor and a "comprehension" - basically a for loop in a single phrase:
function revc(seq)
comp_dict = Dict(
'A'=>'T',
'C'=>'G',
'G'=>'C',
'T'=>'A'
)
comp = String([comp_dict[base] for base in seq])
return reverse(comp)
endrevc (generic function with 1 method)Here, the "comprehension" [comp_dict[base] for base in seq] is equivalent to something like
comp = Char[]
for base in seq
push!(comp, comp_dict[base])
endSo let's see if it works!
input_dna = "AAAACCCGGT"
answer = "ACCGGGTTTT"
@assert revc(input_dna) == answerApproach 2: using replace() again â
It turns out, the replace() function we used for the transcription problem can be passed mulitple Pairs of patterns to replace!
So we can just pass the pairs directly:
function revc2(seq)
comp = replace(seq,
'A'=>'T',
'C'=>'G',
'G'=>'C',
'T'=>'A'
)
return reverse(comp)
end
@assert revc(input_dna) == revc2(input_dna)Approach 3: BioSequences.jl â
This is a pretty common need in bioinformatics, so BioSequences.jl actually has a reverse_complement() function built-in.
using BioSequences
reverse_complement(LongDNA{2}(input_dna))10nt DNA Sequence:
ACCGGGTTTTOnce more, benchmarks â
using BenchmarkTools
testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)
@benchmark revc($testseq_str)BenchmarkTools.Trial: 4713 samples with 1 evaluation per sample.
Range (min âĶ max): 991.133 Ξs âĶ 6.187 ms â GC (min âĶ max): 0.00% âĶ 83.03%
Time (median): 1.002 ms â GC (median): 0.00%
Time (mean Âą Ï): 1.050 ms Âą 147.442 Ξs â GC (mean Âą Ï): 0.94% Âą 3.84%
âââ
ââ ââ â âââ ââ â
âââââââââââââââââ
ââââââââââââââââ
â
ââ
âââââââââââââââââââââ
ââââ â
991 Ξs Histogram: log(frequency) by time 1.46 ms <
Memory estimate: 586.50 KiB, allocs estimate: 9.@benchmark revc2($testseq_str)BenchmarkTools.Trial: 1349 samples with 1 evaluation per sample.
Range (min âĶ max): 3.621 ms âĶ 9.888 ms â GC (min âĶ max): 0.00% âĶ 44.36%
Time (median): 3.657 ms â GC (median): 0.00%
Time (mean Âą Ï): 3.706 ms Âą 279.194 Ξs â GC (mean Âą Ï): 0.21% Âą 2.00%
â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
3.62 ms Histogram: frequency by time 4.77 ms <
Memory estimate: 215.19 KiB, allocs estimate: 5.@benchmark reverse_complement($testseq)BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min âĶ max): 1.927 Ξs âĶ 310.346 Ξs â GC (min âĶ max): 0.00% âĶ 95.79%
Time (median): 2.373 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 3.878 Ξs Âą 8.314 Ξs â GC (mean Âą Ï): 16.24% Âą 9.46%
ââââ ââ â
ââââââ
â
âââââââââââââââââââââââââ
â
â
âââââââââââââââââââââ
ââ
â
â â
1.93 Ξs Histogram: log(frequency) by time 41.5 Ξs <
Memory estimate: 48.97 KiB, allocs estimate: 4.@benchmark reverse_complement(testseq_4bit) setup=(testseq_4bit = convert(LongDNA{4}, testseq))BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
Range (min âĶ max): 1.908 Ξs âĶ 378.154 Ξs â GC (min âĶ max): 0.00% âĶ 95.92%
Time (median): 2.422 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 4.029 Ξs Âą 9.191 Ξs â GC (mean Âą Ï): 15.87% Âą 8.76%
ââââ âââ â
âââââââ
ââ
ââââââââââââââââââââââââ
ââââââââââââââââââââââââââ
â
1.91 Ξs Histogram: log(frequency) by time 41.9 Ξs <
Memory estimate: 48.97 KiB, allocs estimate: 4.Conclusions â
This one is a no-brainer! The reverse_complement() function is about 200x faster than the dictionary method, and about 1000x faster than replace() for both 2 bit and 4 bit DNA sequences.
â Overall Conclusions â
A lot of bioinformatics is essentially string manipulation. Julia has a lot of useful functionality to work with Strings directly, but those methods often leave a lot of performance on the table.
BioSequences.jl provides some nice sequence types and incredibly efficient data structures. We'll be seeing more of them in coming tutorials.