๐ Problem 3: Reverse Complement
I know, I know, not the compliment, but if you have a better emoji idea, let me know.
๐ค Problem link
The Problem
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.
The reverse complement of a DNA string \(s\) is the string \(sc\) formed by reversing the symbols of \(s\), then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").
Given: A DNA string \(s\) of length at most 1000 bp.
Return: The reverse complement \(sc\) of \(s\).
Sample Dataset
AAAACCCGGTSample Output
ACCGGGTTTT
This one is a bit tougher - we need to change each base coming in,
and then reverse the result. Actually, that second part is easy,
because julia has a built-in reverse() function that works for Strings.
reverse("complement") # "tnemelpmoc"
Approach 1: using a Dictionary
In my opinion, the easiest thing to do is to use a Dict(),
a data structure that allows arbitrary keys to look up arbitrary entries.
For example:
my_dictionary = Dict("thing1"=> "hello", "thing2" => "world!")
my_dictionary["thing1"] # "hello"
my_dictionary["thing2"] # "world!"
So, we just make a dictionary with 4 entries, one for each base.
Then, to apply this to every base in the sequence, we have a couple of options.
One is to use the String() constructor and a "comprehension" -
basically a for loop in a single phrase:
function revc(seq)
comp_dict = Dict(
'A'=>'T',
'C'=>'G',
'G'=>'C',
'T'=>'A'
)
comp = String([comp_dict[base] for base in seq])
return reverse(comp)
end
Here, the "comprehension" [comp_dict[base] for base in seq] is equivalent to something like
comp = Char[]
for base in seq
push!(comp, comp_dict[base])
end
So let's see if it works!
input_dna = "AAAACCCGGT"
answer = "ACCGGGTTTT"
@assert revc(input_dna) == answer
Approach 2: using replace() again
It turns out, the replace() function we used for the transcription problem
can be passed multiple Pairs of patterns to replace!
So we can just pass the pairs directly:
function revc2(seq)
comp = replace(seq,
'A'=>'T',
'C'=>'G',
'G'=>'C',
'T'=>'A'
)
return reverse(comp)
end
@assert revc(input_dna) == revc2(input_dna)
Approach 3: BioSequences.jl
This is a pretty common need in bioinformatics, so BioSequences.jl actually has a reverse_complement() function built-in.
using BioSequences
reverse_complement(LongDNA{2}(input_dna)) # ACCGGGTTTT
Benchmarks
using BenchmarkTools
testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)
@benchmark revc($testseq_str)
@benchmark revc2($testseq_str)
@benchmark reverse_complement($testseq)
@benchmark reverse_complement(testseq_4bit) setup=(testseq_4bit = convert(LongDNA{4}, testseq))
Conclusions
This one is a no-brainer! The reverse_complement() function is about 200x faster than the dictionary method, and about 1000x faster than replace() for both 2 bit and 4 bit DNA sequences.
โ Overall Conclusions
A lot of bioinformatics is essentially string manipulation.
Julia has a lot of useful functionality to work with Strings
directly, but those methods often leave a lot of performance on the table.
BioSequences.jl provides some nice sequence types and incredibly efficient
data structures. We'll be seeing more of them in coming tutorials.