âïļ Problem 2: Transcription â
ðĪ Problem link
The Problem
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.
Given a DNA string
Given: A DNA string
Return: The transcribed RNA string of
Sample Dataset
GATGGAACTTGACTACGTAAATTSample Output
GAUGGAACUUGACUACGUAAAUUApproach 1 - string replace() â
input_dna = "GATGGAACTTGACTACGTAAATT"
answer = "GAUGGAACUUGACUACGUAAAUU""GAUGGAACUUGACUACGUAAAUU"This one is pretty straightforward, as described. All we need to do is replace any 'T's with 'U's. Happily, julia has a handy replace() function that takes a string, and a Pair that is pattern => replacement. In principle, the pattern can be a literal String, or even a regular expression. But here, we can just use a Char.
I'll also write the function using julia's one-line function definition syntax:
input_dna == "GATGGAACTTGACTACGTAAATT"
simple_transcribe(seq) = replace(seq, 'T'=> 'U')
@assert simple_transcribe(input_dna) == answerAs always, there are lots of ways you could do this. This function won't hanndle poorly formatted sequences, for example. Or rather, it will handle them, even though it shouldn't:
Approach 2 - BioSequences LongRNA â
As you might expect, BioSequences.jl has a way to do this as well. BioSequences.jl doesn't just use a String to represent sequences, there are special types that can efficiently encode nucleic acid or amino acid sequences. In some cases, eg DNA or RNA with no ambiguous bases, using as few as 2 bits per base.
using BioSequences
dna_seq = LongDNA{2}(input_dna)
simple_transcribe(seq::LongDNA{N}) where N = LongRNA{N}(seq)
rna_seq = simple_transcribe(dna_seq)23nt RNA Sequence:
GAUGGAACUUGACUACGUAAAUU@assert String(rna_seq) == answersimple_transcribe("This Is QUITE silly")"Uhis Is QUIUE silly"A couple of things to note here. First, I'm taking advantage of julia's multiple dispatch system. Instead of writing a separate function name for dealing with a LongDNA from BioSequences.jl, I wrote a new method for the same function by adding ::LongDNA{N} to the argument.
This tells julia to call this version of simple_transcribe() whenever the argument is a LongDNA. Otherwise, it will fall back to the original (julia always uses the method that is most specific for its arguments).
The last thing to note is the {N} ... where N. This is just a way that we can use any DNA alphabet (2 bit or 4 bit), and get similar behavior.
Benchmarks â
using BenchmarkTools
testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)
@benchmark simple_transcribe($testseq)BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min âĶ max): 1.391 Ξs âĶ 695.413 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 1.855 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 4.410 Ξs Âą 10.970 Ξs â GC (mean Âą Ï): 17.93% Âą 11.14%
ââ
ââ âââ â
âââââââââââââââ
ââ
âââââ
â
ââ
ââââââ
ââââââ
â
ââ
â
â
â
â
â
â
ââ
ââ
ââ
â
â
â
ââââ
â
1.39 Ξs Histogram: log(frequency) by time 47 Ξs <
Memory estimate: 48.97 KiB, allocs estimate: 4.@benchmark simple_transcribe(x) setup=(x=LongDNA{2}(testseq))BenchmarkTools.Trial: 10000 samples with 68 evaluations per sample.
Range (min âĶ max): 835.088 ns âĶ 86.832 Ξs â GC (min âĶ max): 0.00% âĶ 8.51%
Time (median): 1.102 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 1.701 Ξs Âą 2.227 Ξs â GC (mean Âą Ï): 22.41% Âą 17.99%
âââââââ ââââ â
âââââââââ
ââ
ââ
âââââââââââââââââââââââââââââ
ââ
ââ
â
â
â
â
ââ
â
â
ââ
â
âââ â
835 ns Histogram: log(frequency) by time 11.1 Ξs <
Memory estimate: 24.53 KiB, allocs estimate: 4.@benchmark simple_transcribe(x) setup=(x=LongDNA{4}(testseq))BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min âĶ max): 1.520 Ξs âĶ 760.248 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 2.825 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 5.131 Ξs Âą 12.230 Ξs â GC (mean Âą Ï): 15.77% Âą 10.16%
â
ââââ ââ â
âââââââââ
ââ
ââââââââ
âââââââââââââââââââââââââââââââ
â
â
ââ
ââ
âââ
â
1.52 Ξs Histogram: log(frequency) by time 48 Ξs <
Memory estimate: 48.97 KiB, allocs estimate: 4.Conclusions â
I'm actually a little surprised that the replace() method does so well, but there you have it. The `BioJulia method is about 2x faster on a 2-bit sequence (that is, if there's no ambiguity), but about the same speed on 4-bit sequences.