✍️ Problem 2: Transcription
The Problem
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.
Given a DNA string \(t\) corresponding to a coding strand, its transcribed RNA string \(u\) is formed by replacing all occurrences of 'T' in \(t\) with 'U' in \(u\).
Given: A DNA string \(t\) having length at most 1000 nt.
Return: The transcribed RNA string of \(t\).
Sample Dataset
GATGGAACTTGACTACGTAAATTSample Output
GAUGGAACUUGACUACGUAAAUU
Approach 1 - string replace()
input_dna = "GATGGAACTTGACTACGTAAATT"
answer = "GAUGGAACUUGACUACGUAAAUU"
This one is pretty straightforward, as described.
All we need to do is replace any 'T's with 'U's.
Happily, julia has a handy replace() function
that takes a string, and a Pair that is pattern => replacement.
In principle, the pattern can be a literal String,
or even a regular expression. But here, we can just use a Char.
I'll also write the function using julia's one-line function definition syntax:
simple_transcribe(seq) = replace(seq, 'T'=> 'U')
@assert simple_transcribe(input_dna) == answer
As always, there are lots of ways you could do this. This function won't handle poorly formatted sequences, for example. Or rather, it will handle them, even though it shouldn't:
simple_transcribe("This Is QUITE silly") # "This Is QUIUE silly"
Approach 2 - BioSequences LongRNA
As you might expect, BioSequences.jl has a way to do this as well.
BioSequences.jl doesn't just use a String to represent sequences,
there are special types that can efficiently encode nucleic acid
or amino acid sequences.
In some cases, eg DNA or RNA with no ambiguous bases, using as few as 2 bits
per base.
using BioSequences
dna_seq = LongDNA{2}(input_dna)
simple_transcribe(seq::LongDNA{N}) where N = LongRNA{N}(seq)
rna_seq = simple_transcribe(dna_seq)
@assert String(rna_seq) == answer
A couple of things to note here. First,
I'm taking advantage of julia's multiple dispatch system.
Instead of writing a separate function name for dealing with
a LongDNA from BioSequences.jl, I wrote a new method
for the same function by adding ::LongDNA{N} to the argument.
This tells julia to call this version of simple_transcribe()
whenever the argument is a LongDNA. Otherwise, it will fall back to the original
(julia always uses the method that is most specific for its arguments).
The last thing to note is the {N} ... where N. This is just a way
that we can use any DNA alphabet (2 bit or 4 bit), and get similar behavior.
Benchmarks
using BenchmarkTools
testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)
@benchmark simple_transcribe($testseq_str)
@benchmark simple_transcribe(x) setup=(x=LongDNA{2}(testseq))
@benchmark simple_transcribe(x) setup=(x=LongDNA{4}(testseq))
Conclusions
The replace() method does surprisingly well,
but the BioJulia method is about 2x faster on a 2-bit sequence
(that is, if there's no ambiguity), and about the same speed on 4-bit sequences.