Skip to content

✍ïļ Problem 2: Transcription ​

ðŸĪ” Problem link

The Problem

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

Sample Dataset

txt
GATGGAACTTGACTACGTAAATT

Sample Output

txt
GAUGGAACUUGACUACGUAAAUU

Approach 1 - string replace() ​

julia
input_dna = "GATGGAACTTGACTACGTAAATT"
answer = "GAUGGAACUUGACUACGUAAAUU"
"GAUGGAACUUGACUACGUAAAUU"

This one is pretty straightforward, as described. All we need to do is replace any 'T's with 'U's. Happily, julia has a handy replace() function that takes a string, and a Pair that is pattern => replacement. In principle, the pattern can be a literal String, or even a regular expression. But here, we can just use a Char.

I'll also write the function using julia's one-line function definition syntax:

julia
input_dna == "GATGGAACTTGACTACGTAAATT"

simple_transcribe(seq) = replace(seq, 'T'=> 'U')

@assert simple_transcribe(input_dna) == answer

As always, there are lots of ways you could do this. This function won't hanndle poorly formatted sequences, for example. Or rather, it will handle them, even though it shouldn't:

Approach 2 - BioSequences LongRNA ​

As you might expect, BioSequences.jl has a way to do this as well. BioSequences.jl doesn't just use a String to represent sequences, there are special types that can efficiently encode nucleic acid or amino acid sequences. In some cases, eg DNA or RNA with no ambiguous bases, using as few as 2 bits per base.

julia
using BioSequences

dna_seq = LongDNA{2}(input_dna)


simple_transcribe(seq::LongDNA{N}) where N = LongRNA{N}(seq)

rna_seq = simple_transcribe(dna_seq)
23nt RNA Sequence:
GAUGGAACUUGACUACGUAAAUU
julia
@assert String(rna_seq) == answer
julia
simple_transcribe("This Is QUITE silly")
"Uhis Is QUIUE silly"

A couple of things to note here. First, I'm taking advantage of julia's multiple dispatch system. Instead of writing a separate function name for dealing with a LongDNA from BioSequences.jl, I wrote a new method for the same function by adding ::LongDNA{N} to the argument.

This tells julia to call this version of simple_transcribe() whenever the argument is a LongDNA. Otherwise, it will fall back to the original (julia always uses the method that is most specific for its arguments).

The last thing to note is the {N} ... where N. This is just a way that we can use any DNA alphabet (2 bit or 4 bit), and get similar behavior.

Benchmarks ​

julia
using BenchmarkTools

testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)


@benchmark simple_transcribe($testseq)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min â€Ķ max):  1.391 Ξs â€Ķ 695.413 Ξs  ┊ GC (min â€Ķ max):  0.00% â€Ķ  0.00%
 Time  (median):     1.855 ξs               ┊ GC (median):     0.00%
 Time  (mean Âą σ):   4.410 Ξs Âą  10.970 Ξs  ┊ GC (mean Âą σ):  17.93% Âą 11.14%

  █▅▆▂                     ▃▃▁                                ▁
  █████▆▇▄▄▁▁▃▄▃▅▃▅▃▄▄▄▅▅▄▅███▇▆▅▄▃▃▁▄▅▅▄▅▅▅▅▅▅▅▆▅▃▅▄▅▅▅▅▄▄▄▅ █
  1.39 Ξs      Histogram: log(frequency) by time        47 Ξs <

 Memory estimate: 48.97 KiB, allocs estimate: 4.
julia
@benchmark simple_transcribe(x) setup=(x=LongDNA{2}(testseq))
BenchmarkTools.Trial: 10000 samples with 68 evaluations per sample.
 Range (min â€Ķ max):  835.088 ns â€Ķ 86.832 Ξs  ┊ GC (min â€Ķ max):  0.00% â€Ķ  8.51%
 Time  (median):       1.102 ξs              ┊ GC (median):     0.00%
 Time  (mean Âą σ):     1.701 Ξs Âą  2.227 Ξs  ┊ GC (mean Âą σ):  22.41% Âą 17.99%

  ▇█▇▄▃▃▂                           ▂▂▂▁                       ▂
  ████████▅▆▅▁▅▄▃▄▁▃▄▁▁▁▃▃▃▁▃▄▁▄▆▆▇███████▆▅▆▅▄▅▅▅▅▅▆▅▅▅▆▅▅▃▆█ █
  835 ns        Histogram: log(frequency) by time      11.1 Ξs <

 Memory estimate: 24.53 KiB, allocs estimate: 4.
julia
@benchmark simple_transcribe(x) setup=(x=LongDNA{4}(testseq))
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min â€Ķ max):  1.520 Ξs â€Ķ 760.248 Ξs  ┊ GC (min â€Ķ max):  0.00% â€Ķ  0.00%
 Time  (median):     2.825 ξs               ┊ GC (median):     0.00%
 Time  (mean Âą σ):   5.131 Ξs Âą  12.230 Ξs  ┊ GC (mean Âą σ):  15.77% Âą 10.16%

  ▅█▇▄▁                    ▃▂                                 ▂
  █████▇▃▄▅▄▅▄▃▄▄▄▄▄▅▁▆▃▄▃████▇▆▄▄▃▁▁▃▁▄▁▃▁▁▃▁▁▁▁▃▃▅▅▅▆▅▆▅▄▄▅ █
  1.52 Ξs      Histogram: log(frequency) by time        48 Ξs <

 Memory estimate: 48.97 KiB, allocs estimate: 4.

Conclusions ​

I'm actually a little surprised that the replace() method does so well, but there you have it. The `BioJulia method is about 2x faster on a 2-bit sequence (that is, if there's no ambiguity), but about the same speed on 4-bit sequences.