✍️ Problem 2: Transcription

🤔 Problem link

The Problem

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string \(t\) corresponding to a coding strand, its transcribed RNA string \(u\) is formed by replacing all occurrences of 'T' in \(t\) with 'U' in \(u\).

Given: A DNA string \(t\) having length at most 1000 nt.

Return: The transcribed RNA string of \(t\).

Sample Dataset

GATGGAACTTGACTACGTAAATT

Sample Output

GAUGGAACUUGACUACGUAAAUU

Approach 1 - string replace()

input_dna = "GATGGAACTTGACTACGTAAATT"
answer = "GAUGGAACUUGACUACGUAAAUU"

This one is pretty straightforward, as described. All we need to do is replace any 'T's with 'U's. Happily, julia has a handy replace() function that takes a string, and a Pair that is pattern => replacement. In principle, the pattern can be a literal String, or even a regular expression. But here, we can just use a Char.

I'll also write the function using julia's one-line function definition syntax:

simple_transcribe(seq) = replace(seq, 'T'=> 'U')

@assert simple_transcribe(input_dna) == answer

As always, there are lots of ways you could do this. This function won't handle poorly formatted sequences, for example. Or rather, it will handle them, even though it shouldn't:

simple_transcribe("This Is QUITE silly")  # "This Is QUIUE silly"

Approach 2 - BioSequences LongRNA

As you might expect, BioSequences.jl has a way to do this as well. BioSequences.jl doesn't just use a String to represent sequences, there are special types that can efficiently encode nucleic acid or amino acid sequences. In some cases, eg DNA or RNA with no ambiguous bases, using as few as 2 bits per base.

using BioSequences

dna_seq = LongDNA{2}(input_dna)

simple_transcribe(seq::LongDNA{N}) where N = LongRNA{N}(seq)

rna_seq = simple_transcribe(dna_seq)

@assert String(rna_seq) == answer

A couple of things to note here. First, I'm taking advantage of julia's multiple dispatch system. Instead of writing a separate function name for dealing with a LongDNA from BioSequences.jl, I wrote a new method for the same function by adding ::LongDNA{N} to the argument.

This tells julia to call this version of simple_transcribe() whenever the argument is a LongDNA. Otherwise, it will fall back to the original (julia always uses the method that is most specific for its arguments).

The last thing to note is the {N} ... where N. This is just a way that we can use any DNA alphabet (2 bit or 4 bit), and get similar behavior.

Benchmarks

using BenchmarkTools

testseq = randdnaseq(100_000) #this is defined in BioSequences
testseq_str = string(testseq)

@benchmark simple_transcribe($testseq_str)
@benchmark simple_transcribe(x) setup=(x=LongDNA{2}(testseq))
@benchmark simple_transcribe(x) setup=(x=LongDNA{4}(testseq))

Conclusions

The replace() method does surprisingly well, but the BioJulia method is about 2x faster on a 2-bit sequence (that is, if there's no ambiguity), and about the same speed on 4-bit sequences.