Searching for sequence motifs
There are many ways to search for particular motifs in biological sequences:
- Exact searches, where you are looking for exact matches of a particular character of substring.
- Approximate searches, where you are looking for sequences that are sufficiently similar to a given sequence or family of sequences.
- Searches where you are looking for sequences that conform to some sort of pattern.
Like other Julia sequences such as Vector
, you can search a BioSequence
with the findfirst(predicate, collection)
method pattern.
All these kinds of searches are provided in BioSequences.jl, and they all conform to the findnext
, findprev
, and occursin
patterns established in Base
for String
and collections like Vector
.
The exception is searching using the specialised regex provided in this package, which as you shall see, conforms to the match
pattern established in Base
for pcre and String
s.
Symbol search
julia> seq = dna"ACAGCGTAGCT";
julia> findfirst(DNA_A, seq)
1
julia> findlast(DNA_A, seq)
8
julia> findnext(DNA_A, seq, 2)
3
julia> findprev(DNA_A, seq, 7)
3
julia> findall(DNA_A, seq)
3-element Vector{Int64}:
1
3
8
Exact search
BioSequences.ExactSearchQuery
— TypeExactSearchQuery{F<:Function,S<:BioSequence}
Query type for exact sequence search.
An exact search, is one where are you are looking in some given sequence, for exact instances of some given substring.
These queries are used as a predicate for the Base.findnext
, Base.findprev
, Base.occursin
, Base.findfirst
, and Base.findlast
functions.
Examples
julia> seq = dna"ACAGCGTAGCT";
julia> query = ExactSearchQuery(dna"AGC");
julia> findfirst(query, seq)
3:5
julia> findlast(query, seq)
8:10
julia> findnext(query, seq, 6)
8:10
julia> findprev(query, seq, 7)
3:5
julia> findall(query, seq)
2-element Vector{UnitRange{Int64}}:
3:5
8:10
julia> occursin(query, seq)
true
You can pass a comparator function such as isequal
or iscompatible
to its constructor to modify the search behaviour.
The default is isequal
, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.
julia> query = ExactSearchQuery(dna"CGT", iscompatible);
julia> findfirst(query, dna"ACNT") # 'N' matches 'G'
2:4
julia> findfirst(query, dna"ACGT") # 'G' matches 'N'
2:4
julia> occursin(ExactSearchQuery(dna"CNT", iscompatible), dna"ACNT")
true
Allowing mismatches
BioSequences.ApproximateSearchQuery
— TypeApproximateSearchQuery{F<:Function,S<:BioSequence}
Query type for approximate sequence search.
These queries are used as a predicate for the Base.findnext
, Base.findprev
, Base.occursin
, Base.findfirst
, and Base.findlast
functions.
Using these functions with these queries allows you to search a given sequence for a sub-sequence, whilst allowing a specific number of errors.
In other words they find a subsequence of the target sequence within a specific Levenshtein distance of the query sequence.
Examples
julia> seq = dna"ACAGCGTAGCT";
julia> query = ApproximateSearchQuery(dna"AGGG");
julia> findfirst(query, 0, seq) == nothing # nothing matches with no errors
true
julia> findfirst(query, 1, seq) # seq[3:6] matches with one error
3:6
julia> findfirst(query, 2, seq) # seq[1:4] matches with two errors
1:4
You can pass a comparator function such as isequal
or iscompatible
to its constructor to modify the search behaviour.
The default is isequal
, however, in biology, sometimes we want a more flexible comparison to find subsequences of compatible symbols.
julia> query = ApproximateSearchQuery(dna"AGGG", iscompatible);
julia> occursin(query, 1, dna"AAGNGG") # 1 mismatch permitted (A vs G) & matched N
true
julia> findnext(query, 1, dna"AAGNGG", 1) # 1 mismatch permitted (A vs G) & matched N
1:4
This method of searching for motifs was implemented with smaller query motifs in mind.
If you are looking to search for imperfect matches of longer sequences in this manner, you are likely better off using some kind of local-alignment algorithm or one of the BLAST variants.
Searching according to a pattern
Regular expression search
Query patterns can be described in regular expressions. The syntax supports a subset of Perl and PROSITE's notation.
Biological regexes can be constructed using the BioRegex
constructor, for example by doing BioRegex{AminoAcid}("MV+")
. For bioregex literals, it is instead recommended using the @biore_str
macro:
The Perl-like syntax starts with biore
(BIOlogical REgular expression) and ends with a symbol option: "dna", "rna" or "aa". For example, biore"A+"dna
is a regular expression for DNA sequences and biore"A+"aa
is for amino acid sequences. The symbol options can be abbreviated to its first character: "d", "r" or "a", respectively.
Here are examples of using the regular expression for BioSequence
s:
julia> match(biore"A+C*"dna, dna"AAAACC")
RegexMatch("AAAACC")
julia> match(biore"A+C*"d, dna"AAAACC")
RegexMatch("AAAACC")
julia> occursin(biore"A+C*"dna, dna"AAC")
true
julia> occursin(biore"A+C*"dna, dna"C")
false
match
will return a RegexMatch
if a match is found, otherwise it will return nothing
if no match is found.
The table below summarizes available syntax elements.
Syntax | Description | Example |
---|---|---|
| | alternation | "A|T" matches "A" and "T" |
* | zero or more times repeat | "TA*" matches "T" , "TA" and "TAA" |
+ | one or more times repeat | "TA+" matches "TA" and "TAA" |
? | zero or one time | "TA?" matches "T" and "TA" |
{n,} | n or more times repeat | "A{3,}" matches "AAA" and "AAAA" |
{n,m} | n -m times repeat | "A{3,5}" matches "AAA" , "AAAA" and "AAAAA" |
^ | the start of the sequence | "^TAN*" matches "TATGT" |
$ | the end of the sequence | "N*TA$" matches "GCTA" |
(...) | pattern grouping | "(TA)+" matches "TA" and "TATA" |
[...] | one of symbols | "[ACG]+" matches "AGGC" |
eachmatch
and findfirst
are also defined, just like usual regex and strings found in Base
.
julia> collect(matched(x) for x in eachmatch(biore"TATA*?"d, dna"TATTATAATTA")) # overlap
4-element Vector{LongSequence{DNAAlphabet{4}}}:
TAT
TAT
TATA
TATAA
julia> collect(matched(x) for x in eachmatch(biore"TATA*"d, dna"TATTATAATTA", false)) # no overlap
2-element Vector{LongSequence{DNAAlphabet{4}}}:
TAT
TATAA
julia> findfirst(biore"TATA*"d, dna"TATTATAATTA")
1:3
julia> findfirst(biore"TATA*"d, dna"TATTATAATTA", 2)
4:8
Noteworthy differences from strings are:
- Ambiguous characters match any compatible characters (e.g.
biore"N"d
is equivalent tobiore"[ACGT]"d
). - Whitespaces are ignored (e.g.
biore"A C G"d
is equivalent tobiore"ACG"d
).
The PROSITE notation is described in ScanProsite - user manual. The syntax supports almost all notations including the extended syntax. The PROSITE notation starts with prosite
prefix and no symbol option is needed because it always describes patterns of amino acid sequences:
julia> match(prosite"[AC]-x-V-x(4)-{ED}", aa"CPVPQARG")
RegexMatch("CPVPQARG")
julia> match(prosite"[AC]xVx(4){ED}", aa"CPVPQARG")
RegexMatch("CPVPQARG")
Position weight matrix search
A motif can be specified using position weight matrix (PWM) in a probabilistic way. This method searches for the first position in the sequence where a score calculated using a PWM is greater than or equal to a threshold. More formally, denoting the sequence as $S$ and the PWM value of symbol $s$ at position $j$ as $M_{s,j}$, the score starting from a position $p$ is defined as
\[\operatorname{score}(S, p) = \sum_{i=1}^L M_{S[p+i-1],i}\]
and the search returns the smallest $p$ that satisfies $\operatorname{score}(S, p) \ge t$.
There are two kinds of matrices in this package: PFM
and PWM
. The PFM
type is a position frequency matrix and stores symbol frequencies for each position. The PWM
is a position weight matrix and stores symbol scores for each position. You can create a PFM
from a set of sequences with the same length and then create a PWM
from the PFM
object.
julia> motifs = [dna"TTA", dna"CTA", dna"ACA", dna"TCA", dna"GTA"]
5-element Vector{LongSequence{DNAAlphabet{4}}}:
TTA
CTA
ACA
TCA
GTA
julia> pfm = PFM(motifs) # sequence set => PFM
4×3 PFM{DNA, Int64}:
A 1 0 5
C 1 2 0
G 1 0 0
T 2 3 0
julia> pwm = PWM(pfm) # PFM => PWM
4×3 PWM{DNA, Float64}:
A -0.321928 -Inf 2.0
C -0.321928 0.678072 -Inf
G -0.321928 -Inf -Inf
T 0.678072 1.26303 -Inf
julia> pwm = PWM(pfm .+ 0.01) # add pseudo counts to avoid infinite values
4×3 PWM{DNA, Float64}:
A -0.319068 -6.97728 1.99139
C -0.319068 0.673772 -6.97728
G -0.319068 -6.97728 -6.97728
T 0.673772 1.25634 -6.97728
julia> pwm = PWM(pfm .+ 0.01, prior=[0.2, 0.3, 0.3, 0.2]) # GC-rich prior
4×3 PWM{DNA, Float64}:
A 0.00285965 -6.65535 2.31331
C -0.582103 0.410737 -7.24031
G -0.582103 -7.24031 -7.24031
T 0.9957 1.57827 -6.65535
The $PWM_{s,j}$ matrix is computed from $PFM_{s,j}$ and the prior probability $p(s)$ as follows ([Wasserman2004]):
\[\begin{align} PWM_{s,j} &= \log_2 \frac{p(s,j)}{p(s)} \\ p(s,j) &= \frac{PFM_{s,j}}{\sum_{s'} PFM_{s',j}}. \end{align}\]
However, if you just want to quickly conduct a search, constructing the PFM and PWM is done for you as a convenience if you build a PWMSearchQuery
, using a collection of sequences:
julia> motifs = [dna"TTA", dna"CTA", dna"ACA", dna"TCA", dna"GTA"]
5-element Vector{LongSequence{DNAAlphabet{4}}}:
TTA
CTA
ACA
TCA
GTA
julia> subject = dna"TATTATAATTA";
julia> qa = PWMSearchQuery(motifs, 1.0);
julia> findfirst(qa, subject)
3
julia> findall(qa, subject)
3-element Vector{Int64}:
3
5
9
[Wasserman2004]: https://doi.org/10.1038/nrg1315