BioMarkovChains.TransitionModelType
struct TransitionModel

The TransitionModel struct represents a transition model used in a sequence analysis. It consists of a transition probability matrix (TransitionProbabilityMatrix) and initial distribution probabilities.

Fields

  • TransitionProbabilityMatrix::Matrix{Float64}: The transition probability matrix, a matrix of type Float64 representing the probabilities of transitioning from one state to another.
  • initials::Matrix{Float64}: The initial distribution probabilities, a matrix of type Float64 representing the probabilities of starting in each state.
  • n: is the order of the transition model, or in other words the order of the resulted Markov chain.

Constructors

  • TransitionModel(tpm::Matrix{Float64}, initials::Matrix{Float64}; n::Int64=1): Constructs a TransitionModel object with the provided transition probability matrix and initial distribution probabilities.
source
BioMarkovChains.dnaseqprobabilityMethod
sequenceprobability(sequence::LongNucOrView{4}, tpm::Matrix{Float64}, initials=Vector{Float64})

Compute the probability of a given sequence using a transition probability matrix and the initial probabilities distributions.

\[P(X_1 = i_1, \ldots, X_T = i_T) = \pi_{i_1}^{T-1} \prod_{t=1}^{T-1} a_{i_t, i_{t+1}}\]

Arguments

  • sequence::LongNucOrView{4}: The input sequence of nucleotides.
  • tm::TransitionModel is the actual data structure composed of a tpm::Matrix{Float64} the transition probability matrix and initials=Vector{Float64} the initial state probabilities.

Returns

  • probability::Float64: The probability of the input sequence.

Example

mainseq = LongDNA{4}("CCTCCCGGACCCTGGGCTCGGGAC")

tpm = transition_probability_matrix(mainseq)
    
    4×4 Matrix{Float64}:
    0.0   1.0    0.0    0.0
    0.0   0.5    0.2    0.3
    0.25  0.125  0.625  0.0
    0.0   0.667  0.333  0.0

initials = initial_distribution(mainseq)

    1×4 Vector{Float64}:
    0.0869565  
    0.434783
    0.347826
    0.130435
    
tm = transition_model(tpm, initials)
- Transition Probability Matrix -> Matrix{Float64}(4 × 4):
    0.0	  1.0	 0.0	0.0
    0.0	  0.5	 0.2	0.3
    0.25  0.125	 0.625	0.0
    0.0	  0.667	 0.333	0.0
- Initial Probabilities -> Vector{Float64}(4 × 1):
   0.087	
   0.435	
   0.348	
   0.13
- Markov Chain Order:1

newseq = LondDNA("CCTG")

    4nt DNA Sequence:
    CCTG


dnaseqprobability(newseq, tm)
    
    0.0217
source
BioMarkovChains.hasprematurestopMethod
hasprematurestop(sequence::LongNucOrView{4})::Bool

Determine whether the sequence of type LongSequence{DNAAlphabet{4}} contains a premature stop codon.

Returns a boolean indicating whether the sequence has more than one stop codon.

source
BioMarkovChains.iscodingFunction
iscoding(
    sequence::LongSequence{DNAAlphabet{4}}, 
    codingmodel::TransitionModel, 
    noncodingmodel::TransitionModel,
    η::Float64 = 1e-5
    )

Check if a given DNA sequence is likely to be coding based on a log-odds ratio. The log-odds ratio is a statistical measure used to assess the likelihood of a sequence being coding or non-coding. It compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model. If the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. It is formally described as a decision rule:

\[S(X) = \log \left( \frac{{P_C(X_1=i_1, \ldots, X_T=i_T)}}{{P_N(X_1=i_1, \ldots, X_T=i_T)}} \right) \begin{cases} > \eta & \Rightarrow \text{{coding}} \\ < \eta & \Rightarrow \text{{noncoding}} \end{cases}\]

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: The DNA sequence to be evaluated.
  • codingmodel::TransitionModel: The transition model for coding regions.
  • noncodingmodel::TransitionModel: The transition model for non-coding regions.
  • η::Float64 = 1e-5: The threshold value (eta) for the log-odds ratio (default: 1e-5).

Returns

  • true if the sequence is likely to be coding.
  • false if the sequence is likely to be non-coding.

Raises

  • ErrorException: if the length of the sequence is not divisible by 3.
  • ErrorException: if the sequence contains a premature stop codon.

Example

sequence = LondDNA("ATGGCATCTAG")
codingmodel = TransitionModel()
noncodingmodel = TransitionModel()
iscoding(sequence, codingmodel, noncodingmodel)  # Returns: true
source
BioMarkovChains.transition_count_matrixMethod
transition_count_matrix(sequence::LongSequence{DNAAlphabet{4}})

Compute the transition count matrix (TCM) of a given DNA sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: a LongSequence{DNAAlphabet{4}} object representing the DNA sequence.

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A Matrix object representing the transition count matrix of the sequence.

Example

seq = LongDNA{4}("AGCTAGCTAGCT")

tcm = transition_count_matrix(seq)

4x4 Matrix{Int64}:
   A C G T
A  0 0 3 0
C  0 0 0 3
G  0 3 0 0
T  2 0 0 0
source
BioMarkovChains.transition_modelFunction
transition_model(sequence::LongNucOrView{4}, n::Int64=1)

Constructs a transition model based on the given DNA sequence and transition order.

Arguments

  • sequence::LongNucOrView{4}: A DNA sequence represented as a LongNucOrView{4} object.
  • n::Int64 (optional): The transition order (default: 1).

Returns

A TransitionModel object representing the transition model.


transition_model(tpm::Matrix{Float64}, initials::Matrix{Float64}, n::Int64=1)

Builds a transtition model based on the transition probability matrix and the initial distributions. It can also calculates higer orders of the model if n is changed.

Arguments

  • tpm::Matrix{Float64}: the transition probability matrix
  • initials::Vector{Float64}: the initial distributions of the model.
  • n::Int64 (optional): The transition order (default: 1).

Returns

A TransitionProbabilityMatrix object representing the transition probability matrix.

Example

sequence = LongDNA{4}("ACTACATCTA")

model = transition_model(sequence, 2)
TransitionModel:
  - Transition Probability Matrix -> Matrix{Float64}(4 × 4):
    0.444	0.111	0.0	  0.444
    0.444	0.444	0.0	  0.111
    0.0	    0.0	    0.0	  0.0
    0.111	0.444	0.0	  0.444
  - Initial Probabilities -> Vector{Float64}(4 × 1):
    0.333
    0.333
    0.0
    0.333
  - Markov Chain Order:2
source
BioMarkovChains.transition_probability_matrixFunction
transition_probability_matrix(sequence::LongSequence{DNAAlphabet{4}})

Compute the transition probability matrix (TPM) of a given DNA sequence. Formally it construct $\hat{A}$ where:

\[a_{ij} = P(X_t = j \mid X_{t-1} = i) = \frac{{P(X_{t-1} = i, X_t = j)}}{{P(X_{t-1} = i)}}\]

Arguments

  • sequence::LongNucOrView{4}: a LongNucOrView{4} object representing the DNA sequence.
  • n::Int64=1: The order of the Markov model. That is the $\hat{A}^{n}$

Keywords

  • extended_alphabet::Bool=false: If true will pass the extended alphabet of DNA to search

Returns

A Matrix object representing the transition probability matrix of the sequence.

Example

seq = dna"AGCTAGCTAGCT"

tpm = transition_probability_matrix(seq)

4x4 Matrix{Float64}:
   A   C   G   T
A  0.0 0.0 1.0 0.0
C  0.0 0.0 0.0 1.0
G  0.0 1.0 0.0 0.0
T  1.0 0.0 0.0 0.0
source
BioMarkovChains.transitionsMethod
transitions(sequence::LongSequence)

Compute the transition counts of each pair in a given biological sequence sequence.

Arguments

  • sequence::LongSequence{DNAAlphabet{4}}: a LongSequence{DNAAlphabet{4}} object representing the DNA sequence.

Returns

A dictionary with keys being Dict{Tuple{DNA, DNA}, Int64} objects representing the dinucleotides, and values being the number of occurrences of each dinucleotide in the sequence.

Example

seq = dna"AGCTAGCTAGCT"

dinucleotides(seq)
Dict{Tuple{DNA, DNA}, Int64} with 4 entries:
  (DNA_C, DNA_T) => 3
  (DNA_G, DNA_C) => 3
  (DNA_T, DNA_A) => 2
  (DNA_A, DNA_G) => 3
source