Intervals: Genomic Interval Manipulation
The Intervals
module consists of tools for working efficiently with genomic intervals.
Interval types
Intervals in Bio.jl are consistent with ranges in Julia: 1-based and end-inclusive. When data is read from formats with different representations (i.e. 0-based and/or end-exclusive) they are always converted automatically. Similarly when writing data. You should not have to reason about off-by-one errors due to format differences while using functionality provided in Bio.jl.
The Interval
type is defined as
type Interval{T} <: AbstractInterval{Int64} seqname::StringField first::Int64 last::Int64 strand::Strand metadata::T end
The first three fields (seqname
, first
, and last
) are mandatory arguments when constructing an Interval
object. seqname
is the sequence name associated with the interval. The first
and last
fields are the leftmost and rightmost positions of the interval, which can be accessed with leftposition
and rightposition
functions, respectively.
The strand
field can take four kinds of values listed in the next table:
Symbol | Constant | Meaning |
---|---|---|
'?' |
STRAND_NA |
strand is unknown or inapplicable |
'+' |
STRAND_POS |
positive strand |
'-' |
STRAND_NEG |
negative strand |
'.' |
STRAND_BOTH |
non-strand-specific feature |
Similarly to the SeqRecord
type in the Bio.Seq
module, Interval
is parameterized on metadata type, which lets it efficiently and precisely be specialized to represent intervals from a variety of formats.
The default strand and metadata values are STRAND_BOTH
and nothing
:
julia> Interval("chr1", 10000, 20000) Bio.Intervals.Interval{Void}: sequence name: chr1 leftmost position: 10000 rightmost position: 20000 strand: . metadata: nothing julia> Interval("chr1", 10000, 20000, '+') Bio.Intervals.Interval{Void}: sequence name: chr1 leftmost position: 10000 rightmost position: 20000 strand: + metadata: nothing
The following example shows all accessor functions for the five fields:
julia> i = Interval("chr1", 10000, 20000, '+', "some annotation") Bio.Intervals.Interval{String}: sequence name: chr1 leftmost position: 10000 rightmost position: 20000 strand: + metadata: some annotation julia> seqname(i) "chr1" julia> leftposition(i) 10000 julia> rightposition(i) 20000 julia> strand(i) STRAND_POS julia> metadata(i) "some annotation"
Collections of intervals
Collections of intervals are represented using the IntervalCollection
type, which is a general purpose indexed container for intervals. It supports fast intersection operations as well as insertion, deletion, and sorted iteration.
Interval collections can be initialized by inserting elements one by one using push!
.
# The type parameter (Void here) indicates the interval metadata type. incol = IntervalCollection{Void}() for i in 1:100:10000 push!(incol, Interval("chr1", i, i + 99)) end
Incrementally building an interval collection like this works, but IntervalCollection
also has a bulk insertion constructor that is able to build the indexed data structure extremely efficiently from an array of intervals.
incol = IntervalCollection([Interval("chr1", i, i + 99) for i in 1:100:10000])
Bulding IntervalCollections
in one shot like this should be preferred when it's convenient or speed in an issue.
Intersection
There are number of intersect
function in the Intervals module. They follow two patterns: interval versus collection queries which return an iterator over intervals in the collection that intersect the query, and collection versus collection queries which iterate over all pairs of intersecting intervals.
#
Base.intersect
— Function.
intersect(s1,s2...) ∩(s1,s2)
Construct the intersection of two or more sets. Maintains order and multiplicity of the first argument for arrays and ranges.
Interval streams
Intervals need not necessarily stored in an indexed data structure for efficient intersection to be practical. Two collections of intervals need only be both sorted to compute all intersecting pairs. This is particularly useful in genomics where datasets are sometimes so large that loading them entirely into memory is not practical.
The Intervals module is able to intersect any two iterators that yield intervals in sorted order, which we refer to as "interval streams". An IntervalCollection
is also a interval stream, but so is a sorted array of intervals, and parsers over interval file formats. This allows for a very general notion of intersection.
for (x, y) in intersect(open(BEDReader, "x_features.bed"), open(BEDReader, "y_features.bed")) println("Intersection found between ", x, " and ", y) end
An exception will be thrown if an interval in encountered out of order while processing an interval stream. Ordering of intervals has one complication: there is not necessarily a standardized way to order sequence names. By default in Bio.jl intervals are sorted using a special alphanum_isless
comparison function that compares numbers numerically if they exist in string, so that names like chr1, chr2, chr10
end up in their natural order.
The intersect
function takes as an optional parameter an isless
function to use to compare sequence names to account for arbitrary sequence name orderings.
# assume lexigraphic ordering for sequence names for (x, y) in intersect(xs, ys, isless) println("Intersection found between ", a, " and ", b) end
A special sort of intersection can also be performed on a IntervalStreams
against itself to produce "coverage intervals".
#
Bio.Intervals.coverage
— Function.
coverage(intervals)
Compute the coverage of a collection of intervals and return an IntervalCollection
that contains run-length encoded coverage data.
For example, given intervals like:
[------] [------------] [---------------]
This function would return a new set of disjoint intervals with annotated coverage like:
[1][-2-][-1-][--2--][--1--]