Intervals: Genomic Interval Manipulation

The Intervals module consists of tools for working efficiently with genomic intervals.

Interval types

Intervals in Bio.jl are consistent with ranges in Julia: 1-based and end-inclusive. When data is read from formats with different representations (i.e. 0-based and/or end-exclusive) they are always converted automatically. Similarly when writing data. You should not have to reason about off-by-one errors due to format differences while using functionality provided in Bio.jl.

The Interval type is defined as

type Interval{T} <: AbstractInterval{Int64}
    seqname::StringField
    first::Int64
    last::Int64
    strand::Strand
    metadata::T
end

The first three fields (seqname, first, and last) are mandatory arguments when constructing an Interval object. seqname is the sequence name associated with the interval. The first and last fields are the leftmost and rightmost positions of the interval, which can be accessed with leftposition and rightposition functions, respectively.

The strand field can take four kinds of values listed in the next table:

Symbol	Constant	Meaning
`'?'`	`STRAND_NA`	strand is unknown or inapplicable
`'+'`	`STRAND_POS`	positive strand
`'-'`	`STRAND_NEG`	negative strand
`'.'`	`STRAND_BOTH`	non-strand-specific feature

Similarly to the SeqRecord type in the Bio.Seq module, Interval is parameterized on metadata type, which lets it efficiently and precisely be specialized to represent intervals from a variety of formats.

The default strand and metadata values are STRAND_BOTH and nothing:

julia> Interval("chr1", 10000, 20000)
Bio.Intervals.Interval{Void}:
  sequence name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: .
  metadata: nothing

julia> Interval("chr1", 10000, 20000, '+')
Bio.Intervals.Interval{Void}:
  sequence name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: +
  metadata: nothing

The following example shows all accessor functions for the five fields:

julia> i = Interval("chr1", 10000, 20000, '+', "some annotation")
Bio.Intervals.Interval{String}:
  sequence name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: +
  metadata: some annotation

julia> seqname(i)
"chr1"

julia> leftposition(i)
10000

julia> rightposition(i)
20000

julia> strand(i)
STRAND_POS

julia> metadata(i)
"some annotation"

Collections of intervals

Collections of intervals are represented using the IntervalCollection type, which is a general purpose indexed container for intervals. It supports fast intersection operations as well as insertion, deletion, and sorted iteration.

Interval collections can be initialized by inserting elements one by one using push!.

# The type parameter (Void here) indicates the interval metadata type.
incol = IntervalCollection{Void}()

for i in 1:100:10000
    push!(incol, Interval("chr1", i, i + 99))
end

Incrementally building an interval collection like this works, but IntervalCollection also has a bulk insertion constructor that is able to build the indexed data structure extremely efficiently from an array of intervals.

incol = IntervalCollection([Interval("chr1", i, i + 99) for i in 1:100:10000])

Bulding IntervalCollections in one shot like this should be preferred when it's convenient or speed in an issue.

Intersection

There are number of intersect function in the Intervals module. They follow two patterns: interval versus collection queries which return an iterator over intervals in the collection that intersect the query, and collection versus collection queries which iterate over all pairs of intersecting intervals.

# Base.intersect — Function.

intersect(s1,s2...)
∩(s1,s2)

Construct the intersection of two or more sets. Maintains order and multiplicity of the first argument for arrays and ranges.

source

Interval streams

Intervals need not necessarily stored in an indexed data structure for efficient intersection to be practical. Two collections of intervals need only be both sorted to compute all intersecting pairs. This is particularly useful in genomics where datasets are sometimes so large that loading them entirely into memory is not practical.

The Intervals module is able to intersect any two iterators that yield intervals in sorted order, which we refer to as "interval streams". An IntervalCollection is also a interval stream, but so is a sorted array of intervals, and parsers over interval file formats. This allows for a very general notion of intersection.

for (x, y) in intersect(open(BEDReader, "x_features.bed"), open(BEDReader, "y_features.bed"))
    println("Intersection found between ", x, " and ", y)
end

An exception will be thrown if an interval in encountered out of order while processing an interval stream. Ordering of intervals has one complication: there is not necessarily a standardized way to order sequence names. By default in Bio.jl intervals are sorted using a special alphanum_isless comparison function that compares numbers numerically if they exist in string, so that names like chr1, chr2, chr10 end up in their natural order.

The intersect function takes as an optional parameter an isless function to use to compare sequence names to account for arbitrary sequence name orderings.

# assume lexigraphic ordering for sequence names
for (x, y) in intersect(xs, ys, isless)
    println("Intersection found between ", a, " and ", b)
end

A special sort of intersection can also be performed on a IntervalStreams against itself to produce "coverage intervals".

# Bio.Intervals.coverage — Function.

coverage(intervals)

Compute the coverage of a collection of intervals and return an IntervalCollection that contains run-length encoded coverage data.

For example, given intervals like:

[------]     [------------]
   [---------------]

This function would return a new set of disjoint intervals with annotated coverage like:

[1][-2-][-1-][--2--][--1--]

source