Genomic Interval Manipulation

The GenomicFeatures module consists of tools for working efficiently with genomic intervals.

GenomicInterval Type

Intervals in GenomicFeatures are consistent with ranges in Julia: 1-based and end-inclusive. When data is read from formats with different representations (i.e. 0-based and/or end-exclusive) they are always converted automatically. Similarly when writing data, you should not have to reason about off-by-one errors due to format differences while using functionality provided in GenomicFeatures.

The GenomicInterval type is defined as

struct GenomicInterval{T} <: IntervalTrees.AbstractInterval{Int64}
    groupname::String
    first::Int64
    last::Int64
    strand::Strand
    metadata::T
end

The first three fields (groupname, first, and last) are mandatory arguments when constructing the GenomicInterval object. The groupname field holds the group name associated with the interval. In some cases the groupname may denote the chromosome. The first and last fields are the leftmost and rightmost positions of the interval, which can be accessed with leftposition and rightposition functions, respectively.

The strand field can take four kinds of values listed in the next table:

Symbol	Constant	Meaning
`'?'`	`STRAND_NA`	strandedness is relevant, but unknown
`'+'`	`STRAND_POS`	positive strand
`'-'`	`STRAND_NEG`	negative strand
`'.'`	`STRAND_BOTH`	non-strand-specific feature

GenomicInterval is parameterized on metadata type, which lets it efficiently and precisely be specialized to represent intervals from a variety of formats.

The default strand and metadata value are STRAND_BOTH and nothing:

julia> GenomicInterval("chr1", 10000, 20000)
GenomicInterval{Nothing}:
  group name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: .
  metadata: nothing

julia> GenomicInterval("chr1", 10000, 20000, '+')
GenomicInterval{Nothing}:
  group name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: +
  metadata: nothing

The following example shows all accessor functions for the five fields:

julia> i = GenomicInterval("chr1", 10000, 20000, '+', "some annotation")
GenomicInterval{String}:
  group name: chr1
  leftmost position: 10000
  rightmost position: 20000
  strand: +
  metadata: some annotation

julia> groupname(i)
"chr1"

julia> leftposition(i)
10000

julia> rightposition(i)
20000

julia> strand(i)
STRAND_POS

julia> metadata(i)
"some annotation"

Collections of GenomicIntervals

Collections of intervals are represented using the GenomicIntervalCollection type, which is a general purpose indexed container for intervals. It supports fast intersection operations as well as insertion, deletion, and sorted iteration.

Empty interval collections can be initialized, and intervals elements can be added to the collection one-by-one using push!.

# The type parameter (Nothing here) indicates the interval metadata type.
col = GenomicIntervalCollection{GenomicInterval{Nothing}}()

for i in 1:100:10000
    push!(col, GenomicInterval("chr1", i, i + 99))
end

Incrementally building an interval collection like this works, but GenomicIntervalCollection also has a bulk insertion constructor that is able to build the indexed data structure extremely efficiently from a sorted vector of intervals.

col = GenomicIntervalCollection([GenomicInterval("chr1", i, i + 99) for i in 1:100:10000])

Building GenomicIntervalCollections in one shot like this should be preferred when it's convenient or speed is an issue.

Filtering

Below are some examples of filtering intervals. The examples take advantage of bulk insertion.

intervals = [
  GenomicInterval("chr1", 1, 8),
  GenomicInterval("chr1", 4, 20),
  GenomicInterval("chr1", 14, 27)]

col = GenomicIntervalCollection(intervals)

predicate(i) = isodd(leftposition(i))

selected = GenomicIntervalCollection(Base.Iterators.filter(predicate, col))
selected = GenomicIntervalCollection([x for x in col if predicate(x)])

# output

GenomicIntervalCollection{GenomicInterval{Nothing}} with 1 intervals:
  chr1:1-8  .  nothing

Overlap Query

There are number of eachoverlap functions in the GenomicFeatures module. They follow two patterns:

interval versus collection queries which return an iterator over intervals in the collection that overlap the query, and
collection versus collection queries which iterate over all pairs of overlapping intervals.

GenomicFeatures.eachoverlap — Function

eachoverlap(intervals_a, intervals_b, [groupname_isless=Base.isless])

Create an iterator of overlapping intervals between intervals_a and intervals_b.

This function assumes elements of intervals_a and intervals_b are sorted by its group name and left position. If the element type is not a subtype of GenomicFeatures.AbstractGenomicInterval, elements are converted to GenomicInterval objects.

The third optional argument is a function that defines the order of the group names. The default function is Base.isless, which is the lexicographical order.

source

The order of interval pairs is the same as the following nested loop but eachoverlap is often much faster:

for a in intervals_a, b in intervals_b
    if isoverlapping(a, b)
        # do something...
    end
end

Coverage

A special sort of intersection can also be performed on an interval stream against itself to produce "coverage intervals".

GenomicFeatures.coverage — Function

coverage(intervals)

Compute the coverage of a collection of intervals and return an GenomicIntervalCollection that contains run-length encoded coverage data.

For example, given intervals like:

[------]     [------------]
   [---------------]

This function would return a new set of disjoint intervals with annotated coverage like:

[1][-2-][-1-][--2--][--1--]

Example

julia> intervals = [
           GenomicInterval("chr1", 1, 8),
           GenomicInterval("chr1", 4, 20),
           GenomicInterval("chr1", 14, 27)];

julia> coverage(intervals)
GenomicIntervalCollection{GenomicInterval{UInt32}} with 5 intervals:
  chr1:1-3  .  1
  chr1:4-8  .  2
  chr1:9-13  .  1
  chr1:14-20  .  2
  chr1:21-27  .  1

source