Genomic Interval Manipulation
The GenomicFeatures
module consists of tools for working efficiently with genomic intervals.
Interval type
Intervals in GenomicFeatures.jl are consistent with ranges in Julia: 1-based and end-inclusive. When data is read from formats with different representations (i.e. 0-based and/or end-exclusive) they are always converted automatically. Similarly when writing data. You should not have to reason about off-by-one errors due to format differences while using functionality provided in GenomicFeatures.jl.
The Interval
type is defined as
immutable Interval{T} <: AbstractInterval{Int64} seqname::String first::Int64 last::Int64 strand::Strand metadata::T end
The first three fields (seqname
, first
, and last
) are mandatory arguments when constructing an Interval
object. seqname
is the sequence name associated with the interval. The first
and last
fields are the leftmost and rightmost positions of the interval, which can be accessed with leftposition
and rightposition
functions, respectively.
The strand
field can take four kinds of values listed in the next table:
Symbol | Constant | Meaning |
---|---|---|
'?' |
STRAND_NA |
strand is unknown or inapplicable |
'+' |
STRAND_POS |
positive strand |
'-' |
STRAND_NEG |
negative strand |
'.' |
STRAND_BOTH |
non-strand-specific feature |
Interval
is parameterized on metadata type, which lets it efficiently and precisely be specialized to represent intervals from a variety of formats.
The default strand and metadata value are STRAND_BOTH
and nothing
:
julia> Interval("chr1", 10000, 20000) GenomicFeatures.Interval{Void}: sequence name: chr1 leftmost position: 10000 rightmost position: 20000 strand: . metadata: nothing julia> Interval("chr1", 10000, 20000, '+') GenomicFeatures.Interval{Void}: sequence name: chr1 leftmost position: 10000 rightmost position: 20000 strand: + metadata: nothing
The following example shows all accessor functions for the five fields:
julia> i = Interval("chr1", 10000, 20000, '+', "some annotation") GenomicFeatures.Interval{String}: sequence name: chr1 leftmost position: 10000 rightmost position: 20000 strand: + metadata: some annotation julia> seqname(i) "chr1" julia> leftposition(i) 10000 julia> rightposition(i) 20000 julia> strand(i) STRAND_POS julia> metadata(i) "some annotation"
Collections of intervals
Collections of intervals are represented using the IntervalCollection
type, which is a general purpose indexed container for intervals. It supports fast intersection operations as well as insertion, deletion, and sorted iteration.
Interval collections can be initialized by inserting elements one by one using push!
.
# The type parameter (Void here) indicates the interval metadata type. col = IntervalCollection{Void}() for i in 1:100:10000 push!(col, Interval("chr1", i, i + 99)) end
Incrementally building an interval collection like this works, but IntervalCollection
also has a bulk insertion constructor that is able to build the indexed data structure extremely efficiently from an array of intervals.
col = IntervalCollection([Interval("chr1", i, i + 99) for i in 1:100:10000])
Bulding IntervalCollections
in one shot like this should be preferred when it's convenient or speed in an issue.
IntervalCollection
s can also be build directly from a genome annotation file, here in GFF3 format:
reader = open(GFF3.Reader, "some_genome.gff3") features = IntervalCollection(reader)
Overlap query
There are number of eachoverlap
function in the GenomicFeatures
module. They follow two patterns: interval versus collection queries which return an iterator over intervals in the collection that overlap the query, and collection versus collection queries which iterate over all pairs of overlapping intervals.
#
GenomicFeatures.eachoverlap
— Function.
eachoverlap(intervals_a, intervals_b, [seqname_isless=Base.isless])
Create an iterator of overlapping intervals between intervals_a
and intervals_b
.
This function assumes elements of intervals_a
and intervals_b
are sorted by its sequence name and left position. If the element type is not a subtype of GenomicFeatures.Interval
, elements are converted to Interval
objects.
The third optional argument is a function that defines the order of sequence names. The default function is Base.isless
, which is the lexicographical order.
The order of interval pairs is the same as the following nested loop but eachoverlap
is often much faster:
for a in intervals_a, b in intervals_b if isoverlapping(a, b) # do something... end end
Interval streams
Intervals need not necessarily stored in an indexed data structure for efficient intersection to be practical. Two collections of intervals need only be both sorted to compute all overlapping pairs. This is particularly useful in genomics where datasets are sometimes so large that loading them entirely into memory is not practical.
The GenomicFeatures
module is able to intersect any two iterators that yield intervals in sorted order, which we refer to as "interval streams". An IntervalCollection
is also an interval stream, but so is a sorted array of intervals, and parsers over interval file formats. This allows for a very general notion of intersection.
features_x = open(BED.Reader, "features_x.bed") features_y = open(BED.Reader, "features_y.bed") for (x, y) in eachoverlap(features_x, features_y) println("intersection found between ", x, " and ", y) end close(features_x) close(features_y)
An exception will be thrown if an interval in encountered out of order while processing an interval stream. Ordering of intervals has one complication: there is not necessarily a standardized way to order sequence names. By default in GenomicFeatures.jl intervals are sorted using a Base.isless
comparison function that is a default order in most command-line tools.
A special sort of intersection can also be performed on an interval stream against itself to produce "coverage intervals".
#
GenomicFeatures.coverage
— Function.
coverage(intervals)
Compute the coverage of a collection of intervals and return an IntervalCollection
that contains run-length encoded coverage data.
For example, given intervals like:
[------] [------------] [---------------]
This function would return a new set of disjoint intervals with annotated coverage like:
[1][-2-][-1-][--2--][--1--]