Genomic Interval Manipulation
The GenomicFeatures
module consists of tools for working efficiently with genomic intervals.
GenomicInterval Type
Intervals in GenomicFeatures
are consistent with ranges in Julia: 1-based and end-inclusive. When data is read from formats with different representations (i.e. 0-based and/or end-exclusive) they are always converted automatically. Similarly when writing data, you should not have to reason about off-by-one errors due to format differences while using functionality provided in GenomicFeatures
.
The GenomicInterval
type is defined as
struct GenomicInterval{T} <: IntervalTrees.AbstractInterval{Int64}
groupname::String
first::Int64
last::Int64
strand::Strand
metadata::T
end
The first three fields (groupname
, first
, and last
) are mandatory arguments when constructing the GenomicInterval
object. The groupname
field holds the group name associated with the interval. In some cases the groupname
may denote the chromosome. The first
and last
fields are the leftmost and rightmost positions of the interval, which can be accessed with leftposition
and rightposition
functions, respectively.
The strand
field can take four kinds of values listed in the next table:
Symbol | Constant | Meaning |
---|---|---|
'?' | STRAND_NA | strandedness is relevant, but unknown |
'+' | STRAND_POS | positive strand |
'-' | STRAND_NEG | negative strand |
'.' | STRAND_BOTH | non-strand-specific feature |
GenomicInterval
is parameterized on metadata type, which lets it efficiently and precisely be specialized to represent intervals from a variety of formats.
The default strand and metadata value are STRAND_BOTH
and nothing
:
julia> GenomicInterval("chr1", 10000, 20000)
GenomicInterval{Nothing}:
group name: chr1
leftmost position: 10000
rightmost position: 20000
strand: .
metadata: nothing
julia> GenomicInterval("chr1", 10000, 20000, '+')
GenomicInterval{Nothing}:
group name: chr1
leftmost position: 10000
rightmost position: 20000
strand: +
metadata: nothing
The following example shows all accessor functions for the five fields:
julia> i = GenomicInterval("chr1", 10000, 20000, '+', "some annotation")
GenomicInterval{String}:
group name: chr1
leftmost position: 10000
rightmost position: 20000
strand: +
metadata: some annotation
julia> groupname(i)
"chr1"
julia> leftposition(i)
10000
julia> rightposition(i)
20000
julia> strand(i)
STRAND_POS
julia> metadata(i)
"some annotation"
Collections of GenomicIntervals
Collections of intervals are represented using the GenomicIntervalCollection
type, which is a general purpose indexed container for intervals. It supports fast intersection operations as well as insertion, deletion, and sorted iteration.
Empty interval collections can be initialized, and intervals elements can be added to the collection one-by-one using push!
.
# The type parameter (Nothing here) indicates the interval metadata type.
col = GenomicIntervalCollection{GenomicInterval{Nothing}}()
for i in 1:100:10000
push!(col, GenomicInterval("chr1", i, i + 99))
end
Incrementally building an interval collection like this works, but GenomicIntervalCollection
also has a bulk insertion constructor that is able to build the indexed data structure extremely efficiently from a sorted vector of intervals.
col = GenomicIntervalCollection([GenomicInterval("chr1", i, i + 99) for i in 1:100:10000])
Building GenomicIntervalCollection
s in one shot like this should be preferred when it's convenient or speed is an issue.
Filtering
Below are some examples of filtering intervals. The examples take advantage of bulk insertion.
intervals = [
GenomicInterval("chr1", 1, 8),
GenomicInterval("chr1", 4, 20),
GenomicInterval("chr1", 14, 27)]
col = GenomicIntervalCollection(intervals)
predicate(i) = isodd(leftposition(i))
selected = GenomicIntervalCollection(Base.Iterators.filter(predicate, col))
selected = GenomicIntervalCollection([x for x in col if predicate(x)])
# output
GenomicIntervalCollection{GenomicInterval{Nothing}} with 1 intervals:
chr1:1-8 . nothing
Overlap Query
There are number of eachoverlap
functions in the GenomicFeatures
module. They follow two patterns:
- interval versus collection queries which return an iterator over intervals in the collection that overlap the query, and
- collection versus collection queries which iterate over all pairs of overlapping intervals.
GenomicFeatures.eachoverlap
— Functioneachoverlap(intervals_a, intervals_b, [groupname_isless=Base.isless])
Create an iterator of overlapping intervals between intervals_a
and intervals_b
.
This function assumes elements of intervals_a
and intervals_b
are sorted by its group name and left position. If the element type is not a subtype of GenomicFeatures.AbstractGenomicInterval
, elements are converted to GenomicInterval
objects.
The third optional argument is a function that defines the order of the group names. The default function is Base.isless
, which is the lexicographical order.
The order of interval pairs is the same as the following nested loop but eachoverlap
is often much faster:
for a in intervals_a, b in intervals_b
if isoverlapping(a, b)
# do something...
end
end
Coverage
A special sort of intersection can also be performed on an interval stream against itself to produce "coverage intervals".
GenomicFeatures.coverage
— Functioncoverage(intervals)
Compute the coverage of a collection of intervals and return an GenomicIntervalCollection
that contains run-length encoded coverage data.
For example, given intervals like:
[------] [------------]
[---------------]
This function would return a new set of disjoint intervals with annotated coverage like:
[1][-2-][-1-][--2--][--1--]
Example
julia> intervals = [
GenomicInterval("chr1", 1, 8),
GenomicInterval("chr1", 4, 20),
GenomicInterval("chr1", 14, 27)];
julia> coverage(intervals)
GenomicIntervalCollection{GenomicInterval{UInt32}} with 5 intervals:
chr1:1-3 . 1
chr1:4-8 . 2
chr1:9-13 . 1
chr1:14-20 . 2
chr1:21-27 . 1