GFF3
Description
GFF3 is a text-based file format for representing genomic annotations. The major difference from BED is that GFF3 is more structured and can include sequences in the FASTA file format.
I/O tools for GFF3 are provided from the GenomicFeatures.GFF3
module, which exports following three types:
- Reader type:
GFF3.Reader
- Writer type:
GFF3.Writer
- Element type:
GFF3.Record
A GFF3 file may contain directives and/or comments in addition to genomic features. These lines are skipped by default but you can control the behavior by passing keyword arguments to GFF3.Reader
. See the docstring for details.
Examples
Here is a common workflow to iterate over all records in a GFF3 file:
# Import the GFF3 module. using GenomicFeatures # Open a GFF3 file. reader = open(GFF3.Reader, "data.gff3") # Iterate over records. for record in reader # Do something on record (see Accessors section). seqid = GFF3.seqid(reader) # ... end # Finally, close the reader close(reader)
If you are interested in directives (which starts with '#') in addition to genomic features, you need to pass skip_directives=false
when initializing a GFF3 constructor:
# Set skip_directives to true (this is set to false by default). reader = GFF3.Reader(open("data.gff3"), skip_directives=false) for record in record # Branch by record type. if GFF3.isfeature(record) # ... elseif GFF3.isdirective(record) # ... else # This never happens. assert(false) end end close(reader)
GenomicFeatures.jl supports tabix to retrieve records overlapping with a specific interval. First you need to create a block compression file from a GFF3 file using bgzip and then index it using tabix.
cat data.gff3 | grep -v "^#" | sort -k1,1 -k4,4n | bgzip >data.gff3.bgz tabix data.gff3.bgz # this creates data.gff3.bgz.tbi
Then you can read the block compression file as follows:
# Read the block compression gzip file. reader = GFF3.Reader("data.gff3.bgz") for record in eachoverlap(reader, Interval("chr1", 250_000, 300_000)) # Each record overlap the query interval. # ... end
API
#
GenomicFeatures.GFF3.Reader
— Type.
GFF3.Reader(input::IO; index=nothing, save_directives::Bool=false, skip_features::Bool=false, skip_directives::Bool=true, skip_comments::Bool=true) GFF3.Reader(input::AbstractString; index=:auto, save_directives::Bool=false, skip_features::Bool=false, skip_directives::Bool=true, skip_comments::Bool=true)
Create a reader for data in GFF3 format.
The first argument specifies the data source. When it is a filepath that ends with .bgz, it is considered to be block compression file format (BGZF) and the function will try to find a tabix index file (
Arguments
input
: data source (IO
object or filepath)index
: path to a tabix filesave_directives
: flag to save directive records (which can be accessed withGFF3.directives
)skip_features
: flag to skip feature recordsskip_directives
: flag to skip directive recordsskip_comments
: flag to skip comment records
#
GenomicFeatures.GFF3.directives
— Function.
Return all directives that preceded the last GFF entry parsed as an array of strings.
Directives at the end of the file can be accessed by calling close(reader)
and then directives(reader)
.
#
GenomicFeatures.GFF3.hasfasta
— Function.
Return true if the GFF3 stream is at its end and there is trailing FASTA data.
#
GenomicFeatures.GFF3.getfasta
— Function.
Return a BioSequences.FASTA.Reader initialized to parse trailing FASTA data.
Throws an exception if there is no trailing FASTA, which can be checked using hasfasta
.
#
GenomicFeatures.GFF3.Writer
— Type.
GFF3.Writer(output::IO)
Create a data writer of the GFF3 file format.
Arguments:
output
: data sink
#
GenomicFeatures.GFF3.Record
— Type.
GFF3.Record()
Create an unfilled GFF3 record.
GFF3.Record(data::Vector{UInt8})
Create a GFF3 record object from data
. This function verifies and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
GFF3.Record(str::AbstractString)
Create a GFF3 record object from str
. This function verifies and indexes fields for accessors.
#
GenomicFeatures.GFF3.isfeature
— Function.
isfeature(record::Record)::Bool
Test if record
is a feature record.
#
GenomicFeatures.GFF3.isdirective
— Function.
isdirective(record::Record)::Bool
Test if record
is a directive record.
#
GenomicFeatures.GFF3.iscomment
— Function.
iscomment(record::Record)::Bool
Test if record
is a comment record.
#
GenomicFeatures.GFF3.seqid
— Function.
seqid(record::Record)::String
Get the sequence ID of record
.
#
GenomicFeatures.GFF3.source
— Function.
source(record::Record)::String
Get the source of record
.
#
GenomicFeatures.GFF3.featuretype
— Function.
featuretype(record::Record)::String
Get the type of record
.
#
GenomicFeatures.GFF3.seqstart
— Function.
seqstart(record::Record)::Int
Get the start coordinate of record
.
#
GenomicFeatures.GFF3.seqend
— Function.
seqend(record::Record)::Int
Get the end coordinate of record
.
#
GenomicFeatures.GFF3.score
— Function.
score(record::Record)::Float64
Get the score of record
#
GenomicFeatures.GFF3.strand
— Function.
strand(record::Record)::GenomicFeatures.Strand
Get the strand of record
.
#
GenomicFeatures.GFF3.phase
— Function.
phase(record::Record)::Int
Get the phase of record
.
#
GenomicFeatures.GFF3.attributes
— Function.
attributes(record::Record)::Vector{Pair{String,Vector{String}}}
Get the attributes of record
.
attributes(record::Record, key::String)::Vector{String}
Get the attributes of record
with key
.
#
GenomicFeatures.GFF3.content
— Function.
content(record::Record)::String
Get the content of record
. Leading '#' letters are removed.