IO - FASTA formatted files
FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of sequence records with name, description, and sequence.
The template of a sequence record is:
>{name} {description}?
{sequence}
Here is an example of a chromosomal sequence:
>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG
Readers and Writers
The reader and writer for FASTA formatted files, are found within the BioSequences.FASTA
module.
BioSequences.FASTA.Reader
— Type.FASTA.Reader(input::IO; index=nothing)
Create a data reader of the FASTA file format.
Arguments
input
: data sourceindex=nothing
: filepath to a random access index (currently fai is supported)
BioSequences.FASTA.Writer
— Type.FASTA.Writer(output::IO; width=70)
Create a data writer of the FASTA file format.
Arguments
output
: data sinkwidth=70
: wrapping width of sequence characters
They can be created with IOStreams:
r = FASTA.Reader(open("MyInput.fasta", "r"))
w = FASTA.Writer(open("MyFile.fasta", "w"))
Usually sequence records will be read sequentially from a file by iteration.
using BioSequences
reader = FASTA.Reader(open("hg38.fa", "r"))
for record in reader
# Do something
end
close(reader)
But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:
reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai")
chrIV = reader["chrIV"] # directly read sequences called chrIV.
Reading in a sequence from a FASTA formatted file will give you a variable of type FASTA.Record
.
BioSequences.FASTA.Record
— Type.FASTA.Record()
Create an unfilled FASTA record.
FASTA.Record(data::Vector{UInt8})
Create a FASTA record object from data
.
This function verifies and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
FASTA.Record(str::AbstractString)
Create a FASTA record object from str
.
This function verifies and indexes fields for accessors.
FASTA.Record(identifier, sequence)
Create a FASTA record object from identifier
and sequence
.
FASTA.Record(identifier, description, sequence)
Create a FASTA record object from identifier
, description
and sequence
.
Various getters and setters are available for FASTA.Record
s:
BioSequences.FASTA.hasidentifier
— Function.hasidentifier(record::Record)
Checks whether or not the record
has an identifier.
BioSequences.FASTA.identifier
— Function.identifier(record::Record)::String
Get the sequence identifier of record
.
BioSequences.FASTA.hasdescription
— Function.hasdescription(record::Record)
Checks whether or not the record
has a description.
BioSequences.FASTA.description
— Function.description(record::Record)::String
Get the description of record
.
BioSequences.FASTA.hassequence
— Function.hassequence(record::Record)
Checks whether or not a sequence record contains a sequence.
BioSequences.FASTA.sequence
— Method.sequence(record::Record, [part::UnitRange{Int}])
Get the sequence of record
.
This function infers the sequence type from the data. When it is wrong or unreliable, use sequence(::Type{S}, record::Record)
. If part
argument is given, it returns the specified part of the sequence.
To write a BioSequence
to FASTA file, you first have to create a FASTA.Record
:
using BioSequences
x = dna"aaaaatttttcccccggggg"
rec = FASTA.Record("MySeq", x)
w = FASTA.Writer(open("MyFile.fasta", "w"))
write(w, rec)
As always with julia IO types, remember to close your file readers and writer after you are finished.