FASTA formatted files

IO - FASTA formatted files

FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of sequence records with name, description, and sequence.

The template of a sequence record is:

>{name} {description}?
{sequence}

Here is an example of a chromosomal sequence:

>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

Readers and Writers

The reader and writer for FASTA formatted files, are found within the BioSequences.FASTA module.

FASTA.Reader(input::IO; index=nothing)

Create a data reader of the FASTA file format.

Arguments

  • input: data source
  • index=nothing: filepath to a random access index (currently fai is supported)
source
FASTA.Writer(output::IO; width=70)

Create a data writer of the FASTA file format.

Arguments

  • output: data sink
  • width=70: wrapping width of sequence characters
source

They can be created with IOStreams:

r = FASTA.Reader(open("MyInput.fasta", "r"))
w = FASTA.Writer(open("MyFile.fasta", "w"))

Usually sequence records will be read sequentially from a file by iteration.

using BioSequences
reader = FASTA.Reader(open("hg38.fa", "r"))
for record in reader
    # Do something
end
close(reader)

But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:

reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai")
chrIV = reader["chrIV"]  # directly read sequences called chrIV.

Reading in a sequence from a FASTA formatted file will give you a variable of type FASTA.Record.

FASTA.Record()

Create an unfilled FASTA record.

source
FASTA.Record(data::Vector{UInt8})

Create a FASTA record object from data.

This function verifies and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source
FASTA.Record(str::AbstractString)

Create a FASTA record object from str.

This function verifies and indexes fields for accessors.

source
FASTA.Record(identifier, sequence)

Create a FASTA record object from identifier and sequence.

source
FASTA.Record(identifier, description, sequence)

Create a FASTA record object from identifier, description and sequence.

source

Various getters and setters are available for FASTA.Records:

hasidentifier(record::Record)

Checks whether or not the record has an identifier.

source
identifier(record::Record)::String

Get the sequence identifier of record.

source
hasdescription(record::Record)

Checks whether or not the record has a description.

source
description(record::Record)::String

Get the description of record.

source
hassequence(record::Record)

Checks whether or not a sequence record contains a sequence.

source
sequence(record::Record, [part::UnitRange{Int}])

Get the sequence of record.

This function infers the sequence type from the data. When it is wrong or unreliable, use sequence(::Type{S}, record::Record). If part argument is given, it returns the specified part of the sequence.

source

To write a BioSequence to FASTA file, you first have to create a FASTA.Record:

using BioSequences
x = dna"aaaaatttttcccccggggg"
rec = FASTA.Record("MySeq", x)
w = FASTA.Writer(open("MyFile.fasta", "w"))
write(w, rec)

As always with julia IO types, remember to close your file readers and writer after you are finished.