FASTA formatted files
NB: First read the overview in the sidebar
FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of sequence records with name, description, and sequence.
The template of a sequence record is:
>{description}
{sequence}Where the "identifier" is the first part of the description up to the first whitespace (or the entire description if there is no whitespace)
Here is an example of a chromosomal sequence:
>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAHere:
- The
identifieris"chrI" - The
descriptionis"chrI chromosome 1", containing the identifier - The sequence is the DNA sequence
"CCACA..."
The FASTARecord
FASTA records are, by design, very lax in what they can contain. They can contain almost arbitrary byte sequences, including invalid unicode, and trailing whitespace on their sequence lines, which will be interpreted as part of the sequence. If you want to have more certainty about the format, you can either check the content of the sequences with a regex, or (preferably), convert them to the desired BioSequence type.
FASTX.FASTA.Record — TypeFASTA.RecordMutable struct representing a FASTA record as parsed from a FASTA file. The content of the record can be queried with the following functions: identifier, description, sequence.
FASTA records are un-typed, i.e. they are agnostic to what kind of data they contain.
See also: FASTA.Reader, FASTA.Writer
Examples
julia> rec = parse(FASTARecord, ">some header\nTAqA\nCC");
julia> identifier(rec)
"some"
julia> description(rec)
"some header"
julia> sequence(rec)
"TAqACC"
julia> typeof(description(rec)) == typeof(sequence(rec)) <: AbstractString
trueFASTAReader and FASTAWriter
FASTAWriter can optionally be passed the keyword width to control the line width. If this is zero or negative, it will write all record sequences on a single line. Else, it will wrap lines to the given maximal width.
Reference:
FASTX.FASTA — ModuleFASTAModule under FASTX with code related to FASTA files.
FASTX.FASTA.Reader — TypeFASTA.Reader(input::IO; index=nothing, copy::Bool=true)Create a buffered data reader of the FASTA file format. The reader is a BioGenerics.IO.AbstractReader, a stateful iterator of FASTA.Record. Readers take ownership of the underlying IO. Mutating or closing the underlying IO not using the reader is undefined behaviour. Closing the Reader also closes the underlying IO.
See more examples in the FASTX documentation.
See also: FASTA.Record, FASTA.Writer
Arguments
input: data sourceindex: Optional random access index (currently fai is supported).indexcan benothing, aFASTA.Index, or anIOin which case an index will be parsed from the IO, orAbstractString, in which case it will be treated as a path to a fai file.copy::Bool: iterating returns fresh copies instead of the same Record. Set tofalsefor improved performance, but be wary that iterating mutates records.
Examples
julia> rdr = FASTAReader(IOBuffer(">header\nTAG\n>another\nAGA"));
julia> records = collect(rdr); close(rdr);
julia> foreach(println, map(identifier, records))
header
another
julia> foreach(println, map(sequence, records))
TAG
AGAFASTX.FASTA.Writer — TypeFASTA.Writer(output::IO; width=70)Create a data writer of the FASTA file format. The writer is a BioGenerics.IO.AbstractWriter. Writers take ownership of the underlying IO. Mutating or closing the underlying IO not using the writer is undefined behaviour. Closing the writer also closes the underlying IO.
See more examples in the FASTX documentation.
See also: FASTA.Record, FASTA.Reader
Arguments
output: Data sink to write towidth: Wrapping width of sequence characters. If < 1, no wrapping.
Examples
julia> FASTA.Writer(open("some_file.fna", "w")) do writer
write(writer, record) # a FASTA.Record
endFASTX.FASTA.validate_fasta — Functionvalidate_fasta(io::IO) >: NothingCheck if io is a valid FASTA file. Return nothing if it is, and an instance of another type if not.
Examples
julia> validate_fasta(IOBuffer(">a bc\nTAG\nTA")) === nothing
true
julia> validate_fasta(IOBuffer(">a bc\nT>G\nTA")) === nothing
false