FASTA index (FAI files)
FASTX.jl supports FASTA index (FAI) files. When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily.
See the FAI specifcation here: http://www.htslib.org/doc/faidx.html
Making an Index
A FASTA index (of type Index
) can be constructed from an IO
object representing a FAI file:
julia> io = IOBuffer("seqname\t9\t2\t6\t8");
julia> Index(io) isa Index
true
Or from a path representing a FAI file:
julia> Index("../test/data/test.fasta.fai");
Alternatively, a FASTA file can be indexed to produce an Index
using faidx
.
julia> faidx(IOBuffer(">abc\nTAGA\nTA"))
Index:
abc 6 5 4 5
Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file, by passing an AbstractString
to faidx
:
julia> rm("../test/data/test.fasta.fai") # remove existing fai
julia> ispath("../test/data/test.fasta.fai")
false
julia> faidx("../test/data/test.fasta");
julia> ispath("../test/data/test.fasta.fai")
true
Note that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser, so not all FASTA files that can be read can be indexed:
julia> str = ">\0\n\0";
julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord
true
julia> Index(IOBuffer(str))
ERROR
[...]
Writing a FAI file
If you have an Index
object, you can simply write
it to an IO:
julia> index = open(i -> Index(i), "../test/data/test.fasta.fai");
julia> filename = tempname();
julia> open(i -> write(i, index), filename, "w");
julia> index2 = open(i -> Index(i), filename);
julia> string(index) == string(index2)
true
Attaching an Index
to a Reader
When opening a FASTA.Reader
, you can attach an Index
by passing the index
keyword. You can either pass an Index
directly, or else an IO
, in which case an Index
will be parsed from the IO
, or an AbstractString
that will be interpreted as a path to a FAI file:
julia> str = ">abc\nTAG\nTA";
julia> idx = faidx(IOBuffer(str));
julia> rdr = FASTAReader(IOBuffer(str), index=idx);
You can also add a index to an existing reader using the index!
function:
FASTX.FASTA.index!
— Functionindex!(r::FASTA.Reader, ind::Union{Nothing, Index, IO, AbstractString})
Set the index of r
, and return r
. If ind
isa Union{Nothing, Index}
, directly set the index to ind
. If ind
isa IO
, parse the index from the FAI-formatted IO first. If ind
isa AbstractString
, treat it as the path to a FAI file to parse.
See also: Index
, FASTA.Reader
Seeking using an Index
With an Index
attached to a Reader
, you can do the following operation in O(1) time. In these examples, we will use the following FASTA file:
>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC
- Seek to a Record using its identifier:
julia> seekrecord(reader, "seq2");
julia> record = first(reader); sequence(record)
"AACGGUUGC"
- Directly extract a record using its identifier
julia> record = reader["seq1"];
julia> description(record)
"seq1 sequence"
- Extract a sequence directly without loading the whole record into memory. This is useful for huge sequences like chromosomes
julia> extract(reader, "seq1", 3:5)
"GAA"
FASTX.jl does not yet support indexing FASTQ files.
Reference:
FASTX.FASTA.faidx
— Functionfaidx(io::IO)::Index
Read a FASTA.Index
from io
.
See also: Index
Examples
julia> ind = faidx(IOBuffer(">ab\nTA\nT\n>x y\nGAG\nGA"))
Index:
ab 3 4 2 3
x 5 14 3 4
faidx(fnapath::AbstractString, [idxpath::AbstractString], check=true)
Index FASTA path at fnapath
and write index to idxpath
. If idxpath
is not given, default to same name as fnapath * ".fai"
. If check
, throw an error if the output file already exists
See also: Index
FASTX.FASTA.seekrecord
— Functionseekrecord(reader::FASTAReader, i::Union{AbstractString, Integer})
Seek Reader
to the i
'th record. The next iterated record with be the i
'th record. i
can be the identifier of a sequence, or the 1-based record number in the Index
.
The Reader
needs to be indexed for this to work.
FASTX.FASTA.extract
— Functionextract(reader::Reader, name::AbstractString, range::Union{Nothing, UnitRange})
Extract a subsequence given by index range
from the sequence named
in a Reader
with an index. Returns a String
. If range
is nothing (the default value), return the entire sequence.
FASTX.FASTA.Index
— TypeIndex(src::Union{IO, AbstractString})
FASTA index object, which allows constant-time seeking of FASTA files by name. The index is assumed to be in FAI format.
Notable methods:
Index(::Union{IO, AbstractString})
: Read FAI file from IO or file at pathwrite(::IO, ::Index)
: Write index in FAI formatfaidx(::IO)::Index
: Index FASTA fileseekrecord(::Reader, ::AbstractString)
: Go to position of seqextract(::Reader, ::AbstractString)
: Extract part of sequence
Note that the FAI specs are stricter than FASTX.jl's definition of FASTA, such that some valid FASTA records may not be indexable. See the specs at: http://www.htslib.org/doc/faidx.html
See also: FASTA.Reader
Examples
julia> src = IOBuffer("seqname\t9\t14\t6\t8\nA\t1\t3\t1\t2");
julia> fna = IOBuffer(">A\nG\n>seqname\nACGTAC\r\nTTG");
julia> rdr = FASTA.Reader(fna; index=src);
julia> seekrecord(rdr, "seqname");
julia> sequence(String, first(rdr))
"ACGTACTTG"