FASTA index (FAI files)
FASTX.jl supports FASTA index (FAI) files. When a FASTA file is indexed with a FAI file, one can seek records by their name, or extract parts of records easily.
See the FAI specifcation here: http://www.htslib.org/doc/faidx.html
Making an Index
A FASTA index (of type Index) can be constructed from an IO object representing a FAI file:
julia> io = IOBuffer("seqname\t9\t2\t6\t8");
julia> Index(io) isa Index
trueOr from a path representing a FAI file:
julia> Index("../test/data/test.fasta.fai");Alternatively, a FASTA file can be indexed to produce an Index using faidx.
julia> faidx(IOBuffer(">abc\nTAGA\nTA"))
Index:
abc 6 5 4 5Alternatively, a FASTA file can be indexed, and the index immediately written to a FAI file, by passing an AbstractString to faidx:
julia> rm("../test/data/test.fasta.fai") # remove existing fai
julia> ispath("../test/data/test.fasta.fai")
false
julia> faidx("../test/data/test.fasta");
julia> ispath("../test/data/test.fasta.fai")
trueNote that the restrictions on FASTA files for indexing are stricter than Julia's FASTA parser, so not all FASTA files that can be read can be indexed:
julia> str = ">\0\n\0";
julia> first(FASTAReader(IOBuffer(str))) isa FASTARecord
true
julia> Index(IOBuffer(str))
ERROR
[...]Writing a FAI file
If you have an Index object, you can simply write it to an IO:
julia> index = open(i -> Index(i), "../test/data/test.fasta.fai");
julia> filename = tempname();
julia> open(i -> write(i, index), filename, "w");
julia> index2 = open(i -> Index(i), filename);
julia> string(index) == string(index2)
trueAttaching an Index to a Reader
When opening a FASTA.Reader, you can attach an Index by passing the index keyword. You can either pass an Index directly, or else an IO, in which case an Index will be parsed from the IO, or an AbstractString that will be interpreted as a path to a FAI file:
julia> str = ">abc\nTAG\nTA";
julia> idx = faidx(IOBuffer(str));
julia> rdr = FASTAReader(IOBuffer(str), index=idx);You can also add a index to an existing reader using the index! function:
FASTX.FASTA.index! — Functionindex!(r::FASTA.Reader, ind::Union{Nothing, Index, IO, AbstractString})Set the index of r, and return r. If ind isa Union{Nothing, Index}, directly set the index to ind. If ind isa IO, parse the index from the FAI-formatted IO first. If ind isa AbstractString, treat it as the path to a FAI file to parse.
See also: Index, FASTA.Reader
Seeking using an Index
With an Index attached to a Reader, you can do the following operation in O(1) time. In these examples, we will use the following FASTA file:
>seq1 sequence
TAGAAAGCAA
TTAAAC
>seq2 sequence
AACGG
UUGC- Seek to a Record using its identifier:
julia> seekrecord(reader, "seq2");
julia> record = first(reader); sequence(record)
"AACGGUUGC"- Directly extract a record using its identifier
julia> record = reader["seq1"];
julia> description(record)
"seq1 sequence"- Extract a sequence directly without loading the whole record into memory. This is useful for huge sequences like chromosomes
julia> extract(reader, "seq1", 3:5)
"GAA"FASTX.jl does not yet support indexing FASTQ files.
Reference:
FASTX.FASTA.faidx — Functionfaidx(io::IO)::IndexRead a FASTA.Index from io.
See also: Index
Examples
julia> ind = faidx(IOBuffer(">ab\nTA\nT\n>x y\nGAG\nGA"))
Index:
ab 3 4 2 3
x 5 14 3 4faidx(fnapath::AbstractString, [idxpath::AbstractString], check=true)Index FASTA path at fnapath and write index to idxpath. If idxpath is not given, default to same name as fnapath * ".fai". If check, throw an error if the output file already exists
See also: Index
FASTX.FASTA.seekrecord — Functionseekrecord(reader::FASTAReader, i::Union{AbstractString, Integer})Seek Reader to the i'th record. The next iterated record with be the i'th record. i can be the identifier of a sequence, or the 1-based record number in the Index.
The Reader needs to be indexed for this to work.
FASTX.FASTA.extract — Functionextract(reader::Reader, name::AbstractString, range::Union{Nothing, UnitRange})Extract a subsequence given by index range from the sequence named in a Reader with an index. Returns a String. If range is nothing (the default value), return the entire sequence.
FASTX.FASTA.Index — TypeIndex(src::Union{IO, AbstractString})FASTA index object, which allows constant-time seeking of FASTA files by name. The index is assumed to be in FAI format.
Notable methods:
Index(::Union{IO, AbstractString}): Read FAI file from IO or file at pathwrite(::IO, ::Index): Write index in FAI formatfaidx(::IO)::Index: Index FASTA fileseekrecord(::Reader, ::AbstractString): Go to position of seqextract(::Reader, ::AbstractString): Extract part of sequence
Note that the FAI specs are stricter than FASTX.jl's definition of FASTA, such that some valid FASTA records may not be indexable. See the specs at: http://www.htslib.org/doc/faidx.html
See also: FASTA.Reader
Examples
julia> src = IOBuffer("seqname\t9\t14\t6\t8\nA\t1\t3\t1\t2");
julia> fna = IOBuffer(">A\nG\n>seqname\nACGTAC\r\nTTG");
julia> rdr = FASTA.Reader(fna; index=src);
julia> seekrecord(rdr, "seqname");
julia> sequence(String, first(rdr))
"ACGTACTTG"