FASTX formatted files

Readers and writers

A Reader and a Writer are structs that wrap an IO, and allows efficient reading/writing of FASTX Records. For FASTA, use FASTAReader and FASTAWriter, and for FASTQ - well I'm sure you've guessed it.

Readers and writers take control over the underlying IO, and manipulating the IO underneath a Reader/Writer, e.g. by flushing or closing it, cause them to behave in an undefined manner. Instead, you must interact directly with the reader and writer object.

Closing readers/writers closes the underlying IO. Because they carry their own buffers, it's important to remember to close writers in particular, else the results may not be fully written to the file.

Readers are iterables of Record:

julia> reader = FASTAReader(IOBuffer(">A\nTAG\n>B\nAGA"));

julia> record = first(reader); typeof(record) == FASTA.Record
true

julia> sequence(record)
"TAG"

julia> # NB! Readers are mutable iterators as can be seen here:

julia> sequence(first(reader))
"AGA"

julia> iterate(reader) === nothing # now empty
true

julia> close(reader)

They are normally more than fast enough as they are. To squeeze extra performance out, you can pass the keyword copy=false. This will cause the reader to return the same record over and over, and mutate it into place.

julia> reader = FASTAReader(IOBuffer(">A\nTAG\n>B\nAGA"); copy=false);

julia> rec1 = first(reader); sequence(rec1)
"TAG"

julia> rec2 = first(reader); sequence(rec2)
"AGA"

julia> rec1 === rec2
true

julia> sequence(rec1)
"AGA"

julia> close(reader)

When using readers and writers, be careful that they carry their own buffer, meaning that the underlying IO may not be updated immediately after reading/writing:

julia> io = IOBuffer();

julia> writer = FASTAWriter(io);

julia> write(writer, parse(FASTARecord, ">ABC\nDEF"));

julia> take!(io) # NB: Empty!
UInt8[]

To use it correctly, either call flush, or close the writer first (which also closes the underlying stream). It is recommended to use readers and writers to do syntax in the form:

julia> FASTAWriter(open(tempname(), "w")) do writer
           write(writer, FASTARecord("identifier", "seq"))
       end
16

Which will work for most underlying IO types, and will close the writer when the function returns (hence also closing the underlying IO).

Alternatively, the following syntax may be used:

julia> open(FASTAWriter, tempname()) do writer
           write(writer, FASTARecord("identifier", "seq"))
       end
16

However, this latter syntax does not easily extend to different types of IO, such as gzip compressed streams.

Validate files

The functions validate_fasta and validate_fastq can be used to check if an IO contains data that can be read as FASTX. They return nothing if the IO is correctly formatted, and another value if not.

They are significantly faster than parsing the whole file into records, and are memory efficient. Be aware that the validators mutate the IO by reading it, so make sure to reset the IO before using it to parse FASTX files.

julia> io = IOBuffer(">header\r\nAGG\nKK");

julia> validate_fasta(io) === nothing
true

julia> read(io) # NB: IO is now exhausted
UInt8[]

julia> validate_fastq(IOBuffer("@header\nTTT\n+\njkm")) === nothing
true