FASTQ formatted files
NB: First read the overview in the sidebar
FASTQ is a text-based file format for representing DNA sequences along with qualities for each base. A FASTQ file stores a list of sequence records in the following format:
The template of a sequence record is:
@{description}
{sequence}
+{description}?
{qualities}
Where the "identifier" is the first part of the description up to the first whitespace (or the entire description if there is no whitespace)
The description may optionally be present on the third line, and if so, must be identical to the description on the first line.
Here is an example of one record from a FASTQ file:
@FSRRS4401BE7HA
tcagTTAAGATGGGAT
+
###EEEEEEEEE##E#
Where:
identifier
is"FSRRS4401BE7HA"
description
is also"FSRRS4401BE7HA"
sequence
is"tcagTTAAGATGGGAT"
quality
is"###EEEEEEEEE##E#"
The FASTQRecord
FASTQRecord
s optionally have the description repeated on the third line. This can be toggled with quality_header!(::Record, ::Bool)
:
julia> record = parse(FASTQRecord, "@ILL01\nCCCGC\n+\nKM[^d");
julia> print(record)
@ILL01
CCCGC
+
KM[^d
julia> quality_header!(record, true); print(record)
@ILL01
CCCGC
+ILL01
KM[^d
FASTX.FASTQ.Record
— TypeFASTQ.Record
Mutable struct representing a FASTQ record as parsed from a FASTQ file. The content of the record can be queried with the following functions: identifier
, description
, sequence
, quality
FASTQ records are un-typed, i.e. they are agnostic to what kind of data they contain.
See also: FASTQ.Reader
, FASTQ.Writer
Examples
julia> rec = parse(FASTQRecord, "@ill r1\nGGC\n+\njjk");
julia> identifier(rec)
"ill"
julia> description(rec)
"ill r1"
julia> sequence(rec)
"GGC"
julia> show(collect(quality_scores(rec)))
Int8[73, 73, 74]
julia> typeof(description(rec)) == typeof(sequence(rec)) <: AbstractString
true
Qualities
Unlike FASTARecord
s, a FASTQRecord
contain quality scores, see the example above.
The quality string can be obtained using the quality
method:
julia> record = parse(FASTQRecord, "@ILL01\nCCCGC\n+\nKM[^d");
julia> quality(record)
"KM[^d"
Qualities are numerical values that are encoded by ASCII characters. Unfortunately, multiple encoding schemes exist, although PHRED+33 is the most common. The scores can be obtained using the quality_scores
function, which returns an iterator of PHRED+33 scores:
julia> collect(quality_scores(record))
5-element Vector{Int8}:
42
44
58
61
67
If you want to decode the qualities using another scheme, you can use one of the predefined QualityEncoding
objects. For example, Illumina v 1.3 used PHRED+64:
julia> collect(quality_scores(record, FASTQ.ILLUMINA13_QUAL_ENCODING))
5-element Vector{Int8}:
11
13
27
30
36
Alternatively, quality_scores
accept a name of the known quality encodings:
julia> (collect(quality_scores(record, FASTQ.ILLUMINA13_QUAL_ENCODING)) ==
collect(quality_scores(record, :illumina13)))
true
Lastly, you can create your own:
FASTX.FASTQ.QualityEncoding
— TypeQualityEncoding(range::StepRange{Char}, offset::Integer)
FASTQ quality encoding scheme. QualityEncoding
objects are used to interpret the quality scores of FASTQ records. range
is a range of allowed ASCII chars in the encoding, e.g. '!':'~'
for the most common encoding scheme. The offset is the ASCII offset, i.e. a character with ASCII value x
encodes the value x - offset
.
See also: quality_scores
Examples
julia> read = parse(FASTQ.Record, "@hdr\nAGA\n+\nabc");
julia> qe = QualityEncoding('a':'z', 16); # hypothetical encoding
julia> collect(quality_scores(read, qe)) == [Int8(i) - 16 for i in "abc"]
true
Reference:
FASTX.FASTQ.quality
— Functionquality([T::Type{String, StringView}], record::FASTQ.Record, [part::UnitRange])
Get the ASCII quality of record
at positions part
as type T
. If not passed, T
defaults to StringView
. If not passed, part
defaults to the entire quality string.
Examples
julia> rec = parse(FASTQ.Record, "@hdr\nUAGUCU\n+\nCCDFFG");
julia> qual = quality(rec)
"CCDFFG"
julia> qual isa AbstractString
true
FASTX.FASTQ.quality_scores
— Functionquality_scores(record::FASTQ.Record, [encoding::QualityEncoding], [part::UnitRange])
Get an iterator of PHRED base quality scores of record
at positions part
. This iterator is corrupted if the record is mutated. By default, part
is the whole sequence. By default, the encoding is PHRED33 Sanger encoding, but may be specified with a QualityEncoding
object
quality(record::Record, encoding_name::Symbol, [part::UnitRange])::Vector{UInt8}
Get an iterator of base quality of the slice part
of record
's quality.
The encoding_name
can be either :sanger
, :solexa
, :illumina13
, :illumina15
, or :illumina18
.
FASTX.FASTQ.quality_header!
— Functionquality_header!(record::Record, x::Bool)
Set whether the record repeats its header on the quality comment line, i.e. the line with +
.
Examples
julia> record = parse(FASTQ.Record, "@A B\nT\n+\nJ");
julia> string(record)
"@A B\nT\n+\nJ"
julia> quality_header!(record, true);
julia> string(record)
"@A B\nT\n+A B\nJ"
FASTQReader
and FASTQWriter
FASTQWriter
can optionally be passed the keyword quality_header
to control whether or not to print the description on the third line (the one with +
). By default this is nothing
, meaning that it will print the second header, if present in the record itself.
If set to a Bool
value, the Writer
will override the Records
, without changing the records themselves.
Reference:
FASTX.FASTQ
— ModuleFASTA
Module under FASTX with code related to FASTA files.
FASTX.FASTQ.Reader
— TypeFASTQ.Reader(input::IO; copy::Bool=true)
Create a buffered data reader of the FASTQ file format. The reader is a BioGenerics.IO.AbstractReader
, a stateful iterator of FASTQ.Record
. Readers take ownership of the underlying IO. Mutating or closing the underlying IO not using the reader is undefined behaviour. Closing the Reader also closes the underlying IO.
See more examples in the FASTX documentation.
See also: FASTQ.Record
, FASTQ.Writer
Arguments
input
: data sourcecopy::Bool
: iterating returns fresh copies instead of the same Record. Set tofalse
for improved performance, but be wary that iterating mutates records.
Examples
julia> rdr = FASTQReader(IOBuffer("@readname\nGGCC\n+\njk;]"));
julia> record = first(rdr); close(rdr);
julia> identifier(record)
"readname"
julia> sequence(record)
"GGCC"
julia> show(collect(quality_scores(record))) # phred 33 encoding by default
Int8[73, 74, 26, 60]
FASTX.FASTQ.Writer
— TypeFASTQ.Writer(output::IO; quality_header::Union{Nothing, Bool}=nothing)
Create a data writer of the FASTQ file format. The writer is a BioGenerics.IO.AbstractWriter
. Writers take ownership of the underlying IO. Mutating or closing the underlying IO not using the writer is undefined behaviour. Closing the writer also closes the underlying IO.
See more examples in the FASTX documentation.
See also: FASTQ.Record
, FASTQ.Reader
Arguments
output
: Data sink to write toquality_header
: Whether to print second header on the + line. Ifnothing
(default), check the individualRecord
objects for whether they contain a second header.
Examples
julia> FASTQ.Writer(open("some_file.fq", "w")) do writer
write(writer, record) # a FASTQ.Record
end
FASTX.FASTQ.validate_fastq
— Functionvalidate_fastq(io::IO) >: Nothing
Check if io
is a valid FASTQ file. Return nothing
if it is, and an instance of another type if not.
Examples
julia> validate_fastq(IOBuffer("@i1 r1\nuuag\n+\nHJKI")) === nothing
true
julia> validate_fastq(IOBuffer("@i1 r1\nu;ag\n+\nHJKI")) === nothing
false