FASTQ formatted files
NB: First read the overview in the sidebar
FASTQ is a text-based file format for representing DNA sequences along with qualities for each base. A FASTQ file stores a list of sequence records in the following format:
The template of a sequence record is:
@{description}
{sequence}
+{description}?
{qualities}Where the "identifier" is the first part of the description up to the first whitespace (or the entire description if there is no whitespace)
The description may optionally be present on the third line, and if so, must be identical to the description on the first line.
Here is an example of one record from a FASTQ file:
@FSRRS4401BE7HA
tcagTTAAGATGGGAT
+
###EEEEEEEEE##E#Where:
identifieris"FSRRS4401BE7HA"descriptionis also"FSRRS4401BE7HA"sequenceis"tcagTTAAGATGGGAT"qualityis"###EEEEEEEEE##E#"
The FASTQRecord
FASTQRecords optionally have the description repeated on the third line. This can be toggled with quality_header!(::Record, ::Bool):
julia> record = parse(FASTQRecord, "@ILL01\nCCCGC\n+\nKM[^d");
julia> print(record)
@ILL01
CCCGC
+
KM[^d
julia> quality_header!(record, true); print(record)
@ILL01
CCCGC
+ILL01
KM[^dFASTX.FASTQ.Record — TypeFASTQ.RecordMutable struct representing a FASTQ record as parsed from a FASTQ file. The content of the record can be queried with the following functions: identifier, description, sequence, quality FASTQ records are un-typed, i.e. they are agnostic to what kind of data they contain.
See also: FASTQ.Reader, FASTQ.Writer
Examples
julia> rec = parse(FASTQRecord, "@ill r1\nGGC\n+\njjk");
julia> identifier(rec)
"ill"
julia> description(rec)
"ill r1"
julia> sequence(rec)
"GGC"
julia> show(collect(quality_scores(rec)))
Int8[73, 73, 74]
julia> typeof(description(rec)) == typeof(sequence(rec)) <: AbstractString
trueQualities
Unlike FASTARecords, a FASTQRecord contain quality scores, see the example above.
The quality string can be obtained using the quality method:
julia> record = parse(FASTQRecord, "@ILL01\nCCCGC\n+\nKM[^d");
julia> quality(record)
"KM[^d"Qualities are numerical values that are encoded by ASCII characters. Unfortunately, multiple encoding schemes exist, although PHRED+33 is the most common. The scores can be obtained using the quality_scores function, which returns an iterator of PHRED+33 scores:
julia> collect(quality_scores(record))
5-element Vector{Int8}:
42
44
58
61
67If you want to decode the qualities using another scheme, you can use one of the predefined QualityEncoding objects. For example, Illumina v 1.3 used PHRED+64:
julia> collect(quality_scores(record, FASTQ.ILLUMINA13_QUAL_ENCODING))
5-element Vector{Int8}:
11
13
27
30
36Alternatively, quality_scores accept a name of the known quality encodings:
julia> (collect(quality_scores(record, FASTQ.ILLUMINA13_QUAL_ENCODING)) ==
collect(quality_scores(record, :illumina13)))
trueLastly, you can create your own:
FASTX.FASTQ.QualityEncoding — TypeQualityEncoding(range::StepRange{Char}, offset::Integer)FASTQ quality encoding scheme. QualityEncoding objects are used to interpret the quality scores of FASTQ records. range is a range of allowed ASCII chars in the encoding, e.g. '!':'~' for the most common encoding scheme. The offset is the ASCII offset, i.e. a character with ASCII value x encodes the value x - offset.
See also: quality_scores
Examples
julia> read = parse(FASTQ.Record, "@hdr\nAGA\n+\nabc");
julia> qe = QualityEncoding('a':'z', 16); # hypothetical encoding
julia> collect(quality_scores(read, qe)) == [Int8(i) - 16 for i in "abc"]
trueReference:
FASTX.FASTQ.quality — Functionquality([T::Type{String, StringView}], record::FASTQ.Record, [part::UnitRange])Get the ASCII quality of record at positions part as type T. If not passed, T defaults to StringView. If not passed, part defaults to the entire quality string.
Examples
julia> rec = parse(FASTQ.Record, "@hdr\nUAGUCU\n+\nCCDFFG");
julia> qual = quality(rec)
"CCDFFG"
julia> qual isa AbstractString
trueFASTX.FASTQ.quality_scores — Functionquality_scores(record::FASTQ.Record, [encoding::QualityEncoding], [part::UnitRange])Get an iterator of PHRED base quality scores of record at positions part. This iterator is corrupted if the record is mutated. By default, part is the whole sequence. By default, the encoding is PHRED33 Sanger encoding, but may be specified with a QualityEncoding object
quality(record::Record, encoding_name::Symbol, [part::UnitRange])::Vector{UInt8}Get an iterator of base quality of the slice part of record's quality.
The encoding_name can be either :sanger, :solexa, :illumina13, :illumina15, or :illumina18.
FASTX.FASTQ.quality_header! — Functionquality_header!(record::Record, x::Bool)Set whether the record repeats its header on the quality comment line, i.e. the line with +.
Examples
julia> record = parse(FASTQ.Record, "@A B\nT\n+\nJ");
julia> string(record)
"@A B\nT\n+\nJ"
julia> quality_header!(record, true);
julia> string(record)
"@A B\nT\n+A B\nJ"FASTQReader and FASTQWriter
FASTQWriter can optionally be passed the keyword quality_header to control whether or not to print the description on the third line (the one with +). By default this is nothing, meaning that it will print the second header, if present in the record itself.
If set to a Bool value, the Writer will override the Records, without changing the records themselves.
Reference:
FASTX.FASTQ — ModuleFASTAModule under FASTX with code related to FASTA files.
FASTX.FASTQ.Reader — TypeFASTQ.Reader(input::IO; copy::Bool=true)Create a buffered data reader of the FASTQ file format. The reader is a BioGenerics.IO.AbstractReader, a stateful iterator of FASTQ.Record. Readers take ownership of the underlying IO. Mutating or closing the underlying IO not using the reader is undefined behaviour. Closing the Reader also closes the underlying IO.
See more examples in the FASTX documentation.
See also: FASTQ.Record, FASTQ.Writer
Arguments
input: data sourcecopy::Bool: iterating returns fresh copies instead of the same Record. Set tofalsefor improved performance, but be wary that iterating mutates records.
Examples
julia> rdr = FASTQReader(IOBuffer("@readname\nGGCC\n+\njk;]"));
julia> record = first(rdr); close(rdr);
julia> identifier(record)
"readname"
julia> sequence(record)
"GGCC"
julia> show(collect(quality_scores(record))) # phred 33 encoding by default
Int8[73, 74, 26, 60]FASTX.FASTQ.Writer — TypeFASTQ.Writer(output::IO; quality_header::Union{Nothing, Bool}=nothing)Create a data writer of the FASTQ file format. The writer is a BioGenerics.IO.AbstractWriter. Writers take ownership of the underlying IO. Mutating or closing the underlying IO not using the writer is undefined behaviour. Closing the writer also closes the underlying IO.
See more examples in the FASTX documentation.
See also: FASTQ.Record, FASTQ.Reader
Arguments
output: Data sink to write toquality_header: Whether to print second header on the + line. Ifnothing(default), check the individualRecordobjects for whether they contain a second header.
Examples
julia> FASTQ.Writer(open("some_file.fq", "w")) do writer
write(writer, record) # a FASTQ.Record
endFASTX.FASTQ.validate_fastq — Functionvalidate_fastq(io::IO) >: NothingCheck if io is a valid FASTQ file. Return nothing if it is, and an instance of another type if not.
Examples
julia> validate_fastq(IOBuffer("@i1 r1\nuuag\n+\nHJKI")) === nothing
true
julia> validate_fastq(IOBuffer("@i1 r1\nu;ag\n+\nHJKI")) === nothing
false