API: The Graphs submodule
The Graphs submodule contains the core graph data-structure used in GenomeGraphs.jl: The SequenceDistanceGraph.
It also defines some of the common methods used in the rest of the higher-level parts GenomeGraphs.jl:
- To query nodes/sequences in the graph.
- To find certain topological structures.
- To edit graph structure.
- To write/load SequenceDistanceGraph's to/from file.
Some methods, particularly ones that access the sequences stored in the graph, and to edit topology, are marked as unsafe and are not exported deliberately.
All types and methods defined in this module are documented here.
This is a reference of an internal sub-module's API for developers and experienced users. First ask yourself if what you need isn't covered by the higher-level WorkSpace API.
Types
GenomeGraphs.Graphs.SequenceDistanceGraph
— TypeThe SequenceDistanceGraph is a representation of a genome assembly.
The SequenceDistanceGraph stores sequences in a vector of SequenceDistanceGraphNode
s (or SDGNode
for short).
A singe node represents a sequence and its reverse complement.
Every node also has an orientaton: Each node has a positive end (+), and a negative end (-). So when a node is accessed with (or traversed by entering) the positive end the node yields the stored sequence. Conversely, when a node is accessed with (or traversed by entering) the negative end the node yelds the reverse complement of the stored sequence. In this way the positive end can be thought of as the sequence start of the node's canonical sequence, and the negative end can be thought of as its end.
Thankfully, you don't need to think about "positive node ends" and "negative node ends" most of the time you interact with a SequenceDistanceGraph
: In a SequenceDistanceGraph
, every has a correlative ID starting from 1. For every node stored in the graph with an ID X, the negative ID -X is mapped to the reverse complement of that node. This mapping is virtual: Only one node is stored in the graph. To see more about how these positive and negative node IDs work, refer to the docs for methods such as get_next_nodes
and get_previous_nodes
. These high level functions simplify coding graph traversal A LOT.
This SequenceDistanceGraph
type is only intended to be interacted with directly by developers and users who know what they are doing. Most tasks an end user wants to do to manipulate or query a graph can be achieved through the WorkSpace
type and the NodeView interface.
GenomeGraphs.Graphs.SDG
— TypeShorthand for SequenceDistanceGraph
GenomeGraphs.Graphs.SDGNode
— TypeThe SDGNode type represents a node in a SequenceDistanceGraph.
At present it contains only two fields, first it holds an instance of a BioSequences BioSequence type.
Secondly, it tracks a flag which indicates if the node has been deleted or not.
The deleted flag allows us to mark nodes in the graph as deleted, which can be of help in some algorithms where the graph structure is being edited (merging nodes for example).
Actually deleting the node would shift node IDs and require redoing all links in the graph and so on.
So just marking a node as deleted and not using it anymore is a lazy but sometimes helpful choice.
GenomeGraphs.Graphs.SequenceDistanceGraphLink
— TypeA SequenceDistanceGraphLink or SDGLink represents a single distance between two sequences in a SequenceDistanceGraph.
Every link describes a connection between two node ends and contains a distance (they take the form ([+, -]n1, [+, -]n2, [+, -]dist)
).
A link connects two node ends, and so the order of the signed nodes in the links does not change the link.
If the distance in a link is negative, this represents an overlap between two sequences. These overlaps must be perfect overlaps.
Two links are considered equal (==
will return true
) if the source and the destination fields of the two links are the same, the distance or dist field is not considered!
Public / Safe methods
Graph nodes and sequences
GenomeGraphs.Graphs.name
— FunctionGet the name of the graph. Defaults to the symbol :sdg.
GenomeGraphs.Graphs.n_nodes
— FunctionGet the number of nodes in the sequence distance graph sg
.
GenomeGraphs.Graphs.sequence
— Functionsequence(sg::SequenceDistanceGraph, n::NodeID)
Get the full sequence of a node in a sequence distance graph using its correlative node id n
.
sequence
accepts a NodeID that can be positive or negative. Nodes represent stretches of sequence in a canonical orientation, if you ask for for the sequence of say the third node, the positive node id 3 (which denotes traversing the third node in the forward direction), gives you the canonical sequence. If you use the negative ID -3 (which denotes traversing the third node in the reverse direction), you will get the reverse complement of the node's canonical (forward) sequence.
It is safe to modify the returned sequence without screwing up your graph, yet thanks to BioSequences.jl's copy on write system for LongSequences, data copying will only occur if nessecery. You get the best of both worlds.
Graph Topology
GenomeGraphs.Graphs.source
— FunctionGet the source node end of a link.
GenomeGraphs.Graphs.destination
— FunctionGet the destination node end of a link.
GenomeGraphs.Graphs.distance
— FunctionGet the distance between two node ends involved in a link.
GenomeGraphs.Graphs.is_forwards_from
— FunctionTest if link l
is a forward link leaving node n
.
GenomeGraphs.Graphs.is_backwards_from
— FunctionTest if link l
is a forward link leaving node n
.
GenomeGraphs.Graphs.find_link
— Functionfind_link(sg::SequenceDistanceGraph, src::NodeID, dst::NodeID)
Find and return the link that exists between a source node and a destination node, using their correlative node ids.
In this instance, the IDs also denote the "ends" of a node, a link connects.
Recall that every node has an orientaton: Each node has a positive end (+), and a negative end (-). So when a node is accessed with (or traversed by entering) the positive end the node yields the stored sequence. Conversely, when a node is accessed with (or traversed by entering) the negative end the node yelds the reverse complement of the stored sequence.
Thus find_link(sg, -5, 1)
means you want to find a link in the graph that allows you to exit the (-) end of node 5 (meaning you just traversed it in the canonical + –> - orientation), and enter the (+) end of node 1 (meaning you will also traverse node one in the canonical + –> - orientation).
By contrast find_link(sg, -5, -1)
means you want to find a link in the graph that allows you to exit the (-) end of node 5 - again having just traverse it in the + –> - orientation, but want to enter node 1 through it's (-) end - and thus traversing node 1 in the + <– - orientation, yielding the reverse complement of the canonical sequence of node 1.
If a link connecting the two node ends is present, you get it. If not, you get nothing
. So make sure to check the output.
This method is safe and public, but not reeeeaaallly intended for the end user, as they have to worry about links between "ends" of a node. In contrast to the rest of this framework, which rather has the user thinking in node centric terms. e.g. "I am at node 5 which means I have traversed it forwards or I am at node -5, which means I have traversed node 5 backwards. From here I can visit node 1, -3, and 2, so I may go through nodes 1 and 2 forwards, and through 3 backwards. The methods get_next_nodes
and get_previous_nodes
are more user-friendly and follow this more node-centric mindset.
GenomeGraphs.Graphs.forward_links
— Functionforward_links(sg::SequenceDistanceGraph, n::NodeID)
Get a vector of the links that are ahread of you, as you traverse node n
, continuing forward in you present direction of travel (n
can be a positive or negative node id).
The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the links that allow you to leave node 5 in the canonical direction. If you used a negative ID such as -5, that means you are traversing node 5 in the non-canonical orientation, and so you will get the links that allow you to leave node 5 in the non-canonical direction.
GenomeGraphs.Graphs.backward_links
— Functionbackward_links(sg::SequenceDistanceGraph, n::NodeID)
Get a vector of the links that are behind of you, as you traverse node n
, continuing forward in you present direction of travel (n
can be a positive or negative node id).
The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the links that would allow you to enter node 5 in the canonical direction. If you used a negative ID such as -5, that means you are traversing node 5 in the non-canonical orientation, and so you will get the links that would allow you to enter node 5 in the non-canonical direction.
GenomeGraphs.Graphs.get_next_nodes
— Functionget_next_nodes(sg::SequenceDistanceGraph, n::NodeID)
Get the nodes you may visit next as you exit node n
, maintaining your current direction of travel (n
can be a positive or negative node ID).
The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the nodes you may visit as you leave node 5 in the canonical direction. If you use a negative ID, such as -5, this means you are traversing node 5 in the non-canonical orientation, and so you will get the nodes you may visit as you leave node 5 in the non-canonical direction.
The list of nodes returned are also signed to denote direction: Say you got [1, -2, 3]
as a result of get_next_nodes(sg, 5)
. That means as you leave node 5 after traversing it in the canonical direction, you may proceed to travel through nodes 1 and 3 in the canonical direction, and node 2 in the non-canonical direction.
GenomeGraphs.Graphs.get_previous_nodes
— Functionget_previous_nodes(sg::SequenceDistanceGraph, n::NodeID)
Get the nodes you may have previously been on before you enterd node n
, maintaining your current direction of travel (n
can be a positive or negative node ID).
The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the nodes you may have visited prior to entering node 5 in the canonical direction. If you use a negative ID, such as -5, this means you are traversing node 5 in the non-canonical orientation, and so you will get the nodes you may have visited prior to entering node 5 in the non-canonical direction.
The list of nodes returned are also signed to denote direction: Say you got [1, -2, 3]
as a result of get_previous_nodes(sg, 5)
. That means prior to entering node 5 in the canonical direction, you may have traveled through nodes 1 and 3 in the canonical direction, and node 2 in the non-canonical direction.
GenomeGraphs.Graphs.find_tip_nodes
— Functionfind_tip_nodes(sg::SequenceDistanceGraph, min_size::Integer)
Get a set of IDs of all the nodes in the graph sg
that count as "tips".
Here a tip node is defined as a node that only has a single neighbouring node at one of it's ends, and no neighbouring nodes out of one of its other end.
The node's sequence must also be larger than min_size
base pairs in length.
GenomeGraphs.Graphs.find_tip_nodes!
— Functionfind_tip_nodes!(result::Set{NodeID}, sg::SequenceDistanceGraph, min_size::Integer)
Get a set of IDs of all the nodes in the graph sg
that count as "tips".
Here a tip node is defined as a node that only has a single neighbouring node at one of it's ends, and no neighbouring nodes out of one of its other end.
The node's sequence must also be larger than min_size
base pairs in length.
This method modifys a result
input. So if you want to find tips in a graph repeatedly, you can resuse a Set
, saving you some allocation costs.
GenomeGraphs.Graphs.find_all_unitigs
— Functionfind_all_unitigs(sg::G, min_nodes::Integer) where {G<:SequenceDistanceGraph}
Find and return a vector of paths through the graph that represent all the unitigs or transitive paths in the graphs. Such paths are defined as a chain of nodes with only one neighbour. Such simple regions of the graph can safely be collapsed into one larger node.
GenomeGraphs.Graphs.find_all_unitigs!
— Functionfind_all_unitigs!(unitigs::Vector{SequenceDistanceGraphPath{G}}, sg::G, min_nodes::Integer) where {G<:SequenceDistanceGraph}
Find and return a vector of paths through the graph that represent all the unitigs or transitive paths in the graphs. Such paths are defined as a chain of nodes with only one neighbour. Such simple regions of the graph can safely be collapsed into one larger node.
This vector modifies an input unitigs
vector to contain the result. This is useful for situations where you want to repeatedly find unitigs in the graph to save on additional allocations.
Graph IO
GenomeGraphs.Graphs.write_to_gfa1
— Functionwrite_to_gfa1(sg::SequenceDistanceGraph, filename::String)
Write a graph from a GFAv1 formatted file, and associated FASTA file of node sequences.
The GFA format permits storing sequences of the graph nodes in a seperate fasta file, instead of in the GFA file. This is so as the sequences of the graph nodes can be easily fed into other tools that typically accept FASTA files as input. Many assemblers also output a GFA + FASTA combo. Therefore, this method writes node sequences to a file called [filename].fasta
, and the graph structure to a file called [filename].gfa
.
GenomeGraphs.Graphs.load_from_gfa1!
— Functionload_from_gfa1!(sg::SequenceDistanceGraph{S}, gfafile::AbstractString, fafile::AbstractString) where {S<:BioSequence}
Load a graph from a GFAv1 formatted file, and associated FASTA file of node sequences.
The GFA format permits storing sequences of the graph nodes in a seperate fasta file, instead of in the GFA file. This is so as the sequences of the graph nodes can be easily fed into other tools that typically accept FASTA files as input. Many assemblers also output a GFA + FASTA combo. Therefore, this method asks for the filepath of a GFAv1 file, as well as a filepath to a FASTA formatted file. This method reads the node sequences from the FASTA file, before getting the links between nodes from the GFAv1 file.
Internal / Unsafe methods
Graph nodes and sequences
GenomeGraphs.Graphs.empty_seq
— Functionempty_seq(::Type{LongSequence{DNAAlphabet{2}}})
Get a reference to an empty 2-bit long dna sequence.
In this module, a single empty sequence is allocated, and every call to empty_seq
returns a reference to that same allocated sequence.
This saves on allocations if you are deleting many nodes in a SequenceDistanceGraph.
empty_seq(::Type{LongSequence{DNAAlphabet{4}}})
Get a reference to an empty 4-bit long dna sequence.
In this module, a single empty sequence is allocated, and every call to empty_seq
returns a reference to that same allocated sequence.
This saves on allocations if you are deleting many nodes in a SequenceDistanceGraph.
GenomeGraphs.Graphs.is_deleted
— FunctionCheck if a SDGNode is a deleted.
GenomeGraphs.Graphs.unsafe_sequence
— Functionunsafe_sequence(n::SDGNode{S}) where {S<:BioSequence}
Get the reference to a node's underlying sequence object.
You get a reference to the node's sequence object - not a copy. So doing transformation operations (reverse_complement, setindex, etc.) to it will probably screw up the graph!
unsafe_sequence(sg::SequenceDistanceGraph, n::NodeID)
Get the reference to a node's underlying sequence object.
This method is unsafe, no checking of node id's occurs and you get a reference to the node's sequence object - not a copy, so doing transformation operations (reverse_complement, setindex, etc.) to it will probably screw up the graph!
GenomeGraphs.Graphs.check_node_id
— Functioncheck_node_id(sg::SequenceDistanceGraph, i::NodeID)
The method responsible for checking that a node id used as input to a method that queries or edits a SequenceDistanceGraph
is a sensible value.
GenomeGraphs.Graphs.nodes
— FunctionGet a reference to the vector of nodes in a graph sg
.
It is a bad idea to edit this vector yourself unless you know what you are doing.
GenomeGraphs.Graphs.node_unsafe
— Functionnode_unsafe(sg::SequenceDistanceGraph, n::NodeID)
Get a reference specific node from a sequence distance graph sg
using its correlative node id n
.
node_unsafe
accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will get the node for node 5.
This method returns a reference to a graph's underlying SDGNode
. NOT a copy! Messing with this will screw up your graph. So this method is not recommended or exported - unless you really know what you're doing.
This method is explicitly marked unsafe for a reason. It does zero checking of the value it is passed as the node id. It also makes use of the @inbounds
macro. This makes it faster, but you need to be 100% sure that your code won't try to call this with a bad id. So it is not exported or recommended
- unless you really know what you're doing.
GenomeGraphs.Graphs.node
— Functionnode(sg::SequenceDistanceGraph, n::NodeID)
Get a reference to a specific node from a sequence distance graph sg
using its correlative node id n
.
node
accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will get the links for node 5.
This method returns a reference to a graph's underlying SDGNode
. NOT a copy! Messing with this will screw up your graph. So this method is not recommended or exported - unless you really know what you're doing.
Graph topology
GenomeGraphs.Graphs.links
— FunctionGet a reference to the vector of vectors of links in a graph sg
.
It is a bad idea to edit this vector yourself unless you know what you are doing.
GenomeGraphs.Graphs.linksof_unsafe
— Functionlinksof_unsafe(sg::SequenceDistanceGraph, n::NodeID)
Get a reference to a vector storing all the links of a node in a SequenceDistanceGraph
, the node is specified using its correlative node id n
.
links_unsafe
accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will get the links for node 5.
This method returns a reference to an underlying links vector that the SequenceDistanceGraph
owns - NOT a copy! Messing with this will screw up your graph. So this method is not recommended or exported - unless you really know what you're doing.
This method is explicitly marked unsafe for a reason. It does zero checking of the value it is passed as the node id. It also makes use of the @inbounds
macro. This makes it faster, but you need to be 100% sure that your code won't try to call this with a bad id. So it is not exported or recommended
- unless you really know what you're doing.
Graph editing and manipulation
GenomeGraphs.Graphs.add_node!
— Functionadd_node!(sg::SequenceDistanceGraph{S}, n::SDGNode{S}) where {S<:BioSequence}
Add a node n
to a graph sg
.
Returns the node ID used to access the new node added in the graph.
We don't enforce the sequence in the node is canonical here. We just trust that it is canonical.
Adding a node to the graph does just that. After adding the node it still will not be linked to any other nodes.
add_node!(sg::SequenceDistanceGraph{S}, seq::BioSequence) where {S<:BioSequence}
Add a sequence to a sequence distance graph as a node.
Returns the node ID used to access the new node added in the graph.
Can accept any sequence type and will attempt to coerce the input sequence to the type required by the graph.
We don't enforce the sequence in the node is canonical here. We just trust that it is canonical.
Adding a node to the graph does just that. After adding the node it still will not be linked to any other nodes.
GenomeGraphs.Graphs.remove_node!
— Functionremove_node!(sg::SequenceDistanceGraph{S}, n::NodeID) where {S<:BioSequence}
Remove a node from a sequence distance graph.
This method can accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will end up deleting node 5.
Links involving this node will also be removed from the graph.
GenomeGraphs.Graphs.add_link!
— Functionadd_link!(sg::SequenceDistanceGraph, source::NodeID, dest::NodeID, dist::Int)
Construct a link between two node ends in a sequence Graph.
GenomeGraphs.Graphs.remove_link!
— Functionremove_link!(sg::SequenceDistanceGraph, src::NodeID, dst::NodeID)
Remove a link between two nodes in a SequenceDistanceGraph. Returns a boolean indicating whether the removal was successful. Reasons this function would not return true
include that the link didn't exist in the graph, and so could not be removed.
GenomeGraphs.Graphs.disconnect_node!
— FunctionRemoves all the links in the collection from and to a given nodeID.
GenomeGraphs.Graphs.collapse_all_unitigs!
— Functioncollapse_all_unitigs!(unitigs::Vector{SequenceDistanceGraphPath{G}}, newnodes::Vector{NodeID}, sg::G, min_nodes::Integer, consume::Bool) where {G<:SequenceDistanceGraph}
Detects all of the trivial paths through the graph that define unitigs. Such paths are defined as a chain of nodes with only one neighbour. Each such simple paths through the graph can safely be collapsed into one larger node, which is what this function does.
This method will edit unitigs
so as after this method returns it contains All the paths that were detected as unitigs in the graph. This method also edits the newnodes
vector, so as after this method returns it contains the IDs of all new nodes that were created as a result of collapsing the paths in unitigs
.
Modifies the SequenceDistanceGraph sg
.
collapse_all_unitigs!(sg::SequenceDistanceGraph, min_nodes::Integer, consume::Bool)
Detects all of the trivial paths through the graph that define unitigs. Such paths are defined as a chain of nodes with only one neighbour. Each such simple paths through the graph can safely be collapsed into one larger node, which is what this function does.
Modifies the SequenceDistanceGraph sg
.