API: The Graphs submodule

The Graphs submodule contains the core graph data-structure used in GenomeGraphs.jl: The SequenceDistanceGraph.

It also defines some of the common methods used in the rest of the higher-level parts GenomeGraphs.jl:

  • To query nodes/sequences in the graph.
  • To find certain topological structures.
  • To edit graph structure.
  • To write/load SequenceDistanceGraph's to/from file.

Some methods, particularly ones that access the sequences stored in the graph, and to edit topology, are marked as unsafe and are not exported deliberately.

All types and methods defined in this module are documented here.

Note

This is a reference of an internal sub-module's API for developers and experienced users. First ask yourself if what you need isn't covered by the higher-level WorkSpace API.

Types

GenomeGraphs.Graphs.SequenceDistanceGraphType

The SequenceDistanceGraph is a representation of a genome assembly.

The SequenceDistanceGraph stores sequences in a vector of SequenceDistanceGraphNodes (or SDGNode for short).

A singe node represents a sequence and its reverse complement.

Every node also has an orientaton: Each node has a positive end (+), and a negative end (-). So when a node is accessed with (or traversed by entering) the positive end the node yields the stored sequence. Conversely, when a node is accessed with (or traversed by entering) the negative end the node yelds the reverse complement of the stored sequence. In this way the positive end can be thought of as the sequence start of the node's canonical sequence, and the negative end can be thought of as its end.

Thankfully, you don't need to think about "positive node ends" and "negative node ends" most of the time you interact with a SequenceDistanceGraph: In a SequenceDistanceGraph, every has a correlative ID starting from 1. For every node stored in the graph with an ID X, the negative ID -X is mapped to the reverse complement of that node. This mapping is virtual: Only one node is stored in the graph. To see more about how these positive and negative node IDs work, refer to the docs for methods such as get_next_nodes and get_previous_nodes. These high level functions simplify coding graph traversal A LOT.

This SequenceDistanceGraph type is only intended to be interacted with directly by developers and users who know what they are doing. Most tasks an end user wants to do to manipulate or query a graph can be achieved through the WorkSpace type and the NodeView interface.

GenomeGraphs.Graphs.SDGNodeType

The SDGNode type represents a node in a SequenceDistanceGraph.

At present it contains only two fields, first it holds an instance of a BioSequences BioSequence type.

Secondly, it tracks a flag which indicates if the node has been deleted or not.

Note

The deleted flag allows us to mark nodes in the graph as deleted, which can be of help in some algorithms where the graph structure is being edited (merging nodes for example).

Actually deleting the node would shift node IDs and require redoing all links in the graph and so on.

So just marking a node as deleted and not using it anymore is a lazy but sometimes helpful choice.

GenomeGraphs.Graphs.SequenceDistanceGraphLinkType

A SequenceDistanceGraphLink or SDGLink represents a single distance between two sequences in a SequenceDistanceGraph.

Every link describes a connection between two node ends and contains a distance (they take the form ([+, -]n1, [+, -]n2, [+, -]dist)).

Note

A link connects two node ends, and so the order of the signed nodes in the links does not change the link.

Note

If the distance in a link is negative, this represents an overlap between two sequences. These overlaps must be perfect overlaps.

Note

Two links are considered equal (== will return true) if the source and the destination fields of the two links are the same, the distance or dist field is not considered!

Public / Safe methods

Graph nodes and sequences

GenomeGraphs.Graphs.sequenceFunction
sequence(sg::SequenceDistanceGraph, n::NodeID)

Get the full sequence of a node in a sequence distance graph using its correlative node id n.

Note

sequence accepts a NodeID that can be positive or negative. Nodes represent stretches of sequence in a canonical orientation, if you ask for for the sequence of say the third node, the positive node id 3 (which denotes traversing the third node in the forward direction), gives you the canonical sequence. If you use the negative ID -3 (which denotes traversing the third node in the reverse direction), you will get the reverse complement of the node's canonical (forward) sequence.

Note

It is safe to modify the returned sequence without screwing up your graph, yet thanks to BioSequences.jl's copy on write system for LongSequences, data copying will only occur if nessecery. You get the best of both worlds.

Graph Topology

GenomeGraphs.Graphs.find_linkFunction
find_link(sg::SequenceDistanceGraph, src::NodeID, dst::NodeID)

Find and return the link that exists between a source node and a destination node, using their correlative node ids.

In this instance, the IDs also denote the "ends" of a node, a link connects.

Recall that every node has an orientaton: Each node has a positive end (+), and a negative end (-). So when a node is accessed with (or traversed by entering) the positive end the node yields the stored sequence. Conversely, when a node is accessed with (or traversed by entering) the negative end the node yelds the reverse complement of the stored sequence.

Thus find_link(sg, -5, 1) means you want to find a link in the graph that allows you to exit the (-) end of node 5 (meaning you just traversed it in the canonical + –> - orientation), and enter the (+) end of node 1 (meaning you will also traverse node one in the canonical + –> - orientation).

By contrast find_link(sg, -5, -1) means you want to find a link in the graph that allows you to exit the (-) end of node 5 - again having just traverse it in the + –> - orientation, but want to enter node 1 through it's (-) end - and thus traversing node 1 in the + <– - orientation, yielding the reverse complement of the canonical sequence of node 1.

If a link connecting the two node ends is present, you get it. If not, you get nothing. So make sure to check the output.

Note

This method is safe and public, but not reeeeaaallly intended for the end user, as they have to worry about links between "ends" of a node. In contrast to the rest of this framework, which rather has the user thinking in node centric terms. e.g. "I am at node 5 which means I have traversed it forwards or I am at node -5, which means I have traversed node 5 backwards. From here I can visit node 1, -3, and 2, so I may go through nodes 1 and 2 forwards, and through 3 backwards. The methods get_next_nodes and get_previous_nodes are more user-friendly and follow this more node-centric mindset.

GenomeGraphs.Graphs.forward_linksFunction
forward_links(sg::SequenceDistanceGraph, n::NodeID)

Get a vector of the links that are ahread of you, as you traverse node n, continuing forward in you present direction of travel (n can be a positive or negative node id).

The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the links that allow you to leave node 5 in the canonical direction. If you used a negative ID such as -5, that means you are traversing node 5 in the non-canonical orientation, and so you will get the links that allow you to leave node 5 in the non-canonical direction.

GenomeGraphs.Graphs.backward_linksFunction
backward_links(sg::SequenceDistanceGraph, n::NodeID)

Get a vector of the links that are behind of you, as you traverse node n, continuing forward in you present direction of travel (n can be a positive or negative node id).

The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the links that would allow you to enter node 5 in the canonical direction. If you used a negative ID such as -5, that means you are traversing node 5 in the non-canonical orientation, and so you will get the links that would allow you to enter node 5 in the non-canonical direction.

GenomeGraphs.Graphs.get_next_nodesFunction
get_next_nodes(sg::SequenceDistanceGraph, n::NodeID)

Get the nodes you may visit next as you exit node n, maintaining your current direction of travel (n can be a positive or negative node ID).

Note

The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the nodes you may visit as you leave node 5 in the canonical direction. If you use a negative ID, such as -5, this means you are traversing node 5 in the non-canonical orientation, and so you will get the nodes you may visit as you leave node 5 in the non-canonical direction.

Note

The list of nodes returned are also signed to denote direction: Say you got [1, -2, 3] as a result of get_next_nodes(sg, 5). That means as you leave node 5 after traversing it in the canonical direction, you may proceed to travel through nodes 1 and 3 in the canonical direction, and node 2 in the non-canonical direction.

GenomeGraphs.Graphs.get_previous_nodesFunction
get_previous_nodes(sg::SequenceDistanceGraph, n::NodeID)

Get the nodes you may have previously been on before you enterd node n, maintaining your current direction of travel (n can be a positive or negative node ID).

Note

The node id can be positive or negative. For example if you use a positive ID, such as 5, this means you are traversing node 5 in the canonical orientation, and so you will get the nodes you may have visited prior to entering node 5 in the canonical direction. If you use a negative ID, such as -5, this means you are traversing node 5 in the non-canonical orientation, and so you will get the nodes you may have visited prior to entering node 5 in the non-canonical direction.

Note

The list of nodes returned are also signed to denote direction: Say you got [1, -2, 3] as a result of get_previous_nodes(sg, 5). That means prior to entering node 5 in the canonical direction, you may have traveled through nodes 1 and 3 in the canonical direction, and node 2 in the non-canonical direction.

GenomeGraphs.Graphs.find_tip_nodesFunction
find_tip_nodes(sg::SequenceDistanceGraph, min_size::Integer)

Get a set of IDs of all the nodes in the graph sg that count as "tips".

Here a tip node is defined as a node that only has a single neighbouring node at one of it's ends, and no neighbouring nodes out of one of its other end.

The node's sequence must also be larger than min_size base pairs in length.

GenomeGraphs.Graphs.find_tip_nodes!Function
find_tip_nodes!(result::Set{NodeID}, sg::SequenceDistanceGraph, min_size::Integer)

Get a set of IDs of all the nodes in the graph sg that count as "tips".

Here a tip node is defined as a node that only has a single neighbouring node at one of it's ends, and no neighbouring nodes out of one of its other end.

The node's sequence must also be larger than min_size base pairs in length.

Note

This method modifys a result input. So if you want to find tips in a graph repeatedly, you can resuse a Set, saving you some allocation costs.

GenomeGraphs.Graphs.find_all_unitigsFunction
find_all_unitigs(sg::G, min_nodes::Integer) where {G<:SequenceDistanceGraph}

Find and return a vector of paths through the graph that represent all the unitigs or transitive paths in the graphs. Such paths are defined as a chain of nodes with only one neighbour. Such simple regions of the graph can safely be collapsed into one larger node.

GenomeGraphs.Graphs.find_all_unitigs!Function
find_all_unitigs!(unitigs::Vector{SequenceDistanceGraphPath{G}}, sg::G, min_nodes::Integer) where {G<:SequenceDistanceGraph}

Find and return a vector of paths through the graph that represent all the unitigs or transitive paths in the graphs. Such paths are defined as a chain of nodes with only one neighbour. Such simple regions of the graph can safely be collapsed into one larger node.

Note

This vector modifies an input unitigs vector to contain the result. This is useful for situations where you want to repeatedly find unitigs in the graph to save on additional allocations.

Graph IO

GenomeGraphs.Graphs.write_to_gfa1Function
write_to_gfa1(sg::SequenceDistanceGraph, filename::String)

Write a graph from a GFAv1 formatted file, and associated FASTA file of node sequences.

Note

The GFA format permits storing sequences of the graph nodes in a seperate fasta file, instead of in the GFA file. This is so as the sequences of the graph nodes can be easily fed into other tools that typically accept FASTA files as input. Many assemblers also output a GFA + FASTA combo. Therefore, this method writes node sequences to a file called [filename].fasta, and the graph structure to a file called [filename].gfa.

GenomeGraphs.Graphs.load_from_gfa1!Function
load_from_gfa1!(sg::SequenceDistanceGraph{S}, gfafile::AbstractString, fafile::AbstractString) where {S<:BioSequence}

Load a graph from a GFAv1 formatted file, and associated FASTA file of node sequences.

Note

The GFA format permits storing sequences of the graph nodes in a seperate fasta file, instead of in the GFA file. This is so as the sequences of the graph nodes can be easily fed into other tools that typically accept FASTA files as input. Many assemblers also output a GFA + FASTA combo. Therefore, this method asks for the filepath of a GFAv1 file, as well as a filepath to a FASTA formatted file. This method reads the node sequences from the FASTA file, before getting the links between nodes from the GFAv1 file.

Internal / Unsafe methods

Graph nodes and sequences

GenomeGraphs.Graphs.empty_seqFunction
empty_seq(::Type{LongSequence{DNAAlphabet{2}}})

Get a reference to an empty 2-bit long dna sequence.

In this module, a single empty sequence is allocated, and every call to empty_seq returns a reference to that same allocated sequence.

This saves on allocations if you are deleting many nodes in a SequenceDistanceGraph.

empty_seq(::Type{LongSequence{DNAAlphabet{4}}})

Get a reference to an empty 4-bit long dna sequence.

In this module, a single empty sequence is allocated, and every call to empty_seq returns a reference to that same allocated sequence.

This saves on allocations if you are deleting many nodes in a SequenceDistanceGraph.

GenomeGraphs.Graphs.unsafe_sequenceFunction
unsafe_sequence(n::SDGNode{S}) where {S<:BioSequence}

Get the reference to a node's underlying sequence object.

Warning

You get a reference to the node's sequence object - not a copy. So doing transformation operations (reverse_complement, setindex, etc.) to it will probably screw up the graph!

unsafe_sequence(sg::SequenceDistanceGraph, n::NodeID)

Get the reference to a node's underlying sequence object.

Warning

This method is unsafe, no checking of node id's occurs and you get a reference to the node's sequence object - not a copy, so doing transformation operations (reverse_complement, setindex, etc.) to it will probably screw up the graph!

GenomeGraphs.Graphs.check_node_idFunction
check_node_id(sg::SequenceDistanceGraph, i::NodeID)

The method responsible for checking that a node id used as input to a method that queries or edits a SequenceDistanceGraph is a sensible value.

GenomeGraphs.Graphs.nodesFunction

Get a reference to the vector of nodes in a graph sg.

Warning

It is a bad idea to edit this vector yourself unless you know what you are doing.

GenomeGraphs.Graphs.node_unsafeFunction
node_unsafe(sg::SequenceDistanceGraph, n::NodeID)

Get a reference specific node from a sequence distance graph sg using its correlative node id n.

Note

node_unsafe accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will get the node for node 5.

Warning

This method returns a reference to a graph's underlying SDGNode. NOT a copy! Messing with this will screw up your graph. So this method is not recommended or exported - unless you really know what you're doing.

Warning

This method is explicitly marked unsafe for a reason. It does zero checking of the value it is passed as the node id. It also makes use of the @inbounds macro. This makes it faster, but you need to be 100% sure that your code won't try to call this with a bad id. So it is not exported or recommended

  • unless you really know what you're doing.
GenomeGraphs.Graphs.nodeFunction
node(sg::SequenceDistanceGraph, n::NodeID)

Get a reference to a specific node from a sequence distance graph sg using its correlative node id n.

Note

node accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will get the links for node 5.

Warning

This method returns a reference to a graph's underlying SDGNode. NOT a copy! Messing with this will screw up your graph. So this method is not recommended or exported - unless you really know what you're doing.

Graph topology

GenomeGraphs.Graphs.linksFunction

Get a reference to the vector of vectors of links in a graph sg.

Warning

It is a bad idea to edit this vector yourself unless you know what you are doing.

GenomeGraphs.Graphs.linksof_unsafeFunction
linksof_unsafe(sg::SequenceDistanceGraph, n::NodeID)

Get a reference to a vector storing all the links of a node in a SequenceDistanceGraph, the node is specified using its correlative node id n.

Note

links_unsafe accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will get the links for node 5.

Warning

This method returns a reference to an underlying links vector that the SequenceDistanceGraph owns - NOT a copy! Messing with this will screw up your graph. So this method is not recommended or exported - unless you really know what you're doing.

Warning

This method is explicitly marked unsafe for a reason. It does zero checking of the value it is passed as the node id. It also makes use of the @inbounds macro. This makes it faster, but you need to be 100% sure that your code won't try to call this with a bad id. So it is not exported or recommended

  • unless you really know what you're doing.

Graph editing and manipulation

GenomeGraphs.Graphs.add_node!Function
add_node!(sg::SequenceDistanceGraph{S}, n::SDGNode{S}) where {S<:BioSequence}

Add a node n to a graph sg.

Returns the node ID used to access the new node added in the graph.

Warning

We don't enforce the sequence in the node is canonical here. We just trust that it is canonical.

Warning

Adding a node to the graph does just that. After adding the node it still will not be linked to any other nodes.

add_node!(sg::SequenceDistanceGraph{S}, seq::BioSequence) where {S<:BioSequence}

Add a sequence to a sequence distance graph as a node.

Returns the node ID used to access the new node added in the graph.

Can accept any sequence type and will attempt to coerce the input sequence to the type required by the graph.

Warning

We don't enforce the sequence in the node is canonical here. We just trust that it is canonical.

Warning

Adding a node to the graph does just that. After adding the node it still will not be linked to any other nodes.

GenomeGraphs.Graphs.remove_node!Function
remove_node!(sg::SequenceDistanceGraph{S}, n::NodeID) where {S<:BioSequence}

Remove a node from a sequence distance graph.

Note

This method can accepts a NodeID that can be positive or negative. E.g. providing either 5 or -5 both mean node 5 in a graph, and so you will end up deleting node 5.

Note

Links involving this node will also be removed from the graph.

GenomeGraphs.Graphs.add_link!Function
add_link!(sg::SequenceDistanceGraph, source::NodeID, dest::NodeID, dist::Int)

Construct a link between two node ends in a sequence Graph.

GenomeGraphs.Graphs.remove_link!Function
remove_link!(sg::SequenceDistanceGraph, src::NodeID, dst::NodeID)

Remove a link between two nodes in a SequenceDistanceGraph. Returns a boolean indicating whether the removal was successful. Reasons this function would not return true include that the link didn't exist in the graph, and so could not be removed.

GenomeGraphs.Graphs.collapse_all_unitigs!Function
collapse_all_unitigs!(unitigs::Vector{SequenceDistanceGraphPath{G}}, newnodes::Vector{NodeID}, sg::G, min_nodes::Integer, consume::Bool) where {G<:SequenceDistanceGraph}

Detects all of the trivial paths through the graph that define unitigs. Such paths are defined as a chain of nodes with only one neighbour. Each such simple paths through the graph can safely be collapsed into one larger node, which is what this function does.

Note

This method will edit unitigs so as after this method returns it contains All the paths that were detected as unitigs in the graph. This method also edits the newnodes vector, so as after this method returns it contains the IDs of all new nodes that were created as a result of collapsing the paths in unitigs.

Note

Modifies the SequenceDistanceGraph sg.

collapse_all_unitigs!(sg::SequenceDistanceGraph, min_nodes::Integer, consume::Bool)

Detects all of the trivial paths through the graph that define unitigs. Such paths are defined as a chain of nodes with only one neighbour. Each such simple paths through the graph can safely be collapsed into one larger node, which is what this function does.

Note

Modifies the SequenceDistanceGraph sg.