Package Guide

Installation

You can install GenomeGraphs from the julia REPL. Press ] to enter pkg mode, and enter the following:

pkg> add GenomeGraphs

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Creating your first WorkSpace

You can create an empty genome graph workspace with the empty constructor:

using GenomeGraphs

ws = WorkSpace()

Obviously this workspace is quite useless on its own!

You need a genome graph and information to project onto it before you can do any exploration or analysis. There are a few ways to get your first graph:

Construct a de-Bruijn graph from raw sequencing reads.
Load a graph (such as produced from another assembler) from a GFAv1 file

Constructing your first de-Bruijn graph

Let's see how to do option number 1, and construct a de-Bruijn graph from raw sequencing reads. This can be achieved with a few simple steps:

Prepare the sequencing reads & build a datastore.
Add the read datastore to a WorkSpace.
Run the dbg process.
Run the tip removal process.

Preparing the sequencing reads

Let's prepare the sequencing reads.

WorkSpaces store sequencing reads in ReadDatastores, provided by the ReadDatastores.jl package.

GenomeGraphs uses (and re-exports types and methods from) ReadDatastores. ReadDatastores is a standalone package in its own right (although it was built in the first place for GenomeGraphs).

If you want to use ReadDatastores as a standalone package in another Bio(Julia) based project (and we recommend you do - the data stores are more efficient than text files), you can find standalone docs for the package here. Some types and methods documented there, are repeated in the Library section of this manual here, for convenience.

Anyway, let's see how to build a paired end reads datastore!

using GenomeGraphs

fwq = open(FASTQ.Reader, "test/ecoli_pe_R1.fastq")
rvq = open(FASTQ.Reader, "test/ecoli_pe_R2.fastq")

ds = PairedReads(fwq, rvq, "ecoli-test-paired", "my-ecoli", 250, 300, 0, FwRv)

Here "ecoli-test-paired" is provided as the base filename of the datastore, the datastore is given the name of "my-ecoli", this name will be used to identify it in the workspace later.

The minimum length for the reads is set at 250 base pairs, and the maximum length is set to 300 base pairs. Reads that are too short will be discarded, reads that are too long are truncated.

Note

The insert size of the paired reads to 0, since I'm not sure of it and right now the value is optional.

I set the orientation of the paired reads to FwRv. This is the default, and means for every pair of reads, read 1 is oriented in the forward direction, and read 2 is oriented backwards (forwards on the opposite strand). This orientation distinguishes regular paired-end reads from other paired read types like Long Mate Pairs.

Add datastore to the WorkSpace

Now the datastore is created, it can be added to a workspace.

ws = WorkSpace()
add_paired_reads!(ws, ds)

Run the `dbg` process

GenomeGraphs comes with some very high-level methods in it's API, that we like to call processes. They perform some critical and common task as part of a larger workflow. Examples include constructing a de-Bruijn graph from sequencing reads, mapping reads to a graph, kmer counting and so on.

Once a workspace has an attached read datastore, you can run the dbg process to produce a first de-Bruijn graph of the genome.

dbg!(BigDNAMer{61}, 10, ws, "my-ecoli")

Warning

Some steps of this process, especially the Kmer counting steps, may take a long time for big inputs. No parallelism or batching to disk is used currently (although it is planned). This process should take just about a minute for E.coli paired end reads with a decent coverage. So, be warned, for big stuff, performance may suuuuuuucck in these early days!

Run the `remove_tips` process

Now you have a raw compressed de-Bruijn assembly graph. You can start to use it for analyses, and also try to improve its structure and resolve parts of the graph that represent error, repetitive content, and so forth. Some of these structures can be identified and resolved using only the topology of the graph, some require additional information sources (linked reads / long reads / and so on), to be incorporated into the workspace first.

Here let's see how to resolve a common structure, using only the graph topology.

Let's fix the tips of the graph.

Tips looks like this:

tips

See how a piece of the assembly which should be one long stretch of sequence is broken into 3 pieces (red) because of the existence of two tips (blue). Such tips are defined topologically as very short segments which have one incoming neighbour, and no outgoing neighbour.

Tips are caused by sequencer errors that occur at the end of reads, because the sequencing by synthesis technique becomes more error prone over time; reagents are consumed and products generated as time progresses, making the base detection more difficult. Hence errors occur at the ends of reads, and erroneous kmers from the read ends are unlikely to have forward neighbours, and they end up forming tip nodes when they are incorporated into the de-Bruijn graph.

you can remove these tips and improve the contiguity of the graph by using the remove_tips! process.

remove_tips!(ws, 200)

The value of 200 provided is a size (in base pairs) threshold: Nodes are considered tips only if they have one ingoing neighbour, no outgoing neighbours, and are (in this case) smaller than 200 base pairs in length.

Once this process is finished you will have a collapsed de-Bruijn graph with the tips removed.

Using the NodeView interface

Graphs can be complicated data structures. If you want an in depth explanation of the data-structure used to represent genome graphs, then head here.

However, most people should not have to care about the internal structure of the graph.

To make things as simple as possible, the NodeView type provides a single entry point for node-centric analyses. The NodeView wraps a point to a workspace's graph and contains a node id. A NodeView gives you acces to a node's underlying sequence, the nodes neighbouring nodes in the forward and backward directions, the reads mapped to a node, and kmer coverage over the node.

To get a NodeView of a node, use the node method on a WorkSpace, providing a node id number. A positive ID denotes a view of the node traversing it in the forward (canonical) direction. A negative ID denotes a view of the node, traversing it in the reverse complement direction.

julia> n = node(ws, 3)
A view of a graph node (node: 3, graph: sdg):
  AAAAAACCTCCGCAACCCCATGTTTTCACATAACTGTTG…GCCATGACCGGCTGGCTGTCAGGCTGTCACTGATAATCA