name: titleslide slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward class: center, middle # How biojulia is building for the future <object data="fonts/SinhalaMN/biojulia.svg" type="image/svg+xml" /> --- name: whoami1 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward class: center, middle # > whoami **Ben** / **Ward9250** Evolutionary Biology/Genetics, Bioinformatics, Programming - I like them! Games and Batman - I also like them! .center[] --- name: whoami2 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # > whoami ### Roles: - PhD Candidate (Evolutionary and Population Genetics) @ **University of East Anglia** - van Oosterhout Group @ The School of Environmental Sciences - c.van-oosterhout@uea.ac.uk - Software Developer @ **The Genome Analysis Center** - Plant and Microbial Genomics - Matt Clark Group - matt.clark@tgac.ac.uk --- name: whoami3 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # > whoami ### Projects - **BioJulia** - Slidewinder - HybridCheck - Evolution of Polar Diatoms. - Plant pathogens genomics; Hybridisation and Introgression. --- name: contact slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Reaching me I use the handle "Ward9250" where possible. - @Ward9250 - ben.ward@tgac.ac.uk - webtocome - https://github.com/Ward9250 - https://keybase.io/Ward9250 <- Talk to me all secret-agent like! --- name: julialogo slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward class: center, middle <svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" width="210px" height="142px" viewBox="0 0 310 216" enable-background="new 0 0 310 216" xml:space="preserve"> <!-- blue dot --> <circle fill="#6b85dd" stroke="#4266d5" stroke-width="3" cx="50.5" cy="58.665" r="16.5"></circle> <!-- red dot --> <circle fill="#d66661" stroke="#c93d39" stroke-width="3" cx="212.459" cy="60.249" r="16.5"></circle> <!-- green dot --> <circle fill="#6bab5b" stroke="#3b972e" stroke-width="3" cx="233.834" cy="23.874" r="16.5"></circle> <!-- purple dot --> <circle fill="#aa7dc0" stroke="#945bb0" stroke-width="3" cx="255.459" cy="59.999" r="16.5"></circle> <!-- "j" --> <path fill="#252525" d="M37.216,138.427c0-15.839,0.006-31.679-0.018-47.517c-0.001-0.827,0.169-1.234,1.043-1.47 c7.876-2.127,15.739-4.308,23.606-6.47c1.33-0.366,1.333-0.36,1.333,1.019c0,25.758,0.015,51.517-0.012,77.274 c-0.006,5.514,0.245,11.032-0.272,16.543c-0.628,6.69-2.15,13.092-6.438,18.506c-3.781,4.771-8.898,7.25-14.767,8.338 c-6.599,1.222-13.251,1.552-19.934,0.938c-4.616-0.423-9.045-1.486-12.844-4.363c-2.863-2.168-4.454-4.935-3.745-8.603 c0.736-3.806,3.348-5.978,6.861-7.127c2.262-0.74,4.628-0.872,6.994-0.53c1.823,0.264,3.42,1.023,4.779,2.288 c1.38,1.284,2.641,2.674,3.778,4.177c0.872,1.15,1.793,2.256,2.991,3.086c2.055,1.426,4,0.965,5.213-1.216 c0.819-1.473,0.997-3.106,1.173-4.731c0.255-2.348,0.255-4.707,0.256-7.062C37.218,167.145,37.216,152.786,37.216,138.427z"></path> <!-- "u" --> <path fill="#252525" d="M125.536,162.479c-2.908,2.385-5.783,4.312-8.88,5.904c-10.348,5.323-20.514,4.521-30.324-1.253 c-6.71-3.95-11.012-9.849-12.52-17.606c-0.236-1.213-0.363-2.438-0.363-3.688c0.01-19.797,0.017-39.593-0.02-59.39 c-0.002-1.102,0.285-1.357,1.363-1.351c7.798,0.049,15.597,0.044,23.396,0.003c0.95-0.005,1.177,0.25,1.175,1.183 c-0.027,19.356-0.025,38.713-0.018,58.07c0.002,6.34,3.599,10.934,9.672,12.42c2.13,0.521,4.19,0.396,6.173-0.6 c4.26-2.139,7.457-5.427,10.116-9.307c0.333-0.487,0.224-1,0.224-1.51c0.007-19.635,0.016-39.271-0.02-58.904 c-0.002-1.083,0.255-1.369,1.353-1.361c7.838,0.052,15.677,0.045,23.515,0.004c0.916-0.005,1.103,0.244,1.102,1.124 c-0.025,27.677-0.026,55.353,0.002,83.024c0.001,0.938-0.278,1.099-1.139,1.095c-7.918-0.028-15.837-0.028-23.756-0.001 c-0.815,0.003-1.1-0.166-1.073-1.037C125.581,167.117,125.536,164.928,125.536,162.479z"></path> <!-- "l" --> <path fill="#252525" d="M187.423,107.08c0,20.637-0.011,41.273,0.026,61.91c0.003,1.119-0.309,1.361-1.381,1.355 c-7.799-0.052-15.598-0.047-23.396-0.008c-0.898,0.008-1.117-0.222-1.115-1.115c0.021-39.074,0.021-78.147,0-117.226 c0-0.811,0.189-1.169,1.006-1.392c7.871-2.149,15.73-4.327,23.584-6.545c1.045-0.295,1.308-0.17,1.306,0.985 C187.412,65.727,187.423,86.403,187.423,107.08z"></path> <!-- "i" --> <path fill="#252525" d="M223.46,126.477c0,14.155-0.011,28.312,0.021,42.467c0.002,1.027-0.164,1.418-1.332,1.408 c-7.838-0.061-15.676-0.047-23.516-0.01c-0.881,0.004-1.121-0.189-1.119-1.104c0.026-26.153,0.025-52.307,0-78.458 c0-0.776,0.203-1.101,0.941-1.302c7.984-2.172,15.972-4.35,23.938-6.596c1.049-0.296,1.08,0.031,1.078,0.886 C223.454,98.004,223.46,112.239,223.46,126.477z"></path> <!-- "a" --> <path fill="#252525" d="M277.695,163.6c-0.786,0.646-1.404,1.125-2,1.635c-4.375,3.746-9.42,5.898-15.16,6.42 c-5.792,0.527-11.479,0.244-16.934-2.047c-12.08-5.071-15.554-17.188-11.938-27.448c1.799-5.111,5.472-8.868,9.831-11.94 c5.681-4.003,12.009-6.732,18.504-9.074c5.576-2.014,11.186-3.939,16.955-5.347c0.445-0.104,0.773-0.243,0.757-0.854 c-0.136-4.389,0.261-8.79-0.479-13.165c-1.225-7.209-6.617-10.013-12.895-9.348c-0.516,0.055-1.029,0.129-1.536,0.241 c-4.877,1.081-7.312,4.413-7.374,10.127c-0.02,1.729-0.229,3.418-0.693,5.084c-0.906,3.229-2.969,5.354-6.168,6.266 c-3.422,0.979-6.893,0.998-10.23-0.305c-6.529-2.543-8.877-10.164-5.12-16.512c2.249-3.799,5.606-6.4,9.461-8.405 c6.238-3.246,12.914-4.974,19.896-5.537c7.565-0.61,15.096-0.366,22.49,1.507c4.285,1.085,8.312,2.776,11.744,5.657 c4.473,3.749,6.776,8.647,6.812,14.374c0.139,21.477,0.096,42.951,0.143,64.428c0.002,0.799-0.248,0.983-1.021,0.98 c-8.035-0.025-16.074-0.023-24.113-0.001c-0.716,0.002-0.973-0.146-0.941-0.915C277.736,167.562,277.695,165.698,277.695,163.6z M277.695,126.393c-4.793,2.104-9.25,4.373-13.287,7.408c-2.151,1.618-4.033,3.483-5.732,5.581 c-4.229,5.226-1.988,13.343,1.693,16.599c1.592,1.406,3.359,1.906,5.419,1.521c1.621-0.307,3.149-0.857,4.549-1.734 c1.521-0.951,2.949-2.072,4.539-2.887c2.31-1.18,2.97-2.861,2.894-5.445C277.561,140.484,277.695,133.527,277.695,126.393z"></path> </svg> --- name: julia_features slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # julia's features... - Multiple dispatch - Dynamic type system - User defined types - Speed - Package manager - Macros and metaprogramming - PyCall - Call C code directly - Shell capability - Coroutines and tasks --- name: benchmarks slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # The canonical benchmarks - (caveats of course apply!)  .footnote[http://julialang.org/] --- name: bmarkcode slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Example code ```julia function randmatstat(t) n = 5 v = zeros(t) w = zeros(t) for i = 1:t a = randn(n,n) b = randn(n,n) c = randn(n,n) d = randn(n,n) P = [a b c d] Q = [a b; c d] v[i] = trace((P.'*P)^4) w[i] = trace((Q.'*Q)^4) end std(v)/mean(v), std(w)/mean(w) end ``` .footnote[http://julialang.org/] --- name: whyjulia slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Ok... but why? - Came about to tackle the two-language problem. - I love programming and different tools and languages... but! - The two language problem hinders science. - It prevents us "treating code right". - Too many languages for the non-code-savvy scientist. - Time spent mastering the n'th language is time not thinking about subject. - Mixing and combining languages is even more troublesome! - What does this mean when it comes to review? - And what about packages, publication = one review? --- name: mozillacr slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Mozilla Code Review - Mozilla & PLOS Compulational Biology - Code between 200-500 lines from a series of selected papers. - Typical analysis code in Python, R, Perl - Not packages. - Interviewed volunteers. ## Findings - Most scientists claimed it was their first time. - Code was not written for others to use. - Lack of commenting and documentation. - Much discussion on whether scientists are encouraged to participate in CR. .footnote[Full write up: http://arxiv.org/abs/1311.2412] --- name: origins slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Enter BioJulia ## Began with individual efforts - BioSeq.jl - Phylogenetics.jl ## Core Team... Daniel Jones (University of Washington) - @dcjones Richard Smith Unna (Cambridge University) - @blahah404 Ben J. Ward (UEA, TGAC) - @Ward9250 Kenta Sato (University of Tokyo) - @bicycle1885 Paulo Castro (University of São Paulo) - @prcastro Diego Zea (Buenos Aires, Argentina) - @diegozea --- name: packages slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Maintained packages - Bio - FMIndexes - Libz - WaveletMatrices - IndexableBitVectors - BioFmtSpecimens - BufferedStreams - IntervalStreams - Dat - Dalliance --- name: whatwewant slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # What we want - A well maintained base infrastructure package of very efficient types and algorithms for biology. - Bio.jl - Suitable for more high level packages. - Constant and consistent code review. - An open and inclusive community. - Learning resources. - To write out code *once*, to have more people able to review it, modify it, improve it! --- name: parsing1 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Parsers can be a pain! ## Fasta - simple format to parse and enforce. ``` » seqtk seq -r sweetpea.fasta >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds AACAAACCTCGGAGTAGTGTTATGACAAACTACCAAACCCCAAAGTCTCTTTTTCTTTTGTGGTAGAACTGCGTCACGGCTATCTCCATCTTC ``` --- name: parsing2 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Parsers can be a pain! ## So why can you parse ASCII pizzas with FASTA parsers? ``` » seqtk seq -r pizza.txt > | " `.' " // .' : " ' ~ // `""" *| | , . * @ "@ .@.@ )Q( . / , )O( " , ''---..__ . . " ; ,@'. ~ .@.@ . ~ ""--- / , " _ '''--..,__ ' . . @ .@ * . " ..___/ '. ''---..__ '''--@..@ | ' \\ , .--. " ~ '.@ * __ ''--...__ " @ ' | | \\ " . "--@ * .@',_' , '"'--..,__ """@ ~ | | | | ~ \\ ))( . ' " '. ' , '"--..__ ~ ', .' || , | | | | ' _ " .--. .-"", __ " )0( " ___| |__ || *| | | | )G( * @ ~ )@( '...@ _ , ~ .-. ', ~ | | | | " . ~ // )g( "==== , ~ , ' ~ // . " | | >,..--''' . * ~ || * ~ / " * , ~ ` ~ , (O) ' . '.-@ ''"""""'@ , @ _.-@ '--..___ ___..--@ / _.-@ .@ '-.,_ ""--....--@" _.@ / " . .@ _.@ '-._ '"-.__ " @ * _..-@@ , / // , .@ .@ '._ "-._ * )O( \\ , _., / . // " .@ . ', "-._ ~ * , \\ )o( " / / )G( . * . .@ ', '._ * '@ ' ~ * / / . ' )g( , @ . / '. " @', : , ,__. ~ ,/ / ~ * @ "@ ~ . .@ ', . // '. ' : : " / / " ; ,@'. * ----- .@ @ / . // )@( .--, || .-. / /,__. @ .@ _____ @ . , ' " ./ * _ . " || )O( ~/ / : : j| .--. . ~ " ' '. ' " * . || " ~ _ / ' .-. || " " )@( , | | ``` This particular parser is used 1615 times on GitHub. Because formats are not standardized, parsers are often "non-validating" or vaguely defined. --- name: parsing3 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Enter State machine specifications With Ragel we can define a FASTA parser in 9 lines that's faster and more accurate that most hand-written C/C++ parsers. ``` newline = '\r'? '\n'; hspace = [ \t\v]; whitespace = space | newline; identifier = (any - space)+; description = ((any - hspace) [^\r\n]*); letters = (any - space - '>')+; sequence = whitespace* letters? (whitespace+ letters)*; fasta_entry = '>' identifier (hspace+ description)? newline sequence whitespace*; main := whitespace* (fasta_entry)*; ``` The same specification can be used to generate equivalent parsers C, D, Go, Java, Ruby, C#, OCaml. --- name: parsing4 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Parsers are fast ```julia using Bio.Seq @time collect(read("/Users/dcjones/Homo_sapiens.GRCh38.dna.primary_assembly.fa", FASTA)); ``` 26.561 seconds (8951 allocations: 1423 MB, 4.74% gc time) 30.8 seconds in Biostrings (R), 39.7 seconds in SeqAn(C++) Other cool stuff is possible too! ``` using Bio.Seq # Memory mapping read("/Users/dcjones/Homo_sapiens.GRCh38.dna.primary_assembly.fa", FASTA, memory_map=true) # Reading from commands for seqrec in take(drop(read(`quip -cd /Users/dcjones/SRR094166.fastq.qp`, FASTQ), 10000), 3) println(seqrec) end ``` --- name: nucseq slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Nucleotide Sequences ### 2 bit encoding: ``` A => 0b00 C => 0b01 T => 0b10 G => 0b11 ``` ### Reverse complement ```julia using Bio.Seq const chr1 = first(read("/Users/dcjones/Homo_sapiens.GRCh38.dna.primary_assembly.fa", FASTA)).seq @time reverse_complement(chr1); ``` 151.301 milliseconds (11 allocations: 91171 KB) Versus 956 milliseconds in Biostrings (R) and 445 milliseconds in SeqAn (C++) --- name: nucseq2 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward ### Counting Kmers ```julia @time DNAKmerCounts{10}(chr1); ``` 1.901 seconds (8 allocations: 4096 KB) Versus 2.724 seconds in Biostrings (R) and 5.954 seconds in SeqAn (C++) ### Counting Nucleotides ```julia @time NucleotideCounts(chr1) ``` 67.946 milliseconds (196 k allocations: 3058 KB) Versus 1193 milliseconds in Biostrings (R) 3165 milliseconds in SeqAn (C++) ```julia count_c(x::Uint64) = count_ones((((~x) >>> 1) & x) & 0x5555555555555555) ``` --- name: nucseq3 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward ### Translation ```julia const seqs = collect(read("/Users/dcjones/human-coding-sequences.fa", FASTA)) @time for seq in seqs translate(convert(RNASequence, seq.seq[1:end-3]), Seq.standard_genetic_code, true) end ``` 467.524 milliseconds (765 k allocations: 65283 KB) 632 milliseconds in Biostrings (R) and 938 milliseconds in SeqAn (C++) ### Subsequences are cheap ```julia @time [chr1[i:end] for i in 1:1000000]; ``` 48.829 milliseconds (1000 k allocations: 54688 KB) Immutable by convention. --- name: intervals slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Intervals ```julia immutable Interval{T} seqname::ASCIIString first::Int64 last::Int64 strand::Strand metadata::T end ``` B+-trees are provided by IntervalTrees: .center[] Intersection is a fundamental operation for sets of intervals. Find all pairs of intervals from A and B that intersect. --- name: intervals2 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward ### Intersection ```julia genes = read("genes.bed", BED) reads = IntervalCollection(read("reads.bed", BED)) @time for (a,b) in intersect(reads, genes) end ``` 3.190 seconds (17592 k allocations: 612 MB, 59.76% gc time) 5.9 seconds for `bedtools intersect -loj`. .center[] --- name: intervals3 slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward ### Coverage ```julia @time coverage(read("bodymap-heart-2-chr1.bed", BED)) IntervalCollection with 3208603 intervals: chr1:10586-10633 . 1 ⋮ ``` 34.209 seconds (419 M allocations: 16216 MB, 31.36% gc time) `bedtools genomecov` takes 54.4 seconds to do the same. .center[] --- name: alignments slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Sequence alignments Use an anchor based system. ```julia immutable AlignmentAnchor seqpos::Int refpos::Int op::Operation end immutable Alignment anchors::Vector{AlignmentAnchor} firstref::Int lastref::Int end immutable AlignedSequence{S} seq::S aln::Alignment end ``` Operations supported covers all CIGAR ops, encoded as single bytes. --- name: moretocome slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # More to come - Dalliance - BED & BigBED - Alignment Algorithms - Project Management - Phylogenetics - Population Genetics - Assembly and PanGenomics The list goes on... --- name: thankyou slide_author: Ben J. Ward title: How BioJulia is building for the future. author: Ben J. Ward # Thank You! Github: https://github.com/BioJulia Gitter: https://gitter.im/BioJulia/Bio.jl