Genome Informatics 2016

This annual conference alternates between the Wellcome Genome Campus in Hinxton, UK, and Cold Springs Harbour, USA. While the CSHL campus is quite lovely and prestigious, I was thrilled to finally have a chance to visit the vaunted Sanger Institute; they were pioneers of genomics from the very beginning, and are still a leading source of innovation in the field. The campus was a great mix of older original buildings and modern construction, and the conference drew some of the leaders in genomics.

One of the keynote speakers was Richard Durbin, a leader of such projects as 1000 genomes, Pfam, WormBase, and Ensembl. His book on Biological Sequence Analysis is the standard in the field, and his sequence similarity matching software is probably the most widely used after BLAST. Here, he was speaking about a new way to store genome information, one that the field will likely be using in the not-so-distant future. Today, a genome is typically stored as a sequence of characters, which works very well. But if you sequence thousands of individuals from a population, the vast majority of these characters will be repeated. This ends up wasting terabytes of storage, and further, the collection of files tells you nothing about the population itself without extensive processing. A graphical representation of the population of genomes is far more efficient, allows common errors to be easily identified and corrected, inherently contains information about the population, and lends itself to efficient computation and manipulation. This may seem like a minor and overly-technical detail, but it is likely to change the field in the near future.

There were, of course, many other things to be learned. As an example, RNAseq is a widespread technique used to study gene expression in some cell or organism. Unfortunately, analysis of these data can be misleadingly difficult, and mistreatment can easily result in incorrect conclusions. A Venn diagram is a common way to display complex membership data, and research groups have been using these with RNAseq data for years. However, binning the data in this way completely discards all relative information, and often lower-level membership information as well. A group from Melbourne has developed an excellent tool to make these analyses much easier, even for non-specialists, and adoption of this or similar software could make RNAseq studies more reliable and predictive in the future.

The conference was excellent, and I would absolutely encourage others to attend future iterations if they are interested in genomics. I sincerely appreciate the support of the Office of Graduate Education and the HPI for allowing me to go.


Bioinformatics in Toronto

139322084-lysanyqu-_dsc21070001In May of this year the University of Toronto hosted the first Canadian Computational Biology Conference, and bioinformatics geeks like myself flocked to their beautiful campus from all over the continent and across the pond. There was a lot of cool new work being presented, and it’s always interesting to get an update on the current state of our art. It was especially striking as the conference took place at Victoria College, which is essentially a castle sitting on on the U of T campus. I’ve never before heard about new applications of set cover machines or the pitfalls of failing to treat metagenomics data as compositional, while the speaker is being framed by beautiful 180 year old stained glass windows.

me_presenting_2I was fortunate enough to be able to present my work in a talk, which was somewhat nerve-wracking as my session was chaired by Dannie Durand, the very researcher whose work I have adapted and modified for my own purposes. She spoke about their recent work analyzing gene families by considering the evolution of each domain separately. We also heard about the staggering complexity involved in trying to predict transcription factor binding behavior, about how representing bacterial genomes as profiles of k-mers yields better phenotypic information than standard phylogenetic analyses, and how disordered regions on eukaryotic proteins are vital to their functions (especially transcription factors).

In addition, we heard about several projects out of Greg Gloor’s group, with his talk titled “We’ve been analyzing high throughput sequence data in the wrong geometric space”. It raised a fundamental problem with how abundance sequencing data is commonly handled, essentially suggesting that most of the metagenomics field have been analyzing their data incorrectly. It boils down to the fact that 10% of one sample can’t be assumed to be the same as 10% of an independent sample, when you have no idea about how many total data you have in each sample. If your data are ratios, you can’t treat them the same as if they were actual counts. Spurious correlations will absolutely appear simply due to the structure of the data, and even simple operations like addition and subtraction no longer really work. Researchers have not really properly respected this fundamental issue, but there is now a tool available called CoDaSeq to help them handle it in the future.

Overall I found the conference very valuable, and I am very grateful to the HPI for the funding and support that sent me there. Hopefully this conference will continue in the future!


ISMB 2015 – Dublin

dublin_conferenceFor 5 days in July, I was fortunate enough to attend the Intelligent Systems for Molecular Biology (ISMB) / European Conference on Computational Biology (ECCB), which was held in the rather stunning Dublin convention center. This is the biggest bioinformatics conference around, and it was amazing to see the work of thousands of fellow researchers. There were 4 keynote presentations, over 100 talks, 600 posters, and Q&A sessions with industry. It was great to get a chance to talk to Dutch biotech entrepreneurs, the head of IBM’s bioinformatics program, and a research director with Pfizer.

There were a lot of great talks, on some very interesting topics (well, interesting to me): “Understanding operon evolution using an event-driven model and phylogenetic visualizations”, “Inferring parental genomic ancestries using pooled semi-Markov processes”, “Detecting molecular similarities between allergenic and metazoan parasitic proteins: allergy in the light of immunity”, and many more. Another highlight of the conference was the reception dinner, which involved a tour of the absolutely enormous Guinness brewery. The complex includes dozens of buildings, and must cover at least 10 city blocks on either side. Fun fact: this is where William Sealy Gosset was working when he developed the t-test in 1908.


There weren’t many hotels available in the area, so I was forced to stay in the Marker, a glorious building just across the river from the convention center. Besides the great location, they served up an amazingly delicious breakfast every morning before the conference began. I can still remember the buttered toast, potatoes, and espresso. It was really a great experience, and I’m thankful I was able to attend.

Article by Dave Curran, PhD candidate in the Wasmuth lab.




Worms and bugs in the mountains

Reposted from the UCVM helminthology blog: The Banff Conference on Infectious Diseases (BCID) ran this year from Wed-Sun over the first week of June. It has always been organized jointly between the Universities of Calgary and Alberta, and as usual they brought in many top researchers from around the world. I have attended twice in the past, back when I was working in microbiology and infectious diseases, but this time our lab went to represent parasites. This is actually something that I felt was somewhat missing from my previous degree; in classes or seminars or conferences we heard about bacteria, viruses, prions, and malaria, but almost never about parasitic worms. So my supervisor, James Wasmuth, gave a talk on Trichinella spiralis, and I presented a poster about our work on worms. There were a lot of very interesting talks at the conference. One that stuck in my mind is from a group looking to quantify the response of a single bacterial cell under different conditions. Their rather interesting approach was to use nanotechnology to create bacterial traps. They manufactured a chip with a few dozen very small structures on it, where each was in the shape of a “C” that was just large enough to accommodate a single bacteria inside. When the slide was exposed to heat, the material would swell shut, effectively trapping that bacterium in a set location. You could then expose the cells to any treatment you like, measure a response, and repeat; each measurement would be on the exact same cell so you wouldn’t have to measure only population level data. Another talk that I found very interesting had to do with horizontal gene transfer in bacteria. It is a well known and studied phenomenon, that many bacterial species are able to uptake DNA from their environment, and that many of them integrate it into their own genomes. It is a common mechanism of spreading useful genes (often virulence factors) even between divergent species. But when you think about it, taking up a whole operon from a different bacteria carries with it quite a bit of risk. A bacteria doesn’t want something that is constantly draining resources for no benefit, or worse, some product that would interfere with the standard cellular processes. And for many complex features, such as a type-6 secretion system or a toxin/anti-toxin pair, all of the parts would have to begin working at the same time in order to be at all useful or safe. It is extremely likely that any foreign DNA would have to undergo many specific mutational changes in order to begin to function properly within the cell, which is extremely unlikely to occur in a short time period. All of these traits would be strong selective pressures to kick out foreign DNA, but on the other hand the rapid spread of new genetic material can provide a very beneficial evolutionary response to changing environments. It turns out that Salmonella uses the histone-like nucleoid structuring (H-NS) system to have their cake and eat it too. The genomes of different species of bacteria often have very different G-C content (can range from 25%-75%); Salmonella has a relatively high level. This means that any foreign DNA it takes up will likely be comparatively A-T rich. The H-NS system exploits this fact, and is able to bind to A-T rich regions of the genome regardless of the specific sequences. These proteins effectively supercoil the genome at these regions, preventing transcription at very low cost to the cell. Salmonella is therefore able to maintain a diverse reservoir of genes that are not actually active, but can collect mutations over time and quickly be turned on when the appropriate infrastructure comes into being. Systems like these have been found in several other bacterial species, and may help to explain how multi-part complex traits sometimes seem to suddenly spring into existence. So quite an interesting conference, and a good opportunity for me to see new research that is not directly related to what I do. Hopefully some people found our work as interesting as I found theirs.