In May of this year the University of Toronto hosted the first Canadian Computational Biology Conference, and bioinformatics geeks like myself flocked to their beautiful campus from all over the continent and across the pond. There was a lot of cool new work being presented, and it’s always interesting to get an update on the current state of our art. It was especially striking as the conference took place at Victoria College, which is essentially a castle sitting on on the U of T campus. I’ve never before heard about new applications of set cover machines or the pitfalls of failing to treat metagenomics data as compositional, while the speaker is being framed by beautiful 180 year old stained glass windows.
I was fortunate enough to be able to present my work in a talk, which was somewhat nerve-wracking as my session was chaired by Dannie Durand, the very researcher whose work I have adapted and modified for my own purposes. She spoke about their recent work analyzing gene families by considering the evolution of each domain separately. We also heard about the staggering complexity involved in trying to predict transcription factor binding behavior, about how representing bacterial genomes as profiles of k-mers yields better phenotypic information than standard phylogenetic analyses, and how disordered regions on eukaryotic proteins are vital to their functions (especially transcription factors).
In addition, we heard about several projects out of Greg Gloor’s group, with his talk titled “We’ve been analyzing high throughput sequence data in the wrong geometric space”. It raised a fundamental problem with how abundance sequencing data is commonly handled, essentially suggesting that most of the metagenomics field have been analyzing their data incorrectly. It boils down to the fact that 10% of one sample can’t be assumed to be the same as 10% of an independent sample, when you have no idea about how many total data you have in each sample. If your data are ratios, you can’t treat them the same as if they were actual counts. Spurious correlations will absolutely appear simply due to the structure of the data, and even simple operations like addition and subtraction no longer really work. Researchers have not really properly respected this fundamental issue, but there is now a tool available called CoDaSeq to help them handle it in the future.
Overall I found the conference very valuable, and I am very grateful to the HPI for the funding and support that sent me there. Hopefully this conference will continue in the future!