Sunday, December 13, 2009

Ontology matching and Phylogenies

As many will know, I've been spending the autumn at NESCent, working on two projects: a continuing effort in Phenoscape, and a new project to develop and implement an algorithm to align multiple taxon-specific ontologies using a tree. The resulting tool, Phylontal is still aways from even an initial release, but I still gave a brown-bag talk on Friday that covered ontology matching as it relates to evolutionary biology, particular compartive methods. While there is ongoing interest in the general topic of ontology matching (e.g., the OntologyMatching site) there has been relatively little in either the model organism or evolutionary biology communities. This is starting to change, there are several approaches being tried by model organism projects (most notably Uberon and the Homontol tool and Homology ontology of the BGEE project).

Although Uberon and Homontol may represent viable approaches for linking model organism ontologies, I've been dubious from the start that any approach that ignores or minimizes the role of phylogeny would be appropriate for studies that combine ontologies to ask comparative questions. Phylontal extends some of the ideas introduced by Homonotol and its Homologous Organ Groups (HOG's) by attaching alignments (the results of matching operations) to specific nodes in a tree and by explicitly distinguishing homologous and non-homologous alignments. Homolonol could move in a similar direction, and their homology ontology suggests they have been thinking about other types of correspondences between anatomical terms, but their multispecies gene expression database is plenty to fill their plate I think. If nothing else, introducing phylogeneticists to these issues will get people thinking about this.

In the talk, the question of missing various absent terms came up, especially when I discussed how phylontal could deal with a missing term in an ingroup that was shared with an outgroup. I'm beginning to think that the OwlWatcher approach of reasoning up from a series of instances, each of which is a graph, might allow the distinction between absent and missing terms to appear. This is particularly true in behavior sequences: if in one clade the sequence A->B->C is observed, and in another C immediately follows A in all the observed instance, then B is absent where it would be expected to be observed. Likewise, if all the observations show no successor to A, and no predecessor to C, then B may just not have been observed. It's the combination of use of sequences (complete orderings) and the ability to refer back to observed instances that make the difference. In principle, you could do the same sort of thing with anatomy by building chains of connections, but these are not the sort of details that make it into character matrices, so it would require going back to drawings/photos/free text and probably putting it into the taxon-level phenotype statements rather than the multispecies ontologies.

Aligning phenotypes might be a new frontier for Phenoscape and similar projects.

There was also some discussion about whether a down-pass from tips to root was sufficient to match terms. If so, then Phylontal can avoid some work when phylogenies change by getting the tree nodes aligned first. If otherwise, as Dave Swafford pointed out, it may be necessary to align from scratch with a new tree. This is an open and potentially important question.