Abstract
Many modern information systems rely on faithful knowledge retrieval to function effectively. For instance, large language models protect against hallucinations by leveraging techniques such as retrieval augmented generation (RAG). Meanwhile, the healthcare industry relies on structured knowledge bases, such as SNOMED CT, to aid decision support and to enable clinical reporting. Despite the central role that knowledge retrieval plays, LLMs continue to struggle with factual accuracy, and studies on effective retrieval for SNOMED CT remain limited. Motivated by these shortcomings, this work investigates the effectiveness of ontology-aware (hyperbolic) bi-encoders, focusing on the Hierarchy and Ontology Transformer frameworks (HiT and OnT, respectively). Through an investigation of ontology-grounded knowledge retrieval using SNOMED CT, we assess whether OnT-based retrieval improvements transfer to downstream tasks, including RAG-based BioMedical MCQA and web-based search. We construct an out-of-vocabulary (OOV) mention set using the MIRAGE benchmark, annotate gold target reference classes from SNOMED CT, and evaluate retrieval in single-target, multi-target and application-specific settings. We compare our results against strong lexical (TF–IDF, BM25) and contextual (Sentence-BERT) baselines, whilst evaluating potential for exploratory techniques such as mixed model spaces with heterogeneous curvature. For single-target retrieval, we find that the best-performing OnT models provide a 13-point gain in MRR over SBERT and more than double the relative performance when compared to lexical baselines. Similarly, in the multi-target setting, ontology encoders continue to outperform both lexical and contextual baselines, where a depth-biased subsumption score further improves mAP by 1–2 points compared to measures of pure geodesic distance (whilst eliciting minimal nDCG trade-off). Despite these performance improvements, single (top-1) concept retrieval applied to vanilla RAG for biomedical MCQA shows no significant accuracy gains on MIRAGE. Limitations likely stem from language model mismatch, short context length and insufficiencies tied to axiom verbalisation, suggesting that further work is required. We release a modular retrieval toolkit, annotated OOV queries, and a reproducible artefact to support future work, available at https://github.com/jonathondilworth/uom-thesis.