Research¶
My research falls into two overlapping fields: informetrics and natural language processing.
In terms of informetrics, I work with publication data. More specifically, I'm interested in any data that deals with scientific research output from journal articles to open source software repositories on GitHub. My focus so far has been the evaluation of impact indicators (such as the h-index) and ranking algorithms (such as PageRank) on publication databases to understand how well and how fairly they rank researchers and papers. My current focus is on South African research output stemming from higher education institutions. I am interested in developing data mining techniques to extract useful scientometric meta data from unstructured text sources.
In terms of natural language processing, I focus on transformer-based large language models, in particularly in low-resource settings. I also try and combine natural language processing techniques with bibliometric data in order to gain further insights from the data contained in articles and dissertations. Specifically I am interested in using large language models for hierarchical classification, extracting in-text citation context from documents, and using citation networks for community detection and citation weighting.
Summary of my PhD work¶
The focus of my PhD work was on how bibliometric indicators such as the h-index can be evaluated using test data. The test data I collected comprised multiple datasets of author and paper entities that have won scientific awards and other academic accolades for their high impact and influence. These test datasets were used to evaluate whether current impact metrics fulfil their intended purpose and how fairly they rank papers and authors from different academic disciplines. I conducted experiments on large real-world publication databases comprising up to half a billion data points. Furthermore, I made methodological contributions to the field of informetrics by transferring and adapting statistical methods from the field of information retrieval to the context of rankings of academic entities.