Main Article Content
A reliable system is crucial in satisfying users’ need, but the reliability is dependent on the varying effects of the test collection. The reliability is usually evaluated by the similarities of a set of system rankings to understand the impact of variations in relevance to judgments or effectiveness metrics. However, such evaluations do not indicate the reliability of individual system rankings. This study proposes a method to measure the reliability of individual retrieval systems based on their relative rankings. The Intraclass Correlation Coefficient (ICC) is used as a reliability measure of individual system ranks. Various combination of effectiveness metrics according to their clusters, selection of topic sizes, and Kendall’s tau correlation coefficient with the gold standard are experimented. The metrics average precision (AP) and rank-biased precision (RBP) are suitable for measuring the reliability of system rankings and generalizing the outcome with other similar metrics. Highly reliable system rankings belong mostly to the top and mid performing systems and are strongly correlated with the gold standard system ranks. The proposed method can be replicated to other test collections as it utilizes relative ranking in measuring reliability. The study measures the ranking reliability of individual retrieval systems to indicate the level of reliability a user can consume from the retrieval system regardless of its performance.