Document recommendation systems for searching relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large oﬄine gold-standard benchmark of relevant documents that cover a variety of research ﬁelds such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium of more than 1,500 scientists in 84 countries, who have collectively annotated the relevance of over 180,000 PubMed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data covers 76% of all unique PubMed MESH descriptors. No systematic biases were observed across diﬀerent experience levels, research ﬁelds, or even time spent on annotations. More importantly, the same document pairs annotated by diﬀerent researchers are highly consistent with each other. We further show that three representative baseline methods [Okapi Best Matching 25 (BM25), Term Frequency–Inverse Document Frequency (TF-IDF), and PubMed Related Articles (PMRA)] have similar overall performance. The database server located at https://relishdb.ict.griffith.edu.au is freely available for data downloading and blind testing of new methods.