The management of drug–drug interactions (DDIs) is a critical issue resulting from the overwhelming amount of information available on them. Natural Language Processing (NLP) techniques can provide an interesting way to reduce the time spent by healthcare professionals on reviewing biomedical literature. However, the shortage of annotated corpora for DDI extraction is the main bottleneck in the development of NLP systems for this area of Pharmacovigilance. So precisely for this reason, we are pleased to announce that the DDI corpus, an annotated corpus with pharmacological substances and drug-drug interactions (DDIs), is now available at http://labda.inf.uc3m.es/ddicorpus.
The DDI corpus is made up of 792 texts selected from the DrugBank database and other 233 Medline abstracts on the subject of DDIs. The corpus was annotated with a total of 18,502 pharmacological substances and 5028 DDIs, including both pharmacokinetic (PK) as well as pharmacodynamic (PD) interactions. To date, the corpora annotated with DDIs have focused in PK DDIs, but not in PD DDIs.
Annotation guidelines were developed by domain experts in order to ensure a high-quality, reliable and accurate annotation of the corpus. Pharmacological substances were classified according to four entity types: drug (for generic drugs), brand (for trade drugs), group (for drug classes) and drug_n (for active substances not approved for human use). DDIs were also classified into four types: mechanism (for DDIs describing the way the interaction occurs), effect (for DDIs describing the consequence of the interaction), advice (for DDIs described by a recommendation or advice) and int (for DDIs without any additional information). Inter-Annotator Agreement (IAA) was measured to assess the consistency and quality of the corpus. The agreement was almost perfect (Kappa up to 0.96 and generally over 0.80), except for the DDIs in the MedLine database (0.55–0.72).
The DDI corpus was developed for the SemEval 2013-DDIExtraction 2013 task, whose main goal was to provide a common framework for the evaluation of information extraction techniques applied to the recognition and classification of pharmacological substances (DrugNER subtask) and the detection and classification of drug-drug interactions (DDIExtraction subtask) from biomedical texts. The DDI corpus is a valuable gold-standard for those research groups interested in the recognition of pharmacological active substances, including drugs, groups of drugs, toxins, etc. or those specifically working in the field of DDI relation extraction.
The DDI corpus is divided into two datasets: training and test. The training dataset is the same for both subtasks and contains gold-standard annotations of pharmacological substances and their interactions. It consists of 714 texts (572 from DrugBank and 142 MedLIne abstracts) annotated with a total of 13029 pharmacological substances (13029 from DrugBank and 1826 from MedLine) and 4037 DDIs (3805 from DrugBank and 232 from MedLine). The test dataset for the Drug NER subtask consists of 52 DrugBank texts (annotated with 303 pharmacological substances) and 58 MedLine abstracts (with 382 pharmacological substances). The test dataset for the subtask of DDI extraction consists of 158 DrugBank Texts (annotated with 889 DDIs) and 33 MedLine abstracts (with 95 DDIs). We hope that the release of this dataset will encourage further research on the DDI problem.
A detailed description of the DDI corpus and the DDIExtraction 2013 task can be found in the following articles:
María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, Thierry Declerck, The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions, Journal of Biomedical Informatics, Volume 46, Issue 5, October 2013, Pages 914-920, ISSN 1532-0464, http://dx.doi.org/10.1016/j.jbi.2013.07.011.)
Isabel Segura-Bedmar, Paloma Martínez, María Herrero-Zazo. SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013).
Contact info: Isabel Segura-Bedmar