Coreference resolution refers to the task of identifying all the expressions in a text that refer to the same entity, such as pronouns, nouns, or noun phrases, and linking them to their referring entity. This post, inspired by a real world problem, describes a few challenges and explores a few approaches, along with code snippets.

First, a simple example. Given the text “John Smith lives in Singapore. He has a 2-year-old golden retriever”, the coreference resolution would identify that “He” refers to “John Smith”.

The Challenges

Real World Challenges

Listing a few challenges observed in real world publications.

Real World Challenge #1

(RWC1 was extracted from [#6])

On November 3, 1992, Clinton was elected the 42nd president of the United States, and the following year Hillary Clinton became the first lady. In 2013, he won the Presidential Medal of Freedom.

Challenge (for the model) : “he” refers to “Clinton” or “Hillary Clinton”?

Real World Challenge #2

(RWC2 was extracted from this news article published by The Straits Times)

While it is normal for defendants charged with felonies to be handcuffed – as former Trump Organisation chief financial officer Allen Weisselberg was in 2021 – one of Trump’s lawyers, Mr Joseph Tacopina, has said he does not expect that to occur.

Challenge (for the model) : “he” refers to “Joseph Tacopina”, “Trump” or “Allen Weisselberg”?

Real World Challenge #3

(RWC3 was extracted from this news article published by The Straits Times)

Mr Li, who was then the Shanghai chief, said PM Lee had gone to great lengths to discuss “people’s well-being and how to deliver tangible benefits to the people”. “I was impressed by our conversation. You also talked about cultural diversity and inclusiveness,” he told PM Lee.

Challenge (for the model) : “he” refers to “PM Lee” or “Mr Li”? “I” refers to “Mr Li” or “PM Lee”?

Winograd Schema Challenge

The Winograd Schema Challenge (WSC) is a test that assesses a system’s capability to perform common sense reasoning. It also serves as an alternative to the Turing Test. A Winograd schema is a set of two sentences that have only one or two different words, but a highly ambiguous pronoun. The pronoun has different meanings in the two sentences, and it requires commonsense knowledge to resolve it correctly. The examples were created to be easy for humans to solve but challenging for machines, as they need to understand the text’s context and the situation it describes on a deeper level.

A few of Winograd Schema (from [#8]) :

WSC02 : “The trophy does not fit into the brown suitcase because it is too large.”

Question asked : What is too large? the trophy or the suitcase? visualize this

WSC06 : “The delivery truck zoomed by the school bus because it was going so slow.”

Question asked : What was going so slow? the delivery truck or the school bus? visualize this

WSC10 : “John couldn’t see the stage with Billy in front of him because he is so tall.”

Question asked : Who is so short? John or Billy

A Few Approaches

Over the years, various approaches have been proposed to improve the performance of coreference resolution. These include:

  • Rule-based approaches:
    • Use a set of hand-crafted rules to identify and cluster mentions in a text.
    • Eimple and interpretable, but may not generalize well to new datasets.
  • Mention-ranking approaches:
    • Use machine learning techniques to rank candidate antecedents for each mention based on their features. Features may include syntactic, semantic, and discourse-level information. The highest-ranking antecedent is then selected as the final antecedent.
    • Effective, but can be computationally expensive.
  • Entity-based approaches:
    • Use named entity recognition (NER) (and other approaches) to identify and cluster mentions that refer to the same entity.
    • Examples include : entity-centric [#2], [#3], entity-grid [#1]
  • Hybrid approaches: These approaches combine multiple techniques to improve the performance of coreference resolution. Examples include,
    • Deep reinforcement learning + mention-ranking [#4]
    • SpanBERT [#5]
    • Entity-centric + graph neural networks [#6]

Mention-ranking approach

Mention-ranking approaches to coreference resolution work by assigning scores to pairs of mentions based on their likelihood of coreference. Consider the following text as an example: “Nirmalya invited his old friends Suraj and Sandeep for dinner. The three friends had not seen each other in years, so the catch-up was long overdue.” There are several mentions, “Nirmalya”, “his”, “Suraj”, “Sandeep”, and “they”. A model based on the mention-ranking approach will first compute the probabilities

P("Nirmalya", "his") = 0.5             # OK
P("Suraj", "they") = 0.05
P("Sandeep", "they") = 0.03
P("Nirmalya", "they") = 0.02
P("Suraj and Sandeep", "they") = 0.4   # OK

Evaluation

At the time of writing (March 2023), I am aware of 3 main Python libraries used for coreference resolution. These are: allennlp, fastcoref and neuralcoref. These 3 libraries have a few conflicting dependencies.

Library Version Dependencies License
allennlp 2.10.1 spacy >= 2.10 Apache 2.0
fastcoref 2.1.1 spacy == 3.0.6 MIT
neuralcoref 4.0.0 spacy >= 2.10 < 3.0.0 MIT

Ease Of Installation (based on my experience)

  • allennlp, ⭐⭐⭐
  • fastcoref, ⭐⭐⭐⭐⭐
  • neuralcoref, ⭐⭐

Evaluation Methodology

For each library evaluated, I used the texts RWC1, RWC2, RWC3, WSC02, WSC06, and WSC10. I then calculated a score (maximum 6.0) for each, based on correct (1 point), partial (0.5 points) and incorrect (0 points) identification of the expected clusters.

Please note:

  • This is not intended to be an academic / scientific / research-grade evaluation.
  • The intent is to get a quick idea of the performance of the libraries against a few real world examples, challenges encountered (if any) and licenses.

Texts Used For Evaluation

texts_for_testing=[
 "On November 3, 1992, Clinton was elected the 42nd president of the United States, and the following year Hillary Clinton became the first lady. In 2013, he won the Presidential Medal of Freedom.",
 "While it is normal for defendants charged with felonies to be handcuffed – as former Trump Organisation chief financial officer Allen Weisselberg was in 2021 – one of Trump’s lawyers, Mr Joseph Tacopina, has said he does not expect that to occur.",
 """Mr Li, who was then the Shanghai chief, said PM Lee had gone to great lengths to discuss "people’s well-being and how to deliver tangible benefits to the people". "I was impressed by our conversation. You also talked about cultural diversity and inclusiveness," he told PM Lee.""",
 "The trophy does not fit into the brown suitcase because it is too large.",
 "The delivery truck zoomed by the school bus because it was going so slow.",
 "John couldn’t see the stage with Billy in front of him because he is so tall.",
]

allennlp

Installing allennlp was not smooth for me. I have created a Google Colab notebook to make it easier for you to try it out.

from allennlp.predictors.predictor import Predictor

model_url = "https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz"
predictor = Predictor.from_path(model_url)

for text in texts_for_testing:
  print("    TEXT :", text)
  prediction = predictor.predict(document=text)
  tokens = prediction["document"]
  l_clusters = []
  for clusters in prediction["clusters"]:
      a_cluster = []
      for _,c in enumerate(clusters):
          s = " ".join(tokens[c[0]:c[1]+1])
          a_cluster.append(s)
      l_clusters.append(a_cluster)
  
  print("CLUSTERS :", l_clusters)

Results For allennlp

Example Clusters Correct?
RWC1 [['Clinton', 'he']] Correct
RWC2 [['Trump Organisation', 'Trump ’s'], ['one of Trump ’s lawyers , Mr Joseph Tacopina ,', 'he'], ['be', 'that']] Correct
RWC3 [['Mr Li , who was then the Shanghai chief', 'I', 'he'], ['PM Lee', 'You', 'PM Lee']] Correct
WSC02 [['The trophy', 'it']] Correct
WSC06 [['The delivery truck', 'it']] Correct
WSC10 [['John', 'him'], ['Billy', 'he']] Correct

Evaluation Score For allennlp : 6.0 / 6.0

fastcoref

from fastcoref import FCoref

model = FCoref(device='cuda:0')

preds = model.predict(texts=texts_for_testing)

for i, text in enumerate(texts_for_testing):
  print("    TEXT :", text)
  print("CLUSTERS :", preds[i].get_clusters())

Results For fastcoref

Example Clusters Correct?
RWC1 [['Clinton', 'he']] Correct
RWC2 [['one of Trump’s lawyers, Mr Joseph Tacopina', 'he'], ['handcuffed', 'that']] Correct
RWC3 [['Mr Li, who was then the Shanghai chief,', 'I', 'he'], ['PM Lee', 'You', 'PM Lee']] Correct
WSC02 [['the brown suitcase', 'it']] Incorrect
WSC06 [['The delivery truck', 'it']] Correct
WSC10 [['John', 'him', 'he']] Incorrect

Evaluation Score For fastcoref : 4.0 / 6.0

neuralcoref

Installing neuralcoref was not smooth for me. So, I have added a separate subsection for help with troubleshooting.

import neuralcoref
import spacy

nlp = spacy.load("en_core_web_sm")

for i, text in enumerate(texts_for_testing):
  doc_x = nlp(text)
  print("    TEXT :", text)
  print("CLUSTERS :", doc_x._.coref_clusters)

Results For neuralcoref

Example Clusters Correct?
RWC1 [Clinton: [Clinton, Hillary Clinton, he]] Incorrect
RWC2 [Trump: [Trump, Trump], Mr Joseph Tacopina: [Mr Joseph Tacopina, he]] Correct
RWC3 [PM Lee: [PM Lee, he, PM Lee]] Partial
WSC02 [The trophy: [The trophy, it]] Correct
WSC06 [] Incorrect
WSC10 [John: [John, him, he]] Incorrect

Evaluation Score For neuralcoref : 2.5 / 6.0

Installation Troubleshooting For neuralcoref

  1. If you encountered an error similar to the error below, follow the instructions articulated in this answer on StackOverflow
Installing collected packages: neuralcoref
  error: subprocess-exited-with-error
  
  × Running setup.py install for neuralcoref did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Running setup.py install for neuralcoref ... error: legacy-install-failure

× Encountered error while trying to install package.
╰─> neuralcoref
  1. If you encountered an issue similar to the warning below, you could try suggestion articulated in this Medium post.
RuntimeWarning: spacy.tokens.span.Span size changed, may indicate binary incompatibility. Expected X from C header, got Y from PyObject
  1. I have created a Google Colab notebook. I adapted it from this answer on StackOverflow.

TL;DR

  • Coreference resolution is hard.
  • Python libraries such as allennlp, fastcoref and neuralcoref make the task simpler.
  • Balancing speed, ease of installation and performance on real world examples, I suggest using fastcoref. Moreover, it comes with a MIT license, thus making it very suitable for use in commercial applications.

References

  1. Modeling Local Coherence: An Entity-Based Approach (2008)
  2. Entity-Centric Coreference Resolution with Model Stacking (2015)
  3. Deterministic Coreference Resolution Based on Entity-Centric, Precision-Ranked Rules (2013)
  4. Deep Reinforcement Learning for Mention-Ranking Coreference Models (2016)
  5. SpanBERT: Improving Pre-training by Representing and Predicting Spans (2019)
  6. Improving Coreference Resolution by Leveraging Entity-Centric Features with Graph Neural Networks and Second-order Inference (2020)
  7. A Brief Survey on Recent Advances in Coreference Resolution (2021)
  8. NYU’s Collection of Winograd Schemas
  9. Here are two great sites for understanding open source software licenses : FOSSA, and choosealicense.com