High-Quality Annotations For Custom NER, With Reduced Human Effort : Using ChatGPT

Developing custom Named Entity Recognition (NER) models for specific use cases depend on the availability of high-quality annotated datasets, which can be expensive. As someone who has worked on several real-world use cases, I know the challenges all too well. This post describes a few real-world challenges, a solution which reduces human effort whilst maintaining high quality, and code snippets for the solution.

Real World Challenges

spaCy is an excellent industrial-grade Python library for various NLP tasks. However, it sometimes falls short when used “out of the box” for specific NLP tasks, such as NER, on region-specific texts.

Listed below are a few such examples.
- Geographies covered : Singapore, Malaysia, Thailand and Indonesia.
- The text spans of interest within each example is indicated in bold italics.
I have used the latest version (at the time of writing) of the spaCy model, English - en_core_web_sm (v3.5.0), to evaluate the examples below. The spaCy model’s output for each example can be visualized using the links.

TL;DR : If you have just a minute, I’d encourage you to read through a few examples, before jumping into the TL;DR.

Example 1 : Singapore

“If this name is familiar to you, you might be thinking of Masjid Hajjah Fatimah located along Beach Road. Yup, this is the iconic lady that the mosque was commissioned by and is named after! Hajjah Fatimah binte Sulaiman was born in what is now Malacca in the mid-1700s, but she later moved to Singapore with her merchant husband. After his death, Hajjah Fatimah took over his business and grew it into an impressive trading operation.” [Source] (11 Jan 2023).

In this example, spaCy identified the following entities (visualize this example):

Entity	Entity Type Inferred By spaCy	Correct?
Masjid Hajjah Fatimah	`PERSON`	Incorrect
Hajjah Fatimah	`PERSON`	Incorrect - Partial
binte Sulaiman	`PERSON`	Incorrect - Partial
Malacca	`PERSON`	Incorrect - Misclassified
the mid-1700s	`DATE`	Correct
Singapore	`GPE`	Correct
Hajjah Fatimah	`PERSON`	Correct

The name of one of the persons of interest within this text, Hajjah Fatimah binte Sulaiman was identified across two spans, thus indicating two persons, which might not be useful for the use case.
Another mistake was the span Masjid Hajjah Fatimah being identified be a PERSON, when in reality, it refers to a mosque (the word “masjid” means mosque).

Example 2 : Malaysia

“Hailing from Johor, Associate Professor Madya Dr Nur Adlyka Binti Ainul Annuar was declared a winner of Britain’s Women of the Future Award South East Asia 2021” [Source] (04 Mar 2022).

In this example, spaCy identified the following entities (visualize this example):

Entity	Entity Type Inferred By spaCy	Correct?
Madya	`PERSON`	Incorrect
Britain	`GPE`	Correct
Women of the Future Award South East Asia	`ORG`	Incorrect
2021	`DATE`	Correct

The text spans of interest were not identified by the spaCy model

The place, Johor
The person’s name, Dr Nur Adlyka Binti Ainul Annuar

Example 3 : Indonesia

“The startup’s most recent round was its series A in August 2016. Jualo’s founder, Chaim Fetter, is a Dutch tech entrepreneur who also started Peduli Anak Indonesia, a nonprofit that helps underprivileged children in Lombok.” [Source] (26 Mar 2022).

In this example, spaCy was not able to identify the person or place, and made a few misclassifications (visualize this example):

Entity	Entity Type Inferred By spaCy	Correct?
August 2016	`DATE`	Correct
Jualo	`PERSON`	Incorrect - Misclassified
Chaim Fetter	`ORG`	Incorrect - Misclassified
Peduli Anak	`GPE`	Incorrect - Partial & Misclassified
Indonesia	`GPE`	Incorrect - Partial & Misclassified
Lombok	`GPE`	Correct

The presence of the word Indonesia within the name the organization Peduli Anak Indonesia was likely the reason spaCy identified it as two separate named entities.

Example 4 : Thailand

“When it comes to fashion in Thailand, Pun Thriratanachat is one of the undisputed masters of fashion and design.” [Source] (26 Mar 2022).

In this example, spaCy was not able to identify the person (visualize this example):

Entity	Entity Type Inferred By spaCy	Correct?
Thailand	`GPE`	Correct

Why Custom NER Models?

In order to overcome limitations (which, btw, are perfectly understandable) similar to the examples mentioned above, custom NER models need to be developed for specific use cases. This can get expensive. Costs arise from not just the human annotation exercise but also from validation and, worst, corrections.

GPT-3/3.5 Using Promptify

First, a recap,

ChatGPT is powered by the GPT-3.5 family of large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using Reinforcement Learning from Human Feedback (RLHF).
Promptify is a Python library used to generate prompts used for interacting with LLMs, using prompt-based NLP tasks such as:
- Named Entity Recognition (NER)
- Text Classification
- Question Answering
- Etc.
Prompts are text inputs provided to LLMs, such as ChatGPT, to serve as a starting point for the model to generate its output. A prompt is analogous to a cue given to a human when asking her/him a question. Over recent months, a new discipline, dubbed “prompt engineering”, has arisen - the motivation behind it is significantly improved results to the same query by simple/complex perturbation of the text inputs, i.e. the prompts, to the model

Baseline - Zero Shot

Zero-Shot Learning (ZSL) in NLP is a technique that allows models to analyze language similar to how humans learn. It enables models to make inferences on new, unseen data even if those models have not specifically been trained on that specific data.

ZSL helps improve productivity - the author’s and other practitioners’ experience

For some use cases, ZSL helps save time and resources by eliminating the need to train separate models.
For other use cases, ZSL helps save time in the annotation required for a specific NLP tasks such as text classification, NER, etc.

Creating A Baseline Model

OpenAI has released several GPT-3 models, which have since been superceded by more powerful GPT-3.5 generation models. For the purpose of creating a baseline, I chose the text-babbage-001 model as I observed that it performed reasonably well on the text blocks I used for evaluation . Prices of different models vary, more details can be found here.

The code snippets below make use of the promptify library.

First, initialize the model and prompter. The default model is the text-davinci-003.

llm_model = OpenAI(api_key)
llm_prompter = Prompter(llm_model)

The choice of model can be changed via the model parameter.
Supported models at the time of writing are : gpt-3.5-turbo (can be expensive, depending on the volume), text-davinci-003, text-curie-001, text-babbage-001, and text-ada-001 (cheapest, but not practical).

llm_model = OpenAI(api_key, model="text-babbage-001")

Next, use the instance of the Prompter to construct a simple prompt with instructions and send to the LLM.

Note: The labels PERSON, ORG, PLACE were not pre-defined - they were introduced for this specific NER task.

text = "Hajjah Fatimah Binte Sulaiman was born in Malacca"
# Extracted from example 1

result = llm_prompter.fit(
   "ner.jinja",
   domain = "general",
   text_input = text,
   labels = ["PERSON", "ORG", "PLACE"])
print(result)

Results (formatted for ease of readability)

{'text': " [[
   {'E': 'Hajjah Fatimah Binte Sulaiman', 'T': 'BORN'}, 
   {'E': 'Malacca', 'T': 'PLACE'}]]",
   'prompt_tokens': 325, 'completion_tokens': 43, 'total_tokens': 368}

A few things to take note from the example above:

Whilst text span of the person was identified, the “out of the box” model misclassified it. This is a major improvement, as a common challenge with NER is identifying text spans with four or more words.
It correctly identified the place, Malacca

The next step will be to provide a few examples in the prompt sent to the LLM.

Few Shot

Few-Shot Learning (FSL) refers to the ability to learn new concepts by training machine learning models with only a few examples. Most approaches to few-shot learning involve meta-learning, often referred to as “learning to learn”.

Meta-learning performs the learning through training on a variety of tasks, each of which requires it to learn from a few examples. During this process, it learns how to improve the learning algorithm, thus allowing it to generalize, i.e. adapt to new tasks based on only a few examples.

The code snippets below build on the work done in the previous section using ZSL.

Curate Examples For Few-Shot Learning

This step is crucial for a successful outcome.

“similar yet diverse enough” - author’s experience

The examples chosen need to be similar yet diverse to learn from and be able to generalize.

I curated just 23 examples, of the three entity classes of interest, each with one or more entities.

few_shot_examples = [
   ["Tun Dr. Siti Hasmah Mohamad Ali is the wife of Malaysia's former Prime Minister, Tun Dr. Mahathir Mohamad and a well-respected medical doctor.", [{'E' : "Dr. Siti Hasmah Mohamad Ali", 'T': "PERSON" }, {'E' : "Malaysia", 'T': "GPE" }, {'E' : "Dr. Mahathir Mohamad", 'T': "PERSON" }]],
   ["Raja Permaisuri Agong Tunku Azizah Aminah Maimunah Iskandariah is the current Queen consort of Malaysia and the wife of the Yang di-Pertuan Agong, the Malaysian monarch.", [{'E' : "Azizah Aminah Maimunah Iskandariah", 'T': "PERSON" }, {'E' : "Malaysia", 'T': "GPE" }]],
   ["Zainah Alsagoff is a prominent lawyer and a senior partner at the law firm WongPartnership LLP.", [{'E' : "Zainah Alsagoff", 'T': "PERSON" }, {'E' : "WongPartnership LLP", 'T': "ORG" }]],
   ... # For bervity
]

Next, reconstruct the prompt with the few-shot examples, and send it to the LLM.

llm_model = OpenAI(api_key, model="text-davinci-003")

result = llm_prompter.fit(
   "ner.jinja",
   domain = "general",
   text_input = text,
   examples = few_shot_examples,
   labels = ["PERSON", "ORG", "PLACE", "GPE"])
print(result)

Results Using FSL

Summary

GPT Model	Entity Types Correctly Identified	Approximate Cost
`text-babbage-001`	1 (`PERSON`)	USD 0.90
`text-davinci-003`	1 (`PERSON`), 1 partial (`GPE` vs `PLACE`)	USD 36.00
`gpt-3.5-turbo`	1 (`PERSON`), 1 partial (`GPE` vs `PLACE`)	USD 3.60

The approximate cost refers to cost for automated annotation for 1000 records of similar length, and similar number of examples.

Details

Results (formatted for ease of readability)

When using the text-babbage-001 model,

{'text': "
   [[{'E': 'Hajjah Fatimah Binte Sulaiman', 'T': 'PERSON'}]]",
   'prompt_tokens': 1558, 'completion_tokens': 27, 'total_tokens': 1585}

When using the text-davinci-003 and gpt-3.5-turbo models,

{'text': "[[
   {'E': 'Hajjah Fatimah Binte Sulaiman', 'T': 'PERSON'},
   {'E': 'Malacca', 'T': 'GPE'}]]",
   'prompt_tokens': 1367, 'completion_tokens': 39, 'total_tokens': 1406}

A few things to take note of from the example above:

The text-babbage-001, text-davinci-003 and gpt-3.5-turbo models were able to pick up the name of the person. Curiously, text-curie-001 was not able to.
Depending on the use case, a significant improvement in the entity class of interest can be considered a success.
If the names of places of interest are small, other approaches can be employed to make the necessary corrections between entity types GPE and PLACE. One library I have often used is flashtext.

Constructing Own Prompts

An alternative to libraries, such as Promptify, would be to construct own prompts, with examples and custom instructions. Here’s an example.

The ability to develop good prompts ~~is an art~~ (aka “prompt engineering”) that is fast emerging to be a must-have-skill — more on that in a future blog post.

Add To Existing Annotations

The programmatically annotated datasets produced by the approaches above would need to be converted into a format acceptable for spaCy NER. This can be done using spaCy’s convert tool.

First, convert the inferred entities produced by the approaches above (I refer to these as the programmatically annotated datasets). You can refer to this gist on how to do so.
Next, convert the datasets into either of conll, conllu, iob or ner format.
Finally, use spaCy’s convert tool.

TL;DR

spaCy is an excellent industrial-grade Python library for various NLP tasks, including NER. However, sometimes using “out of the box” does not work well for a few use cases which require a custom NER model to be developed on region/domain-specific texts.
Large language models (LLMs) similar to those powering ChatGPT can be used to produce annotated datasets that are not only high-quality but also require significantly reduced human effort.
Using LLMs to produce annotated datasets requires a few carefully curated examples to learn from, and well-defined prompts.
Depending on the use-case and the volume of records, the cost for automated annotation can be below US$ 100.

References

Few-Shot Learning & Meta-Learning | Tutorial (30 Mar 2023)
Online tokenizer tool, used for estimating the number of tokens in input prompts