Developing custom Named Entity Recognition (NER) models for specific use cases depend on the availability of high-quality annotated datasets, which can be expensive. As someone who has worked on several real-world use cases, I know the challenges all too well. This post describes a few real-world challenges, a solution which reduces human effort whilst maintaining high quality, and code snippets for the solution.
Real World Challenges
spaCy is an excellent industrial-grade Python library for various NLP tasks. However, it sometimes falls short when used “out of the box” for specific NLP tasks, such as NER, on region-specific texts.
- Listed below are a few such examples.
- Geographies covered : Singapore, Malaysia, Thailand and Indonesia.
- The text spans of interest within each example is indicated in bold italics.
- I have used the latest version (at the time of writing) of the spaCy model,
English - en_core_web_sm (v3.5.0)
, to evaluate the examples below. The spaCy model’s output for each example can be visualized using the links.
TL;DR : If you have just a minute, I’d encourage you to read through a few examples, before jumping into the TL;DR.
Example 1 : Singapore
“If this name is familiar to you, you might be thinking of Masjid Hajjah Fatimah located along Beach Road. Yup, this is the iconic lady that the mosque was commissioned by and is named after! Hajjah Fatimah binte Sulaiman was born in what is now Malacca in the mid-1700s, but she later moved to Singapore with her merchant husband. After his death, Hajjah Fatimah took over his business and grew it into an impressive trading operation.” [Source] (11 Jan 2023).
In this example, spaCy identified the following entities (visualize this example):
Entity | Entity Type Inferred By spaCy | Correct? |
---|---|---|
Masjid Hajjah Fatimah | PERSON |
Incorrect |
Hajjah Fatimah | PERSON |
Incorrect - Partial |
binte Sulaiman | PERSON |
Incorrect - Partial |
Malacca | PERSON |
Incorrect - Misclassified |
the mid-1700s | DATE |
Correct |
Singapore | GPE |
Correct |
Hajjah Fatimah | PERSON |
Correct |
- The name of one of the persons of interest within this text, Hajjah Fatimah binte Sulaiman was identified across two spans, thus indicating two persons, which might not be useful for the use case.
- Another mistake was the span Masjid Hajjah Fatimah being identified be a
PERSON
, when in reality, it refers to a mosque (the word “masjid” means mosque).
Example 2 : Malaysia
“Hailing from Johor, Associate Professor Madya Dr Nur Adlyka Binti Ainul Annuar was declared a winner of Britain’s Women of the Future Award South East Asia 2021” [Source] (04 Mar 2022).
In this example, spaCy identified the following entities (visualize this example):
Entity | Entity Type Inferred By spaCy | Correct? |
---|---|---|
Madya | PERSON |
Incorrect |
Britain | GPE |
Correct |
Women of the Future Award South East Asia | ORG |
Incorrect |
2021 | DATE |
Correct |
The text spans of interest were not identified by the spaCy model
- The place, Johor
- The person’s name, Dr Nur Adlyka Binti Ainul Annuar
Example 3 : Indonesia
“The startup’s most recent round was its series A in August 2016. Jualo’s founder, Chaim Fetter, is a Dutch tech entrepreneur who also started Peduli Anak Indonesia, a nonprofit that helps underprivileged children in Lombok.” [Source] (26 Mar 2022).
In this example, spaCy was not able to identify the person or place, and made a few misclassifications (visualize this example):
Entity | Entity Type Inferred By spaCy | Correct? |
---|---|---|
August 2016 | DATE |
Correct |
Jualo | PERSON |
Incorrect - Misclassified |
Chaim Fetter | ORG |
Incorrect - Misclassified |
Peduli Anak | GPE |
Incorrect - Partial & Misclassified |
Indonesia | GPE |
Incorrect - Partial & Misclassified |
Lombok | GPE |
Correct |
- The presence of the word Indonesia within the name the organization Peduli Anak Indonesia was likely the reason spaCy identified it as two separate named entities.
Example 4 : Thailand
“When it comes to fashion in Thailand, Pun Thriratanachat is one of the undisputed masters of fashion and design.” [Source] (26 Mar 2022).
In this example, spaCy was not able to identify the person (visualize this example):
Entity | Entity Type Inferred By spaCy | Correct? |
---|---|---|
Thailand | GPE |
Correct |
Why Custom NER Models?
In order to overcome limitations (which, btw, are perfectly understandable) similar to the examples mentioned above, custom NER models need to be developed for specific use cases. This can get expensive. Costs arise from not just the human annotation exercise but also from validation and, worst, corrections.
GPT-3/3.5 Using Promptify
First, a recap,
- ChatGPT is powered by the GPT-3.5 family of large language models (LLMs) and has been fine-tuned (an approach to transfer learning) using Reinforcement Learning from Human Feedback (RLHF).
- Promptify is a Python library used to generate prompts used for interacting with LLMs, using prompt-based NLP tasks such as:
- Named Entity Recognition (NER)
- Text Classification
- Question Answering
- Etc.
- Prompts are text inputs provided to LLMs, such as ChatGPT, to serve as a starting point for the model to generate its output. A prompt is analogous to a cue given to a human when asking her/him a question. Over recent months, a new discipline, dubbed “prompt engineering”, has arisen - the motivation behind it is significantly improved results to the same query by simple/complex perturbation of the text inputs, i.e. the prompts, to the model
Baseline - Zero Shot
Zero-Shot Learning (ZSL) in NLP is a technique that allows models to analyze language similar to how humans learn. It enables models to make inferences on new, unseen data even if those models have not specifically been trained on that specific data.
ZSL helps improve productivity - the author’s and other practitioners’ experience
- For some use cases, ZSL helps save time and resources by eliminating the need to train separate models.
- For other use cases, ZSL helps save time in the annotation required for a specific NLP tasks such as text classification, NER, etc.
Creating A Baseline Model
OpenAI has released several GPT-3 models, which have since been superceded by more powerful GPT-3.5 generation models. For the purpose of creating a baseline, I chose the text-babbage-001
model as I observed that it performed reasonably well on the text blocks I used for evaluation . Prices of different models vary, more details can be found here.
The code snippets below make use of the promptify
library.
First, initialize the model and prompter. The default model is the text-davinci-003
.
llm_model = OpenAI(api_key)
llm_prompter = Prompter(llm_model)
- The choice of model can be changed via the
model
parameter. - Supported models at the time of writing are :
gpt-3.5-turbo
(can be expensive, depending on the volume),text-davinci-003
,text-curie-001
,text-babbage-001
, andtext-ada-001
(cheapest, but not practical).
llm_model = OpenAI(api_key, model="text-babbage-001")
Next, use the instance of the Prompter
to construct a simple prompt with instructions and send to the LLM.
- Note: The labels
PERSON
,ORG
,PLACE
were not pre-defined - they were introduced for this specific NER task.
text = "Hajjah Fatimah Binte Sulaiman was born in Malacca"
# Extracted from example 1
result = llm_prompter.fit(
"ner.jinja",
domain = "general",
text_input = text,
labels = ["PERSON", "ORG", "PLACE"])
print(result)
Results (formatted for ease of readability)
{'text': " [[
{'E': 'Hajjah Fatimah Binte Sulaiman', 'T': 'BORN'},
{'E': 'Malacca', 'T': 'PLACE'}]]",
'prompt_tokens': 325, 'completion_tokens': 43, 'total_tokens': 368}
A few things to take note from the example above:
- Whilst text span of the person was identified, the “out of the box” model misclassified it. This is a major improvement, as a common challenge with NER is identifying text spans with four or more words.
- It correctly identified the place, Malacca
The next step will be to provide a few examples in the prompt sent to the LLM.
Few Shot
Few-Shot Learning (FSL) refers to the ability to learn new concepts by training machine learning models with only a few examples. Most approaches to few-shot learning involve meta-learning, often referred to as “learning to learn”.
Meta-learning performs the learning through training on a variety of tasks, each of which requires it to learn from a few examples. During this process, it learns how to improve the learning algorithm, thus allowing it to generalize, i.e. adapt to new tasks based on only a few examples.
The code snippets below build on the work done in the previous section using ZSL.
Curate Examples For Few-Shot Learning
This step is crucial for a successful outcome.
“similar yet diverse enough” - author’s experience
The examples chosen need to be similar yet diverse to learn from and be able to generalize.
I curated just 23 examples, of the three entity classes of interest, each with one or more entities.
few_shot_examples = [
["Tun Dr. Siti Hasmah Mohamad Ali is the wife of Malaysia's former Prime Minister, Tun Dr. Mahathir Mohamad and a well-respected medical doctor.", [{'E' : "Dr. Siti Hasmah Mohamad Ali", 'T': "PERSON" }, {'E' : "Malaysia", 'T': "GPE" }, {'E' : "Dr. Mahathir Mohamad", 'T': "PERSON" }]],
["Raja Permaisuri Agong Tunku Azizah Aminah Maimunah Iskandariah is the current Queen consort of Malaysia and the wife of the Yang di-Pertuan Agong, the Malaysian monarch.", [{'E' : "Azizah Aminah Maimunah Iskandariah", 'T': "PERSON" }, {'E' : "Malaysia", 'T': "GPE" }]],
["Zainah Alsagoff is a prominent lawyer and a senior partner at the law firm WongPartnership LLP.", [{'E' : "Zainah Alsagoff", 'T': "PERSON" }, {'E' : "WongPartnership LLP", 'T': "ORG" }]],
... # For bervity
]
Next, reconstruct the prompt with the few-shot examples, and send it to the LLM.
llm_model = OpenAI(api_key, model="text-davinci-003")
result = llm_prompter.fit(
"ner.jinja",
domain = "general",
text_input = text,
examples = few_shot_examples,
labels = ["PERSON", "ORG", "PLACE", "GPE"])
print(result)
Results Using FSL
Summary
GPT Model | Entity Types Correctly Identified | Approximate Cost |
---|---|---|
text-babbage-001 |
1 (PERSON ) |
USD 0.90 |
text-davinci-003 |
1 (PERSON ), 1 partial (GPE vs PLACE ) |
USD 36.00 |
gpt-3.5-turbo |
1 (PERSON ), 1 partial (GPE vs PLACE ) |
USD 3.60 |
- The approximate cost refers to cost for automated annotation for 1000 records of similar length, and similar number of examples.
Details
Results (formatted for ease of readability)
When using the text-babbage-001
model,
{'text': "
[[{'E': 'Hajjah Fatimah Binte Sulaiman', 'T': 'PERSON'}]]",
'prompt_tokens': 1558, 'completion_tokens': 27, 'total_tokens': 1585}
When using the text-davinci-003
and gpt-3.5-turbo
models,
{'text': "[[
{'E': 'Hajjah Fatimah Binte Sulaiman', 'T': 'PERSON'},
{'E': 'Malacca', 'T': 'GPE'}]]",
'prompt_tokens': 1367, 'completion_tokens': 39, 'total_tokens': 1406}
A few things to take note of from the example above:
- The
text-babbage-001
,text-davinci-003
andgpt-3.5-turbo
models were able to pick up the name of the person. Curiously,text-curie-001
was not able to. - Depending on the use case, a significant improvement in the entity class of interest can be considered a success.
- If the names of places of interest are small, other approaches can be employed to make the necessary corrections between entity types
GPE
andPLACE
. One library I have often used isflashtext
.
Constructing Own Prompts
An alternative to libraries, such as Promptify
, would be to construct own prompts, with examples and custom instructions.
Here’s an example.
The ability to develop good prompts is an art (aka “prompt engineering”) that is fast emerging to be a must-have-skill — more on that in a future blog post.
Add To Existing Annotations
The programmatically annotated datasets produced by the approaches above would need to be converted into a format acceptable for spaCy NER. This can be done using spaCy’s convert
tool.
- First, convert the inferred entities produced by the approaches above (I refer to these as the programmatically annotated datasets). You can refer to this gist on how to do so.
- Next, convert the datasets into either of
conll
,conllu
,iob
orner
format. - Finally, use spaCy’s
convert
tool.
TL;DR
- spaCy is an excellent industrial-grade Python library for various NLP tasks, including NER. However, sometimes using “out of the box” does not work well for a few use cases which require a custom NER model to be developed on region/domain-specific texts.
- Large language models (LLMs) similar to those powering ChatGPT can be used to produce annotated datasets that are not only high-quality but also require significantly reduced human effort.
- Using LLMs to produce annotated datasets requires a few carefully curated examples to learn from, and well-defined prompts.
- Depending on the use-case and the volume of records, the cost for automated annotation can be below US$ 100.
References
- Few-Shot Learning & Meta-Learning | Tutorial (30 Mar 2023)
- Online tokenizer tool, used for estimating the number of tokens in input prompts