Using Mixtral 8x7B For NLP Tasks On Small GPUs

Large language models (LLM) are made up of billions of parameters, thus posing challenges when loading them onto GPU memory for model inference or fine-tuning. This post briefly explains the challenges and describes a solution to load Mixtral 8x7B, a State-of-the-art (SOTA) LLM, onto consumer-grade GPUs, followed by using the model for NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Text Classification.

Mixtral 8x7B is an open weight large language model (LLM) with performance better than GPT-3.5 (which currently powers ChatGPT). Mixtral 8x7B is a mixture of experts model (MoE).

Open Weight Or Open Source? The model weights are made publicly accessible, so the term “open weight” is more suitable rather than incorrectly calling it “open source” - this is because the model itself is just weights, no source code.

A Short Recap

What Are Mixture of Experts (MoE) Models?

In most LLMs, like Llama2, every token (roughly equivalent to ¾ of a word) interacts with all the information simultaneously. This can be slow, especially for complex tasks.

Mixtral 8x7B uses a MoE architecture. Instead of a dense layer, it is comprised of sub-layers of “experts” (each could be thought of as being equivalent to a specialist). Before reaching any specialist, a “router” analyses the tokens and decides which expert is best suited to handle the tokens.

Similar To A Consulting Firm

Mixtral 8x7B is like having a team of deep specialists within a consulting firm.

Instead of one all-purpose consultant, it has 8 specialists (“experts”), each with deep knowledge in specific areas.
One might be focused on Data Engineering, whilst the others might be specialists in areas such as ML Engineering, MLOps, Natural Language Processing, Cloud Solutions Architecture, Data Governance, etc.
The tokens received would be equivalent to incoming requests for consultation.
After analysis of each request, the manager (“router”) would route it to the best specialist suited to handle it.
This way, only the relevant experts work on each request. For example, if a request were about deploying complex ML models, it would be routed to an MLE expert. Similarly, requests about optimising a complex data pipeline would be routed to a Data Engineering expert, and so on.

Model Weights

Most LLMs comprise of several billion parameters. Model weights refer to these parameters. Each of these parameters is a decimal number, for example 1.2345. These are stored as 16- or 32-bit floating point numbers called float16 (FP16) or float32 (FP32).

56B or 47B?

The “8x7B” in the name of the Mixtral 8x7B model might suggest it is a 56 billion parameter model. However, it is a 46.7 billion parameter model. [#1, #2]

Even though it is 46.7 billion parameter model, it only uses 12.9 billion parameters per token. It, therefore, processes input and generates output at the same speed and cost as a 12.9 billion parameter model.

GPU VRAM

VRAM, or Video Random Access Memory, is crucial for loading weights in LLMs onto GPUs for inference or fine-tuning. Serving as a high-speed buffer, it stores data for quick access during computations, ensuring efficient loading and preventing performance issues.

Consumer Grade GPUs

For reference, here are a few of the “consumer-grade” GPUs at the time of writing.

Nvidia RTX 4090, 24 GB VRAM
Nvidia RTX 4080, 16 GB VRAM
Nvidia RTX 4070, 12 GB VRAM
Nvidia RTX 3090, 24 GB VRAM
Nvidia RTX 3070, 8 GB VRAM

In comparison, the GPUs available via Google Colab (free tier) T4, which has 16 GB of VRAM.

Industrial Grade GPUs

For reference, here are a few data center (i.e., “industrial-grade”) GPUs.

NVIDIA A100, 80 GB VRAM
NVIDIA H100, 80 - 188 GB VRAM
NVIDIA V100, 32 GB VRAM

Model Weights And GPU VRAM

A few rules of thumb regarding GPU VRAM required to load an LLM with X billion parameters

4*X GB of VRAM (if float32)
2*X GB of VRAM (if float16)

👉 Using the rules above, the minimum GPU memory required for a few of the open weight models are as follows:

Llama 2 7B model, 28 GB (30 GB per the NVIDIA documentation).
Llama 2 13B model, 52 GB
Falcon 40 B model, 80 GB
Mixtral 8x7B model, 94 GB

❌ None of the models listed above can be loaded onto a Google Colab notebook “as is”.

Reading up until this point, an obvious question that might come to mind is Why does this article have such a misleading title?.

Fortunately, there is a solution, and the title is not misleading. 😏

Solution

This section describes how to make use of the high performant Mixtral 8x7B LLM within a Google Colab notebook.

Credit

The work was done by Denis Mazur and Artyom Eliseev of the Moscow Institute of Physics and Technology. Details.

What Was Done?

The following is based on my reading of their report. Any misinterpretation is unintentional.

The original model was quantized in mixed precision mode.
- They chose Half Quadratic Quantization (HQQ). Choice of quantization is important as the wrong choice could result in loss in perplexity. Since HQQ had previously already been used for Mixtral, it was chosen over GPTQ or AWQ
- 4-bit HQQ quantization for the shared attention layers and 2- or 3-bit quantization for experts.
An MoE-specific offloading strategy was implemented.
- The researchers noticed how a MoE language model uses its experts between tokens and identified some patterns: (i) it reuses certain experts for consecutive tokens, and (ii) the hidden states of early layers already anticipate which experts will be used in later layers.
- The researchers designed a MoE-specific offloading strategy that leverages these patterns: (i) uses an LRU cache to reduce communication between the GPU and RAM, resulting in faster generation, and (ii) it guesses which experts are needed ahead of time to better coordinate when experts need to be loaded based on computations required by those experts.

Brief Explanations

For readers unfamiliar with some of the terms.

What is Half Quadratic Quantization? A data-free quantization algorithm that supports a variety of bit rates and can reduce the memory footprint of LLMs without sacrificing accuracy or speed. Several LLMs and vision models can be quantized with HQQ. These include Llama 2, Mixtral, and OpenCLIP models from the Visual Transformers (ViT) family. Full list of ready-to-use quantized models.
What is an LRU Cache? An LRU Cache is like a small toy box. When it’s full and you get a new toy, you remove the toy you haven’t played with for the longest time to make room for the new one. It’s a way computers manage limited memory, keeping recent information and removing older, less used data.

Quick Test

My first test was completely non-serious, without much expectations.

User: “What does Nirmalya do in Singapore?”
Mixtral: “Nirmalya is a common Indian name, and there may be many individuals with that name in Singapore, so I can’t provide information on a specific person. However, I can provide some information about the significance and cultural context of the name. “Nirmalya” is a Sanskrit name, and in Hinduism, it refers to the ritual of offering flowers, leaves, and other natural materials to a deity or a sacred symbol, and then the offering is referred to as “Nirmalya." It is also a name of Lord Shiva. In Singapore, individuals with the name Nirmalya would likely be of Indian descent and may be involved in Hindu religious activities or cultural organizations. However, without more specific information, I cannot provide details about a person’s occupation or activities.”

The response generated was accurate. Not many around me (of Indian descent) know the meaning, so is impressive that an LLM was able to correctly articulate the significance and cultural context of my first name.

Using 8x7b For Few NLP Tasks

Named Entity Recognition

I used the quantized Mixtral 8x7b model for Named Entity Recognition (NER) on region-specific texts from Singapore, Malaysia, Indonesia, and Thailand. For the purpose of comparison, I used identical examples from my April 2023 blog post.

Prompt used for the purpose of conducting the tests: Given the indicated text, please help identify the dates, persons, places and organisations within it. These 4 entity types must be annotated with the entity types DATE, PERSON, GPE and ORG, respectively. The output should indicate the start and ending position of the occurrence of the entity in the text. The output must not indicate other entity types in the response. ###Text:### {placeholder for text}

Test #1: Singapore

Text: “If this name is familiar to you, you might be thinking of Masjid Hajjah Fatimah located along Beach Road. Yup, this is the iconic lady that the mosque was commissioned by and is named after! Hajjah Fatimah binte Sulaiman was born in what is now Malacca in the mid-1700s, but she later moved to Singapore with her merchant husband. After his death, Hajjah Fatimah took over his business and grew it into an impressive trading operation.”

Mixtral:
“In the mid-1700s” - DATE, start position: 90, end position: 98
“Masjid Hajjah Fatimah” - GPE, start position: 0, end position: 22
“Hajjah Fatimah binte Sulaiman” - PERSON, start position: 23, end position: 44
“Malacca” - GPE, start position: 50, end position: 56
“Singapore” - GPE, start position: 62, end position: 70
“beach road” - GPE, start position: 4, end position: 14
“her merchant husband” - PERSON, start position: 84, end position: 90
“his business” - ORG, start position: 105, end position: 110
“an impressive trading operation” - ORG, start position: 115, end position: 127

Outcome: ❌ (6 True Positives, 3 False Positives). The last 3 are False Positives.

Test #2: Malaysia

Text: “Hailing from Johor, Associate Professor Madya Dr Nur Adlyka Binti Ainul Annuar was declared a winner of Britain’s Women of the Future Award South East Asia 2021”

Mixtral:
“Johor” - GPE, start position: 0, end position: 5
“Associate Professor Madya Dr Nur Adlyka Binti Ainul Annuar” - PERSON, start position: 21, end position: 76
“Britain’s Women of the Future Award South East Asia 2021” - ORG, start position: 92, end position: 108
“2021” - DATE, start position: 108, end position: 109

Outcome: ✅ (3 True Positives, 1 False Positive)

Test #3: Indonesia

Text: “The startup’s most recent round was its series A in August 2016. Jualo’s founder, Chaim Fetter, is a Dutch tech entrepreneur who also started Peduli Anak Indonesia, a nonprofit that helps underprivileged children in Lombok.”

Mixtral:
“August 2016” - DATE, start position: 63, end position: 69
“Jualo” - ORG, start position: 13, end position: 18
“Chaim Fetter” - PERSON, start position: 79, end position: 90
“Peduli Anak Indonesia” - ORG, start position: 104, end position: 119
“Dutch” - ORIG, start position: 80, end position: 84
“Lombok” - GPE, start position: 120, end position: 125
“started” - ORG, start position: 91, end position: 95

Outcome: ✅ (6 True Positives, 1 False Positive, “started”)

Test #4: Thailand

Text: “When it comes to fashion in Thailand, Pun Thriratanachat is one of the undisputed masters of fashion and design”

Mixtral:
“Thailand” - GPE, start position: 15, end position: 22
“Pun Thriratanachat” - PERSON, start position: 43, end position: 55
“fashion” - ORG, start position: 6, end position: 13
“fashion in Thailand” - ORG, start position: 6, end position: 22
“one of the undisputed masters of fashion and design” - ORG, start position: 44, end position: 76

Outcome: ❌ (2 True Positives, 3 False Positives)

Sentiment Analysis

I used the quantized Mixtral 8x7b model for Sentiment Analysis. For the purpose of the tests, I chose random paragraphs from random articles published on the Straits Times over the last 2-3 days.

Sentiment Analysis Test #1

Prompt used for the purpose of conducting the tests: Given the indicated paragraph, please help identify the sentiment of the text within the paragraph. In your analysis, do not just look for positive or negative words, but look a the entire text. Sentiment can be NEGATIVE, SLIGHTLY_NEGATIVE, NEUTRAL, SLIGHTLY_POSITIVE, POSITIVE. ###Paragraph:### {placeholder for text}

Text: “Deputy Prime Minister Lawrence Wong, whose appearance has also been used to promote investment scams, alerted the public on Dec 11 of deepfake posts spreading misinformation that the authorities were planning a circuit breaker amid a spike in Covid-19 cases.”
[extracted from this article]

Mixtral: The sentiment of the given paragraph is NEUTRAL. It is primarily informative and not meant to provoke any particular emotion from the reader. The content is focused on relaying information about potential misinformation, not expressing a sentiment towards Deputy Prime Minister Lawrence Wong or the content of the deepfake posts.

Outcome: ✅

Sentiment Analysis Test #2

Text: “Ms Choy, a 34-year-old artist, said: “It’s something unique and special, it’s grown and cultivated right here in Singapore… and is a great reminder of the humble mushroom that plays such a unique role in… the environment.””
[extracted from this article]

Mixtral: The sentiment of the given paragraph is POSITIVE. The content expresses a sense of appreciation and admiration for something unique and special that is grown and cultivated in Singapore. The focus on the mushroom’s role in the environment highlights a positive environmental impact.

Outcome: ✅

Sentiment Analysis Test #3

Text: “All Eurostar trains were cancelled on Dec 30 due to flooded tunnels in southern England, causing misery for New Year travellers in the second major disruption in 10 days. The latest cancellations follow a wildcat strike by French unions days before Christmas and weather warnings from Britain’s Met Office for rain, snow and ice across large parts of the country. Hundreds of travellers were left stranded at London’s St Pancras station after Eurostar cancelled all services for the entire day.”
[extracted from this article]

Mixtral: The sentiment of the indicated paragraph is NEGATIVE. The text describes several negative events, including cancelled Eurostar trains, flooded tunnels, a wildcat strike, and severe weather warnings. These events are causing inconvenience and misery for travelers, which further emphasizes the negative tone of the text. There are no positive or uplifting details in the text to balance out the negative information, so the overall sentiment is negative.

Outcome: ✅

Text Classification

Prompt used for the purpose of conducting the tests: Given the indicated paragraph, please help identify the type of toxicity in the text indicated below. In your analysis, do not just look for keywords, but look a the entire text. The output can only be 1 of the 6 types of toxicity which are toxic, severe toxic, obscene, threat, insult, and identity hate. Do not compare with previous examples. ###Text:### {placeholder for text}

Being based in Singapore, I wanted to also test whether it is able to correct classify toxic comments in Singlish. These can be seen in tests [#4], [#5] and [#6] below.

What is Singlish? [from one of my older articles on medium.com]
"(For those not familiar with) Singlish emerged from Singapore’s multiethnic community, with Mandarin, Hokkien, Cantonese, and Teochew influencing its intonation and sentence structure; and with words/expressions added in from Malay. Singlish is often used by Singaporeans (and long-term residents) among friends and family in informal settings.”

Text Classification Test #1

Text: “I would like to punch you. In hindsight, I do apologize for my previous statement.”

Mixtral: Based on the provided text, the type of toxicity that can be identified is a “threat.” The speaker initially expresses a desire to physically harm the recipient, which is classified as a threat under most community guidelines for online behavior. However, the speaker does qualify their statement by acknowledging it was inappropriate and offering an apology, which can mitigate the severity of the threat to some extent. It is important to note that even if the threat was not carried out, it can still have a negative impact on the recipient and other readers, and should be avoided in online communication.

Outcome: ✅

Text Classification Test #2

Text: “Go f[masked] yourself, you piece of s[masked].”

Mixtral: The type of toxicity present in the indicated text is obscene. The use of expletives and profane language, while offensive and impolite, do not rise to the level of a specific threat, insult, or hate speech, which are other types of toxicity. Instead, the language used is vulgar and designed to be deliberately offensive, making it an example of obscene toxicity.

Outcome: ✅

Text Classification Test #3

Text: “You’re just a [masked], go back to where you came from.”

Mixtral: The type of toxicity present in the indicated text is “insult.” This is because the statement is a personal attack that is meant to belittle or belittle the individual or group being referred to. The use of the term “[masked]” is a derogatory label, and the phrase “go back to where you came from” is a common insult used to demean people from a specific cultural, racial, or ethnic background, implying they are not truly part of the current society or community.

Outcome: ✅

Text Classification Test #4 (Singlish, racist comment)

Text: “You’re not that kind of Indian - you’re the good kind”

Mixtral: The toxicity in this text can be classified as identity hate. The statement “You’re not that kind of Indian - you’re the good kind” implies a negative stereotype or prejudiced view about a certain group of people based on their ethnicity or race. This type of statement is considered toxic because it dehumanizes and marginalizes individuals who belong to that group, perpetuating harmful stereotypes and biases. The speaker’s attempt to differentiate between “kinds” of Indian people indicates a discriminatory attitude that is harmful and divisive. This kind of identity hate is toxic as it can create and reinforce negative attitudes, behaviors, and systems that affect individuals and groups of people based on their identity.

Outcome: ✅

Text Classification Test #5 (Singlish, obscene comment)

Text: “Go cb lah”

Mixtral: Based on the given text, the type of toxicity present is “insult”. This is inferred from the offensive and disrespectful language used in the statement, which is intended to harm the recipient’s feelings or self-esteem. It does not contain any elements of threat, severe toxicity, obscenity, or hate speech related to the person’s identity.

Outcome: ✅

Text Classification Test #6 (Singlish, insult comment)

Text: “You’re so bo liao.”

Mixtral: The type of toxicity in the indicated text is “insult.” The statement is a colloquial expression in the Chinese language, which roughly translates to “you’re so boring” or “you’re so tedious” in English. The language is disrespectful and dismissive, and such language should be avoided in professional or civil discourse. However, the statement is not severe toxic, obscene, a threat, or identity hate as it does not include any explicit or direct threats of harm or target a specific identity group. The statement is less severe than previous examples, and thus the level of toxicity is lower. In this context, the term “bo liao” is a colloquial expression that is often used to express frustration, disappointment, or to dismiss an idea or person. The level of toxicity is low, as it is not meant to cause harm or hurt, but rather to express a negative emotion or dismiss a statement or idea.

Outcome: ✅

References

Mixtral of experts (Dec, 2023)
Introducing Mixtral 8x7B with Databricks Model Serving (Dec, 2023)