While large language models (LLMs), such as GPT-4 and Claude, are capable of extracting structured information from text, small language models (SLMs) have historically struggled to do so reliably. Previously, the only viable approach was to fine-tune a larger open-weights model using distillation. A week ago, there was an announcement, which appears to an alternative.
Osmosis-Structure-0.6B
is a 596 million parameter SLM (file size 1.2 GB) finetuned on another SLM Qwen3-0.6B
using approximately 500,000 examples of JSON-to-natural language pairs.
Sample Code
First, create subclasses of Pydantic’s BaseModel - this helps with defining and validating the structure information being extracted from the text.
class GeographicInformation(BaseModel):
# In the examples below,
# this is used to capture the city, and/or country of domicile
city: Optional[str]
country: Optional[str]
class PersonInformation(BaseModel):
# In the examples below,
# this is used to capture the information about a person
name: str
age: Optional[int]
hobbies: Optional[List[str]]
residence: Optional[GeographicInformation]
Next, craft the prompt. I find it easier to define the instructions clearly one per line and concatenate them prior to sending the prompt (with the instructions) to the model.
instructions = [
"Extract ONLY the following information "
"that is EXPLICITLY PRESENT in the text:",
"1. name: The person's exact name as written in the text (if present)",
"2. age: The exact age number mentioned (if present)",
"3. hobbies: Exact hobbies mentioned (if any)",
"4. city of domicile: if mentioned in the text",
"5. country of domicile: if mentioned in the text",
"",
"RULES:",
"- NEVER invent information not in the text",
"- For name: Must match exactly what's written",
"- For age: Must be the exact number written",
"- For hobbies: Must be verbatim from text",
"- If information is missing, use null",
"",
"Now process this input:"
]
Next, using Ollama to call the Osmosis-Structure-0.6B
model
messages = [
{
"role": "system",
"content": "\n".join(instructions)
},
{
"role": "user",
"content": input_text.replace("\n", " "),
}
]
response = chat(
messages=messages,
model="Osmosis/Osmosis-Structure-0.6B:latest",
options={
"temperature": 0,
"timeout": 180,
"num_ctx": 8192
},
format=PersonInformation.model_json_schema()
)
Lastly, working with the extracted information from the model,
raw_response = response.message.content
answer = PersonInformation.model_validate_json(raw_response)
output_dict = answer.model_dump()
A Few Observations
Observation 1: If the text starts with an entity of interest, the model tends to hallucinate
// Model input: Alex, current living in Singapore, is in her 30s.
// She loves to travel, sing, and go on hikes.
// Model output:
{
"name": "EjgqzvzModel", // NOTE: The model hallucinates the name
"age": 30,
"hobbies": [
"travel",
"music", // NOTE: The model assumes liking to sing, means music is a hobby
"hiking"
],
"residence": {
"city": "Singapore",
"country": "SG" // NOTE: 'SG' was not present in the input text
}
}
Observation 2: Adding a prefix <ignore>
helps prevent the first entity from being hallucinated
// Model input: <ignore> Alex is in her 30s...
// Model output:
{
"name": "Alex",
"age": 30,
"hobbies": [
"travel",
"music", // NOTE: The model assumes liking to sing, means music is a hobby
"hiking"
],
"residence": {
"city": "Singapore",
"country": "SG" // NOTE: 'SG' was not present in the input text
}
}
Observation 3: Tends to hallucinate more with longer texts
I used the text from Chip Huyen’s LinkedIn profile (the text is as of what was pubicly visible a few hours ago).
// Model input: I'm Chip Huyen, a writer and computer scientist. I'm building infrastructure or real-time ML. I also teach Machine Learning Systems Design at Stanford. Previously, I was with Snorkel AI, NVIDIA, Netflix, Primer, Baomoi.com (acquired by VNG). I helped launch Coc Coc - Vietnam's second most popular web browser with 20+ million monthly active users. In my free time, I travel and write. After high school, I went to Brunei for a 3-day vacation which turned into a 3-year trip through Asia, Africa, and South America. During my trip, I worked as a Bollywood extra, a casino hostess, and a street performer. I'm the author of four bestselling Vietnamese books. I'm working on an English book on machine learning interviews.
// Model output:
{
"name": "Chip Huyen",
"age": 32, // NOTE: not present in the input text
"hobbies": [
"travel",
"writing"
],
"residence": {
// NOTE: the input text has no mention of where she lives, so this is hallucinated by the model
"city": "Brunei",
"country": "Vietnam"
}
}
Conclusion
I will not be using the current version of this model for any task.