Product data is messy. Typos, inconsistent SKUs, and incomplete metadata make accurate matching a challenge — and traditional methods like string matching or full-text search often fall short.
We solved this by adding semantic text matching to a Django app using BERT embeddings and pgvector. The results were dramatically better, with minimal architectural changes. Since we were already using Django and PostgreSQL, pgvector integrated seamlessly — no new infrastructure, no added complexity.
The outcome: fast, accurate, and maintainable text similarity that just worked.
Our goal was to standardize product names and metadata — such as SKU, brand, and description — so that user queries would reliably match the correct items in our database.
The challenge? The data was highly inconsistent. Some records used full brand names, others abbreviations. Some included SKUs, others didn’t. Meanwhile, user input varied widely in structure, spelling, and completeness.
To address this, we built a system that semantically compared user queries — often incomplete or inconsistently formatted — to a curated list of standardized product records using embedding-based similarity. This allowed us to match meaning, not just characters.
Here’s how varied the data can be for a single product:
Standardized database record:
Example user queries:
These variations broke traditional matching logic. A robust standardization system was essential — not just for accuracy, but for user satisfaction. Without it, even highly motivated users could fail to find what they were looking for.
To tackle this problem, we explored three common classes of approaches to determine the best fit for our use case. Each came with trade-offs in accuracy, complexity, and resilience to messy data.
Techniques like Levenshtein distance, Jaccard similarity, or cosine similarity on TF-IDF vectors handled basic differences like typos or word substitutions. However, they failed when word order changed or structure varied.
For example, “Wireless Headphones Moxy” and “Moxy Wireless Headphones” scored poorly despite being semantically identical. These methods also struggled with inconsistent or missing metadata.
Tools like PostgreSQL full-text search or Elasticsearch offered fast indexing and keyword-based retrieval. But they relied on lexical matching, not semantics.
As a result, they often missed relationships between reworded or partially structured inputs — making them unsuitable for matching varied product metadata.
Modern transformer models like BERT understood context and meaning. These models generated dense vector representations of text that captured relationships between words and concepts — even when phrasing or formatting changed.
Embeddings allowed us to compare user queries and product records on a semantic level, rather than relying on exact word matches.
We implemented a hybrid solution using BERT sentence embeddings and pgvector, fully integrated into our Django + PostgreSQL stack. This architecture gave us the best balance of accuracy, simplicity, and scalability.
We chose all-mpnet-base-v2
, a general-purpose model that excels at semantic similarity tasks. It has broad knowledge across many domains that can be helpful when dealing with varied product names and descriptions.
Plus, it's relatively fast and resource-efficient compared to larger models.
For different tasks, other models may be preferred. For example, if we were handling multilingual product catalogs, we would consider alternatives like paraphrase-multilingual-mpnet-base-v2
.
We generated two embeddings per product:
These vectors were stored in PostgreSQL using pgvector, a native extension that allowed us to avoid deploying additional services like Milvus, Weaviate, or Pinecone that we would have to tune and maintain.
At our data volume levels, pgvector delivered excellent performance without the operational overhead of managing another system.
When a user submitted a query, we:
This weighting prioritized the most identifiable element — the product name — while still leveraging metadata for improved relevance.
For our PostgreSQL database, we used the following Dockerfile base image:
image: pgvector/pgvector:pg17
This added pgvector to the Postgres image.
In Python, we installed the required libraries:
pip install pgvector sentence-transformers
This also pulled in additional dependencies, including transformers
, numpy
, and torch
.
In our Django model, we used VectorField
to store the embeddings. The dimension was set to 768, matching the output of the BERT SentenceTransformer model "all-mpnet-base-v2"
:
1from pgvector.django import VectorField
2from sentence_transformers import SentenceTransformer
3
4model = SentenceTransformer("all-mpnet-base-v2")
5
6classProduct(models.Model):
7 name = models.CharField(max_length=255)
8 sku = models.CharField(max_length=255, blank=True, null=True)
9 brand = models.CharField(max_length=255, blank=True, null=True)
10 description = models.TextField(blank=True, null=True)
11...
12 # Vector fields
13 name_vector = VectorField(dimensions=768, blank=True, null=True)
14 metadata_vector = VectorField(dimensions=768, blank=True, null=True)
We calculated and stored embeddings like this:
1def get_text_embedding(text):
2 embedding = model.encode(text)
3 return embedding.tolist()
4
5class Product(models.Model):
6...
7 def save(self, *args, **kwargs):
8 self.name_vector = get_text_embedding(self.name)
9 metadata_text = f"""
10 SKU: {self.sku}
11 Brand: {self.brand}
12 Description: {self.description}
13 Category: {self.category}
14 Color: {self.color}
15 """
16 self.metadata_vector = get_text_embedding(metadata_text)
17 super().save(*args, **kwargs)
For querying, given the input fields product_name
, product_sku
, product_brand
, product_description
, and other attributes, we generated embeddings in the same way:
1query_name_vector = get_text_embedding(product_name)
2query_metadata_text = f"""
3 SKU: {product_sku}
4 Brand: {product_brand}
5 Description: {product_description}
6 Category: {product_category}
7 Color: {product_color}
8"""
9query_metadata_vector = get_text_embedding(query_metadata_text)
We used the Django ORM to search for similar products, which simplified research and testing by eliminating the need to write complex SQL queries. Specifically, we calculated a weighted similarity by combining two scores — 70% from the name vector and 30% from the metadata vector — and returned the closest match:
1from pgvector.django import CosineDistance
2
3similar_objects = Product.objects.annotate(
4 name_similarity=CosineDistance("name_vector", query_name_vector),
5 metadata_similarity=CosineDistance("metadata_vector", query_metadata_vector),
6 weighted_similarity=(
7 0.7 * F("name_similarity")
8 + 0.3 * F("metadata_similarity")
9 ),
10).order_by("weighted_similarity")
11
12result = similar_objects.first()
13return result
Tune the Similarity Weights
Beware of Typos
"paraphrase-mpnet-base-v2"
or the ModernBERT family, which are more robust. Keep in mind: switching models requires re-encoding all stored embeddings, as they are model-specific.Use a GPU for Bulk Processing
Indexing Matters
ivfflat
or hnsw
).Our embedding-based solution delivered significantly higher accuracy than traditional text matching methods — such as regular expressions, full-text search, or basic string similarity techniques.
By leveraging BERT embeddings and pgvector, we were able to add semantic matching capabilities directly into our existing Django + PostgreSQL stack with minimal overhead. The integration was clean, the performance solid, and the impact on the user experience substantial.
This approach allowed us to reliably match product names and metadata — even with typos, inconsistent formatting, or missing fields — resulting in a more intuitive search experience.
Want to see vector search in action at scale? Check out our case study, Extracting Gold From Millions of Datapoints, and explore how our Big Data capabilities can help you unlock deeper insights and better results.