<< ALL BLOG POSTS

Implement Text Similarity with Embeddings in Django

Table of Contents

Product data is messy. Typos, inconsistent SKUs, and incomplete metadata make accurate matching a challenge — and traditional methods like string matching or full-text search often fall short.

We solved this by adding semantic text matching to a Django app using BERT embeddings and pgvector. The results were dramatically better, with minimal architectural changes. Since we were already using Django and PostgreSQL, pgvector integrated seamlessly — no new infrastructure, no added complexity.

The outcome: fast, accurate, and maintainable text similarity that just worked.

The Problem: Matching Inconsistent Product Data

Our goal was to standardize product names and metadata — such as SKU, brand, and description — so that user queries would reliably match the correct items in our database.

The challenge? The data was highly inconsistent. Some records used full brand names, others abbreviations. Some included SKUs, others didn’t. Meanwhile, user input varied widely in structure, spelling, and completeness.

To address this, we built a system that semantically compared user queries — often incomplete or inconsistently formatted — to a curated list of standardized product records using embedding-based similarity. This allowed us to match meaning, not just characters.

The Challenge of Variations

Here’s how varied the data can be for a single product:

Standardized database record:

  • Product: Premium Wireless Noise-Cancelling Headphones
  • SKU: WH-1000XM4
  • Brand: Moxy Electronics
  • Description: Industry-leading noise cancellation, 30-hour battery life, touch controls
  • Category: Audio Devices
  • Color: Midnight Black

Example user queries:

  • "Moxy Wireless Headphones, WH-1000XM4, Black" — abbreviated brand, simplified product name
  • "Premum Wirless Noise-Canceling Headfones, WH1000XM4, Moxy" — typos, different SKU formatting
  • "Headphones Wireless Moxy, Noise Cancellation, Black WH-1000" — swapped word order, partial SKU
  • “Moxy WH-1000XM4 with 30-hour battery” — minimal metadata with one product feature
  • “Premium Moxy Noise-Cancelling Headphones” — just brand and product name

These variations broke traditional matching logic. A robust standardization system was essential — not just for accuracy, but for user satisfaction. Without it, even highly motivated users could fail to find what they were looking for.

Text Similarity Methods

To tackle this problem, we explored three common classes of approaches to determine the best fit for our use case. Each came with trade-offs in accuracy, complexity, and resilience to messy data.

String-Based Methods

Techniques like Levenshtein distance, Jaccard similarity, or cosine similarity on TF-IDF vectors handled basic differences like typos or word substitutions. However, they failed when word order changed or structure varied.

For example, “Wireless Headphones Moxy” and “Moxy Wireless Headphones” scored poorly despite being semantically identical. These methods also struggled with inconsistent or missing metadata.

Full-Text Search Engines

Tools like PostgreSQL full-text search or Elasticsearch offered fast indexing and keyword-based retrieval. But they relied on lexical matching, not semantics.

As a result, they often missed relationships between reworded or partially structured inputs — making them unsuitable for matching varied product metadata.

Transformer-Based Embeddings

Modern transformer models like BERT understood context and meaning. These models generated dense vector representations of text that captured relationships between words and concepts — even when phrasing or formatting changed.

Embeddings allowed us to compare user queries and product records on a semantic level, rather than relying on exact word matches.

Our Approach: Embeddings + Vector Search

We implemented a hybrid solution using BERT sentence embeddings and pgvector, fully integrated into our Django + PostgreSQL stack. This architecture gave us the best balance of accuracy, simplicity, and scalability.

Model Selection

We chose all-mpnet-base-v2, a general-purpose model that excels at semantic similarity tasks. It has broad knowledge across many domains that can be helpful when dealing with varied product names and descriptions.

Plus, it's relatively fast and resource-efficient compared to larger models.

For different tasks, other models may be preferred. For example, if we were handling multilingual product catalogs, we would consider alternatives like paraphrase-multilingual-mpnet-base-v2.

How It Works

We generated two embeddings per product:

  • One for the product name
  • One for the metadata (SKU, brand, description, category, color)

These vectors were stored in PostgreSQL using pgvector, a native extension that allowed us to avoid deploying additional services like Milvus, Weaviate, or Pinecone that we would have to tune and maintain.

At our data volume levels, pgvector delivered excellent performance without the operational overhead of managing another system.

When a user submitted a query, we:

  1. Generated two embeddings (name and metadata) using the same method
  2. Compared them to stored vectors using cosine similarity
  3. Applied a weighted score:
    • 70% from product name similarity
    • 30% from metadata similarity
  4. Returned the closest match from the standardized product list

This weighting prioritized the most identifiable element — the product name — while still leveraging metadata for improved relevance.

Implementation

For our PostgreSQL database, we used the following Dockerfile base image:

image: pgvector/pgvector:pg17

This added pgvector to the Postgres image.

In Python, we installed the required libraries:

pip install pgvector sentence-transformers

This also pulled in additional dependencies, including transformers, numpy, and torch.

In our Django model, we used VectorField to store the embeddings. The dimension was set to 768, matching the output of the BERT SentenceTransformer model "all-mpnet-base-v2":

1from pgvector.django import VectorField
2from sentence_transformers import SentenceTransformer
3
4model = SentenceTransformer("all-mpnet-base-v2")
5
6classProduct(models.Model):
7    name = models.CharField(max_length=255)
8    sku = models.CharField(max_length=255, blank=True, null=True)
9    brand = models.CharField(max_length=255, blank=True, null=True)
10    description = models.TextField(blank=True, null=True)
11...    
12    # Vector fields
13    name_vector = VectorField(dimensions=768, blank=True, null=True)
14    metadata_vector = VectorField(dimensions=768, blank=True, null=True)

We calculated and stored embeddings like this:

1def get_text_embedding(text):
2    embedding = model.encode(text)
3    return embedding.tolist()
4
5class Product(models.Model):
6...    
7    def save(self, *args, **kwargs):
8        self.name_vector = get_text_embedding(self.name)
9        metadata_text = f"""
10            SKU: {self.sku}
11            Brand: {self.brand}
12            Description: {self.description}
13            Category: {self.category}
14            Color: {self.color}
15        """
16        self.metadata_vector = get_text_embedding(metadata_text)
17        super().save(*args, **kwargs)

For querying, given the input fields product_name, product_sku, product_brand, product_description, and other attributes, we generated embeddings in the same way:

1query_name_vector = get_text_embedding(product_name)
2query_metadata_text = f"""
3    SKU: {product_sku}
4    Brand: {product_brand}
5    Description: {product_description}
6    Category: {product_category}
7    Color: {product_color}
8"""
9query_metadata_vector = get_text_embedding(query_metadata_text)

We used the Django ORM to search for similar products, which simplified research and testing by eliminating the need to write complex SQL queries. Specifically, we calculated a weighted similarity by combining two scores — 70% from the name vector and 30% from the metadata vector — and returned the closest match:

1from pgvector.django import CosineDistance
2
3similar_objects = Product.objects.annotate(
4    name_similarity=CosineDistance("name_vector", query_name_vector),
5    metadata_similarity=CosineDistance("metadata_vector", query_metadata_vector),
6    weighted_similarity=(
7        0.7 * F("name_similarity")
8        + 0.3 * F("metadata_similarity")
9    ),
10).order_by("weighted_similarity")
11
12result = similar_objects.first()
13return result

Tips and Gotchas

Tune the Similarity Weights

  • We started with a 70/30 split between name and metadata similarity, but this ratio can be adjusted based on experimental results and domain-specific behavior.

Beware of Typos

  • BERT models don’t handle misspellings well. If typos are common in your data, consider models like "paraphrase-mpnet-base-v2" or the ModernBERT family, which are more robust. Keep in mind: switching models requires re-encoding all stored embeddings, as they are model-specific.

Use a GPU for Bulk Processing

  • Generating a large number of embeddings can be slow on CPU. If you're processing at scale (e.g., large product catalogs or frequent updates), a GPU significantly speeds up encoding.

Indexing Matters

  • As your dataset grows, performance improves by adding pgvector-specific indexes. These aren’t as straightforward as BTree indexes, so refer to the pgvector documentation for implementation best practices and index types (like ivfflat or hnsw).

Our embedding-based solution delivered significantly higher accuracy than traditional text matching methods — such as regular expressions, full-text search, or basic string similarity techniques.

By leveraging BERT embeddings and pgvector, we were able to add semantic matching capabilities directly into our existing Django + PostgreSQL stack with minimal overhead. The integration was clean, the performance solid, and the impact on the user experience substantial.

This approach allowed us to reliably match product names and metadata — even with typos, inconsistent formatting, or missing fields — resulting in a more intuitive search experience.

Want to see vector search in action at scale? Check out our case study, Extracting Gold From Millions of Datapoints, and explore how our Big Data capabilities can help you unlock deeper insights and better results.

Related Posts
How can we assist you?
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.