Text Clustering and Topic Modeling

Text clusterings is an attempting to group similar texts based on their semantic content, meaning, and relationships. The utilization of clustering our text based on these three categories allows us to explore the data that is unstructured for data analysis further the image to demonstrate this concept has groupings for visual purpose simplicity. For an analogy how to view Text Clustering imagine a sort of unstructured dataset that you have yet to explore yet you know it might contain some information that you could use for extracting insight. Similar to how each LLM you send a question to the interpretation of those words on the backend have a representation dimension to the word and relationship it might represent the visual is a depiction showing how this would work.

Code from this example can be re-used this is repurpose with modifications originally from Chapter 5 – Hands-On Large Language Models – Jay Alammar, Maarten Grootendorst.

Getting Started

For reference I’m using the following tools for this demonstration

Google Colab (Runtime T4 TPU)
Python

If you haven’t created a notebook or would like to follow along feel free otherwise the structure of this will resemble the Jupyter notebook style of code.

Imports

%pip install datasets
%pip install sentence-transformers==5.1.1
%pip install bertopic==0.17.3

Once these are installed we can then utilize the datasets library for loading our dataset for this demo I’m using the following dataset to model ArXiv articles this is loaded from a existing Hugging Face hosted dataset.

from datasets import load_dataset

dataset = load_dataset("maartengr/arxiv_nlp")["train"]

# Extract metadata
abstracts = dataset["Abstracts"]
titles = dataset["Titles"]

This should look like the following image below for this code.

from sentence_transformers import SentenceTransformer

# Create a embedding we are using a small model
embedding_model = SentenceTransformer("thenlper/gte-small") #Calls the model we are referencing
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

This will take some time to run as you can see the batches are running through 1 of 1,405.

Once this is complete we should be able to access our embeddings with the following code.

embeddings.shape
# This will check our dimensions and tell you that for each embedding it contains 384 values
# Result should match this below
(44949, 384)

From this portion we can think of how are the methods we use to compress the embeddings a few methods exist for this we are going to use the Uniform Manifold Approximation and Projection (UMAP) this is more suited for nonlinear relationships and structures which if you can recall we’ve passed in abstracts to represent the embedding.

from umap import UMAP

# Call our UMAP with the parameters - notice we are using n_components to reduce to 5 dimensions
umap_model = UMAP(
    n_components=5, min_dist=0.0, metric='cosine', random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)

After running this we can get a return statement that reduced embeddings represents now (44949, 5)

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)

We can now with our reduced embeddings start to detect outliers using our HDBSCAN class we define the following areas of parameters.

min_cluster_size – Set to a lower number this is the smallest size grouping that you wish to consider a cluster.

metric – “Euclidean” – computes the straight line between two points in space.

cluster_selection_method – “eom” – stands for Excess of Mass this will try to find the most stable clusters – this is suited for larger clusters this can also be changed depending on what you’re trying to achieve to the “leaf” method.

from hdbscan import HDBSCAN

# Fit the model and extract clusters
hdbscan_model = HDBSCAN(
    min_cluster_size=100, metric="euclidean", cluster_selection_method="eom"
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_

# Grab the length of clusters we've created
len(set(clusters))

Once this is completed we can now explore our cluster I’ve modified this code to put the index in a integer then print to the end user.

# Imports of numpy
import numpy as np

# Print first three documents in cluster 0
cluster = 0
for index in np.where(clusters==cluster)[0][:10]:
  index = int(index)
  print(abstracts[index][:300] + "......\n")

Once we’ve identified our dataset outputs lets put this together in a visualization.

import pandas as pd
import matplotlib.pyplot as plt

# Reduce 384-dimensional embeddings to two dimensions for easier visualization
reduced_embeddings = UMAP(
    n_components=2, min_dist=0.0, metric="cosine", random_state=42
).fit_transform(embeddings)

# Create dataframe
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"])
df["title"] = titles
df["cluster"] = [str(c) for c in clusters]

# Select outliers and non-outliers (clusters)
to_plot = df.loc[df.cluster != "-1", :]
outliers = df.loc[df.cluster == "-1", :]

# Plot outliers and non-outliers separately
plt.scatter(outliers.x, outliers.y, alpha=0.05, s=2, c="grey")
# Scatter initialization
plt.scatter(
    to_plot.x, to_plot.y, c=to_plot.cluster.astype(int),
    alpha=0.6, s=2, cmap="tab20b"
)
plt.axis("off")

Now up to this point when we took the reduced_embeddings we’ve had originally this to 5 components we put this even smaller with the n_components shortened to 2 dimensions for the visual.

Starting our BertTopic

% pip install bertopic==0.17.3


from bertopic import BERTopic

# Train our model with our previously defined models
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)

We can run a topic_model.get_topic_info()

Creating a visualization of our topics.

import numpy as np

# Ensure clean input types
titles = [str(t) for t in titles]
reduced_embeddings = np.array(reduced_embeddings, dtype=float)

# Visualize topics and documents
fig = topic_model.visualize_documents(
    titles, 
    reduced_embeddings=reduced_embeddings, 
    width=1400, 
    hide_annotations=True
)

# Update fonts of legend for easier visualization
fig.update_layout(font=dict(size=20))

We can also interactively click on one of the topics and see in the visual how often it appears in our cluster.

Use of Visualization Heat map

topic_model.visualize_heatmap(n_clusters=30)

# Minimizing our topics using helpers
from copy import deepcopy
original_topics = deepcopy(topic_model.topic_representations_)

# Define our topic differences
def topic_differences(model, original_topics, nr_topics=5):
    """Show the differences in topic representations between two models """
    df = pd.DataFrame(columns=["Topic", "Original", "Updated"])
    for topic in range(nr_topics):

        # Extract top 5 words per topic per model
        og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
        new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
        df.loc[len(df)] = [topic, og_words, new_words]
    
    return df

This function creates the data frame with Topic, Original, Updated then we can run the following to get the representation in a visual format. When we are calling KeyBERTInspired we are fine-tuning based on the semantic relationship between keywords/keyphrases and the set of documents in each topic. First we initialize it with the representation_model variable then we put that in our update_topics for the parameter and call our topic_difference function.

from bertopic.representation import KeyBERTInspired

# Update our topic representations using KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)

# Show topic differences
topic_differences(topic_model, original_topics)

Then we use the MaximalMarginalRelevance to tighten this up and put our function back to use at the end. For reference this concept is a retrieval and re-ranking technique that balances the relevance to a query with diversity among retrieved documents.

from bertopic.representation import MaximalMarginalRelevance

# Update our topic representations to MaximalMarginalRelevance
representation_model = MaximalMarginalRelevance(diversity=0.2)
topic_model.update_topics(abstracts, representation_model=representation_model)

# Show topic differences
topic_differences(topic_model, original_topics)

Where is the LLM?

Throughout this we’ve shown how to convert documents into an embedded model and how to compress the dimensionality of the embedding you might be wondering where is the LLM? So let’s put that to work we’ve re-arranged our data from original to updated to tighten up how the topic is modeled we can see the different such as question to “qa”.

This next code block calls google/flan-t5-small assuming you’re limited to CPU power you can choose to do this portion with another model such as API served.

from transformers import pipeline
from bertopic.representation import TextGeneration

prompt = """I have a topic that contains the following documents: 
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the documents and keywords, what is this topic about?"""

# Update our topic representations using Flan-T5
generator = pipeline("text2text-generation", model="google/flan-t5-small")
representation_model = TextGeneration(
    generator, prompt=prompt, doc_length=50, tokenizer="whitespace"
)
topic_model.update_topics(abstracts, representation_model=representation_model)

# Show topic differences
topic_differences(topic_model, original_topics)

Great now we can see our updated topic using the small model was able to transform our updated column with closer categorization.

Visualizing our updated topics

import numpy as np
from bertopic.plotting._documents import visualize_documents

# Ensure clean input types
titles = [str(t) for t in titles]
reduced_embeddings = np.array(reduced_embeddings, dtype=float)


fig = topic_model.visualize_documents(
    titles,
    embeddings=reduced_embeddings,
    width=1000,
    height=800,
    hide_annotations=True
)

Summary

This blog post covered both the use of embeddings and understanding on the compression of embeddings using algorithms such as UMAP and HDBScan which the later represents the data in a hierarchy which allows us to find outliers in the data without specifying the number of clusters. The key concept here is to expand the mere use of approaches to embeddings and also realize the power of using methods to find outliers in your data. The use of BERT allows us flexibility and speeding up the process with the model in a pipeline the use of other areas such as KeyBERTInspired you could also use a small language model to refine what wasn’t captured from the “flan-t5” model.