Topic modeling and text clustering are powerful unsupervised learning techniques that help discover hidden thematic structures in large collections of documents. These methods are essential for organizing, searching, and understanding large text corpora without requiring labeled data. This chapter explores how to implement these techniques using Python libraries and provides practical examples for various applications.
Introduction to Topic Modeling
Topic modeling algorithms identify abstract "topics" that occur in a collection of documents. A topic is typically represented as a probability distribution over words, where words that frequently co-occur are grouped together. These techniques help answer questions like:
- What are the main themes in a collection of documents? - How do these themes evolve over time? - Which documents discuss similar subjects?
Latent Dirichlet Allocation (LDA)
LDA is one of the most popular topic modeling techniques. It represents documents as mixtures of topics, where each topic is a probability distribution over words.
Let's implement LDA using the `gensim` library:
print(" Topic distribution for each document:") for i, doc in enumerate(corpus): print(f"Document {i+1}: {documents[i][:50]}...") topic_dist = lda_model.get_document_topics(doc) for topic_id, prob in topic_dist: print(f" Topic {topic_id}: {prob:.4f}")
Evaluating Topic Models
Evaluating topic models is challenging since they're unsupervised. Common approaches include coherence scores and perplexity.
# Train model with optimal number of topics optimal_lda_model = model_list[coherence_values.index(max(coherence_values))] print(" Top words per topic (optimal model):") for idx, topic in optimal_lda_model.print_topics(-1): print(f"Topic {idx}: {topic}") except Exception as e: print(f"Error computing coherence values: {e}")
Non-Negative Matrix Factorization (NMF)
NMF is another popular topic modeling technique that often produces more coherent topics than LDA for short texts.
plt.figure(figsize=(12, 8)) plt.imshow(nmf_document_topics, cmap='viridis', aspect='auto') plt.colorbar(label='Topic Weight') plt.xlabel('Topic') plt.ylabel('Document') plt.title('NMF Topic Distribution Across Documents') plt.tight_layout() plt.savefig('nmf_topic_distribution.png') print(" NMF topic distribution visualization saved to 'nmf_topic_distribution.png'")
BERTopic: Modern Topic Modeling with Transformers
BERTopic combines transformer models with traditional clustering techniques for state-of-the-art topic modeling.
Text Clustering
Text clustering groups similar documents together based on their content. Unlike topic modeling, which assigns multiple topics to each document, clustering typically assigns each document to a single cluster.
K-Means Clustering
K-Means is a popular clustering algorithm that partitions data into K clusters.
plt.figure(figsize=(12, 8)) scatter = plt.scatter(tfidf_tsne[:, 0], tfidf_tsne[:, 1], c=clusters, cmap='viridis', alpha=0.8) plt.colorbar(scatter, label='Cluster') plt.title('K-Means Clustering Visualization (t-SNE)') plt.tight_layout() plt.savefig('kmeans_tsne_visualization.png') print(" K-Means clustering visualization saved to 'kmeans_tsne_visualization.png'")
Hierarchical Clustering
Hierarchical clustering creates a tree of clusters, allowing for exploration at different levels of granularity.
plt.figure(figsize=(12, 8)) scatter = plt.scatter(tfidf_tsne[:, 0], tfidf_tsne[:, 1], c=hierarchical_labels, cmap='viridis', alpha=0.8) plt.colorbar(scatter, label='Cluster') plt.title('Hierarchical Clustering Visualization (t-SNE)') plt.tight_layout() plt.savefig('hierarchical_tsne_visualization.png') print(" Hierarchical clustering visualization saved to 'hierarchical_tsne_visualization.png'")
DBSCAN: Density-Based Clustering
DBSCAN identifies clusters as dense regions separated by regions of lower density, and can discover clusters of arbitrary shape.
plt.figure(figsize=(12, 8)) scatter = plt.scatter(tfidf_tsne[:, 0], tfidf_tsne[:, 1], c=dbscan_labels, cmap='viridis', alpha=0.8) plt.colorbar(scatter, label='Cluster') plt.title('DBSCAN Clustering Visualization (t-SNE)') plt.tight_layout() plt.savefig('dbscan_tsne_visualization.png') print(" DBSCAN clustering visualization saved to 'dbscan_tsne_visualization.png'")
Combining Topic Modeling and Clustering
Topic modeling and clustering can be combined for more powerful text analysis.
plt.xlabel('Topic ID') plt.ylabel('Average Topic Weight') plt.title('Average Topic Distribution per Cluster') plt.xticks(np.arange(num_topics) + 0.25, [f'Topic {i}' for i in range(num_topics)]) plt.legend() plt.tight_layout() plt.savefig('topic_cluster_distribution.png') print(" Topic-cluster distribution visualization saved to 'topic_cluster_distribution.png'")
Real-World Applications
Document Organization and Retrieval
Topic modeling and clustering can help organize large document collections and improve retrieval.
print(f" Query: {query}") print("Similar documents:") for doc, similarity in similar_docs: print(f" - Similarity: {similarity:.4f}") print(f" Document: {doc}")
Content Recommendation
Topic modeling can power content recommendation systems.
print(f" User interests: {user_interests}") print("Recommended content:") for doc, similarity in recommendations: print(f" - Similarity: {similarity:.4f}") print(f" Document: {doc}")
Trend Analysis
Topic modeling can identify trends in text data over time.
plt.xlabel('Week Number') plt.ylabel('Average Topic Weight') plt.title('Topic Trends Over Time') plt.legend() plt.grid(True) plt.tight_layout() plt.savefig('topic_trends.png') print(" Topic trends visualization saved to 'topic_trends.png'")
Advanced Topic Modeling Techniques
Dynamic Topic Models
Dynamic Topic Models (DTM) extend LDA to capture how topics evolve over time.
for topic_id in range(5): for time_slice in range(len(time_slices)): print(f"Topic {topic_id} at time {time_slice}:") print(dtm.print_topic_times(topic_id, time_slice)) """ print(" Dynamic Topic Modeling requires larger datasets with time information.") print("The concept involves tracking how topics evolve over different time periods.")
Correlated Topic Models
Correlated Topic Models (CTM) extend LDA by modeling correlations between topics.
print(" Top terms for each topic:") feature_names = tfidf_vectorizer.get_feature_names_out() for topic_idx, topic in enumerate(topic_term_matrix): top_words_idx = topic.argsort()[:-11:-1] top_words = [feature_names[i] for i in top_words_idx] print(f"Topic {topic_idx}: {' '.join(top_words)}")
Topic Modeling with Word Embeddings
Incorporating word embeddings can improve topic coherence.
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=0, num_docs=5) """ print(" Topic modeling with word embeddings combines the strengths of") print("distributional semantics with traditional topic modeling approaches.") print("Libraries like Top2Vec and BERTopic implement these techniques.")
Visualizing Topics and Clusters
Effective visualization is crucial for interpreting topic models and clusters.
# Plot plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title(f'Topic {topic_id} Word Cloud') plt.tight_layout() plt.savefig(f'topic_{topic_id}_wordcloud.png') print(f"Word cloud for Topic {topic_id} saved to 'topic_{topic_id}_wordcloud.png'") except ImportError: print(" WordCloud not installed. To create word clouds, install:") print("pip install wordcloud")
Conclusion
Topic modeling and text clustering are powerful techniques for discovering hidden structures in text data. This chapter covered:
1. Topic Modeling Techniques: - Latent Dirichlet Allocation (LDA) - Non-Negative Matrix Factorization (NMF) - Modern approaches like BERTopic
2. Text Clustering Methods: - K-Means Clustering - Hierarchical Clustering - DBSCAN
3. Evaluation Metrics: - Coherence scores for topic models - Silhouette scores for clustering
4. Applications: - Document organization and retrieval - Content recommendation - Trend analysis
5. Advanced Techniques: - Dynamic Topic Models - Correlated Topic Models - Topic modeling with word embeddings
These unsupervised learning approaches are particularly valuable when dealing with large collections of unlabeled text data. By identifying latent topics and grouping similar documents, they provide structure and insights that can inform further analysis or drive practical applications like search, recommendation, and content organization.
Practice exercises: 1. Apply topic modeling to a corpus of news articles and analyze how topics differ across news categories 2. Compare the performance of LDA and NMF on short texts like tweets or product reviews 3. Implement a document retrieval system using topic modeling 4. Analyze topic evolution over time in a corpus with timestamps 5. Combine topic modeling with supervised learning for improved text classification