8. Machine Learning for NLP with Python

Machine Learning (ML) provides the core algorithms that power many Natural Language Processing (NLP) applications. While previous chapters introduced classical models and feature engineering, this chapter focuses on the broader application of ML techniques to various NLP tasks using Python. We will explore supervised, unsupervised, and semi-supervised learning approaches for problems like text classification, clustering, topic modeling, and sequence labeling.

Overview of ML Paradigms in NLP

1. Supervised Learning: Models learn from labeled data (e.g., text documents paired with categories). Common tasks include text classification, sentiment analysis, and named entity recognition. 2. Unsupervised Learning: Models find patterns in unlabeled data. Common tasks include topic modeling, text clustering, and dimensionality reduction. 3. Semi-Supervised Learning: Models leverage a small amount of labeled data along with a large amount of unlabeled data. 4. Reinforcement Learning: Less common in core NLP but used in areas like dialogue systems.

This chapter primarily focuses on supervised and unsupervised learning techniques commonly applied in NLP using Python libraries like scikit-learn, NLTK, and Gensim.

Supervised Learning for NLP Tasks

Text Classification Revisited

We previously built text classifiers using classical models. Let's refine this process using pipelines and explore more advanced evaluation.

best_pipeline = results[best_model_name]['pipeline'] new_text = ["This is an average product, not great but not bad either."] prediction = best_pipeline.predict(new_text) predicted_label = list(label_map.keys())[list(label_map.values()).index(prediction[0])] print(f" Prediction for ", new_text[0], f" -> {predicted_label}")

Sequence Labeling (e.g., POS Tagging, NER)

Sequence labeling assigns a label to each token in a sequence. Classical ML approaches often use features derived from the word and its context.

#### Conditional Random Fields (CRFs)

CRFs are probabilistic graphical models often used for sequence labeling tasks like Part-of-Speech (POS) tagging and Named Entity Recognition (NER).

i = 10 # Example sentence index print(f" Sample Sentence: {' '.join(sent2tokens(test_sents[i]))}") print(f"Predicted Labels: {y_pred[i]}") print(f"True Labels: {y_test[i]}")

Unsupervised Learning for NLP Tasks

Unsupervised learning helps discover hidden structures in unlabeled text data.

Text Clustering

Clustering groups similar documents together without predefined labels.

plt.legend(handles=scatter.legend_elements()[0], title="Clusters") plt.savefig("kmeans_clustering.png") print(" K-Means clustering visualization saved to kmeans_clustering.png")

Topic Modeling

Topic modeling algorithms like Latent Dirichlet Allocation (LDA) discover abstract topics within a collection of documents.

print(" Generating pyLDAvis visualization...") vis_data = gensimvis.prepare(lda_model, corpus, dictionary) pyLDAvis.save_html(vis_data, 'lda_visualization.html') print("LDA visualization saved to lda_visualization.html")

Combining ML Techniques

Often, unsupervised techniques like topic modeling or clustering can generate features for supervised models.

print(" Classification using Combined TF-IDF and Topic Features:") print(f"Accuracy: {accuracy_comb:.4f}") print(classification_report(y_test_comb, y_pred_comb)) else: print(" Skipping combined feature classification due to dataset mismatch.")

Conclusion

Machine learning provides a powerful toolkit for tackling diverse NLP problems. This chapter demonstrated how to apply supervised techniques like classification (Naive Bayes, Logistic Regression, SVM) and sequence labeling (CRF), as well as unsupervised methods like clustering (K-Means) and topic modeling (LDA) using Python libraries.

Key takeaways: - Supervised learning requires labeled data and is effective for tasks like classification and sequence labeling. - Unsupervised learning discovers patterns in unlabeled data, useful for clustering and topic modeling. - Pipelines are essential for creating reproducible and efficient ML workflows in NLP. - Combining features from different sources (e.g., TF-IDF + topic models) can sometimes improve performance. - Choosing the right ML algorithm and features depends heavily on the specific NLP task and data characteristics.

While classical ML models are valuable, the next chapters will delve into deep learning approaches, which often achieve state-of-the-art performance on complex NLP tasks by automatically learning hierarchical features from raw text.

Practice exercises: 1. Apply different clustering algorithms (e.g., DBSCAN, Agglomerative Clustering) to text data and compare the results. 2. Train a POS tagger using a different classical ML model (e.g., SVM with engineered features) and compare its performance to the CRF model. 3. Experiment with different numbers of topics in LDA and evaluate topic coherence. 4. Build a text classifier using features derived from word embeddings (e.g., averaged Word2Vec vectors) and compare its performance to TF-IDF features. 5. Implement a semi-supervised learning approach (e.g., label propagation) for text classification when only a small amount of labeled data is available.