Before the rise of deep learning, classical machine learning models were the workhorses of Natural Language Processing (NLP). These models, often combined with carefully engineered features like Bag-of-Words or TF-IDF, remain valuable for many NLP tasks, especially when data is limited or computational resources are constrained. This chapter explores how to build and evaluate classical NLP models using Python, primarily leveraging the scikit-learn library.
Overview of Classical Models for NLP
Several classical machine learning algorithms are well-suited for text data:
1. Naive Bayes: A probabilistic classifier based on Bayes' theorem with a "naive" assumption of feature independence. Particularly effective for text classification. 2. Logistic Regression: A linear model used for binary classification tasks, estimating probabilities using a logistic function. 3. Support Vector Machines (SVM): A powerful classifier that finds an optimal hyperplane to separate different classes in high-dimensional space. Effective for text classification with high-dimensional features. 4. Decision Trees and Random Forests: Tree-based models that partition the feature space. Random Forests improve robustness by averaging multiple decision trees. 5. K-Nearest Neighbors (KNN): A non-parametric algorithm that classifies data based on the majority class among its nearest neighbors.
We will implement these models for common NLP tasks like text classification.
Text Classification with Classical Models
Text classification involves assigning predefined categories or labels to text documents. A common example is sentiment analysis.
Preparing the Data
Let's use a sample dataset and apply feature engineering techniques from the previous chapter.
print(f"Training data shape: {X_train_tfidf.shape}") print(f"Testing data shape: {X_test_tfidf.shape}")
Naive Bayes Classifier
Multinomial Naive Bayes is often a good baseline for text classification.
probabilities_nb = nb_classifier.predict_proba(X_test_tfidf) print(" Sample Probabilities (Naive Bayes):") for i in range(len(X_test)): print(f"Text: {X_test.iloc[i][:50]}...") print(f" Predicted: {label_map.keys()[list(label_map.values()).index(y_pred_nb[i])]}") print(f" Probabilities: Neg={probabilities_nb[i][0]:.2f}, Neu={probabilities_nb[i][1]:.2f}, Pos={probabilities_nb[i][2]:.2f}")
Logistic Regression Classifier
Logistic Regression is another strong baseline, often performing well on text data.
probabilities_lr = lr_classifier.predict_proba(X_test_tfidf) print(" Sample Probabilities (Logistic Regression):") for i in range(len(X_test)): print(f"Text: {X_test.iloc[i][:50]}...") print(f" Predicted: {label_map.keys()[list(label_map.values()).index(y_pred_lr[i])]}") print(f" Probabilities: Neg={probabilities_lr[i][0]:.2f}, Neu={probabilities_lr[i][1]:.2f}, Pos={probabilities_lr[i][2]:.2f}")
Support Vector Machine (SVM) Classifier
SVMs, especially with a linear kernel, are often very effective for high-dimensional sparse text data.
Random Forest Classifier
Random Forests can capture non-linear relationships but might be less effective than linear models on very high-dimensional sparse data like TF-IDF.
print(" Top 10 Important Features (Random Forest):") print(feature_importance_df.head(10))
Model Evaluation and Comparison
Evaluating models using appropriate metrics is crucial.
plt.figure(figsize=(6, 5)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names_svm, yticklabels=target_names_svm) plt.xlabel('Predicted Label') plt.ylabel('True Label') plt.title(f'Confusion Matrix - {best_model_name}') plt.savefig("confusion_matrix_svm.png") print(f" Confusion matrix for {best_model_name} saved to confusion_matrix_svm.png")
Hyperparameter Tuning
Optimizing model hyperparameters can significantly improve performance.
best_svm_classifier = grid_search_svm.best_estimator_ y_pred_best_svm = best_svm_classifier.predict(X_test_tfidf) accuracy_best_svm = accuracy_score(y_test, y_pred_best_svm) print(f"Test Set Accuracy (Best SVM): {accuracy_best_svm:.4f}")
Building a Reusable Pipeline
Combining feature extraction and classification into a scikit-learn Pipeline simplifies the workflow.
print(" Predictions on new texts:") for text, label in zip(new_texts, new_labels): print(f"Text: '{text}' -> Predicted: {label}")
Handling Imbalanced Data
Text datasets are often imbalanced. Techniques like adjusting class weights or resampling can help.
print(" Pipeline with SMOTE Oversampling:") print(f"Accuracy: {accuracy_smote:.4f}") print(classification_report(y_test, y_pred_smote, labels=target_codes_lr, target_names=target_names_lr, zero_division=0))
Conclusion
Classical machine learning models provide strong baselines and are often effective for various NLP tasks, especially when paired with appropriate feature engineering techniques like TF-IDF.
Key takeaways: - Naive Bayes, Logistic Regression, and SVMs are often excellent choices for text classification. - Pipelines streamline the process of feature extraction and model training. - Hyperparameter tuning is crucial for optimizing model performance. - Handling imbalanced data is important for real-world applications.
While deep learning models have achieved state-of-the-art results in many complex NLP tasks, classical models remain relevant due to their interpretability, efficiency, and effectiveness on smaller datasets or simpler problems.
In the next chapter, we will delve deeper into word embeddings and vector representations, which bridge the gap between classical methods and modern deep learning approaches.
Practice exercises: 1. Train and evaluate a K-Nearest Neighbors (KNN) classifier on the text classification task. 2. Compare the performance of different feature representations (BoW vs. TF-IDF vs. Hashing) with the same classifier (e.g., Logistic Regression). 3. Implement cross-validation manually (without GridSearchCV) to evaluate model stability. 4. Explore different text preprocessing steps within the pipeline and analyze their impact on model accuracy. 5. Apply these classical models to a different NLP task, such as spam detection or topic classification.