Sentiment analysis and text classification are among the most widely used applications of Natural Language Processing (NLP). Sentiment analysis determines the emotional tone behind text (positive, negative, neutral), while text classification assigns predefined categories to documents. This chapter explores how to implement these tasks using various Python libraries and techniques, from traditional machine learning approaches to advanced deep learning models.
Introduction to Sentiment Analysis
Sentiment analysis (or opinion mining) aims to identify and extract subjective information from text. It has numerous applications, including:
- Brand monitoring and reputation management - Customer feedback analysis - Social media monitoring - Market research and competitive analysis - Product improvement
Let's start with basic sentiment analysis techniques and progressively move to more advanced methods.
Rule-Based Sentiment Analysis
The simplest approach uses lexicons (dictionaries) of words with associated sentiment scores.
plt.figure(figsize=(12, 6)) df_components = df_results[['positive', 'neutral', 'negative']].copy() df_components.index = [f"Text {i+1}" for i in range(len(sentences))] df_components.plot(kind='bar', figsize=(12, 6)) plt.title('Sentiment Components') plt.ylabel('Score') plt.tight_layout() plt.savefig('sentiment_components.png') print("Sentiment components visualization saved to sentiment_components.png")
Limitations of Rule-Based Approaches
Rule-based methods like VADER have limitations:
Machine Learning for Sentiment Analysis
Machine learning approaches can overcome some limitations of rule-based methods by learning patterns from labeled data.
Traditional ML Approach with Scikit-learn
best_classifier = results[best_model_name]['classifier'] print(f" Predictions using {best_model_name}:") for text in new_texts: result = predict_sentiment_ml(text, best_classifier, tfidf_vectorizer, label_names) print(f"Text: '{text}'") print(f"Predicted sentiment: {result['sentiment']}") print(f"Probabilities/Scores: {result['probabilities']}") print("-" * 50)
Feature Importance Analysis
Understanding which words contribute most to sentiment classification:
Deep Learning for Sentiment Analysis
Deep learning models, particularly those based on neural networks, can capture more complex patterns in text.
Using Word Embeddings with Keras
print(" Predictions using Deep Learning model:") for text in new_texts: result = predict_sentiment_dl(text, model, tokenizer) print(f"Text: '{text}'") print(f"Predicted sentiment: {result['sentiment']}") print(f"Probabilities: negative={result['probabilities']['negative']:.4f}, " + f"positive={result['probabilities']['positive']:.4f}, " + f"neutral={result['probabilities']['neutral']:.4f}") print("-" * 50)
Using Pre-trained Transformer Models
Transformer models like BERT have achieved state-of-the-art results in sentiment analysis:
# Test with examples print(" Predictions using Transformer model:") for text in new_texts: result = sentiment_pipeline(text)[0] print(f"Text: '{text}'") print(f"Predicted sentiment: {result['label']}") print(f"Confidence: {result['score']:.4f}") print("-" * 50) except Exception as e: print(f"Error loading Transformer pipeline: {e}") print("Try installing the necessary packages with: pip install transformers[torch]")
Text Classification Beyond Sentiment
Text classification extends beyond sentiment analysis to categorize text into various predefined classes.
Multi-class Classification Example
print(f"Text: '{text}'") print(f"Predicted category: {prediction}") print("Top probabilities:") for category, prob in sorted_probas[:3]: # Show top 3 print(f" {category}: {prob:.4f}") print("-" * 50)
Hierarchical Classification
Some classification tasks involve hierarchical categories:
print(" Hierarchical predictions:") for text in new_hierarchical_texts: result = predict_hierarchical(text, main_classifier, tfidf_vectorizer, sub_classifiers) print(f"Text: '{text}'") print(f"Main category: {result['main_category']}") print(f"Sub-category: {result['sub_category']}") print("-" * 50)
Advanced Techniques for Text Classification
Handling Class Imbalance
Many real-world text classification problems involve imbalanced classes:
# Evaluate y_pred_smote = smote_classifier.predict(X_test_tfidf) print(" Results with SMOTE:") print(classification_report(y_test, y_pred_smote, target_names=label_names)) except Exception as e: print(f"Error applying SMOTE: {e}") print("Try installing imbalanced-learn: pip install imbalanced-learn")
Ensemble Methods
Combining multiple classifiers can improve performance:
plt.figure(figsize=(10, 6)) plt.bar(results.keys(), results.values()) plt.title('Classifier Comparison') plt.ylabel('Accuracy') plt.ylim(0, 1) for i, (name, accuracy) in enumerate(results.items()): plt.text(i, accuracy + 0.02, f"{accuracy:.4f}", ha='center') plt.tight_layout() plt.savefig('classifier_comparison.png') print(" Classifier comparison saved to classifier_comparison.png")
Evaluating Text Classification Models
Beyond accuracy, it's important to consider other metrics and evaluation techniques:
plt.plot([0, 1], [0, 1], 'k--', lw=2) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.tight_layout() plt.savefig('roc_curve.png') print(" ROC curve saved to roc_curve.png")
Practical Applications and Deployment
Building a Simple Sentiment Analysis API
print(" Model and vectorizer saved for deployment.") print("To deploy, create a Flask API as shown in the commented code above.")
Real-time Sentiment Analysis Dashboard
print(" To create a real-time sentiment analysis dashboard:") print("1. Install Streamlit: pip install streamlit") print("2. Save the above code to a file (e.g., app.py)") print("3. Run the app: streamlit run app.py")
Conclusion
Sentiment analysis and text classification are powerful NLP techniques with numerous applications. This chapter covered:
1. Rule-based sentiment analysis using lexicons like VADER 2. Machine learning approaches with traditional algorithms like Naive Bayes, Logistic Regression, and SVM 3. Deep learning methods including LSTM networks and transformer models 4. Multi-class and hierarchical text classification for categorizing text beyond sentiment 5. Advanced techniques for handling class imbalance and improving performance 6. Evaluation methods beyond simple accuracy 7. Practical applications including API development and dashboards
Each approach has its strengths and weaknesses. Rule-based methods are simple and interpretable but struggle with context and nuance. Machine learning models can learn patterns from data but require labeled examples. Deep learning models often achieve the best performance but require more data and computational resources.
The choice of method depends on your specific requirements, available data, and computational constraints. For many applications, a combination of approaches may yield the best results.
Practice exercises: 1. Compare the performance of different sentiment analysis approaches on a domain-specific dataset (e.g., product reviews, movie reviews) 2. Build a multi-class classifier for categorizing news articles or blog posts 3. Implement a hierarchical classifier for a taxonomy of your choice 4. Create a sentiment analysis dashboard for monitoring social media mentions of a brand or product 5. Develop a system that combines rule-based and machine learning approaches for more robust sentiment analysis