12. Sentiment Analysis and Text Classification in Python

Sentiment analysis and text classification are among the most widely used applications of Natural Language Processing (NLP). Sentiment analysis determines the emotional tone behind text (positive, negative, neutral), while text classification assigns predefined categories to documents. This chapter explores how to implement these tasks using various Python libraries and techniques, from traditional machine learning approaches to advanced deep learning models.

Introduction to Sentiment Analysis

Sentiment analysis (or opinion mining) aims to identify and extract subjective information from text. It has numerous applications, including:

- Brand monitoring and reputation management - Customer feedback analysis - Social media monitoring - Market research and competitive analysis - Product improvement

Let's start with basic sentiment analysis techniques and progressively move to more advanced methods.

Rule-Based Sentiment Analysis

The simplest approach uses lexicons (dictionaries) of words with associated sentiment scores.

plt.figure(figsize=(12, 6)) df_components = df_results[['positive', 'neutral', 'negative']].copy() df_components.index = [f"Text {i+1}" for i in range(len(sentences))] df_components.plot(kind='bar', figsize=(12, 6)) plt.title('Sentiment Components') plt.ylabel('Score') plt.tight_layout() plt.savefig('sentiment_components.png') print("Sentiment components visualization saved to sentiment_components.png")

Limitations of Rule-Based Approaches

Rule-based methods like VADER have limitations:

Machine Learning for Sentiment Analysis

Machine learning approaches can overcome some limitations of rule-based methods by learning patterns from labeled data.

Traditional ML Approach with Scikit-learn

best_classifier = results[best_model_name]['classifier'] print(f" Predictions using {best_model_name}:") for text in new_texts: result = predict_sentiment_ml(text, best_classifier, tfidf_vectorizer, label_names) print(f"Text: '{text}'") print(f"Predicted sentiment: {result['sentiment']}") print(f"Probabilities/Scores: {result['probabilities']}") print("-" * 50)

Feature Importance Analysis

Understanding which words contribute most to sentiment classification:

Deep Learning for Sentiment Analysis

Deep learning models, particularly those based on neural networks, can capture more complex patterns in text.

Using Word Embeddings with Keras

print(" Predictions using Deep Learning model:") for text in new_texts: result = predict_sentiment_dl(text, model, tokenizer) print(f"Text: '{text}'") print(f"Predicted sentiment: {result['sentiment']}") print(f"Probabilities: negative={result['probabilities']['negative']:.4f}, " + f"positive={result['probabilities']['positive']:.4f}, " + f"neutral={result['probabilities']['neutral']:.4f}") print("-" * 50)

Using Pre-trained Transformer Models

Transformer models like BERT have achieved state-of-the-art results in sentiment analysis:

# Test with examples print(" Predictions using Transformer model:") for text in new_texts: result = sentiment_pipeline(text)[0] print(f"Text: '{text}'") print(f"Predicted sentiment: {result['label']}") print(f"Confidence: {result['score']:.4f}") print("-" * 50) except Exception as e: print(f"Error loading Transformer pipeline: {e}") print("Try installing the necessary packages with: pip install transformers[torch]")

Text Classification Beyond Sentiment

Text classification extends beyond sentiment analysis to categorize text into various predefined classes.

Multi-class Classification Example

print(f"Text: '{text}'") print(f"Predicted category: {prediction}") print("Top probabilities:") for category, prob in sorted_probas[:3]: # Show top 3 print(f" {category}: {prob:.4f}") print("-" * 50)

Hierarchical Classification

Some classification tasks involve hierarchical categories:

print(" Hierarchical predictions:") for text in new_hierarchical_texts: result = predict_hierarchical(text, main_classifier, tfidf_vectorizer, sub_classifiers) print(f"Text: '{text}'") print(f"Main category: {result['main_category']}") print(f"Sub-category: {result['sub_category']}") print("-" * 50)

Advanced Techniques for Text Classification

Handling Class Imbalance

Many real-world text classification problems involve imbalanced classes:

# Evaluate y_pred_smote = smote_classifier.predict(X_test_tfidf) print(" Results with SMOTE:") print(classification_report(y_test, y_pred_smote, target_names=label_names)) except Exception as e: print(f"Error applying SMOTE: {e}") print("Try installing imbalanced-learn: pip install imbalanced-learn")

Ensemble Methods

Combining multiple classifiers can improve performance:

plt.figure(figsize=(10, 6)) plt.bar(results.keys(), results.values()) plt.title('Classifier Comparison') plt.ylabel('Accuracy') plt.ylim(0, 1) for i, (name, accuracy) in enumerate(results.items()): plt.text(i, accuracy + 0.02, f"{accuracy:.4f}", ha='center') plt.tight_layout() plt.savefig('classifier_comparison.png') print(" Classifier comparison saved to classifier_comparison.png")

Evaluating Text Classification Models

Beyond accuracy, it's important to consider other metrics and evaluation techniques:

plt.plot([0, 1], [0, 1], 'k--', lw=2) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.tight_layout() plt.savefig('roc_curve.png') print(" ROC curve saved to roc_curve.png")

Practical Applications and Deployment

Building a Simple Sentiment Analysis API

print(" Model and vectorizer saved for deployment.") print("To deploy, create a Flask API as shown in the commented code above.")

Real-time Sentiment Analysis Dashboard

print(" To create a real-time sentiment analysis dashboard:") print("1. Install Streamlit: pip install streamlit") print("2. Save the above code to a file (e.g., app.py)") print("3. Run the app: streamlit run app.py")

Conclusion

Sentiment analysis and text classification are powerful NLP techniques with numerous applications. This chapter covered:

1. Rule-based sentiment analysis using lexicons like VADER 2. Machine learning approaches with traditional algorithms like Naive Bayes, Logistic Regression, and SVM 3. Deep learning methods including LSTM networks and transformer models 4. Multi-class and hierarchical text classification for categorizing text beyond sentiment 5. Advanced techniques for handling class imbalance and improving performance 6. Evaluation methods beyond simple accuracy 7. Practical applications including API development and dashboards

Each approach has its strengths and weaknesses. Rule-based methods are simple and interpretable but struggle with context and nuance. Machine learning models can learn patterns from data but require labeled examples. Deep learning models often achieve the best performance but require more data and computational resources.

The choice of method depends on your specific requirements, available data, and computational constraints. For many applications, a combination of approaches may yield the best results.

Practice exercises: 1. Compare the performance of different sentiment analysis approaches on a domain-specific dataset (e.g., product reviews, movie reviews) 2. Build a multi-class classifier for categorizing news articles or blog posts 3. Implement a hierarchical classifier for a taxonomy of your choice 4. Create a sentiment analysis dashboard for monitoring social media mentions of a brand or product 5. Develop a system that combines rule-based and machine learning approaches for more robust sentiment analysis