Feature Engineering for NLP in Python
Feature engineering is a critical step in the Natural Language Processing (NLP) pipeline that transforms preprocessed text into numerical representations that machine learning algorithms can understand. This chapter explores various techniques for converting text data into meaningful features, from traditional approaches like Bag-of-Words to advanced embedding methods.
Understanding Feature Engineering for Text
Machine learning models require numerical inputs, but text is inherently symbolic and unstructured. Feature engineering bridges this gap by converting text into vectors of numbers that capture various linguistic properties while preserving semantic meaning.
The quality of these features significantly impacts model performance. Good features should: - Capture relevant patterns in the text - Represent semantic relationships between words and documents - Be computationally efficient - Scale well with large vocabularies and datasets
Let's explore the most important feature engineering techniques for NLP, implementing each with Python code examples.
Count-Based Methods
Bag-of-Words (BoW)
The Bag-of-Words model is one of the simplest and most common text representation methods. It represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
Sample documents
documents = [ "Natural language processing is fascinating.", "I love working with text data.", "NLP combines linguistics and computer science.", "Text processing techniques are essential for NLP.", "Processing natural language requires understanding linguistics." ]
Create a Bag-of-Words representation
vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(documents)
Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()
Convert to DataFrame for better visualization
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names) print("Bag-of-Words Matrix:") print(bow_df)
Examine the vocabulary
print(f"
Vocabulary size: {len(feature_names)}") print(f"Vocabulary: {feature_names}")
Sparse matrix details
print(f" Matrix shape: {bow_matrix.shape}") print(f"Matrix sparsity: {bow_matrix.nnz / (bow_matrix.shape[0] * bow_matrix.shape[1]):.4f}")
#### Customizing the Bag-of-Words Model
You can customize the BoW model to handle different tokenization patterns, n-grams, and vocabulary constraints:
# Customized Bag-of-Words
custom_vectorizer = CountVectorizer(
min_df=2, # Minimum document frequency (ignore terms that appear in fewer than 2 documents)
max_df=0.8, # Maximum document frequency (ignore terms that appear in more than 80% of documents)
ngram_range=(1, 2), # Include both unigrams and bigrams
stop_words='english', # Remove English stopwords
token_pattern=r'w+' # Only keep word characters
)
custom_bow = custom_vectorizer.fit_transform(documents)
custom_features = custom_vectorizer.get_feature_names_out()
Convert to DataFrame
custom_bow_df = pd.DataFrame(custom_bow.toarray(), columns=custom_features) print("Customized Bag-of-Words Matrix:") print(custom_bow_df)
print(f" Customized vocabulary size: {len(custom_features)}") print(f"Customized vocabulary: {custom_features}")
Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF addresses a limitation of BoW by weighting terms based on their importance in a document relative to the entire corpus. It reduces the impact of common words and emphasizes distinctive terms.
from sklearn.feature_extraction.text import TfidfVectorizer
Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
Get feature names
tfidf_features = tfidf_vectorizer.get_feature_names_out()
Convert to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_features) print("TF-IDF Matrix:") print(tfidf_df)
Compare BoW and TF-IDF for a specific term across documents
term = "processing" if term in feature_names and term in tfidf_features: term_index_bow = list(feature_names).index(term) term_index_tfidf = list(tfidf_features).index(term)
print(f" Comparing '{term}' representation:") print("Document | BoW Count | TF-IDF Weight") print("-" * 40) for i in range(len(documents)): bow_value = bow_matrix[i, term_index_bow] tfidf_value = tfidf_matrix[i, term_index_tfidf] print(f"{i+1:8} | {bow_value:9} | {tfidf_value:.6f}")
#### Customizing TF-IDF
TF-IDF can be customized to adjust term weighting and normalization:
# Customized TF-IDF
custom_tfidf = TfidfVectorizer(
min_df=2, # Minimum document frequency
max_df=0.8, # Maximum document frequency
ngram_range=(1, 2), # Include both unigrams and bigrams
stop_words='english', # Remove English stopwords
norm='l2', # Apply L2 normalization
use_idf=True, # Enable IDF weighting
smooth_idf=True, # Add 1 to document frequencies to prevent division by zero
sublinear_tf=True # Apply sublinear TF scaling (1 + log(TF))
)
custom_tfidf_matrix = custom_tfidf.fit_transform(documents)
custom_tfidf_features = custom_tfidf.get_feature_names_out()
Convert to DataFrame
custom_tfidf_df = pd.DataFrame(custom_tfidf_matrix.toarray(), columns=custom_tfidf_features) print(" Customized TF-IDF Matrix:") print(custom_tfidf_df)
N-grams
N-grams capture sequences of N consecutive words, helping to preserve some word order information.
# Unigrams, Bigrams, and Trigrams
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
combined_vectorizer = CountVectorizer(ngram_range=(1, 3))
Fit and transform
unigram_matrix = unigram_vectorizer.fit_transform(documents) bigram_matrix = bigram_vectorizer.fit_transform(documents) trigram_matrix = trigram_vectorizer.fit_transform(documents) combined_matrix = combined_vectorizer.fit_transform(documents)
Get feature names
unigram_features = unigram_vectorizer.get_feature_names_out() bigram_features = bigram_vectorizer.get_feature_names_out() trigram_features = trigram_vectorizer.get_feature_names_out() combined_features = combined_vectorizer.get_feature_names_out()
print(f"Number of unigram features: {len(unigram_features)}") print(f"Sample unigrams: {unigram_features[:5]}")
print(f"
Number of bigram features: {len(bigram_features)}") print(f"Sample bigrams: {bigram_features[:5]}")
print(f"
Number of trigram features: {len(trigram_features)}") print(f"Sample trigrams: {trigram_features[:5] if len(trigram_features) >= 5 else trigram_features}")
print(f" Number of combined n-gram features: {len(combined_features)}") print(f"Sample combined n-grams: {combined_features[:5]}")
Character N-grams
Character n-grams can capture sub-word patterns and are useful for handling misspellings and morphological variations.
# Character n-grams
char_vectorizer = CountVectorizer(
analyzer='char',
ngram_range=(3, 5) # Character n-grams from 3 to 5 characters long
)
char_matrix = char_vectorizer.fit_transform(documents)
char_features = char_vectorizer.get_feature_names_out()
print(f"Number of character n-gram features: {len(char_features)}")
print(f"Sample character n-grams: {char_features[:10]}")
Character n-grams for word boundaries
char_wb_vectorizer = CountVectorizer( analyzer='char_wb', # Character n-grams only from word boundaries ngram_range=(3, 5) )
char_wb_matrix = char_wb_vectorizer.fit_transform(documents) char_wb_features = char_wb_vectorizer.get_feature_names_out()
print(f" Number of character word-boundary n-gram features: {len(char_wb_features)}") print(f"Sample character word-boundary n-grams: {char_wb_features[:10]}")
Hashing Vectorizer
The Hashing Vectorizer is a memory-efficient alternative to CountVectorizer and TfidfVectorizer, especially useful for large datasets.
from sklearn.feature_extraction.text import HashingVectorizer
Create a Hashing Vectorizer
hash_vectorizer = HashingVectorizer( n_features=210, # 1024 features alternate_sign=False # Use only positive values )
hash_matrix = hash_vectorizer.fit_transform(documents)
Note: Hashing Vectorizer doesn't provide feature names
print(f"Hashing Vectorizer matrix shape: {hash_matrix.shape}") print(f"Hashing Vectorizer matrix sparsity: {hash_matrix.nnz / (hash_matrix.shape[0] * hash_matrix.shape[1]):.4f}")
Convert to DataFrame (with arbitrary feature names)
hash_df = pd.DataFrame( hash_matrix.toarray(), columns=[f"feature_{i}" for i in range(hash_matrix.shape[1])] ) print(" Hashing Vectorizer Matrix (first 5 columns):") print(hash_df.iloc[:, :5])
Word Embeddings
Word embeddings represent words as dense vectors in a continuous vector space, where semantically similar words are mapped to nearby points.
Word2Vec
Word2Vec learns word embeddings by predicting words from their context (Skip-gram) or predicting context from words (CBOW).
import gensim
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np
Prepare sentences for Word2Vec (requires tokenized sentences)
tokenized_sentences = [doc.lower().split() for doc in documents]
Train Word2Vec model
w2v_model = Word2Vec( sentences=tokenized_sentences, vector_size=100, # Embedding dimension window=5, # Context window size min_count=1, # Minimum word frequency workers=4, # Number of threads sg=1 # Skip-gram model (use 0 for CBOW) )
Get vocabulary
vocabulary = list(w2v_model.wv.index_to_key) print(f"Word2Vec vocabulary: {vocabulary}")
Get vector for a specific word
word = "processing" if word in w2v_model.wv: vector = w2v_model.wv[word] print(f"
Vector for '{word}' (first 10 dimensions): {vector[:10]}")
Find similar words
similar_words = w2v_model.wv.most_similar("language", topn=3) print(f"
Words most similar to 'language': {similar_words}")
Visualize word embeddings in 2D
def plot_embeddings(model, words=None): # Get all word vectors if words is None: words = list(model.wv.index_to_key)
# Get word vectors word_vectors = np.array([model.wv[word] for word in words])
# Apply PCA to reduce to 2 dimensions pca = PCA(n_components=2) result = pca.fit_transform(word_vectors)
# Create a scatter plot plt.figure(figsize=(10, 8)) plt.scatter(result[:, 0], result[:, 1], c='b', alpha=0.5)
# Add word labels for i, word in enumerate(words): plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.title("Word Embeddings Visualization") plt.savefig("word_embeddings.png") print("Word embeddings visualization saved to 'word_embeddings.png'")
Plot embeddings
plot_embeddings(w2v_model)
Pre-trained Word Embeddings
Using pre-trained word embeddings like GloVe or fastText can be more effective than training your own, especially for small datasets.
import gensim.downloader as api
import numpy as np
Load pre-trained GloVe embeddings
try: glove_vectors = api.load("glove-wiki-gigaword-100") print(f"Loaded GloVe embeddings with {len(glove_vectors)} words")
# Check if specific words are in the vocabulary test_words = ["language", "processing", "computer", "artificial", "intelligence"] for word in test_words: if word in glove_vectors: print(f"'{word}' is in the vocabulary") else: print(f"'{word}' is NOT in the vocabulary")
# Get vector for a word if "language" in glove_vectors: vector = glove_vectors["language"] print(f"
GloVe vector for 'language' (first 10 dimensions): {vector[:10]}")
# Find similar words if "language" in glove_vectors: similar_words = glove_vectors.most_similar("language", topn=5) print(f"
Words most similar to 'language' in GloVe: {similar_words}")
# Word analogies if all(word in glove_vectors for word in ["king", "man", "woman"]): result = glove_vectors.most_similar(positive=["king", "woman"], negative=["man"], topn=1) print(f" king - man + woman = {result}") except Exception as e: print(f"Error loading GloVe embeddings: {e}") print("You may need to download the embeddings first with: python -m gensim.downloader --download glove-wiki-gigaword-100")
Document Embeddings with Doc2Vec
Doc2Vec extends Word2Vec to learn embeddings for entire documents.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
Prepare documents for Doc2Vec
tagged_documents = [TaggedDocument(words=doc.lower().split(), tags=[i]) for i, doc in enumerate(documents)]
Train Doc2Vec model
d2v_model = Doc2Vec( documents=tagged_documents, vector_size=100, # Embedding dimension window=5, # Context window size min_count=1, # Minimum word frequency workers=4, # Number of threads epochs=100 # Number of training epochs )
Get document vectors
doc_vectors = [d2v_model.dv[i] for i in range(len(documents))]
Print first document vector
print(f"Document vector for first document (first 10 dimensions): {doc_vectors[0][:10]}")
Find most similar document to a query
query = "natural language processing techniques" query_vector = d2v_model.infer_vector(query.lower().split()) similar_docs = d2v_model.dv.most_similar([query_vector], topn=len(documents))
print(f" Documents most similar to query '{query}':") for doc_id, similarity in similar_docs: print(f"Document {doc_id+1}: {similarity:.4f} - {documents[doc_id]}")
Sentence and Document Embeddings
Sentence-BERT
Sentence-BERT (SBERT) provides semantically meaningful sentence embeddings that can be compared using cosine similarity.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
Load pre-trained Sentence-BERT model
try: sbert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Encode sentences sentence_embeddings = sbert_model.encode(documents)
print(f"Sentence embeddings shape: {sentence_embeddings.shape}") print(f"First sentence embedding (first 10 dimensions): {sentence_embeddings[0][:10]}")
# Calculate cosine similarity between all sentence pairs similarity_matrix = cosine_similarity(sentence_embeddings)
# Print similarity matrix print("
Sentence similarity matrix:") similarity_df = pd.DataFrame(similarity_matrix, index=[f"Doc {i+1}" for i in range(len(documents))], columns=[f"Doc {i+1}" for i in range(len(documents))]) print(similarity_df)
# Find most similar sentence pair max_sim = 0 max_pair = (0, 0) for i in range(len(documents)): for j in range(i+1, len(documents)): if similarity_matrix[i, j] > max_sim: max_sim = similarity_matrix[i, j] max_pair = (i, j)
print(f"
Most similar document pair:") print(f"Document {max_pair[0]+1}: {documents[max_pair[0]]}") print(f"Document {max_pair[1]+1}: {documents[max_pair[1]]}") print(f"Similarity: {max_sim:.4f}")
# Query-document similarity query = "What is natural language processing?" query_embedding = sbert_model.encode([query])[0]
# Calculate similarity between query and all documents query_similarities = cosine_similarity([query_embedding], sentence_embeddings)[0]
print(f" Query: '{query}'") print("Document similarities:") for i, sim in enumerate(query_similarities): print(f"Document {i+1}: {sim:.4f} - {documents[i]}") except Exception as e: print(f"Error loading Sentence-BERT model: {e}") print("You may need to install sentence-transformers: pip install sentence-transformers")
Feature Selection and Dimensionality Reduction
Feature selection and dimensionality reduction techniques help manage the high dimensionality of text features.
Chi-Square Feature Selection
Chi-square test selects features that are most dependent on the class label.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2, SelectKBest
Sample documents with labels
labeled_documents = [ "Natural language processing is fascinating.", "I love working with text data.", "NLP combines linguistics and computer science.", "Text processing techniques are essential for NLP.", "Processing natural language requires understanding linguistics." ] labels = [0, 1, 0, 1, 0] # Binary labels for demonstration
Create Bag-of-Words representation
vectorizer = CountVectorizer() X = vectorizer.fit_transform(labeled_documents) feature_names = vectorizer.get_feature_names_out()
Apply chi-square feature selection
k = 10 # Number of features to select chi2_selector = SelectKBest(chi2, k=min(k, X.shape[1])) X_chi2 = chi2_selector.fit_transform(X, labels)
Get selected feature indices
selected_indices = chi2_selector.get_support(indices=True) selected_features = [feature_names[i] for i in selected_indices]
print(f"Original number of features: {len(feature_names)}") print(f"Number of selected features: {len(selected_features)}") print(f"Selected features: {selected_features}")
Get chi-square scores
chi2_scores = chi2_selector.scores_ feature_scores = [(feature, chi2_scores[i]) for i, feature in enumerate(feature_names)] feature_scores.sort(key=lambda x: x[1], reverse=True)
print(" Top features by chi-square score:") for feature, score in feature_scores[:10]: print(f"{feature}: {score:.4f}")
Latent Semantic Analysis (LSA)
LSA uses Singular Value Decomposition (SVD) to reduce the dimensionality of the document-term matrix while preserving semantic relationships.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(documents) feature_names = tfidf_vectorizer.get_feature_names_out()
Apply LSA (Truncated SVD)
n_components = 2 # Number of topics/components lsa_model = TruncatedSVD(n_components=n_components) X_lsa = lsa_model.fit_transform(X_tfidf)
print(f"Original TF-IDF matrix shape: {X_tfidf.shape}") print(f"LSA matrix shape: {X_lsa.shape}") print(f"Explained variance ratio: {lsa_model.explained_variance_ratio_}") print(f"Total explained variance: {sum(lsa_model.explained_variance_ratio_):.4f}")
Get feature weights for each component
feature_weights = lsa_model.components_
Print top terms for each component/topic
print("
Top terms for each topic:") for i, component in enumerate(feature_weights): sorted_indices = component.argsort()[::-1] top_terms = [feature_names[idx] for idx in sorted_indices[:10]] print(f"Topic {i+1}: {', '.join(top_terms)}")
Visualize documents in the LSA space (if using 2 components)
if n_components == 2: plt.figure(figsize=(10, 8)) plt.scatter(X_lsa[:, 0], X_lsa[:, 1])
# Add document labels for i, doc in enumerate(documents): plt.annotate(f"Doc {i+1}", xy=(X_lsa[i, 0], X_lsa[i, 1]))
plt.title("Documents in LSA Space") plt.xlabel("Component 1") plt.ylabel("Component 2") plt.savefig("lsa_visualization.png") print(" LSA visualization saved to 'lsa_visualization.png'")
Non-negative Matrix Factorization (NMF)
NMF is another dimensionality reduction technique that produces more interpretable topics than LSA.
from sklearn.decomposition import NMF
Apply NMF
n_components = 2 # Number of topics nmf_model = NMF(n_components=n_components, random_state=42) X_nmf = nmf_model.fit_transform(X_tfidf)
print(f"NMF matrix shape: {X_nmf.shape}")
Get feature weights for each component
feature_weights = nmf_model.components_
Print top terms for each component/topic
print("
Top terms for each NMF topic:") for i, component in enumerate(feature_weights): sorted_indices = component.argsort()[::-1] top_terms = [feature_names[idx] for idx in sorted_indices[:10]] print(f"Topic {i+1}: {', '.join(top_terms)}")
Visualize documents in the NMF space (if using 2 components)
if n_components == 2: plt.figure(figsize=(10, 8)) plt.scatter(X_nmf[:, 0], X_nmf[:, 1])
# Add document labels for i, doc in enumerate(documents): plt.annotate(f"Doc {i+1}", xy=(X_nmf[i, 0], X_nmf[i, 1]))
plt.title("Documents in NMF Space") plt.xlabel("Component 1") plt.ylabel("Component 2") plt.savefig("nmf_visualization.png") print(" NMF visualization saved to 'nmf_visualization.png'")
Custom Feature Engineering
Sometimes you need to create custom features based on domain knowledge or specific text properties.
Text Statistics Features
import numpy as np
import re
from textstat import textstat # pip install textstat
def extract_text_statistics(text):
"""Extract statistical features from text."""
# Basic counts
char_count = len(text)
word_count = len(text.split())
sentence_count = len(re.split(r'[.!?]+', text)) - 1 # -1 to handle potential empty string at end
# Average lengths
avg_word_length = char_count / word_count if word_count > 0 else 0
avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
# Readability scores
flesch_reading_ease = textstat.flesch_reading_ease(text)
flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
# Lexical diversity (unique words / total words)
unique_words = len(set(text.lower().split()))
lexical_diversity = unique_words / word_count if word_count > 0 else 0
return {
'char_count': char_count,
'word_count': word_count,
'sentence_count': sentence_count,
'avg_word_length': avg_word_length,
'avg_sentence_length': avg_sentence_length,
'flesch_reading_ease': flesch_reading_ease,
'flesch_kincaid_grade': flesch_kincaid_grade,
'lexical_diversity': lexical_diversity
}
Extract text statistics for each document
text_stats = [extract_text_statistics(doc) for doc in documents]
Convert to DataFrame
stats_df = pd.DataFrame(text_stats) print("Text Statistics Features:") print(stats_df)
Linguistic Features
import spacy
import numpy as np
Load spaCy model
try: nlp = spacy.load("en_core_web_sm") except OSError: print("Downloading spaCy model...") spacy.cli.download("en_core_web_sm") nlp = spacy.load("en_core_web_sm")
def extract_linguistic_features(text): """Extract linguistic features using spaCy.""" doc = nlp(text)
# POS tag counts pos_counts = {} for token in doc: pos = token.pos_ pos_counts[pos] = pos_counts.get(pos, 0) + 1
# Normalize by total tokens total_tokens = len(doc) for pos in pos_counts: pos_counts[pos] /= total_tokens
# Named entity counts ent_counts = {} for ent in doc.ents: ent_type = ent.label_ ent_counts[ent_type] = ent_counts.get(ent_type, 0) + 1
# Syntactic dependency counts dep_counts = {} for token in doc: dep = token.dep_ dep_counts[dep] = dep_counts.get(dep, 0) + 1
# Normalize by total tokens for dep in dep_counts: dep_counts[dep] /= total_tokens
# Sentence complexity features sentences = list(doc.sents) if sentences: avg_depth = np.mean([token.head.i - token.i for token in doc if token.head.i != token.i]) max_depth = max([len(list(token.ancestors)) for token in doc]) else: avg_depth = 0 max_depth = 0
return { 'noun_ratio': pos_counts.get('NOUN', 0), 'verb_ratio': pos_counts.get('VERB', 0), 'adj_ratio': pos_counts.get('ADJ', 0), 'adv_ratio': pos_counts.get('ADV', 0), 'entity_count': len(doc.ents), 'avg_dependency_depth': avg_depth, 'max_dependency_depth': max_depth }
Extract linguistic features for each document
linguistic_features = [extract_linguistic_features(doc) for doc in documents]
Convert to DataFrame
ling_df = pd.DataFrame(linguistic_features) print("Linguistic Features:") print(ling_df)
Sentiment and Emotion Features
from textblob import TextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Ensure NLTK data is downloaded
try: nltk.data.find('sentiment/vader_lexicon.zip') except LookupError: nltk.download('vader_lexicon')
def extract_sentiment_features(text): """Extract sentiment and emotion features.""" # TextBlob sentiment blob = TextBlob(text) polarity = blob.sentiment.polarity subjectivity = blob.sentiment.subjectivity
# VADER sentiment sid = SentimentIntensityAnalyzer() vader_scores = sid.polarity_scores(text)
# Emotion words (simplified approach) positive_words = ['good', 'great', 'excellent', 'amazing', 'wonderful', 'love', 'happy', 'joy'] negative_words = ['bad', 'terrible', 'awful', 'horrible', 'hate', 'sad', 'angry', 'fear']
words = text.lower().split() positive_count = sum(1 for word in words if word in positive_words) negative_count = sum(1 for word in words if word in negative_words)
return { 'textblob_polarity': polarity, 'textblob_subjectivity': subjectivity, 'vader_compound': vader_scores['compound'], 'vader_pos': vader_scores['pos'], 'vader_neg': vader_scores['neg'], 'vader_neu': vader_scores['neu'], 'positive_word_count': positive_count, 'negative_word_count': negative_count }
Extract sentiment features for each document
sentiment_features = [extract_sentiment_features(doc) for doc in documents]
Convert to DataFrame
sentiment_df = pd.DataFrame(sentiment_features) print("Sentiment Features:") print(sentiment_df)
Combining Features
In practice, it's often beneficial to combine different types of features.
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack
Combine TF-IDF with statistical and linguistic features
First, standardize the numerical features
scaler = StandardScaler() stats_scaled = scaler.fit_transform(stats_df) ling_scaled = scaler.fit_transform(ling_df)
Convert to sparse matrices for compatibility with TF-IDF
from scipy.sparse import csr_matrix stats_sparse = csr_matrix(stats_scaled) ling_sparse = csr_matrix(ling_scaled)
Combine all features
combined_features = hstack([X_tfidf, stats_sparse, ling_sparse])
print(f"TF-IDF features shape: {X_tfidf.shape}") print(f"Statistical features shape: {stats_sparse.shape}") print(f"Linguistic features shape: {ling_sparse.shape}") print(f"Combined features shape: {combined_features.shape}")
Feature importance analysis (example with a simple model)
from sklearn.ensemble import RandomForestClassifier
Sample labels for demonstration
y = [0, 1, 0, 1, 0] # Binary labels
Train a model on the combined features
model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(combined_features, y)
Get feature importances for TF-IDF features
tfidf_feature_importances = model.feature_importances_[:X_tfidf.shape[1]] tfidf_importance_dict = {feature: importance for feature, importance in zip(feature_names, tfidf_feature_importances)}
Sort by importance
sorted_tfidf_importances = sorted(tfidf_importance_dict.items(), key=lambda x: x[1], reverse=True)
print("
Top TF-IDF features by importance:") for feature, importance in sorted_tfidf_importances[:10]: print(f"{feature}: {importance:.6f}")
Get feature importances for statistical features
stats_feature_importances = model.feature_importances_[X_tfidf.shape[1]:X_tfidf.shape[1]+stats_sparse.shape[1]] stats_importance_dict = {feature: importance for feature, importance in zip(stats_df.columns, stats_feature_importances)}
print(" Statistical features by importance:") for feature, importance in sorted(stats_importance_dict.items(), key=lambda x: x[1], reverse=True): print(f"{feature}: {importance:.6f}")
Feature Engineering Pipeline
Creating a reusable feature engineering pipeline helps streamline the process for new data.
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
Custom transformer for text statistics
class TextStatsTransformer(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self
def transform(self, X): stats = [extract_text_statistics(text) for text in X] return pd.DataFrame(stats).values
Custom transformer for linguistic features
class LinguisticTransformer(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self
def transform(self, X): features = [extract_linguistic_features(text) for text in X] return pd.DataFrame(features).values
Create a feature engineering pipeline
feature_pipeline = Pipeline([ ('features', FeatureUnion([ ('tfidf', Pipeline([ ('vectorizer', TfidfVectorizer(min_df=2, max_df=0.8)) ])), ('stats', Pipeline([ ('extractor', TextStatsTransformer()), ('scaler', StandardScaler()) ])), ('linguistic', Pipeline([ ('extractor', LinguisticTransformer()), ('scaler', StandardScaler()) ])) ])) ])
Apply the pipeline to the documents
X_transformed = feature_pipeline.fit_transform(documents) print(f"Transformed features shape: {X_transformed.shape}")
Example of using the pipeline in a complete ML workflow
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report
Sample data for demonstration
X = documents y = [0, 1, 0, 1, 0] # Binary labels
Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
Create a complete pipeline with feature engineering and model
complete_pipeline = Pipeline([ ('features', feature_pipeline), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ])
Train the model
complete_pipeline.fit(X_train, y_train)
Make predictions
y_pred = complete_pipeline.predict(X_test)
Evaluate
print(" Classification Report:") print(classification_report(y_test, y_pred))
Conclusion
Feature engineering is a critical step in the NLP pipeline that transforms raw text into numerical representations suitable for machine learning algorithms. This chapter covered a range of techniques, from traditional count-based methods like Bag-of-Words and TF-IDF to advanced embedding approaches like Word2Vec and Sentence-BERT.
The choice of feature engineering technique depends on your specific task, dataset characteristics, and computational constraints. Often, combining multiple feature types (lexical, statistical, linguistic, semantic) yields the best results.
Remember that feature engineering is both an art and a science—it requires creativity, domain knowledge, and experimentation to find the optimal representation for your text data.
In the next chapter, we'll explore how to build classical NLP models using these engineered features, focusing on traditional machine learning approaches for tasks like text classification, clustering, and information extraction.
Practice exercises: 1. Compare the performance of BoW, TF-IDF, and Word2Vec features on a text classification task 2. Implement a custom feature extractor for detecting text formality or complexity 3. Create a feature engineering pipeline that combines n-grams, character-level features, and linguistic features 4. Experiment with different dimensionality reduction techniques (LSA, NMF, t-SNE) and visualize the results 5. Build a document similarity system using different feature representations and evaluate their effectiveness