5. Feature Engineering for NLP in Python

Feature Engineering for NLP in Python

Feature engineering is a critical step in the Natural Language Processing (NLP) pipeline that transforms preprocessed text into numerical representations that machine learning algorithms can understand. This chapter explores various techniques for converting text data into meaningful features, from traditional approaches like Bag-of-Words to advanced embedding methods.

Understanding Feature Engineering for Text

Machine learning models require numerical inputs, but text is inherently symbolic and unstructured. Feature engineering bridges this gap by converting text into vectors of numbers that capture various linguistic properties while preserving semantic meaning.

The quality of these features significantly impacts model performance. Good features should: - Capture relevant patterns in the text - Represent semantic relationships between words and documents - Be computationally efficient - Scale well with large vocabularies and datasets

Let's explore the most important feature engineering techniques for NLP, implementing each with Python code examples.

Count-Based Methods

Bag-of-Words (BoW)

The Bag-of-Words model is one of the simplest and most common text representation methods. It represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


Sample documents

documents = [ "Natural language processing is fascinating.", "I love working with text data.", "NLP combines linguistics and computer science.", "Text processing techniques are essential for NLP.", "Processing natural language requires understanding linguistics." ]

Create a Bag-of-Words representation

vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(documents)

Get feature names (vocabulary)

feature_names = vectorizer.get_feature_names_out()

Convert to DataFrame for better visualization

bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names) print("Bag-of-Words Matrix:") print(bow_df)

Examine the vocabulary

print(f" Vocabulary size: {len(feature_names)}") print(f"Vocabulary: {feature_names}")

Sparse matrix details

print(f" Matrix shape: {bow_matrix.shape}") print(f"Matrix sparsity: {bow_matrix.nnz / (bow_matrix.shape[0] * bow_matrix.shape[1]):.4f}")

#### Customizing the Bag-of-Words Model

You can customize the BoW model to handle different tokenization patterns, n-grams, and vocabulary constraints:

# Customized Bag-of-Words
custom_vectorizer = CountVectorizer(
    min_df=2,              # Minimum document frequency (ignore terms that appear in fewer than 2 documents)
    max_df=0.8,            # Maximum document frequency (ignore terms that appear in more than 80% of documents)
    ngram_range=(1, 2),    # Include both unigrams and bigrams
    stop_words='english',  # Remove English stopwords
    token_pattern=r'w+'  # Only keep word characters
)

custom_bow = custom_vectorizer.fit_transform(documents)
custom_features = custom_vectorizer.get_feature_names_out()


Convert to DataFrame

custom_bow_df = pd.DataFrame(custom_bow.toarray(), columns=custom_features) print("Customized Bag-of-Words Matrix:") print(custom_bow_df)

print(f" Customized vocabulary size: {len(custom_features)}") print(f"Customized vocabulary: {custom_features}")

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF addresses a limitation of BoW by weighting terms based on their importance in a document relative to the entire corpus. It reduces the impact of common words and emphasizes distinctive terms.

from sklearn.feature_extraction.text import TfidfVectorizer


Create TF-IDF representation

tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

Get feature names

tfidf_features = tfidf_vectorizer.get_feature_names_out()

Convert to DataFrame

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_features) print("TF-IDF Matrix:") print(tfidf_df)

Compare BoW and TF-IDF for a specific term across documents

term = "processing" if term in feature_names and term in tfidf_features: term_index_bow = list(feature_names).index(term) term_index_tfidf = list(tfidf_features).index(term)

print(f" Comparing '{term}' representation:") print("Document | BoW Count | TF-IDF Weight") print("-" * 40) for i in range(len(documents)): bow_value = bow_matrix[i, term_index_bow] tfidf_value = tfidf_matrix[i, term_index_tfidf] print(f"{i+1:8} | {bow_value:9} | {tfidf_value:.6f}")

#### Customizing TF-IDF

TF-IDF can be customized to adjust term weighting and normalization:

# Customized TF-IDF
custom_tfidf = TfidfVectorizer(
    min_df=2,              # Minimum document frequency
    max_df=0.8,            # Maximum document frequency
    ngram_range=(1, 2),    # Include both unigrams and bigrams
    stop_words='english',  # Remove English stopwords
    norm='l2',             # Apply L2 normalization
    use_idf=True,          # Enable IDF weighting
    smooth_idf=True,       # Add 1 to document frequencies to prevent division by zero
    sublinear_tf=True      # Apply sublinear TF scaling (1 + log(TF))
)

custom_tfidf_matrix = custom_tfidf.fit_transform(documents)
custom_tfidf_features = custom_tfidf.get_feature_names_out()


Convert to DataFrame

custom_tfidf_df = pd.DataFrame(custom_tfidf_matrix.toarray(), columns=custom_tfidf_features) print(" Customized TF-IDF Matrix:") print(custom_tfidf_df)

N-grams

N-grams capture sequences of N consecutive words, helping to preserve some word order information.

# Unigrams, Bigrams, and Trigrams
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
combined_vectorizer = CountVectorizer(ngram_range=(1, 3))


Fit and transform

unigram_matrix = unigram_vectorizer.fit_transform(documents) bigram_matrix = bigram_vectorizer.fit_transform(documents) trigram_matrix = trigram_vectorizer.fit_transform(documents) combined_matrix = combined_vectorizer.fit_transform(documents)

Get feature names

unigram_features = unigram_vectorizer.get_feature_names_out() bigram_features = bigram_vectorizer.get_feature_names_out() trigram_features = trigram_vectorizer.get_feature_names_out() combined_features = combined_vectorizer.get_feature_names_out()

print(f"Number of unigram features: {len(unigram_features)}") print(f"Sample unigrams: {unigram_features[:5]}")

print(f" Number of bigram features: {len(bigram_features)}") print(f"Sample bigrams: {bigram_features[:5]}")

print(f" Number of trigram features: {len(trigram_features)}") print(f"Sample trigrams: {trigram_features[:5] if len(trigram_features) >= 5 else trigram_features}")

print(f" Number of combined n-gram features: {len(combined_features)}") print(f"Sample combined n-grams: {combined_features[:5]}")

Character N-grams

Character n-grams can capture sub-word patterns and are useful for handling misspellings and morphological variations.

# Character n-grams
char_vectorizer = CountVectorizer(
    analyzer='char',
    ngram_range=(3, 5)  # Character n-grams from 3 to 5 characters long
)

char_matrix = char_vectorizer.fit_transform(documents)
char_features = char_vectorizer.get_feature_names_out()

print(f"Number of character n-gram features: {len(char_features)}")
print(f"Sample character n-grams: {char_features[:10]}")


Character n-grams for word boundaries

char_wb_vectorizer = CountVectorizer( analyzer='char_wb', # Character n-grams only from word boundaries ngram_range=(3, 5) )

char_wb_matrix = char_wb_vectorizer.fit_transform(documents) char_wb_features = char_wb_vectorizer.get_feature_names_out()

print(f" Number of character word-boundary n-gram features: {len(char_wb_features)}") print(f"Sample character word-boundary n-grams: {char_wb_features[:10]}")

Hashing Vectorizer

The Hashing Vectorizer is a memory-efficient alternative to CountVectorizer and TfidfVectorizer, especially useful for large datasets.

from sklearn.feature_extraction.text import HashingVectorizer


Create a Hashing Vectorizer

hash_vectorizer = HashingVectorizer( n_features=210, # 1024 features alternate_sign=False # Use only positive values )

hash_matrix = hash_vectorizer.fit_transform(documents)

Note: Hashing Vectorizer doesn't provide feature names

print(f"Hashing Vectorizer matrix shape: {hash_matrix.shape}") print(f"Hashing Vectorizer matrix sparsity: {hash_matrix.nnz / (hash_matrix.shape[0] * hash_matrix.shape[1]):.4f}")

Convert to DataFrame (with arbitrary feature names)

hash_df = pd.DataFrame( hash_matrix.toarray(), columns=[f"feature_{i}" for i in range(hash_matrix.shape[1])] ) print(" Hashing Vectorizer Matrix (first 5 columns):") print(hash_df.iloc[:, :5])

Word Embeddings

Word embeddings represent words as dense vectors in a continuous vector space, where semantically similar words are mapped to nearby points.

Word2Vec

Word2Vec learns word embeddings by predicting words from their context (Skip-gram) or predicting context from words (CBOW).

import gensim
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np


Prepare sentences for Word2Vec (requires tokenized sentences)

tokenized_sentences = [doc.lower().split() for doc in documents]

Train Word2Vec model

w2v_model = Word2Vec( sentences=tokenized_sentences, vector_size=100, # Embedding dimension window=5, # Context window size min_count=1, # Minimum word frequency workers=4, # Number of threads sg=1 # Skip-gram model (use 0 for CBOW) )

Get vocabulary

vocabulary = list(w2v_model.wv.index_to_key) print(f"Word2Vec vocabulary: {vocabulary}")

Get vector for a specific word

word = "processing" if word in w2v_model.wv: vector = w2v_model.wv[word] print(f" Vector for '{word}' (first 10 dimensions): {vector[:10]}")

Find similar words

similar_words = w2v_model.wv.most_similar("language", topn=3) print(f" Words most similar to 'language': {similar_words}")

Visualize word embeddings in 2D

def plot_embeddings(model, words=None): # Get all word vectors if words is None: words = list(model.wv.index_to_key)

# Get word vectors word_vectors = np.array([model.wv[word] for word in words])

# Apply PCA to reduce to 2 dimensions pca = PCA(n_components=2) result = pca.fit_transform(word_vectors)

# Create a scatter plot plt.figure(figsize=(10, 8)) plt.scatter(result[:, 0], result[:, 1], c='b', alpha=0.5)

# Add word labels for i, word in enumerate(words): plt.annotate(word, xy=(result[i, 0], result[i, 1]))

plt.title("Word Embeddings Visualization") plt.savefig("word_embeddings.png") print("Word embeddings visualization saved to 'word_embeddings.png'")

Plot embeddings

plot_embeddings(w2v_model)

Pre-trained Word Embeddings

Using pre-trained word embeddings like GloVe or fastText can be more effective than training your own, especially for small datasets.

import gensim.downloader as api
import numpy as np


Load pre-trained GloVe embeddings

try: glove_vectors = api.load("glove-wiki-gigaword-100") print(f"Loaded GloVe embeddings with {len(glove_vectors)} words")

# Check if specific words are in the vocabulary test_words = ["language", "processing", "computer", "artificial", "intelligence"] for word in test_words: if word in glove_vectors: print(f"'{word}' is in the vocabulary") else: print(f"'{word}' is NOT in the vocabulary")

# Get vector for a word if "language" in glove_vectors: vector = glove_vectors["language"] print(f" GloVe vector for 'language' (first 10 dimensions): {vector[:10]}")

# Find similar words if "language" in glove_vectors: similar_words = glove_vectors.most_similar("language", topn=5) print(f" Words most similar to 'language' in GloVe: {similar_words}")

# Word analogies if all(word in glove_vectors for word in ["king", "man", "woman"]): result = glove_vectors.most_similar(positive=["king", "woman"], negative=["man"], topn=1) print(f" king - man + woman = {result}") except Exception as e: print(f"Error loading GloVe embeddings: {e}") print("You may need to download the embeddings first with: python -m gensim.downloader --download glove-wiki-gigaword-100")

Document Embeddings with Doc2Vec

Doc2Vec extends Word2Vec to learn embeddings for entire documents.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument


Prepare documents for Doc2Vec

tagged_documents = [TaggedDocument(words=doc.lower().split(), tags=[i]) for i, doc in enumerate(documents)]

Train Doc2Vec model

d2v_model = Doc2Vec( documents=tagged_documents, vector_size=100, # Embedding dimension window=5, # Context window size min_count=1, # Minimum word frequency workers=4, # Number of threads epochs=100 # Number of training epochs )

Get document vectors

doc_vectors = [d2v_model.dv[i] for i in range(len(documents))]

Print first document vector

print(f"Document vector for first document (first 10 dimensions): {doc_vectors[0][:10]}")

Find most similar document to a query

query = "natural language processing techniques" query_vector = d2v_model.infer_vector(query.lower().split()) similar_docs = d2v_model.dv.most_similar([query_vector], topn=len(documents))

print(f" Documents most similar to query '{query}':") for doc_id, similarity in similar_docs: print(f"Document {doc_id+1}: {similarity:.4f} - {documents[doc_id]}")

Sentence and Document Embeddings

Sentence-BERT

Sentence-BERT (SBERT) provides semantically meaningful sentence embeddings that can be compared using cosine similarity.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


Load pre-trained Sentence-BERT model

try: sbert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Encode sentences sentence_embeddings = sbert_model.encode(documents)

print(f"Sentence embeddings shape: {sentence_embeddings.shape}") print(f"First sentence embedding (first 10 dimensions): {sentence_embeddings[0][:10]}")

# Calculate cosine similarity between all sentence pairs similarity_matrix = cosine_similarity(sentence_embeddings)

# Print similarity matrix print(" Sentence similarity matrix:") similarity_df = pd.DataFrame(similarity_matrix, index=[f"Doc {i+1}" for i in range(len(documents))], columns=[f"Doc {i+1}" for i in range(len(documents))]) print(similarity_df)

# Find most similar sentence pair max_sim = 0 max_pair = (0, 0) for i in range(len(documents)): for j in range(i+1, len(documents)): if similarity_matrix[i, j] > max_sim: max_sim = similarity_matrix[i, j] max_pair = (i, j)

print(f" Most similar document pair:") print(f"Document {max_pair[0]+1}: {documents[max_pair[0]]}") print(f"Document {max_pair[1]+1}: {documents[max_pair[1]]}") print(f"Similarity: {max_sim:.4f}")

# Query-document similarity query = "What is natural language processing?" query_embedding = sbert_model.encode([query])[0]

# Calculate similarity between query and all documents query_similarities = cosine_similarity([query_embedding], sentence_embeddings)[0]

print(f" Query: '{query}'") print("Document similarities:") for i, sim in enumerate(query_similarities): print(f"Document {i+1}: {sim:.4f} - {documents[i]}") except Exception as e: print(f"Error loading Sentence-BERT model: {e}") print("You may need to install sentence-transformers: pip install sentence-transformers")

Feature Selection and Dimensionality Reduction

Feature selection and dimensionality reduction techniques help manage the high dimensionality of text features.

Chi-Square Feature Selection

Chi-square test selects features that are most dependent on the class label.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2, SelectKBest


Sample documents with labels

labeled_documents = [ "Natural language processing is fascinating.", "I love working with text data.", "NLP combines linguistics and computer science.", "Text processing techniques are essential for NLP.", "Processing natural language requires understanding linguistics." ] labels = [0, 1, 0, 1, 0] # Binary labels for demonstration

Create Bag-of-Words representation

vectorizer = CountVectorizer() X = vectorizer.fit_transform(labeled_documents) feature_names = vectorizer.get_feature_names_out()

Apply chi-square feature selection

k = 10 # Number of features to select chi2_selector = SelectKBest(chi2, k=min(k, X.shape[1])) X_chi2 = chi2_selector.fit_transform(X, labels)

Get selected feature indices

selected_indices = chi2_selector.get_support(indices=True) selected_features = [feature_names[i] for i in selected_indices]

print(f"Original number of features: {len(feature_names)}") print(f"Number of selected features: {len(selected_features)}") print(f"Selected features: {selected_features}")

Get chi-square scores

chi2_scores = chi2_selector.scores_ feature_scores = [(feature, chi2_scores[i]) for i, feature in enumerate(feature_names)] feature_scores.sort(key=lambda x: x[1], reverse=True)

print(" Top features by chi-square score:") for feature, score in feature_scores[:10]: print(f"{feature}: {score:.4f}")

Latent Semantic Analysis (LSA)

LSA uses Singular Value Decomposition (SVD) to reduce the dimensionality of the document-term matrix while preserving semantic relationships.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt


Create TF-IDF representation

tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(documents) feature_names = tfidf_vectorizer.get_feature_names_out()

Apply LSA (Truncated SVD)

n_components = 2 # Number of topics/components lsa_model = TruncatedSVD(n_components=n_components) X_lsa = lsa_model.fit_transform(X_tfidf)

print(f"Original TF-IDF matrix shape: {X_tfidf.shape}") print(f"LSA matrix shape: {X_lsa.shape}") print(f"Explained variance ratio: {lsa_model.explained_variance_ratio_}") print(f"Total explained variance: {sum(lsa_model.explained_variance_ratio_):.4f}")

Get feature weights for each component

feature_weights = lsa_model.components_

Print top terms for each component/topic

print(" Top terms for each topic:") for i, component in enumerate(feature_weights): sorted_indices = component.argsort()[::-1] top_terms = [feature_names[idx] for idx in sorted_indices[:10]] print(f"Topic {i+1}: {', '.join(top_terms)}")

Visualize documents in the LSA space (if using 2 components)

if n_components == 2: plt.figure(figsize=(10, 8)) plt.scatter(X_lsa[:, 0], X_lsa[:, 1])

# Add document labels for i, doc in enumerate(documents): plt.annotate(f"Doc {i+1}", xy=(X_lsa[i, 0], X_lsa[i, 1]))

plt.title("Documents in LSA Space") plt.xlabel("Component 1") plt.ylabel("Component 2") plt.savefig("lsa_visualization.png") print(" LSA visualization saved to 'lsa_visualization.png'")

Non-negative Matrix Factorization (NMF)

NMF is another dimensionality reduction technique that produces more interpretable topics than LSA.

from sklearn.decomposition import NMF


Apply NMF

n_components = 2 # Number of topics nmf_model = NMF(n_components=n_components, random_state=42) X_nmf = nmf_model.fit_transform(X_tfidf)

print(f"NMF matrix shape: {X_nmf.shape}")

Get feature weights for each component

feature_weights = nmf_model.components_

Print top terms for each component/topic

print(" Top terms for each NMF topic:") for i, component in enumerate(feature_weights): sorted_indices = component.argsort()[::-1] top_terms = [feature_names[idx] for idx in sorted_indices[:10]] print(f"Topic {i+1}: {', '.join(top_terms)}")

Visualize documents in the NMF space (if using 2 components)

if n_components == 2: plt.figure(figsize=(10, 8)) plt.scatter(X_nmf[:, 0], X_nmf[:, 1])

# Add document labels for i, doc in enumerate(documents): plt.annotate(f"Doc {i+1}", xy=(X_nmf[i, 0], X_nmf[i, 1]))

plt.title("Documents in NMF Space") plt.xlabel("Component 1") plt.ylabel("Component 2") plt.savefig("nmf_visualization.png") print(" NMF visualization saved to 'nmf_visualization.png'")

Custom Feature Engineering

Sometimes you need to create custom features based on domain knowledge or specific text properties.

Text Statistics Features

import numpy as np
import re
from textstat import textstat  # pip install textstat

def extract_text_statistics(text):
    """Extract statistical features from text."""
    # Basic counts
    char_count = len(text)
    word_count = len(text.split())
    sentence_count = len(re.split(r'[.!?]+', text)) - 1  # -1 to handle potential empty string at end
    
    # Average lengths
    avg_word_length = char_count / word_count if word_count > 0 else 0
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    
    # Readability scores
    flesch_reading_ease = textstat.flesch_reading_ease(text)
    flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
    
    # Lexical diversity (unique words / total words)
    unique_words = len(set(text.lower().split()))
    lexical_diversity = unique_words / word_count if word_count > 0 else 0
    
    return {
        'char_count': char_count,
        'word_count': word_count,
        'sentence_count': sentence_count,
        'avg_word_length': avg_word_length,
        'avg_sentence_length': avg_sentence_length,
        'flesch_reading_ease': flesch_reading_ease,
        'flesch_kincaid_grade': flesch_kincaid_grade,
        'lexical_diversity': lexical_diversity
    }


Extract text statistics for each document

text_stats = [extract_text_statistics(doc) for doc in documents]

Convert to DataFrame

stats_df = pd.DataFrame(text_stats) print("Text Statistics Features:") print(stats_df)

Linguistic Features

import spacy
import numpy as np


Load spaCy model

try: nlp = spacy.load("en_core_web_sm") except OSError: print("Downloading spaCy model...") spacy.cli.download("en_core_web_sm") nlp = spacy.load("en_core_web_sm")

def extract_linguistic_features(text): """Extract linguistic features using spaCy.""" doc = nlp(text)

# POS tag counts pos_counts = {} for token in doc: pos = token.pos_ pos_counts[pos] = pos_counts.get(pos, 0) + 1

# Normalize by total tokens total_tokens = len(doc) for pos in pos_counts: pos_counts[pos] /= total_tokens

# Named entity counts ent_counts = {} for ent in doc.ents: ent_type = ent.label_ ent_counts[ent_type] = ent_counts.get(ent_type, 0) + 1

# Syntactic dependency counts dep_counts = {} for token in doc: dep = token.dep_ dep_counts[dep] = dep_counts.get(dep, 0) + 1

# Normalize by total tokens for dep in dep_counts: dep_counts[dep] /= total_tokens

# Sentence complexity features sentences = list(doc.sents) if sentences: avg_depth = np.mean([token.head.i - token.i for token in doc if token.head.i != token.i]) max_depth = max([len(list(token.ancestors)) for token in doc]) else: avg_depth = 0 max_depth = 0

return { 'noun_ratio': pos_counts.get('NOUN', 0), 'verb_ratio': pos_counts.get('VERB', 0), 'adj_ratio': pos_counts.get('ADJ', 0), 'adv_ratio': pos_counts.get('ADV', 0), 'entity_count': len(doc.ents), 'avg_dependency_depth': avg_depth, 'max_dependency_depth': max_depth }

Extract linguistic features for each document

linguistic_features = [extract_linguistic_features(doc) for doc in documents]

Convert to DataFrame

ling_df = pd.DataFrame(linguistic_features) print("Linguistic Features:") print(ling_df)

Sentiment and Emotion Features

from textblob import TextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer


Ensure NLTK data is downloaded

try: nltk.data.find('sentiment/vader_lexicon.zip') except LookupError: nltk.download('vader_lexicon')

def extract_sentiment_features(text): """Extract sentiment and emotion features.""" # TextBlob sentiment blob = TextBlob(text) polarity = blob.sentiment.polarity subjectivity = blob.sentiment.subjectivity

# VADER sentiment sid = SentimentIntensityAnalyzer() vader_scores = sid.polarity_scores(text)

# Emotion words (simplified approach) positive_words = ['good', 'great', 'excellent', 'amazing', 'wonderful', 'love', 'happy', 'joy'] negative_words = ['bad', 'terrible', 'awful', 'horrible', 'hate', 'sad', 'angry', 'fear']

words = text.lower().split() positive_count = sum(1 for word in words if word in positive_words) negative_count = sum(1 for word in words if word in negative_words)

return { 'textblob_polarity': polarity, 'textblob_subjectivity': subjectivity, 'vader_compound': vader_scores['compound'], 'vader_pos': vader_scores['pos'], 'vader_neg': vader_scores['neg'], 'vader_neu': vader_scores['neu'], 'positive_word_count': positive_count, 'negative_word_count': negative_count }

Extract sentiment features for each document

sentiment_features = [extract_sentiment_features(doc) for doc in documents]

Convert to DataFrame

sentiment_df = pd.DataFrame(sentiment_features) print("Sentiment Features:") print(sentiment_df)

Combining Features

In practice, it's often beneficial to combine different types of features.

from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack


Combine TF-IDF with statistical and linguistic features

First, standardize the numerical features

scaler = StandardScaler() stats_scaled = scaler.fit_transform(stats_df) ling_scaled = scaler.fit_transform(ling_df)

Convert to sparse matrices for compatibility with TF-IDF

from scipy.sparse import csr_matrix stats_sparse = csr_matrix(stats_scaled) ling_sparse = csr_matrix(ling_scaled)

Combine all features

combined_features = hstack([X_tfidf, stats_sparse, ling_sparse])

print(f"TF-IDF features shape: {X_tfidf.shape}") print(f"Statistical features shape: {stats_sparse.shape}") print(f"Linguistic features shape: {ling_sparse.shape}") print(f"Combined features shape: {combined_features.shape}")

Feature importance analysis (example with a simple model)

from sklearn.ensemble import RandomForestClassifier

Sample labels for demonstration

y = [0, 1, 0, 1, 0] # Binary labels

Train a model on the combined features

model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(combined_features, y)

Get feature importances for TF-IDF features

tfidf_feature_importances = model.feature_importances_[:X_tfidf.shape[1]] tfidf_importance_dict = {feature: importance for feature, importance in zip(feature_names, tfidf_feature_importances)}

Sort by importance

sorted_tfidf_importances = sorted(tfidf_importance_dict.items(), key=lambda x: x[1], reverse=True)

print(" Top TF-IDF features by importance:") for feature, importance in sorted_tfidf_importances[:10]: print(f"{feature}: {importance:.6f}")

Get feature importances for statistical features

stats_feature_importances = model.feature_importances_[X_tfidf.shape[1]:X_tfidf.shape[1]+stats_sparse.shape[1]] stats_importance_dict = {feature: importance for feature, importance in zip(stats_df.columns, stats_feature_importances)}

print(" Statistical features by importance:") for feature, importance in sorted(stats_importance_dict.items(), key=lambda x: x[1], reverse=True): print(f"{feature}: {importance:.6f}")

Feature Engineering Pipeline

Creating a reusable feature engineering pipeline helps streamline the process for new data.

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin


Custom transformer for text statistics

class TextStatsTransformer(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self

def transform(self, X): stats = [extract_text_statistics(text) for text in X] return pd.DataFrame(stats).values

Custom transformer for linguistic features

class LinguisticTransformer(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self

def transform(self, X): features = [extract_linguistic_features(text) for text in X] return pd.DataFrame(features).values

Create a feature engineering pipeline

feature_pipeline = Pipeline([ ('features', FeatureUnion([ ('tfidf', Pipeline([ ('vectorizer', TfidfVectorizer(min_df=2, max_df=0.8)) ])), ('stats', Pipeline([ ('extractor', TextStatsTransformer()), ('scaler', StandardScaler()) ])), ('linguistic', Pipeline([ ('extractor', LinguisticTransformer()), ('scaler', StandardScaler()) ])) ])) ])

Apply the pipeline to the documents

X_transformed = feature_pipeline.fit_transform(documents) print(f"Transformed features shape: {X_transformed.shape}")

Example of using the pipeline in a complete ML workflow

from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report

Sample data for demonstration

X = documents y = [0, 1, 0, 1, 0] # Binary labels

Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

Create a complete pipeline with feature engineering and model

complete_pipeline = Pipeline([ ('features', feature_pipeline), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ])

Train the model

complete_pipeline.fit(X_train, y_train)

Make predictions

y_pred = complete_pipeline.predict(X_test)

Evaluate

print(" Classification Report:") print(classification_report(y_test, y_pred))

Conclusion

Feature engineering is a critical step in the NLP pipeline that transforms raw text into numerical representations suitable for machine learning algorithms. This chapter covered a range of techniques, from traditional count-based methods like Bag-of-Words and TF-IDF to advanced embedding approaches like Word2Vec and Sentence-BERT.

The choice of feature engineering technique depends on your specific task, dataset characteristics, and computational constraints. Often, combining multiple feature types (lexical, statistical, linguistic, semantic) yields the best results.

Remember that feature engineering is both an art and a science—it requires creativity, domain knowledge, and experimentation to find the optimal representation for your text data.

In the next chapter, we'll explore how to build classical NLP models using these engineered features, focusing on traditional machine learning approaches for tasks like text classification, clustering, and information extraction.

Practice exercises: 1. Compare the performance of BoW, TF-IDF, and Word2Vec features on a text classification task 2. Implement a custom feature extractor for detecting text formality or complexity 3. Create a feature engineering pipeline that combines n-grams, character-level features, and linguistic features 4. Experiment with different dimensionality reduction techniques (LSA, NMF, t-SNE) and visualize the results 5. Build a document similarity system using different feature representations and evaluate their effectiveness