Feature Engineering for NLP in Python

Feature engineering is a critical step in the Natural Language Processing (NLP) pipeline that transforms preprocessed text into numerical representations that machine learning algorithms can understand. This chapter explores various techniques for converting text data into meaningful features, from traditional approaches like Bag-of-Words to advanced embedding methods.

Understanding Feature Engineering for Text

Machine learning models require numerical inputs, but text is inherently symbolic and unstructured. Feature engineering bridges this gap by converting text into vectors of numbers that capture various linguistic properties while preserving semantic meaning.

The quality of these features significantly impacts model performance. Good features should: - Capture relevant patterns in the text - Represent semantic relationships between words and documents - Be computationally efficient - Scale well with large vocabularies and datasets

Let's explore the most important feature engineering techniques for NLP, implementing each with Python code examples.

Count-Based Methods

Bag-of-Words (BoW)

The Bag-of-Words model is one of the simplest and most common text representation methods. It represents text as an unordered collection of words, disregarding grammar and word order but keeping track of word frequency.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


Sample documents


documents = [     "Natural language processing is fascinating.",     "I love working with text data.",     "NLP combines linguistics and computer science.",     "Text processing techniques are essential for NLP.",     "Processing natural language requires understanding linguistics." ]


Create a Bag-of-Words representation


vectorizer = CountVectorizer() bow_matrix = vectorizer.fit_transform(documents)


Get feature names (vocabulary)


feature_names = vectorizer.get_feature_names_out()


Convert to DataFrame for better visualization


bow_df = pd.DataFrame(bow_matrix.toarray(), columns=feature_names) print("Bag-of-Words Matrix:") print(bow_df)


Examine the vocabulary


print(f"
Vocabulary size: {len(feature_names)}") print(f"Vocabulary: {feature_names}")


Sparse matrix details

print(f" Matrix shape: {bow_matrix.shape}") print(f"Matrix sparsity: {bow_matrix.nnz / (bow_matrix.shape[0] * bow_matrix.shape[1]):.4f}")

#### Customizing the Bag-of-Words Model

You can customize the BoW model to handle different tokenization patterns, n-grams, and vocabulary constraints:

# Customized Bag-of-Words
custom_vectorizer = CountVectorizer(
    min_df=2,              # Minimum document frequency (ignore terms that appear in fewer than 2 documents)
    max_df=0.8,            # Maximum document frequency (ignore terms that appear in more than 80% of documents)
    ngram_range=(1, 2),    # Include both unigrams and bigrams
    stop_words='english',  # Remove English stopwords
    token_pattern=r'w+'  # Only keep word characters
)

custom_bow = custom_vectorizer.fit_transform(documents)
custom_features = custom_vectorizer.get_feature_names_out()


Convert to DataFrame


custom_bow_df = pd.DataFrame(custom_bow.toarray(), columns=custom_features) print("Customized Bag-of-Words Matrix:") print(custom_bow_df)

print(f" Customized vocabulary size: {len(custom_features)}") print(f"Customized vocabulary: {custom_features}")

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF addresses a limitation of BoW by weighting terms based on their importance in a document relative to the entire corpus. It reduces the impact of common words and emphasizes distinctive terms.

from sklearn.feature_extraction.text import TfidfVectorizer


Create TF-IDF representation


tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents)


Get feature names


tfidf_features = tfidf_vectorizer.get_feature_names_out()


Convert to DataFrame


tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_features) print("TF-IDF Matrix:") print(tfidf_df)


Compare BoW and TF-IDF for a specific term across documents


term = "processing" if term in feature_names and term in tfidf_features:     term_index_bow = list(feature_names).index(term)     term_index_tfidf = list(tfidf_features).index(term)

print(f" Comparing '{term}' representation:") print("Document | BoW Count | TF-IDF Weight") print("-" * 40) for i in range(len(documents)): bow_value = bow_matrix[i, term_index_bow] tfidf_value = tfidf_matrix[i, term_index_tfidf] print(f"{i+1:8} | {bow_value:9} | {tfidf_value:.6f}")

#### Customizing TF-IDF

TF-IDF can be customized to adjust term weighting and normalization:

# Customized TF-IDF
custom_tfidf = TfidfVectorizer(
    min_df=2,              # Minimum document frequency
    max_df=0.8,            # Maximum document frequency
    ngram_range=(1, 2),    # Include both unigrams and bigrams
    stop_words='english',  # Remove English stopwords
    norm='l2',             # Apply L2 normalization
    use_idf=True,          # Enable IDF weighting
    smooth_idf=True,       # Add 1 to document frequencies to prevent division by zero
    sublinear_tf=True      # Apply sublinear TF scaling (1 + log(TF))
)

custom_tfidf_matrix = custom_tfidf.fit_transform(documents)
custom_tfidf_features = custom_tfidf.get_feature_names_out()


Convert to DataFrame

custom_tfidf_df = pd.DataFrame(custom_tfidf_matrix.toarray(), columns=custom_tfidf_features) print(" Customized TF-IDF Matrix:") print(custom_tfidf_df)

N-grams

N-grams capture sequences of N consecutive words, helping to preserve some word order information.

# Unigrams, Bigrams, and Trigrams
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
combined_vectorizer = CountVectorizer(ngram_range=(1, 3))


Fit and transform


unigram_matrix = unigram_vectorizer.fit_transform(documents) bigram_matrix = bigram_vectorizer.fit_transform(documents) trigram_matrix = trigram_vectorizer.fit_transform(documents) combined_matrix = combined_vectorizer.fit_transform(documents)


Get feature names


unigram_features = unigram_vectorizer.get_feature_names_out() bigram_features = bigram_vectorizer.get_feature_names_out() trigram_features = trigram_vectorizer.get_feature_names_out() combined_features = combined_vectorizer.get_feature_names_out()


print(f"Number of unigram features: {len(unigram_features)}") print(f"Sample unigrams: {unigram_features[:5]}")


print(f"
Number of bigram features: {len(bigram_features)}") print(f"Sample bigrams: {bigram_features[:5]}")


print(f"
Number of trigram features: {len(trigram_features)}") print(f"Sample trigrams: {trigram_features[:5] if len(trigram_features) >= 5 else trigram_features}")

print(f" Number of combined n-gram features: {len(combined_features)}") print(f"Sample combined n-grams: {combined_features[:5]}")

Character N-grams

Character n-grams can capture sub-word patterns and are useful for handling misspellings and morphological variations.

# Character n-grams
char_vectorizer = CountVectorizer(
    analyzer='char',
    ngram_range=(3, 5)  # Character n-grams from 3 to 5 characters long
)

char_matrix = char_vectorizer.fit_transform(documents)
char_features = char_vectorizer.get_feature_names_out()

print(f"Number of character n-gram features: {len(char_features)}")
print(f"Sample character n-grams: {char_features[:10]}")


Character n-grams for word boundaries


char_wb_vectorizer = CountVectorizer(     analyzer='char_wb',  # Character n-grams only from word boundaries     ngram_range=(3, 5) )


char_wb_matrix = char_wb_vectorizer.fit_transform(documents) char_wb_features = char_wb_vectorizer.get_feature_names_out()

print(f" Number of character word-boundary n-gram features: {len(char_wb_features)}") print(f"Sample character word-boundary n-grams: {char_wb_features[:10]}")

Hashing Vectorizer

The Hashing Vectorizer is a memory-efficient alternative to CountVectorizer and TfidfVectorizer, especially useful for large datasets.

from sklearn.feature_extraction.text import HashingVectorizer


Create a Hashing Vectorizer


hash_vectorizer = HashingVectorizer(     n_features=210,  # 1024 features     alternate_sign=False  # Use only positive values )


hash_matrix = hash_vectorizer.fit_transform(documents)


Note: Hashing Vectorizer doesn't provide feature names


print(f"Hashing Vectorizer matrix shape: {hash_matrix.shape}") print(f"Hashing Vectorizer matrix sparsity: {hash_matrix.nnz / (hash_matrix.shape[0] * hash_matrix.shape[1]):.4f}")


Convert to DataFrame (with arbitrary feature names)

hash_df = pd.DataFrame( hash_matrix.toarray(), columns=[f"feature_{i}" for i in range(hash_matrix.shape[1])] ) print(" Hashing Vectorizer Matrix (first 5 columns):") print(hash_df.iloc[:, :5])

Word Embeddings

Word embeddings represent words as dense vectors in a continuous vector space, where semantically similar words are mapped to nearby points.

Word2Vec

Word2Vec learns word embeddings by predicting words from their context (Skip-gram) or predicting context from words (CBOW).

import gensim
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import numpy as np


Prepare sentences for Word2Vec (requires tokenized sentences)


tokenized_sentences = [doc.lower().split() for doc in documents]


Train Word2Vec model


w2v_model = Word2Vec(     sentences=tokenized_sentences,     vector_size=100,  # Embedding dimension     window=5,         # Context window size     min_count=1,      # Minimum word frequency     workers=4,        # Number of threads     sg=1              # Skip-gram model (use 0 for CBOW) )


Get vocabulary


vocabulary = list(w2v_model.wv.index_to_key) print(f"Word2Vec vocabulary: {vocabulary}")


Get vector for a specific word


word = "processing" if word in w2v_model.wv:     vector = w2v_model.wv[word]     print(f"
Vector for '{word}' (first 10 dimensions): {vector[:10]}")


Find similar words


similar_words = w2v_model.wv.most_similar("language", topn=3) print(f"
Words most similar to 'language': {similar_words}")


Visualize word embeddings in 2D


def plot_embeddings(model, words=None):     # Get all word vectors     if words is None:         words = list(model.wv.index_to_key)


    # Get word vectors     word_vectors = np.array([model.wv[word] for word in words])


    # Apply PCA to reduce to 2 dimensions     pca = PCA(n_components=2)     result = pca.fit_transform(word_vectors)


    # Create a scatter plot     plt.figure(figsize=(10, 8))     plt.scatter(result[:, 0], result[:, 1], c='b', alpha=0.5)


    # Add word labels     for i, word in enumerate(words):         plt.annotate(word, xy=(result[i, 0], result[i, 1]))


    plt.title("Word Embeddings Visualization")     plt.savefig("word_embeddings.png")     print("Word embeddings visualization saved to 'word_embeddings.png'")


Plot embeddings

plot_embeddings(w2v_model)

Pre-trained Word Embeddings

Using pre-trained word embeddings like GloVe or fastText can be more effective than training your own, especially for small datasets.

import gensim.downloader as api
import numpy as np


Load pre-trained GloVe embeddings


try:     glove_vectors = api.load("glove-wiki-gigaword-100")     print(f"Loaded GloVe embeddings with {len(glove_vectors)} words")


    # Check if specific words are in the vocabulary     test_words = ["language", "processing", "computer", "artificial", "intelligence"]     for word in test_words:         if word in glove_vectors:             print(f"'{word}' is in the vocabulary")         else:             print(f"'{word}' is NOT in the vocabulary")


    # Get vector for a word     if "language" in glove_vectors:         vector = glove_vectors["language"]         print(f"
GloVe vector for 'language' (first 10 dimensions): {vector[:10]}")


    # Find similar words     if "language" in glove_vectors:         similar_words = glove_vectors.most_similar("language", topn=5)         print(f"
Words most similar to 'language' in GloVe: {similar_words}")

# Word analogies if all(word in glove_vectors for word in ["king", "man", "woman"]): result = glove_vectors.most_similar(positive=["king", "woman"], negative=["man"], topn=1) print(f" king - man + woman = {result}") except Exception as e: print(f"Error loading GloVe embeddings: {e}") print("You may need to download the embeddings first with: python -m gensim.downloader --download glove-wiki-gigaword-100")

Document Embeddings with Doc2Vec

Doc2Vec extends Word2Vec to learn embeddings for entire documents.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument


Prepare documents for Doc2Vec


tagged_documents = [TaggedDocument(words=doc.lower().split(), tags=[i]) for i, doc in enumerate(documents)]


Train Doc2Vec model


d2v_model = Doc2Vec(     documents=tagged_documents,     vector_size=100,  # Embedding dimension     window=5,         # Context window size     min_count=1,      # Minimum word frequency     workers=4,        # Number of threads     epochs=100        # Number of training epochs )


Get document vectors


doc_vectors = [d2v_model.dv[i] for i in range(len(documents))]


Print first document vector


print(f"Document vector for first document (first 10 dimensions): {doc_vectors[0][:10]}")


Find most similar document to a query


query = "natural language processing techniques" query_vector = d2v_model.infer_vector(query.lower().split()) similar_docs = d2v_model.dv.most_similar([query_vector], topn=len(documents))

print(f" Documents most similar to query '{query}':") for doc_id, similarity in similar_docs: print(f"Document {doc_id+1}: {similarity:.4f} - {documents[doc_id]}")

Sentence and Document Embeddings

Sentence-BERT

Sentence-BERT (SBERT) provides semantically meaningful sentence embeddings that can be compared using cosine similarity.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


Load pre-trained Sentence-BERT model


try:     sbert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')


    # Encode sentences     sentence_embeddings = sbert_model.encode(documents)


    print(f"Sentence embeddings shape: {sentence_embeddings.shape}")     print(f"First sentence embedding (first 10 dimensions): {sentence_embeddings[0][:10]}")


    # Calculate cosine similarity between all sentence pairs     similarity_matrix = cosine_similarity(sentence_embeddings)


    # Print similarity matrix     print("
Sentence similarity matrix:")     similarity_df = pd.DataFrame(similarity_matrix,                                  index=[f"Doc {i+1}" for i in range(len(documents))],                                 columns=[f"Doc {i+1}" for i in range(len(documents))])     print(similarity_df)


    # Find most similar sentence pair     max_sim = 0     max_pair = (0, 0)     for i in range(len(documents)):         for j in range(i+1, len(documents)):             if similarity_matrix[i, j] > max_sim:                 max_sim = similarity_matrix[i, j]                 max_pair = (i, j)


    print(f"
Most similar document pair:")     print(f"Document {max_pair[0]+1}: {documents[max_pair[0]]}")     print(f"Document {max_pair[1]+1}: {documents[max_pair[1]]}")     print(f"Similarity: {max_sim:.4f}")


    # Query-document similarity     query = "What is natural language processing?"     query_embedding = sbert_model.encode([query])[0]


    # Calculate similarity between query and all documents     query_similarities = cosine_similarity([query_embedding], sentence_embeddings)[0]

print(f" Query: '{query}'") print("Document similarities:") for i, sim in enumerate(query_similarities): print(f"Document {i+1}: {sim:.4f} - {documents[i]}") except Exception as e: print(f"Error loading Sentence-BERT model: {e}") print("You may need to install sentence-transformers: pip install sentence-transformers")

Feature Selection and Dimensionality Reduction

Feature selection and dimensionality reduction techniques help manage the high dimensionality of text features.

Chi-Square Feature Selection

Chi-square test selects features that are most dependent on the class label.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2, SelectKBest


Sample documents with labels


labeled_documents = [     "Natural language processing is fascinating.",     "I love working with text data.",     "NLP combines linguistics and computer science.",     "Text processing techniques are essential for NLP.",     "Processing natural language requires understanding linguistics." ] labels = [0, 1, 0, 1, 0]  # Binary labels for demonstration


Create Bag-of-Words representation


vectorizer = CountVectorizer() X = vectorizer.fit_transform(labeled_documents) feature_names = vectorizer.get_feature_names_out()


Apply chi-square feature selection


k = 10  # Number of features to select chi2_selector = SelectKBest(chi2, k=min(k, X.shape[1])) X_chi2 = chi2_selector.fit_transform(X, labels)


Get selected feature indices


selected_indices = chi2_selector.get_support(indices=True) selected_features = [feature_names[i] for i in selected_indices]


print(f"Original number of features: {len(feature_names)}") print(f"Number of selected features: {len(selected_features)}") print(f"Selected features: {selected_features}")


Get chi-square scores


chi2_scores = chi2_selector.scores_ feature_scores = [(feature, chi2_scores[i]) for i, feature in enumerate(feature_names)] feature_scores.sort(key=lambda x: x[1], reverse=True)

print(" Top features by chi-square score:") for feature, score in feature_scores[:10]: print(f"{feature}: {score:.4f}")

Latent Semantic Analysis (LSA)

LSA uses Singular Value Decomposition (SVD) to reduce the dimensionality of the document-term matrix while preserving semantic relationships.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt


Create TF-IDF representation


tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(documents) feature_names = tfidf_vectorizer.get_feature_names_out()


Apply LSA (Truncated SVD)


n_components = 2  # Number of topics/components lsa_model = TruncatedSVD(n_components=n_components) X_lsa = lsa_model.fit_transform(X_tfidf)


print(f"Original TF-IDF matrix shape: {X_tfidf.shape}") print(f"LSA matrix shape: {X_lsa.shape}") print(f"Explained variance ratio: {lsa_model.explained_variance_ratio_}") print(f"Total explained variance: {sum(lsa_model.explained_variance_ratio_):.4f}")


Get feature weights for each component


feature_weights = lsa_model.components_


Print top terms for each component/topic


print("
Top terms for each topic:") for i, component in enumerate(feature_weights):     sorted_indices = component.argsort()[::-1]     top_terms = [feature_names[idx] for idx in sorted_indices[:10]]     print(f"Topic {i+1}: {', '.join(top_terms)}")


Visualize documents in the LSA space (if using 2 components)


if n_components == 2:     plt.figure(figsize=(10, 8))     plt.scatter(X_lsa[:, 0], X_lsa[:, 1])


    # Add document labels     for i, doc in enumerate(documents):         plt.annotate(f"Doc {i+1}", xy=(X_lsa[i, 0], X_lsa[i, 1]))

plt.title("Documents in LSA Space") plt.xlabel("Component 1") plt.ylabel("Component 2") plt.savefig("lsa_visualization.png") print(" LSA visualization saved to 'lsa_visualization.png'")

Non-negative Matrix Factorization (NMF)

NMF is another dimensionality reduction technique that produces more interpretable topics than LSA.

from sklearn.decomposition import NMF


Apply NMF


n_components = 2  # Number of topics nmf_model = NMF(n_components=n_components, random_state=42) X_nmf = nmf_model.fit_transform(X_tfidf)


print(f"NMF matrix shape: {X_nmf.shape}")


Get feature weights for each component


feature_weights = nmf_model.components_


Print top terms for each component/topic


print("
Top terms for each NMF topic:") for i, component in enumerate(feature_weights):     sorted_indices = component.argsort()[::-1]     top_terms = [feature_names[idx] for idx in sorted_indices[:10]]     print(f"Topic {i+1}: {', '.join(top_terms)}")


Visualize documents in the NMF space (if using 2 components)


if n_components == 2:     plt.figure(figsize=(10, 8))     plt.scatter(X_nmf[:, 0], X_nmf[:, 1])


    # Add document labels     for i, doc in enumerate(documents):         plt.annotate(f"Doc {i+1}", xy=(X_nmf[i, 0], X_nmf[i, 1]))

plt.title("Documents in NMF Space") plt.xlabel("Component 1") plt.ylabel("Component 2") plt.savefig("nmf_visualization.png") print(" NMF visualization saved to 'nmf_visualization.png'")

Custom Feature Engineering

Sometimes you need to create custom features based on domain knowledge or specific text properties.

Text Statistics Features

import numpy as np
import re
from textstat import textstat  # pip install textstat

def extract_text_statistics(text):
    """Extract statistical features from text."""
    # Basic counts
    char_count = len(text)
    word_count = len(text.split())
    sentence_count = len(re.split(r'[.!?]+', text)) - 1  # -1 to handle potential empty string at end
    
    # Average lengths
    avg_word_length = char_count / word_count if word_count > 0 else 0
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    
    # Readability scores
    flesch_reading_ease = textstat.flesch_reading_ease(text)
    flesch_kincaid_grade = textstat.flesch_kincaid_grade(text)
    
    # Lexical diversity (unique words / total words)
    unique_words = len(set(text.lower().split()))
    lexical_diversity = unique_words / word_count if word_count > 0 else 0
    
    return {
        'char_count': char_count,
        'word_count': word_count,
        'sentence_count': sentence_count,
        'avg_word_length': avg_word_length,
        'avg_sentence_length': avg_sentence_length,
        'flesch_reading_ease': flesch_reading_ease,
        'flesch_kincaid_grade': flesch_kincaid_grade,
        'lexical_diversity': lexical_diversity
    }


Extract text statistics for each document


text_stats = [extract_text_statistics(doc) for doc in documents]


Convert to DataFrame

stats_df = pd.DataFrame(text_stats) print("Text Statistics Features:") print(stats_df)

Linguistic Features

import spacy
import numpy as np


Load spaCy model


try:     nlp = spacy.load("en_core_web_sm") except OSError:     print("Downloading spaCy model...")     spacy.cli.download("en_core_web_sm")     nlp = spacy.load("en_core_web_sm")


def extract_linguistic_features(text):     """Extract linguistic features using spaCy."""     doc = nlp(text)


    # POS tag counts     pos_counts = {}     for token in doc:         pos = token.pos_         pos_counts[pos] = pos_counts.get(pos, 0) + 1


    # Normalize by total tokens     total_tokens = len(doc)     for pos in pos_counts:         pos_counts[pos] /= total_tokens


    # Named entity counts     ent_counts = {}     for ent in doc.ents:         ent_type = ent.label_         ent_counts[ent_type] = ent_counts.get(ent_type, 0) + 1


    # Syntactic dependency counts     dep_counts = {}     for token in doc:         dep = token.dep_         dep_counts[dep] = dep_counts.get(dep, 0) + 1


    # Normalize by total tokens     for dep in dep_counts:         dep_counts[dep] /= total_tokens


    # Sentence complexity features     sentences = list(doc.sents)     if sentences:         avg_depth = np.mean([token.head.i - token.i for token in doc if token.head.i != token.i])         max_depth = max([len(list(token.ancestors)) for token in doc])     else:         avg_depth = 0         max_depth = 0


    return {         'noun_ratio': pos_counts.get('NOUN', 0),         'verb_ratio': pos_counts.get('VERB', 0),         'adj_ratio': pos_counts.get('ADJ', 0),         'adv_ratio': pos_counts.get('ADV', 0),         'entity_count': len(doc.ents),         'avg_dependency_depth': avg_depth,         'max_dependency_depth': max_depth     }


Extract linguistic features for each document


linguistic_features = [extract_linguistic_features(doc) for doc in documents]


Convert to DataFrame

ling_df = pd.DataFrame(linguistic_features) print("Linguistic Features:") print(ling_df)

Sentiment and Emotion Features

from textblob import TextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer


Ensure NLTK data is downloaded


try:     nltk.data.find('sentiment/vader_lexicon.zip') except LookupError:     nltk.download('vader_lexicon')


def extract_sentiment_features(text):     """Extract sentiment and emotion features."""     # TextBlob sentiment     blob = TextBlob(text)     polarity = blob.sentiment.polarity     subjectivity = blob.sentiment.subjectivity


    # VADER sentiment     sid = SentimentIntensityAnalyzer()     vader_scores = sid.polarity_scores(text)


    # Emotion words (simplified approach)     positive_words = ['good', 'great', 'excellent', 'amazing', 'wonderful', 'love', 'happy', 'joy']     negative_words = ['bad', 'terrible', 'awful', 'horrible', 'hate', 'sad', 'angry', 'fear']


    words = text.lower().split()     positive_count = sum(1 for word in words if word in positive_words)     negative_count = sum(1 for word in words if word in negative_words)


    return {         'textblob_polarity': polarity,         'textblob_subjectivity': subjectivity,         'vader_compound': vader_scores['compound'],         'vader_pos': vader_scores['pos'],         'vader_neg': vader_scores['neg'],         'vader_neu': vader_scores['neu'],         'positive_word_count': positive_count,         'negative_word_count': negative_count     }


Extract sentiment features for each document


sentiment_features = [extract_sentiment_features(doc) for doc in documents]


Convert to DataFrame

sentiment_df = pd.DataFrame(sentiment_features) print("Sentiment Features:") print(sentiment_df)

Combining Features

In practice, it's often beneficial to combine different types of features.

from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack


Combine TF-IDF with statistical and linguistic features


First, standardize the numerical features


scaler = StandardScaler() stats_scaled = scaler.fit_transform(stats_df) ling_scaled = scaler.fit_transform(ling_df)


Convert to sparse matrices for compatibility with TF-IDF


from scipy.sparse import csr_matrix stats_sparse = csr_matrix(stats_scaled) ling_sparse = csr_matrix(ling_scaled)


Combine all features


combined_features = hstack([X_tfidf, stats_sparse, ling_sparse])


print(f"TF-IDF features shape: {X_tfidf.shape}") print(f"Statistical features shape: {stats_sparse.shape}") print(f"Linguistic features shape: {ling_sparse.shape}") print(f"Combined features shape: {combined_features.shape}")


Feature importance analysis (example with a simple model)


from sklearn.ensemble import RandomForestClassifier


Sample labels for demonstration


y = [0, 1, 0, 1, 0]  # Binary labels


Train a model on the combined features


model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(combined_features, y)


Get feature importances for TF-IDF features


tfidf_feature_importances = model.feature_importances_[:X_tfidf.shape[1]] tfidf_importance_dict = {feature: importance for feature, importance in zip(feature_names, tfidf_feature_importances)}


Sort by importance


sorted_tfidf_importances = sorted(tfidf_importance_dict.items(), key=lambda x: x[1], reverse=True)


print("
Top TF-IDF features by importance:") for feature, importance in sorted_tfidf_importances[:10]:     print(f"{feature}: {importance:.6f}")


Get feature importances for statistical features


stats_feature_importances = model.feature_importances_[X_tfidf.shape[1]:X_tfidf.shape[1]+stats_sparse.shape[1]] stats_importance_dict = {feature: importance for feature, importance in zip(stats_df.columns, stats_feature_importances)}

print(" Statistical features by importance:") for feature, importance in sorted(stats_importance_dict.items(), key=lambda x: x[1], reverse=True): print(f"{feature}: {importance:.6f}")

Feature Engineering Pipeline

Creating a reusable feature engineering pipeline helps streamline the process for new data.

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin


Custom transformer for text statistics


class TextStatsTransformer(BaseEstimator, TransformerMixin):     def fit(self, X, y=None):         return self


    def transform(self, X):         stats = [extract_text_statistics(text) for text in X]         return pd.DataFrame(stats).values


Custom transformer for linguistic features


class LinguisticTransformer(BaseEstimator, TransformerMixin):     def fit(self, X, y=None):         return self


    def transform(self, X):         features = [extract_linguistic_features(text) for text in X]         return pd.DataFrame(features).values


Create a feature engineering pipeline


feature_pipeline = Pipeline([     ('features', FeatureUnion([         ('tfidf', Pipeline([             ('vectorizer', TfidfVectorizer(min_df=2, max_df=0.8))         ])),         ('stats', Pipeline([             ('extractor', TextStatsTransformer()),             ('scaler', StandardScaler())         ])),         ('linguistic', Pipeline([             ('extractor', LinguisticTransformer()),             ('scaler', StandardScaler())         ]))     ])) ])


Apply the pipeline to the documents


X_transformed = feature_pipeline.fit_transform(documents) print(f"Transformed features shape: {X_transformed.shape}")


Example of using the pipeline in a complete ML workflow


from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report


Sample data for demonstration


X = documents y = [0, 1, 0, 1, 0]  # Binary labels


Split data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)


Create a complete pipeline with feature engineering and model


complete_pipeline = Pipeline([     ('features', feature_pipeline),     ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ])


Train the model


complete_pipeline.fit(X_train, y_train)


Make predictions


y_pred = complete_pipeline.predict(X_test)


Evaluate

print(" Classification Report:") print(classification_report(y_test, y_pred))

Conclusion

Feature engineering is a critical step in the NLP pipeline that transforms raw text into numerical representations suitable for machine learning algorithms. This chapter covered a range of techniques, from traditional count-based methods like Bag-of-Words and TF-IDF to advanced embedding approaches like Word2Vec and Sentence-BERT.

The choice of feature engineering technique depends on your specific task, dataset characteristics, and computational constraints. Often, combining multiple feature types (lexical, statistical, linguistic, semantic) yields the best results.

Remember that feature engineering is both an art and a science—it requires creativity, domain knowledge, and experimentation to find the optimal representation for your text data.

In the next chapter, we'll explore how to build classical NLP models using these engineered features, focusing on traditional machine learning approaches for tasks like text classification, clustering, and information extraction.

Practice exercises: 1. Compare the performance of BoW, TF-IDF, and Word2Vec features on a text classification task 2. Implement a custom feature extractor for detecting text formality or complexity 3. Create a feature engineering pipeline that combines n-grams, character-level features, and linguistic features 4. Experiment with different dimensionality reduction techniques (LSA, NMF, t-SNE) and visualize the results 5. Build a document similarity system using different feature representations and evaluate their effectiveness