2. Python Basics for NLP Practitioners

Natural Language Processing (NLP) requires a solid foundation in Python programming. This chapter covers essential Python concepts and techniques specifically relevant to NLP tasks, focusing on data structures, functions, and programming patterns that you'll use frequently when working with text data.

Python Data Structures for Text Processing

When working with text data in NLP, you'll rely heavily on certain Python data structures. Understanding how to use these efficiently is crucial for building effective NLP applications.

Strings and String Operations

Strings are the fundamental data type for text in Python. The language provides powerful built-in methods for string manipulation:

# Basic string operations
text = "Natural Language Processing with Python"

word = "tokenization" count = 42 print(f"The word '{word}' appears {count} times in the corpus.")

For NLP tasks, you'll often need to handle special characters, whitespace, and formatting:

import re

text_with_emails = "Contact us at support@example.com or sales@company.org for more information." emails = re.findall(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}', text_with_emails) print(f"Extracted emails: {emails}")

Lists and List Comprehensions

Lists are essential for storing sequences of tokens, sentences, or documents:

# Working with lists for NLP
tokens = ["Natural", "Language", "Processing", "with", "Python"]

tokens = ["I", "love", "natural", "language", "processing"] bigrams = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)] print(f"Bigrams: {bigrams}")

Dictionaries for Frequency Counts and Mappings

Dictionaries are perfect for word counts, vocabulary mappings, and feature representations:

# Using dictionaries for word frequency
text = "to be or not to be that is the question to be or not"
words = text.lower().split()

king_queen_similarity = cosine_similarity(word_vectors["king"], word_vectors["queen"]) print(f"Similarity between 'king' and 'queen': {king_queen_similarity:.4f}")

Sets for Unique Words and Operations

Sets are useful for vocabulary management and text comparison:

# Using sets for NLP tasks
text1 = "Python is great for natural language processing"
text2 = "Natural language processing is easy with Python"

stopwords = {"is", "the", "with", "for"} filtered_words = words1.difference(stopwords) print(f"After stopword removal: {filtered_words}")

File Handling for NLP

NLP often involves processing large text files or corpora. Python's file handling capabilities are essential for these tasks:

# Writing text to a file
with open("sample_text.txt", "w") as file:
    file.write("This is a sample text file.
")
    file.write("It contains multiple lines.
")
    file.write("We'll use it for NLP processing examples.")

with open("sample_text.txt", "r") as file: text = file.read().lower() words = re.findall(r'w+', text) # Extract words using regex word_freq = Counter(words) print(" Word frequency:") for word, count in word_freq.most_common(): print(f"{word}: {count}")

For larger datasets, you might need to process files in chunks:

# Processing large files in chunks
def process_large_file(filename, chunk_size=1024):
    word_count = 0
    with open(filename, "r") as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                break
            # Process the chunk
            words = re.findall(r'w+', chunk.lower())
            word_count += len(words)
    return word_count

try: count = process_large_file("sample_text.txt") print(f"Total words in file: {count}") except FileNotFoundError: print("File not found. Please check the path.")

Functions and Lambda Expressions for Text Processing

Well-designed functions make NLP code more modular and reusable:

# Basic text cleaning function
def clean_text(text):
    """
    Clean text by converting to lowercase, removing punctuation and extra whitespace.
    
    Args:
        text (str): Input text to clean
        
    Returns:
        str: Cleaned text
    """
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^ws]', '', text)
    # Remove extra whitespace
    text = re.sub(r's+', ' ', text).strip()
    return text

text_info = list(map(lambda x: (x.lower(), len(x.split())), texts)) print(f"Text info (lowercase, word count): {text_info}")

Iterators and Generators for Efficient Text Processing

When working with large text datasets, memory efficiency becomes crucial. Iterators and generators help process data without loading everything into memory:

# Generator function for reading a file line by line
def file_line_generator(filename):
    """Yield each line from a file."""
    with open(filename, "r") as file:
        for line in file:
            yield line.strip()

print(" Trigrams:") for trigram in generate_ngrams(sample_tokens, 3): print(trigram)

Error Handling in NLP Applications

Robust NLP applications need proper error handling to deal with unexpected inputs or processing failures:

# Error handling for text processing
def process_text_file_safely(filename):
    """Process a text file with proper error handling."""
    try:
        with open(filename, "r", encoding="utf-8") as file:
            text = file.read()
            word_count = len(re.findall(r'w+', text))
            return word_count
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return 0
    except UnicodeDecodeError:
        print(f"Error: File '{filename}' has encoding issues. Trying with different encoding...")
        try:
            with open(filename, "r", encoding="latin-1") as file:
                text = file.read()
                word_count = len(re.findall(r'w+', text))
                return word_count
        except Exception as e:
            print(f"Error: Failed to process file with alternative encoding: {e}")
            return 0
    except Exception as e:
        print(f"Error: An unexpected error occurred: {e}")
        return 0

for text in texts: numbers = extract_numbers_safely(text) print(f"Text: '{text}'") print(f"Extracted numbers: {numbers}") print()

Classes and Object-Oriented Programming for NLP

Object-oriented programming helps organize complex NLP workflows:

# A simple document class for NLP
class Document:
    """Class representing a text document for NLP processing."""
    
    def __init__(self, text, doc_id=None):
        """Initialize with text content."""
        self.doc_id = doc_id
        self.raw_text = text
        self.clean_text = self._clean(text)
        self.tokens = self._tokenize(self.clean_text)
        self.word_count = len(self.tokens)
        self.unique_words = set(self.tokens)
        self.vocabulary_size = len(self.unique_words)
        
    def _clean(self, text):
        """Clean the text."""
        text = text.lower()
        text = re.sub(r'[^ws]', '', text)
        text = re.sub(r's+', ' ', text).strip()
        return text
    
    def _tokenize(self, text):
        """Tokenize the text."""
        return text.split()
    
    def get_word_frequencies(self):
        """Get word frequency dictionary."""
        return Counter(self.tokens)
    
    def contains_word(self, word):
        """Check if document contains a specific word."""
        return word.lower() in self.unique_words
    
    def similarity(self, other_doc):
        """Calculate Jaccard similarity with another document."""
        if not isinstance(other_doc, Document):
            raise TypeError("Comparison must be with another Document object")
        
        intersection = len(self.unique_words.intersection(other_doc.unique_words))
        union = len(self.unique_words.union(other_doc.unique_words))
        return intersection / union if union > 0 else 0
    
    def __str__(self):
        """String representation of the document."""
        return f"Document(id={self.doc_id}, words={self.word_count}, vocab={self.vocabulary_size})"
    
    def summary(self):
        """Return a summary of the document."""
        freq = self.get_word_frequencies()
        top_words = freq.most_common(5)
        return {
            "id": self.doc_id,
            "word_count": self.word_count,
            "vocabulary_size": self.vocabulary_size,
            "top_words": top_words
        }

nlp_related_words = ["language", "processing", "nlp", "text", "machine", "learning"] print(" Word presence in documents:") for word in nlp_related_words: in_doc1 = doc1.contains_word(word) in_doc2 = doc2.contains_word(word) print(f"'{word}': Doc1: {in_doc1}, Doc2: {in_doc2}")

A More Complex Example: Building a Simple Text Classifier

Let's combine the concepts we've learned to build a simple text classifier:

# A simple text classifier using Python basics
class SimpleTextClassifier:
    """A basic text classifier using bag-of-words approach."""
    
    def __init__(self):
        """Initialize the classifier."""
        self.class_word_counts = {}  # Word counts per class
        self.class_doc_counts = {}   # Document counts per class
        self.vocabulary = set()      # All unique words
        self.total_docs = 0
        
    def _preprocess(self, text):
        """Clean and tokenize text."""
        # Convert to lowercase
        text = text.lower()
        # Remove punctuation
        text = re.sub(r'[^ws]', '', text)
        # Tokenize
        tokens = text.split()
        return tokens
    
    def train(self, texts, labels):
        """Train the classifier with texts and their labels."""
        if len(texts) != len(labels):
            raise ValueError("Number of texts and labels must match")
        
        for text, label in zip(texts, labels):
            # Preprocess the text
            tokens = self._preprocess(text)
            
            # Update vocabulary
            self.vocabulary.update(tokens)
            
            # Update class document counts
            self.class_doc_counts[label] = self.class_doc_counts.get(label, 0) + 1
            self.total_docs += 1
            
            # Update word counts for this class
            if label not in self.class_word_counts:
                self.class_word_counts[label] = Counter()
            self.class_word_counts[label].update(tokens)
        
        print(f"Training complete. {self.total_docs} documents, {len(self.vocabulary)} unique words.")
    
    def predict(self, text):
        """Predict the class for a given text."""
        tokens = self._preprocess(text)
        
        best_score = float('-inf')
        best_class = None
        
        # Calculate score for each class
        for label in self.class_doc_counts:
            # Prior probability P(class)
            prior = self.class_doc_counts[label] / self.total_docs
            
            # Calculate log probability to avoid underflow
            score = math.log(prior)
            
            # Add log likelihood for each token
            for token in tokens:
                if token in self.vocabulary:
                    # Get count of this word in this class (with smoothing)
                    word_count = self.class_word_counts[label].get(token, 0) + 1
                    # Total words in this class (with smoothing for all vocabulary words)
                    total_words = sum(self.class_word_counts[label].values()) + len(self.vocabulary)
                    # Add log probability of this word given the class
                    score += math.log(word_count / total_words)
            
            if score > best_score:
                best_score = score
                best_class = label
        
        return best_class
    
    def evaluate(self, test_texts, test_labels):
        """Evaluate the classifier on test data."""
        if len(test_texts) != len(test_labels):
            raise ValueError("Number of texts and labels must match")
        
        correct = 0
        for text, true_label in zip(test_texts, test_labels):
            predicted_label = self.predict(text)
            if predicted_label == true_label:
                correct += 1
        
        accuracy = correct / len(test_texts)
        return accuracy

accuracy = classifier.evaluate(test_texts, test_labels) print(f"Accuracy: {accuracy:.2f}")

Python's Standard Library for NLP Tasks

Python's standard library includes several modules useful for NLP:

# Collections module for NLP tasks
from collections import Counter, defaultdict, namedtuple

text = "Hellloooo woorrrllldd!!" grouped = [(char, len(list(group))) for char, group in itertools.groupby(text)] print(f" Grouped characters: {grouped}") compressed = ''.join(char for char, _ in grouped) print(f"Compressed text: {compressed}")

Working with Dates and Times in NLP

Processing dates and times is common in many NLP applications:

from datetime import datetime, timedelta
import re

print(" Relative time descriptions:") for date in example_dates: description = relative_time_description(date) print(f"{date} → {description}")

Conclusion

This chapter has covered the essential Python concepts and techniques that form the foundation for NLP development. By mastering these basics, you'll be well-equipped to work with the specialized NLP libraries and frameworks we'll explore in subsequent chapters.

Remember that clean, efficient code is particularly important in NLP, where you often work with large datasets and complex processing pipelines. The Python skills you've learned here will help you write more maintainable and performant NLP applications.

In the next chapter, we'll dive into the essential Python libraries specifically designed for NLP tasks, building on the fundamental concepts we've covered here.

Practice exercises: 1. Create a function that calculates the lexical diversity of a text (unique words divided by total words) 2. Build a simple concordance tool that shows the context around each occurrence of a specific word 3. Implement a basic spell checker using Python's built-in data structures 4. Create a function to detect the language of a text based on character and word frequencies 5. Build a simple text summarizer that extracts the most important sentences based on word frequency