Natural Language Processing (NLP) requires a solid foundation in Python programming. This chapter covers essential Python concepts and techniques specifically relevant to NLP tasks, focusing on data structures, functions, and programming patterns that you'll use frequently when working with text data.
Python Data Structures for Text Processing
When working with text data in NLP, you'll rely heavily on certain Python data structures. Understanding how to use these efficiently is crucial for building effective NLP applications.
Strings and String Operations
Strings are the fundamental data type for text in Python. The language provides powerful built-in methods for string manipulation:
# Basic string operations
text = "Natural Language Processing with Python"
word = "tokenization" count = 42 print(f"The word '{word}' appears {count} times in the corpus.")
For NLP tasks, you'll often need to handle special characters, whitespace, and formatting:
import re
text_with_emails = "Contact us at support@example.com or sales@company.org for more information." emails = re.findall(r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}', text_with_emails) print(f"Extracted emails: {emails}")
Lists and List Comprehensions
Lists are essential for storing sequences of tokens, sentences, or documents:
# Working with lists for NLP
tokens = ["Natural", "Language", "Processing", "with", "Python"]
tokens = ["I", "love", "natural", "language", "processing"] bigrams = [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)] print(f"Bigrams: {bigrams}")
Dictionaries for Frequency Counts and Mappings
Dictionaries are perfect for word counts, vocabulary mappings, and feature representations:
# Using dictionaries for word frequency
text = "to be or not to be that is the question to be or not"
words = text.lower().split()
king_queen_similarity = cosine_similarity(word_vectors["king"], word_vectors["queen"]) print(f"Similarity between 'king' and 'queen': {king_queen_similarity:.4f}")
Sets for Unique Words and Operations
Sets are useful for vocabulary management and text comparison:
# Using sets for NLP tasks
text1 = "Python is great for natural language processing"
text2 = "Natural language processing is easy with Python"
stopwords = {"is", "the", "with", "for"} filtered_words = words1.difference(stopwords) print(f"After stopword removal: {filtered_words}")
File Handling for NLP
NLP often involves processing large text files or corpora. Python's file handling capabilities are essential for these tasks:
# Writing text to a file
with open("sample_text.txt", "w") as file:
file.write("This is a sample text file.
")
file.write("It contains multiple lines.
")
file.write("We'll use it for NLP processing examples.")
with open("sample_text.txt", "r") as file: text = file.read().lower() words = re.findall(r'w+', text) # Extract words using regex word_freq = Counter(words) print(" Word frequency:") for word, count in word_freq.most_common(): print(f"{word}: {count}")
For larger datasets, you might need to process files in chunks:
# Processing large files in chunks
def process_large_file(filename, chunk_size=1024):
word_count = 0
with open(filename, "r") as file:
while True:
chunk = file.read(chunk_size)
if not chunk:
break
# Process the chunk
words = re.findall(r'w+', chunk.lower())
word_count += len(words)
return word_count
try: count = process_large_file("sample_text.txt") print(f"Total words in file: {count}") except FileNotFoundError: print("File not found. Please check the path.")
Functions and Lambda Expressions for Text Processing
Well-designed functions make NLP code more modular and reusable:
# Basic text cleaning function
def clean_text(text):
"""
Clean text by converting to lowercase, removing punctuation and extra whitespace.
Args:
text (str): Input text to clean
Returns:
str: Cleaned text
"""
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^ws]', '', text)
# Remove extra whitespace
text = re.sub(r's+', ' ', text).strip()
return text
text_info = list(map(lambda x: (x.lower(), len(x.split())), texts)) print(f"Text info (lowercase, word count): {text_info}")
Iterators and Generators for Efficient Text Processing
When working with large text datasets, memory efficiency becomes crucial. Iterators and generators help process data without loading everything into memory:
# Generator function for reading a file line by line
def file_line_generator(filename):
"""Yield each line from a file."""
with open(filename, "r") as file:
for line in file:
yield line.strip()
print(" Trigrams:") for trigram in generate_ngrams(sample_tokens, 3): print(trigram)
Error Handling in NLP Applications
Robust NLP applications need proper error handling to deal with unexpected inputs or processing failures:
# Error handling for text processing
def process_text_file_safely(filename):
"""Process a text file with proper error handling."""
try:
with open(filename, "r", encoding="utf-8") as file:
text = file.read()
word_count = len(re.findall(r'w+', text))
return word_count
except FileNotFoundError:
print(f"Error: File '{filename}' not found.")
return 0
except UnicodeDecodeError:
print(f"Error: File '{filename}' has encoding issues. Trying with different encoding...")
try:
with open(filename, "r", encoding="latin-1") as file:
text = file.read()
word_count = len(re.findall(r'w+', text))
return word_count
except Exception as e:
print(f"Error: Failed to process file with alternative encoding: {e}")
return 0
except Exception as e:
print(f"Error: An unexpected error occurred: {e}")
return 0
for text in texts: numbers = extract_numbers_safely(text) print(f"Text: '{text}'") print(f"Extracted numbers: {numbers}") print()
Classes and Object-Oriented Programming for NLP
Object-oriented programming helps organize complex NLP workflows:
# A simple document class for NLP
class Document:
"""Class representing a text document for NLP processing."""
def __init__(self, text, doc_id=None):
"""Initialize with text content."""
self.doc_id = doc_id
self.raw_text = text
self.clean_text = self._clean(text)
self.tokens = self._tokenize(self.clean_text)
self.word_count = len(self.tokens)
self.unique_words = set(self.tokens)
self.vocabulary_size = len(self.unique_words)
def _clean(self, text):
"""Clean the text."""
text = text.lower()
text = re.sub(r'[^ws]', '', text)
text = re.sub(r's+', ' ', text).strip()
return text
def _tokenize(self, text):
"""Tokenize the text."""
return text.split()
def get_word_frequencies(self):
"""Get word frequency dictionary."""
return Counter(self.tokens)
def contains_word(self, word):
"""Check if document contains a specific word."""
return word.lower() in self.unique_words
def similarity(self, other_doc):
"""Calculate Jaccard similarity with another document."""
if not isinstance(other_doc, Document):
raise TypeError("Comparison must be with another Document object")
intersection = len(self.unique_words.intersection(other_doc.unique_words))
union = len(self.unique_words.union(other_doc.unique_words))
return intersection / union if union > 0 else 0
def __str__(self):
"""String representation of the document."""
return f"Document(id={self.doc_id}, words={self.word_count}, vocab={self.vocabulary_size})"
def summary(self):
"""Return a summary of the document."""
freq = self.get_word_frequencies()
top_words = freq.most_common(5)
return {
"id": self.doc_id,
"word_count": self.word_count,
"vocabulary_size": self.vocabulary_size,
"top_words": top_words
}
nlp_related_words = ["language", "processing", "nlp", "text", "machine", "learning"] print(" Word presence in documents:") for word in nlp_related_words: in_doc1 = doc1.contains_word(word) in_doc2 = doc2.contains_word(word) print(f"'{word}': Doc1: {in_doc1}, Doc2: {in_doc2}")
A More Complex Example: Building a Simple Text Classifier
Let's combine the concepts we've learned to build a simple text classifier:
# A simple text classifier using Python basics
class SimpleTextClassifier:
"""A basic text classifier using bag-of-words approach."""
def __init__(self):
"""Initialize the classifier."""
self.class_word_counts = {} # Word counts per class
self.class_doc_counts = {} # Document counts per class
self.vocabulary = set() # All unique words
self.total_docs = 0
def _preprocess(self, text):
"""Clean and tokenize text."""
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^ws]', '', text)
# Tokenize
tokens = text.split()
return tokens
def train(self, texts, labels):
"""Train the classifier with texts and their labels."""
if len(texts) != len(labels):
raise ValueError("Number of texts and labels must match")
for text, label in zip(texts, labels):
# Preprocess the text
tokens = self._preprocess(text)
# Update vocabulary
self.vocabulary.update(tokens)
# Update class document counts
self.class_doc_counts[label] = self.class_doc_counts.get(label, 0) + 1
self.total_docs += 1
# Update word counts for this class
if label not in self.class_word_counts:
self.class_word_counts[label] = Counter()
self.class_word_counts[label].update(tokens)
print(f"Training complete. {self.total_docs} documents, {len(self.vocabulary)} unique words.")
def predict(self, text):
"""Predict the class for a given text."""
tokens = self._preprocess(text)
best_score = float('-inf')
best_class = None
# Calculate score for each class
for label in self.class_doc_counts:
# Prior probability P(class)
prior = self.class_doc_counts[label] / self.total_docs
# Calculate log probability to avoid underflow
score = math.log(prior)
# Add log likelihood for each token
for token in tokens:
if token in self.vocabulary:
# Get count of this word in this class (with smoothing)
word_count = self.class_word_counts[label].get(token, 0) + 1
# Total words in this class (with smoothing for all vocabulary words)
total_words = sum(self.class_word_counts[label].values()) + len(self.vocabulary)
# Add log probability of this word given the class
score += math.log(word_count / total_words)
if score > best_score:
best_score = score
best_class = label
return best_class
def evaluate(self, test_texts, test_labels):
"""Evaluate the classifier on test data."""
if len(test_texts) != len(test_labels):
raise ValueError("Number of texts and labels must match")
correct = 0
for text, true_label in zip(test_texts, test_labels):
predicted_label = self.predict(text)
if predicted_label == true_label:
correct += 1
accuracy = correct / len(test_texts)
return accuracy
accuracy = classifier.evaluate(test_texts, test_labels) print(f"Accuracy: {accuracy:.2f}")
Python's Standard Library for NLP Tasks
Python's standard library includes several modules useful for NLP:
# Collections module for NLP tasks
from collections import Counter, defaultdict, namedtuple
text = "Hellloooo woorrrllldd!!" grouped = [(char, len(list(group))) for char, group in itertools.groupby(text)] print(f" Grouped characters: {grouped}") compressed = ''.join(char for char, _ in grouped) print(f"Compressed text: {compressed}")
Working with Dates and Times in NLP
Processing dates and times is common in many NLP applications:
from datetime import datetime, timedelta
import re
print(" Relative time descriptions:") for date in example_dates: description = relative_time_description(date) print(f"{date} → {description}")
Conclusion
This chapter has covered the essential Python concepts and techniques that form the foundation for NLP development. By mastering these basics, you'll be well-equipped to work with the specialized NLP libraries and frameworks we'll explore in subsequent chapters.
Remember that clean, efficient code is particularly important in NLP, where you often work with large datasets and complex processing pipelines. The Python skills you've learned here will help you write more maintainable and performant NLP applications.
In the next chapter, we'll dive into the essential Python libraries specifically designed for NLP tasks, building on the fundamental concepts we've covered here.
Practice exercises: 1. Create a function that calculates the lexical diversity of a text (unique words divided by total words) 2. Build a simple concordance tool that shows the context around each occurrence of a specific word 3. Implement a basic spell checker using Python's built-in data structures 4. Create a function to detect the language of a text based on character and word frequencies 5. Build a simple text summarizer that extracts the most important sentences based on word frequency