Basic Textual Analysis with Python
Basic Textual Analysis with Python
Overview
Textual analysis is a foundation for many research projects (including digital humanities). This skill covers extracting basic quantitative features from text: word frequencies, concordances (keyword-in-context), and corpus statistics using Python and NLTK.
When to Use
- Exploring a new text corpus
- Identifying key terms and their contexts
- Comparing word usage across documents
- Preparing data for more advanced analysis (topic modeling, NER)
Prerequisites
- Python 3.8+
- Basic Python knowledge
- NLTK library (
pip install nltk)
Steps
Step 1: Load and Tokenize Text
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
# Download required data (first time only)
nltk.download('punkt')
nltk.download('stopwords')
# Load text
with open('corpus/document.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Tokenize
tokens = word_tokenize(text.lower())
print(f"Total tokens: {len(tokens)}")Step 2: Calculate Word Frequencies
from nltk.corpus import stopwords
# Remove stopwords and punctuation
stop_words = set(stopwords.words('english'))
words = [w for w in tokens if w.isalpha() and w not in stop_words]
# Count frequencies
freq = Counter(words)
print("Top 20 words:")
for word, count in freq.most_common(20):
print(f" {word}: {count}")Step 3: Generate Concordance (KWIC)
from nltk.text import Text
# Create NLTK Text object
nltk_text = Text(tokens)
# Show concordance for a keyword
nltk_text.concordance('knowledge', width=75, lines=10)Step 4: Compute Basic Statistics
# Type-token ratio (lexical diversity)
types = len(set(words))
tokens_count = len(words)
ttr = types / tokens_count
print(f"Types (unique words): {types}")
print(f"Tokens (total words): {tokens_count}")
print(f"Type-Token Ratio: {ttr:.3f}")Example
Input
A plain text file containing a chapter from a 19th-century novel.
Process
- Tokenize into words
- Remove stopwords
- Count frequencies
- Generate concordance for “society”
Output
Top 10 words:
time: 45
man: 38
society: 32
...
Concordance for 'society':
the rules of society dictated that
excluded from society entirely
...
Statistics:
Types: 2,847
Tokens: 15,234
TTR: 0.187Tips and Best Practices
- Always specify encoding (
utf-8) when reading files - Normalize text (lowercase) before analysis for consistent counts
- Consider lemmatization for more accurate frequency counts
- Save results to CSV/JSON for further analysis
- Document your preprocessing choices
Related Skills
References
- NLTK Documentation: https://www.nltk.org/
- Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly, 2009.
Last updated on