Basic Textual Analysis with Python

Category: analysis

Workflow Stage: analyze

Author: AI Research Skills Contributors

Last Updated: 2026-01-11

Tags: text-analysis, python, nltk, word-frequency, concordance

Basic Textual Analysis with Python

Overview

Textual analysis is a foundation for many research projects (including digital humanities). This skill covers extracting basic quantitative features from text: word frequencies, concordances (keyword-in-context), and corpus statistics using Python and NLTK.

When to Use

Exploring a new text corpus
Identifying key terms and their contexts
Comparing word usage across documents
Preparing data for more advanced analysis (topic modeling, NER)

Prerequisites

Python 3.8+
Basic Python knowledge
NLTK library (pip install nltk)

Steps

Step 1: Load and Tokenize Text


import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
 
# Download required data (first time only)
nltk.download('punkt')
nltk.download('stopwords')
 
# Load text
with open('corpus/document.txt', 'r', encoding='utf-8') as f:
    text = f.read()
 
# Tokenize
tokens = word_tokenize(text.lower())
print(f"Total tokens: {len(tokens)}")

Step 2: Calculate Word Frequencies


from nltk.corpus import stopwords
 
# Remove stopwords and punctuation
stop_words = set(stopwords.words('english'))
words = [w for w in tokens if w.isalpha() and w not in stop_words]
 
# Count frequencies
freq = Counter(words)
print("Top 20 words:")
for word, count in freq.most_common(20):
    print(f"  {word}: {count}")

Step 3: Generate Concordance (KWIC)


from nltk.text import Text
 
# Create NLTK Text object
nltk_text = Text(tokens)
 
# Show concordance for a keyword
nltk_text.concordance('knowledge', width=75, lines=10)

Step 4: Compute Basic Statistics


# Type-token ratio (lexical diversity)
types = len(set(words))
tokens_count = len(words)
ttr = types / tokens_count
 
print(f"Types (unique words): {types}")
print(f"Tokens (total words): {tokens_count}")
print(f"Type-Token Ratio: {ttr:.3f}")

Example

Input

A plain text file containing a chapter from a 19th-century novel.

Process

Tokenize into words
Remove stopwords
Count frequencies
Generate concordance for “society”

Output


Top 10 words:
  time: 45
  man: 38
  society: 32
  ...

Concordance for 'society':
  the rules of society dictated that
  excluded from society entirely
  ...

Statistics:
  Types: 2,847
  Tokens: 15,234
  TTR: 0.187

Tips and Best Practices

Always specify encoding (utf-8) when reading files
Normalize text (lowercase) before analysis for consistent counts
Consider lemmatization for more accurate frequency counts
Save results to CSV/JSON for further analysis
Document your preprocessing choices

Reference Formatting

References

NLTK Documentation: https://www.nltk.org/
Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly, 2009.

Basic Textual Analysis with Python

Basic Textual Analysis with Python

Overview

When to Use

Prerequisites

Steps

Step 1: Load and Tokenize Text

Step 2: Calculate Word Frequencies

Step 3: Generate Concordance (KWIC)

Step 4: Compute Basic Statistics

Example

Input

Process

Output

Tips and Best Practices

Related Skills

References