Skip to Content
SkillsBasic Textual Analysis with Python

Basic Textual Analysis with Python

Category: analysis
Workflow Stage: analyze
Author: AI Research Skills Contributors
Last Updated: 2026-01-11

Tags: text-analysis, python, nltk, word-frequency, concordance

Basic Textual Analysis with Python

Overview

Textual analysis is a foundation for many research projects (including digital humanities). This skill covers extracting basic quantitative features from text: word frequencies, concordances (keyword-in-context), and corpus statistics using Python and NLTK.

When to Use

  • Exploring a new text corpus
  • Identifying key terms and their contexts
  • Comparing word usage across documents
  • Preparing data for more advanced analysis (topic modeling, NER)

Prerequisites

  • Python 3.8+
  • Basic Python knowledge
  • NLTK library (pip install nltk)

Steps

Step 1: Load and Tokenize Text

import nltk from nltk.tokenize import word_tokenize from collections import Counter # Download required data (first time only) nltk.download('punkt') nltk.download('stopwords') # Load text with open('corpus/document.txt', 'r', encoding='utf-8') as f: text = f.read() # Tokenize tokens = word_tokenize(text.lower()) print(f"Total tokens: {len(tokens)}")

Step 2: Calculate Word Frequencies

from nltk.corpus import stopwords # Remove stopwords and punctuation stop_words = set(stopwords.words('english')) words = [w for w in tokens if w.isalpha() and w not in stop_words] # Count frequencies freq = Counter(words) print("Top 20 words:") for word, count in freq.most_common(20): print(f" {word}: {count}")

Step 3: Generate Concordance (KWIC)

from nltk.text import Text # Create NLTK Text object nltk_text = Text(tokens) # Show concordance for a keyword nltk_text.concordance('knowledge', width=75, lines=10)

Step 4: Compute Basic Statistics

# Type-token ratio (lexical diversity) types = len(set(words)) tokens_count = len(words) ttr = types / tokens_count print(f"Types (unique words): {types}") print(f"Tokens (total words): {tokens_count}") print(f"Type-Token Ratio: {ttr:.3f}")

Example

Input

A plain text file containing a chapter from a 19th-century novel.

Process

  1. Tokenize into words
  2. Remove stopwords
  3. Count frequencies
  4. Generate concordance for “society”

Output

Top 10 words: time: 45 man: 38 society: 32 ... Concordance for 'society': the rules of society dictated that excluded from society entirely ... Statistics: Types: 2,847 Tokens: 15,234 TTR: 0.187

Tips and Best Practices

  • Always specify encoding (utf-8) when reading files
  • Normalize text (lowercase) before analysis for consistent counts
  • Consider lemmatization for more accurate frequency counts
  • Save results to CSV/JSON for further analysis
  • Document your preprocessing choices

References

  • NLTK Documentation: https://www.nltk.org/ 
  • Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly, 2009.
Last updated on