Introduction

Wikipedia, hosted since 2003, has become a significant repository of knowledge, encompassing a wide array of historical events and narratives. The objective of this project is to analyze and measure narrative changes in Wikipedia articles related to civil rights movements in Thailand from the 1970s to the present using natural language processing (NLP) techniques, particularly through word cloud visualization and sentiment analysis. This project aims to uncover significant shifts in how these historical events are portrayed and perceived over time.

However, it is essential to acknowledge the inherent limitation of using Wikipedia for historical narrative analysis. Wikipedia articles serve as secondary documentation, synthesized from various sources, rather than primary documents created contemporaneously with the events they describe. This project thus explores the narratives constructed within these articles, reflecting broader societal perceptions and historical interpretations.

Through the lens of word cloud visualization, the project will explore the themes of each key event, highlighting prominent terms and their evolution across decades. Additionally, sentiment analysis provides deeper insights into the emotional tones embedded within these narratives, shedding light on societal perceptions and the impact of political interventions on civil liberties. Therefore, this project seeks to provide insights into how historical memory and societal discourse evolve and are represented in a digital platform like Wikipedia.

Data Collection

Overview of Collected Articles

The followings are a list of Wikipedia articles related to key civil rights events in Thailand from the 1970s to the present, organized by decade:

1970s

1990s

2000s

2010s

2020s

Data Scraping

Scrape data from the selected Wikipedia articles:

# Install and load necessary packages
install.packages(c("rvest",
                   "xml2", 
                   "dplyr", 
                   "tokenizers", 
                   "tm", 
                   "SnowballC", 
                   "topicmodels", 
                   "ggplot2", 
                   "tidyverse", 
                   "wordcloud"))

library(rvest)
library(xml2)
library(dplyr)
library(tokenizers)
library(tm)
library(SnowballC)
library(topicmodels)
library(ggplot2)
library(tidyverse)
library(wordcloud)

# Define URLs
urls <- c(
  "<https://en.wikipedia.org/wiki/1973_Thai_popular_uprising>",
  "<https://en.wikipedia.org/wiki/6_October_1976_massacre>",
  "<https://en.wikipedia.org/wiki/Black_May_(1992)>",
  "<https://en.wikipedia.org/wiki/2006_Thai_coup_d%27%C3%A9tat>",
  "<https://en.wikipedia.org/wiki/2010_Thai_political_protests>",
  "<https://en.wikipedia.org/wiki/2014_Thai_coup_d%27%C3%A9tat>",
  "<https://en.wikipedia.org/wiki/2020%E2%80%932021_Thai_protests>"
)

# Data Scraping
article_content <- list()

for (url in urls) {
  webpage <- read_html(url)
  article_title <- html_text(html_nodes(webpage, "h1"))
  article_paragraphs <- html_text(html_nodes(webpage, "p"))
  article_content[[url]] <- list(title = article_title, content = article_paragraphs)
}

# View scraped data
print(article_content)

Data Cleaning and Preparation

Text Processing: Tokenization, Stopword Removal, Stemming

Data Structuring: Converting unstructured data into a data frame

# Data Cleaning and Preparation
cleaned_texts <- list()

for (url in names(article_content)) {
  paragraphs <- article_content[[url]]$content
  text <- paste(paragraphs, collapse = " ")
  tokens <- tokenize_words(text)
  tokens <- unlist(tokens)
  tokens <- tokens[!tokens %in% stopwords("en")]
  tokens <- wordStem(tokens, language = "en")
  cleaned_text <- paste(tokens, collapse = " ")
  cleaned_texts[[url]] <- list(title = article_content[[url]]$title, content = cleaned_text)
}

# View cleaned data
print(cleaned_texts)

# Convert to Data Frame
structured_data <- data.frame(
  url = character(),
  title = character(),
  content = character(),
  stringsAsFactors = FALSE
)

for (url in names(cleaned_texts)) {
  article <- cleaned_texts[[url]]
  structured_data <- rbind(
    structured_data,
    data.frame(
      url = url,
      title = article$title,
      content = article$content,
      stringsAsFactors = FALSE
    )
  )
}

# View structured data
View(structured_data)

Visualization and Analysis

Word Cloud Generation

Word clouds are generated to illustrate narrative shifts. By comparing the word clouds of each article, the thematic shifts across different decades can be identified.

# Word Cloud Generation

# Install and load necessary package
install.packages("wordcloud")
library(wordcloud)

# Create a function to generate and plot word clouds
generate_wordcloud <- function(text, title) {
  word_freq <- table(unlist(strsplit(text, "\\\\s+")))
  wordcloud(words = names(word_freq), freq = word_freq, max.words = 100,
            random.order = FALSE, colors = brewer.pal(8, "Dark2"),
            main = title)
}

# Generate and plot word clouds one by one
for (i in 1:nrow(structured_data)) {
  dev.new() 
  generate_wordcloud(structured_data$content[i], structured_data$title[i])
  readline(prompt = "Press [enter] to see the next word cloud...")
}

Results

Untitled

Interpretation and Insights

Narrative Shifts Across Decades

1970s: Student Uprising and Thammasat University Massacre

1990s: Black May

2000s: Military Coup and Political Instability

2010s: Red Shirt Protests and Subsequent Coup

2020s: Pro-Democracy Protests and Calls for Reform