
The insurgency in Southern Thailand has been a protracted and complex conflict, drawing international attention due to its impact on regional stability and human rights. Media coverage plays a crucial role in shaping public perception and understanding of such conflicts. This project aims to analyze the historical coverage of Southern Thailand's insurgency by The Guardian using web scraping and text analysis techniques. The research question guiding this analysis is: "What are the dominant sentiments and topics associated with the Southern Thailand insurgency portrayed in The Guardian over the past two decades?" By extracting and processing articles spanning nearly two decades, the project provides insights into the prominent topics (word frequency data), the tone of articles (sentiment scores), and the thematic content and structure (topic modeling data) of The Guardian's coverage. The findings from this analysis will contribute to a deeper understanding of how international media portrays long-standing regional conflicts.

Dataset Description

This project utilizes news articles from The Guardian to analyze the coverage of the Southern Thailand insurgency. The dataset consists of 25 news articles selected through web scraping, focusing on the period from 2004 to 2021. The choice of this timeframe allows for a comprehensive examination of how media narratives and reporting on the conflict have evolved over nearly two decades.

| Data Source | The Guardian | | --- | --- | | Quantity | 25 news articles | | Timeframe | 2004 to 2021 | | Search Terms | "southern Thailand conflict," "southern Thailand insurgency," "southern Thailand peace," "Pattani conflict," "Yala conflict," "Narathiwat conflict," "Songkhla conflict" |

The articles were chosen based on their relevance to the Southern Thailand insurgency, ensuring a diverse representation of topics and sentiments. The dataset encompasses a variety of articles, including news reports, feature stories, and opinion pieces, which collectively provide a broad perspective on the media portrayal of the conflict.

The selected articles were scraped and processed to extract relevant text data for analysis. This dataset will be used to explore the dominant themes, sentiments, and the narrative surrounding the Southern Thailand insurgency as reported by The Guardian.


Full R Code of the Analysis

### Step 1: Load Required Libraries ###

# Install the packages 
install.packages(c("rvest", "dplyr", "stringr", "tm", "tidytext", "ggplot2", "tidyr", "topicmodels
", "SnowballC"))

# Load the libraries

### Step 2: Scrape the Web Pages ###

# Function to scrape an article
scrape_article <- function(url) {
  webpage <- read_html(url)
  # Extract title
  title <- webpage %>%
    html_node("title") %>%
    html_text(trim = TRUE)
  # Extract date 
  date <- webpage %>%
    html_node("meta[property='article:published_time']") %>%
    html_attr("content") %>%
    str_extract("\\\\d{4}-\\\\d{2}-\\\\d{2}")  # Extract date format
  # Extract article content 
  content <- webpage %>%
    html_nodes("p") %>%
    html_text(trim = TRUE) %>%
    paste(collapse = " ")
  # Return as a data frame
  return(data.frame(title = title, date = date, content = content, url = url, stringsAsFactors = FALSE))

# Apply the function to all URLs
urls <- c(

# Scrape all articles
articles <-, lapply(urls, scrape_article))

# View the scraped data

### Step 3: Data Preprocessing ###

# Function to clean and preprocess text
clean_text <- function(text) {
  text <- tolower(text)  # Convert to lowercase
  text <- removePunctuation(text)  # Remove punctuation
  text <- removeNumbers(text)  # Remove numbers
  text <- removeWords(text, stopwords("en"))  # Remove stopwords
  text <- wordStem(text, language = "en")  # Perform stemming
  text <- stripWhitespace(text)  # Remove extra whitespace

# Apply the cleaning function to the article content
articles$content_clean <- sapply(articles$content, clean_text)

# View the cleaned content

### Step 4: Word Frequency ###

# Tokenize the cleaned content
articles_tidy <- articles %>%
  unnest_tokens(word, content_clean)

# Count word frequencies
word_freq <- articles_tidy %>%
  count(word, sort = TRUE)

# View the most frequent words
head(word_freq, 22)

# Visualize the most common words 
word_freq %>%
  filter(n > 30) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Most Common Words in Articles", x = "Word", y = "Frequency")

### Step 5: Sentiment Analysis ###

# Load sentiment lexicon
sentiment_lexicon <- get_sentiments("bing")

# Perform sentiment analysis
article_sentiments <- articles_tidy %>%
  inner_join(sentiment_lexicon, by = "word") %>%
  count(url, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = list(n = 0)) %>%
  mutate(sentiment_score = positive - negative)

# View sentiment scores by article

# Visualize sentiment scores 
article_sentiments %>%
  ggplot(aes(x = url, y = sentiment_score, fill = factor(sentiment_score > 0))) +
  geom_col() +
  coord_flip() +
  labs(title = "Sentiment Scores by Article", x = "Article", y = "Sentiment Score")

### Step 6: Topic Modeling ###

# Create a document-term matrix
dtm <- articles_tidy %>%
  count(url, word) %>%
  cast_dtm(url, word, n)

# Fit the LDA model with a chosen number of topics (k = 3)
lda_model <- LDA(dtm, k = 3, control = list(seed = 1234))

# Extract topics
topics <- tidy(lda_model, matrix = "beta")

# View top terms in each topic
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


# Visualize top terms in each topic 
top_terms %>%
  ggplot(aes(x = reorder_within(term, beta, topic), y = beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top Terms in Each Topic", x = "Term", y = "Beta")

Rationale Behind Key Decisions

  1. Word Frequency Analysis:
  2. Sentiment Analysis:
  3. Topic Modeling:

Findings and Interpretation

Word Frequency Data



The word frequency analysis of the collected news articles provides insights into the most prominent topics and recurring themes related to the Southern Thailand insurgency. The analysis focuses on the most frequent words, which help highlight key subjects discussed in the articles.

  1. "Said" (113 occurrences): The term "said" is the most frequent word, which is common in news articles as it reflects the direct quotes from various sources, including officials, locals, and other stakeholders. This suggests that the articles heavily rely on reporting statements and opinions from individuals involved or affected by the insurgency.