110 min read

Text Mining, Networks and Visualization: Plebiscito Tweets

Nowadays social media generates a vast amount of raw data (text, images, videos, etc). It is a very interesting challenge to discover techniques to get insights on the content and development of social media data. In addition, as a fundamental component of the analysis, it is important to find ways of communicating the results, i.e. data visualization. In this post I want to present a small case study where I analyze Twitter text data. The aim is not to give a complete analysis (as it would require many interations), but rather to describe how to to start. The emphasis of this post is in the data manipulation and data visualization. In particular, I describe how networks (graphs) can be used as data structures to describe text relations (some measure of pairwise count occurrences).

The topic I chose to run the analysis is the Colombian peace agreement referendum (Plebiscito), celebrated on 2016-10-02. You can find more information about it here. In a previous post I described how to get (scraping) the referendum results data per town.

The analysis is done in R and it is mainly motivated by the techniques presented in the book Text Mining with R.

1. Data Source

The data for the analysis consists of ~ 33.7K Twitter posts, generated between the 2016-10-02 and 2016-10-03, containing relevant hashtags related the the Plebiscito. The data is freely available at Plebicito Tweets 2016 on the website data.world. The raw data was collected (using Twitter API) by Victor Ramirez. On his website you can find the data gathering description and a brief initial analysis. The hashtags tracked were:

  • #HoyVotoSi, #SiALaPaz, #YoVotoSi
  • #HoyVotoNo, #SoyColombiaNO, #YoVotoNo

Remark: No user-specific data was used. All account references on the tweets were deleted.

2. Prepare Notebook

Let us load the required libraries.

# Data Wrangling and Visualization
library(glue)
library(cowplot)
library(magrittr)
library(plotly)
library(tidyverse)
library(widyr)
# Date & Time Manipulation.
library(hms)
library(lubridate) 
# Text Mining
library(tidytext)
library(tm)
library(wordcloud)
# Network Analysis
library(igraph)
# Network Visualization (D3.js)
library(networkD3)

# Set notebook directory.
MAIN.DIR <- here::here()

3. Read Data

# Set file path.
file.name <- glue('{MAIN.DIR}/static/Data/plebiscito.json')

# Read each line. 
tweets.raw.list <- map(.x = read_lines(file = file.name), 
                       .f = rjson::fromJSON)

# Parse subset of the data into a tibble. 
tweets.raw.df <- tweets.raw.list %>% 
  
                    map_df(.f = ~ data.frame(
                                    # Select non-user related data.
                                    ID = .x$id_str,
                                    Created_At = .x$created_at,
                                    Text = .x$text, 
                                    stringsAsFactors = FALSE
                                    )
                           ) %>% 
                    as_tibble()

tweets.raw.df %>% 
  # We  do not want to display accounts.
  filter(!str_detect(string = Text, pattern = '@')) %>% 
  head 
## # A tibble: 6 x 3
##   ID           Created_At           Text                                  
##   <chr>        <chr>                <chr>                                 
## 1 78261639629… Sun Oct 02 16:21:03… #HappySunday  #FelizDomingo #DomingoD…
## 2 78261640341… Sun Oct 02 16:21:05… Por una Colombia diferente. #VotoSiAL…
## 3 78261640705… Sun Oct 02 16:21:06… Me voy acercando a mi puesto de votac…
## 4 78261640845… Sun Oct 02 16:21:06… Hoy todos los que vamos por el SI, al…
## 5 78261641049… Sun Oct 02 16:21:07… "#HoyVotoSi\nPor un gran futuro para …
## 6 78261641683… Sun Oct 02 16:21:08… #VotoSiALaPaz #HoyVotoSi #VotacionesS…

Let us see the structure of this tibble.

tweets.raw.df %>% glimpse
## Observations: 33,762
## Variables: 3
## $ ID         <chr> "782616396299730944", "782616402310094849", "782616...
## $ Created_At <chr> "Sun Oct 02 16:21:03 +0000 2016", "Sun Oct 02 16:21...
## $ Text       <chr> "#HappySunday  #FelizDomingo #DomingoDeGanarSeguido...

We parse the Created_At column into a date format. Note that in the raw file it has type character.

tweets.raw.df %>% slice(1:4) %>% pull(Created_At) 
## [1] "Sun Oct 02 16:21:03 +0000 2016" "Sun Oct 02 16:21:05 +0000 2016"
## [3] "Sun Oct 02 16:21:05 +0000 2016" "Sun Oct 02 16:21:06 +0000 2016"
tweets.raw.df %<>% 
  mutate(Created_At = Created_At %>% 
                        # Remove zeros.
                        str_remove_all(pattern = '\\+0000') %>%
                        # Parse date.
                        parse_date_time(orders = '%a %b %d %H%M%S %Y'))

tweets.raw.df %>% 
  filter(!str_detect(string = Text, pattern = '@')) %>% 
  head
## # A tibble: 6 x 3
##   ID           Created_At          Text                                   
##   <chr>        <dttm>              <chr>                                  
## 1 78261639629… 2016-10-02 16:21:03 #HappySunday  #FelizDomingo #DomingoDe…
## 2 78261640341… 2016-10-02 16:21:05 Por una Colombia diferente. #VotoSiALa…
## 3 78261640705… 2016-10-02 16:21:06 Me voy acercando a mi puesto de votaci…
## 4 78261640845… 2016-10-02 16:21:06 Hoy todos los que vamos por el SI, ale…
## 5 78261641049… 2016-10-02 16:21:07 "#HoyVotoSi\nPor un gran futuro para C…
## 6 78261641683… 2016-10-02 16:21:08 #VotoSiALaPaz #HoyVotoSi #VotacionesSe…

4. Timeline Analysis

As the Created_At column makes reference to the UTC time, we neet to substract 5 hours from it to get the Colombian time.

# We substract seconds, that is why we need three factors. 
tweets.raw.df %<>% mutate(Created_At = Created_At - 5*60*60)

Let us compute the time range:

tweets.raw.df %>% pull(Created_At) %>% min
## [1] "2016-10-02 11:21:03 UTC"
tweets.raw.df %>% pull(Created_At) %>% max
## [1] "2016-10-03 17:43:47 UTC"

We see that the time frame covered by this data set is essentially the day and the day after the referendum.

We create a new variable which “forgets” the seconds of Create_At column.

tweets.raw.df %<>% 
  mutate(Created_At_Round = Created_At %>% round(units = 'mins') %>% as.POSIXct)

We now plot the time series of tweets count per minute.

plt <- tweets.raw.df %>% 
        count(Created_At_Round) %>% 
        ggplot(mapping = aes(x = Created_At_Round, y = n)) +
        theme_light() +
        geom_line() +
        xlab(label = 'Date') +
        ylab(label = NULL) +
        ggtitle(label = 'Number of Tweets per Minute')

plt %>% ggplotly

There is an interesting peak at around 2016-10-02 19:28:00, which is essentially when the referendum results where known. Let us have a look at some tweets after durinng this peak:

results.time <- as.POSIXct(x = '2016-10-02 19:28:00')

tweets.raw.df %>% 
  filter(Created_At_Round > results.time ) %>% 
  select(Text) %>% 
  filter(!str_detect(string = Text, pattern = '@')) %>% 
  pull(Text) %>% 
  head(20) 
##  [1] "Espero que todos los que votaron por el NO mañana mismo se vayan para el monte con su rifle... \U0001f611\U0001f611\n\n#ColombiaDecide #YoVotoSi"
##  [2] "Colombia quiere paz, pero no con tanta impunidad y beneficios. Colombia despertó!! \U0001f64f\U0001f44f\U0001f4aa. #SoyColombiaNO"               
##  [3] "Re-coherentes los hptas. https://t.co/UbUeqsUiD6"                                                                                                
##  [4] "Y Colombia botó su voto dejando ganar el no... #ColombiaDecide #HoyVotoSi"                                                                       
##  [5] "Ay martincito...sos una caricatura! Que pesar! https://t.co/0kokGc52w2"                                                                          
##  [6] "En las zonas golpeadas por la guerra gano el #SiALaPaz, siento #dolor de #patria #ColombiaDecide #pais #sinmemoria."                             
##  [7] "Ganamos Colombia #ColombiaDecide no lo puedo creer!  #sialapaz #noalplebiscito"                                                                  
##  [8] "Sé que he cumplido con mi deber, tristemente todos hemos perdido, llamo a la unidad porque en este país cabemos todos.\n#SiALaPAZ ¿cuándo?"      
##  [9] "Jaja! https://t.co/9HlsYeT6w7"                                                                                                                   
## [10] "Deberían priorizar los resultados de los lugares afectados por la guerra. Dijeron #SiaLAPaz , no es justo que otros decidan por ellos."          
## [11] "Día triste para Colombia gana el no \ntodo este tiempo fue en vano #SiALaPaz https://t.co/bq36CRvvvx"                                            
## [12] "Tristesa #SiaLAPazenColombia #SiALaPaz https://t.co/r5yZDFLTC1"                                                                                  
## [13] "Ganó el #VotoNo #ColombiaDecide\nEs clave saber  qué va a pasar con el proceso de paz y saber si van a renegociar lo… https://t.co/6DqGBz20XF"   
## [14] "#ColombiaDecide #HoyVotoSi \nVivo en el país más ignorante del mundo,\nQue lastima pueblo!"                                                      
## [15] "#DolordePatria \n#HoyVotoSi #VotoSiSinSerSantista https://t.co/CVXZCBNNwx"                                                                       
## [16] "el que mas ayudo al #VotoNo fue Mateo, al no poder salir la gente a votar en la costa, bastión de la Mermelada y compra de votos de Santos."     
## [17] "76. Cada cartucho de fusil cuesta $ 1.040 y este dispara 300 cartuchos por minuto, lo que cuestan 900 huevos… https://t.co/NtbpwCZFD4"           
## [18] "Brexit otra vez y los más cuerdos parecen los de las FARC . Incertidumbre para los colombianos #SiALaPaz https://t.co/3wYYyU6wC2"                
## [19] "#ColombiaDecide\nLa decepción electoral debe fortalecernos para desde mañana mismo seguir generando conciencia en favor del pueblo.\n#SiALaPaz"  
## [20] "Porque será que nunca hemos podido aprender a perdonar? Porque será que nos puede el odio y la ignorancia?? #ColombiaDecide #SiALaPaz"

Indeed, the comments reflect the reactions around the referendum results.

5. Text Normalization

We want to clean and normalize the text for the analysis. We are mainly interested in the content of the tweets, not the hashtags or accounts.

tweets.df <- tweets.raw.df %>% 
               # Remove column.
               select(-  Created_At) %>% 
               # Convert to lowercase. 
               mutate(Text = Text %>% str_to_lower) %>% 
               # Remove unwanted characters. 
               mutate(Text= Text %>% str_remove_all(pattern = '\\n')) %>% 
               mutate(Text = Text %>% str_remove_all(pattern = '&amp')) %>% 
               mutate(Text = Text %>% str_remove_all(pattern = 'https://t.co/[a-z,A-Z,0-9]*')) %>% 
               mutate(Text = Text %>% str_remove_all(pattern = 'http://t.co/[a-z,A-Z,0-9]*')) %>% 
               mutate(Text = Text %>% str_remove_all(pattern = 'https')) %>% 
               mutate(Text = Text %>% str_remove_all(pattern = 'http')) %>% 
               # Remove hashtags.
               mutate(Text = Text %>% str_remove_all(pattern = '#[a-z,A-Z]*')) %>% 
               # Remove accounts.
               mutate(Text = Text %>% str_remove_all(pattern = '@[a-z,A-Z]*')) %>% 
                # Remove retweets.
               mutate(Text = Text %>% str_remove_all(pattern = 'rt [a-z,A-Z]*: ')) %>% 
               mutate(Text = Text %>% str_remove(pattern = '^(rt)')) %>% 
               mutate(Text = Text %>% str_remove_all(pattern = '\\_')) 

# Replace accents. 
replacement.list <- list('á' = 'a', 'é' = 'e', 'í' = 'i', 'ó' = 'o', 'ú' = 'u')

tweets.df %<>% 
  mutate(Text = chartr(old = names(replacement.list) %>% str_c(collapse = ''), 
                       new = replacement.list %>% str_c(collapse = ''),
                       x = Text))

In addition, we convert out text into a corpus to use the tm library.

corpus <-  Corpus(x = VectorSource(x = tweets.df$Text))

tweets.text <- corpus %>% 
                tm_map(removePunctuation) %>% 
                tm_map(removeNumbers) %>% 
                tm_map(removeWords, stopwords('spanish')) %>% 
                tm_map(PlainTextDocument) # %>% 
                # We could also use stemming by uncommenting the folowing line. 
                # tm_map(stemDocument, 'spanish')

# Recover data into original tibble.
tweets.df %<>% mutate(Text = tweets.text[[1]]$content)

We now want to extract only the hashtags of each tweet. We implement a function for this purpose.

GetHashtags <- function(tweet) {

  hashtag.vector <- str_extract_all(string = tweet, pattern = '#\\S+', simplify = TRUE) %>% 
                    as.character
  
  hashtag.string <- NA
  
  if (length(hashtag.vector) > 0) {
    
    hashtag.string <-   hashtag.vector %>% str_c(collapse = ', ')
    
  } 

  return(hashtag.string)
}

And apply it to our data:

hashtags.df <- tibble(Hashtags = tweets.raw.df$Text %>% 
                                  map_chr(.f = ~ GetHashtags(tweet = .x)))

hashtags.df %>% head
## # A tibble: 6 x 1
##   Hashtags                                                                 
##   <chr>                                                                    
## 1 #HappySunday, #FelizDomingo, #DomingoDeGanarSeguidores, #SiaLAPazenColom…
## 2 #SiALaPaz, #PlebiscitoPorLaPaz                                           
## 3 #VotoSiALaPaz, #VotoSi, #HoyVotoSi                                       
## 4 #HoyVotoSi                                                               
## 5 #HoyVotoSi                                                               
## 6 #HoyVotoSi, #SialaPaz, #SíaLAPazenColombia, #FelizDomingo

We merge these data frames together.

tweets.df %<>% bind_cols(hashtags.df) 

Finally, let us split the data before and after the results of the referendum are known, i.e. we split the Created_At_Round column with respect to the results.time.

# "m" will represent before. results.time. 
tweets.m.df <- tweets.df %>% 
                filter(Created_At_Round < results.time) %>% 
                select(ID, Text)

# "p" will represent after results.time. 
tweets.p.df <- tweets.df %>% 
                filter(Created_At_Round >= results.time) %>% 
                select(ID, Text)

6. Words Count

6.1 Tweets

We begin by counting the most popular words in the tweets.

# Remove the shortcut 'q' for 'que'.
extra.stop.words <- c('q')

stopwords.df <- tibble(word = c(stopwords(kind = 'es'), 
                              # We have some tweets in english.
                                stopwords(kind = 'en'),  
                                extra.stop.words))

words.df <- tweets.df %>% 
             unnest_tokens(input = Text, output = word) %>% 
             anti_join(y = stopwords.df, by = 'word')


word.count <- words.df %>% count(word, sort = TRUE)

word.count %>% head(10)
## # A tibble: 10 x 2
##    word            n
##    <chr>       <int>
##  1 si           7642
##  2 paz          6166
##  3 colombia     4131
##  4 pais         3577
##  5 hoy          3477
##  6 llenos       3399
##  7 mas          3180
##  8 guerra       2709
##  9 colombianos  2459
## 10 voto         2086

We can visualize these counts in a bar plot.

plt <- word.count %>% 
        # Set count threshold. 
        filter(n > 700) %>%
        mutate(word = reorder(word, n)) %>%
        ggplot(aes(x = word, y = n)) +
        theme_light() + 
        geom_col(fill = 'black', alpha = 0.8) +
        xlab(NULL) +
        coord_flip() +
        ggtitle(label = 'Top Word Count')

plt %>% ggplotly

Another popular method to visualize word count data is through a word cloud.

wordcloud(words = word.count$word, 
          freq = word.count$n, 
          min.freq = 200, 
          colors = brewer.pal(8, 'Dark2'))

We can do the same for the split data:

# Before results. 
words.m.df <- tweets.m.df %>% 
               unnest_tokens(input = Text, output = word) %>% 
               anti_join(y = stopwords.df, by = 'word')

word.count.m <- words.m.df %>% count(word, sort = TRUE)

plt.m <- word.count.m %>% 
          filter(n > 500) %>%
          mutate(word = reorder(word, n)) %>%
          ggplot(aes(x = word, y = n)) +
          theme_light() + 
          geom_col(fill = 'blue', alpha = 0.8) +
          xlab(NULL) +
          coord_flip() +
          ggtitle(label = 'Top Word Count (Before Results)')

# After results. 
words.p.df <- tweets.p.df %>% 
               unnest_tokens(input = Text, output = word) %>% 
               anti_join(y = stopwords.df, by = 'word')


word.count.p <- words.p.df %>% count(word, sort = TRUE)

plt.p <- word.count.p %>% 
          filter(n > 500) %>%
          mutate(word = reorder(word, n)) %>%
          ggplot(aes(x = word, y = n)) +
          theme_light() + 
          geom_col(fill = 'red', alpha = 0.8) +
          xlab(NULL) +
          coord_flip() +
          ggtitle(label = 'Top Word Count (After Results)')

plot_grid(... = plt.m, plt.p)

# Before the results. 
wc.m <- wordcloud(words = word.count.m$word, 
          freq = word.count.m$n, 
          min.freq = 200, 
          colors=brewer.pal(8, 'Dark2'))

# After the results.
wordcloud(words = word.count.p$word, 
          freq = word.count.p$n, 
          min.freq = 100, 
          colors=brewer.pal(8, 'Dark2'))

We can indeed see how the wording changes in these two time frames.

6.2 Hashtags

We can run an analogous analysis for hastags.

hashtags.unnested.df <- tweets.df %>% 
                          select(Created_At_Round, Hashtags) %>% 
                          unnest_tokens(input = Hashtags, output = hashtag)
  
hashtags.unnested.count <- hashtags.unnested.df %>% 
                            count(hashtag) %>% 
                            drop_na 

We plot the correspondinng word cloud.

wordcloud(words = str_c('#',hashtags.unnested.count$hashtag), 
          freq = hashtags.unnested.count$n, 
          min.freq = 40, 
          colors=brewer.pal(8, 'Dark2'))

The most popular hashtag for the ‘YES’ and ‘NO’ are #hoyvotosi #notono respectively. Let us see the volume development of these hastags.

plt <- hashtags.unnested.df %>% 
        filter(hashtag %in% c('hoyvotosi', 'votono')) %>% 
        count(Created_At_Round, hashtag) %>% 
        ggplot(mapping = aes(x  = Created_At_Round, y = n, color = hashtag)) +
        theme_light() + 
        xlab(label = 'Date') +
        ggtitle(label = 'Top Hastags Counts') +
        geom_line() + 
        scale_color_manual(values = c('hoyvotosi' = 'green3', 'votono' = 'red'))

plt %>% ggplotly

Overall, the tweets supporting the ‘YES’ had much more volume. But this was not reflected on the results.

Remark: Compare with a brief initial analysis on the hastag volume over time.

7. Network Analysis

In this section we are going to describe how to encode and visualize tex data as a weighted netwok (graph). The main ideas is to count pairwise relative occurence of words.

7.1 Bigram Analysis

7.1.1 Network Definition

We want to count pairwise occurences of words which apperar together in the text, this is what is known as bigram count.

bi.gram.words <- tweets.df %>% 
                  unnest_tokens(input = Text, 
                                output = bigram, 
                                token = 'ngrams', 
                                n = 2) %>% 
                  filter(! is.na(bigram))

bi.gram.words %>% select(bigram) %>% head(10)
## # A tibble: 10 x 1
##    bigram            
##    <chr>             
##  1 paris bogota      
##  2 bogota empanadas  
##  3 colombia diferente
##  4 voy acercando     
##  5 acercando puesto  
##  6 puesto votacion   
##  7 votacion siento   
##  8 siento mariposas  
##  9 mariposas estomago
## 10 estomago siento

Next, we filter for stop words and remove white spaces.

bi.gram.words %<>% 
  separate(col = bigram, into = c('word1', 'word2'), sep = ' ') %>% 
  filter(! word1 %in% stopwords.df$word) %>% 
  filter(! word2 %in% stopwords.df$word) %>% 
  filter(! is.na(word1)) %>% 
  filter(! is.na(word2)) 

Finally, we group and count by bigram.

bi.gram.count <- bi.gram.words %>% count(word1, word2, sort = TRUE)

bi.gram.count %>% head
## # A tibble: 6 x 3
##   word1       word2             n
##   <chr>       <chr>         <int>
## 1 llenos      resentimiento  1688
## 2 ambicion    votaron        1686
## 3 colombianos llenos         1686
## 4 liderados   llenos         1686
## 5 llenos      ambicion       1686
## 6 negar       vict           1686

How to define a weighted network from a bigram count?

  • Each word wis going to represent a node.
  • Two words ae going to be connected if they appear as a bigram.
  • The weight of an edge is the numer of times the bigram appears in the corpus.
  • (Optional) We are free to decide if we want the graph to be directed or not.

We are going to use the igraph library to work with networks.

Remark: The reference A User’s Guide to Network Analysis in R is highly recomended if you want to go deeper into network analysis in R.

For visualization purposes, we can set a threshold which defines the minimal weight allowed in the graph:

threshold <- 280

network <-  bi.gram.count %>%
              filter(n > threshold) %>%
              graph_from_data_frame(directed = FALSE)

Let us see how the network object looks like:

network
## IGRAPH ef1c0d8 UN-- 30 27 -- 
## + attr: name (v/c), n (e/n)
## + edges from ef1c0d8 (vertex names):
##  [1] llenos   --resentimiento ambicion --votaron      
##  [3] llenos   --colombianos   llenos   --liderados    
##  [5] llenos   --ambicion      negar    --vict         
##  [7] liderados--resentimiento negar    --votaron      
##  [9] resultado--plebiscito    debe     --buscar       
## [11] debe     --plebiscito    unidad   --pais         
## [13] unidad   --buscar        humildad --mucha        
## [15] mucha    --responsabilid pais     --recibimos    
## + ... omitted several edges

7.1.2 Visualization

To visualize the network (here is a great reference for it) we can simply use the plot function with some additional parameters:

 plot(network, 
      vertex.size = 1,
      vertex.label.color = 'black', 
      vertex.label.cex = 0.7, 
      vertex.label.dist = 1,
      edge.color = 'gray', 
      main = 'Bigram Count Network', 
      sub = glue('Weight Threshold: {threshold}'), 
      alpha = 50)

We can add some additional information to the visualization: Set the sizes of the nodes and the edges by the dregree and weight respectively.

# Store the degree.
V(network)$degree <- degree(graph = network)

# Compute the weight shares.
E(network)$width <- E(network)$n/max(E(network)$n)

 plot(network, 
      vertex.color = 'lightblue',
      # Scale node size by degree.
      vertex.size = 2*V(network)$degree,
      vertex.label.color = 'black', 
      vertex.label.cex = 0.6, 
      vertex.label.dist = 1.6,
      edge.color = 'gray', 
      # Set edge width proportional to the weight relative value.
      edge.width = 3*E(network)$width ,
      main = 'Bigram Count Network', 
      sub = glue('Weight Threshold: {threshold}'), 
      alpha = 50)

We can extract the biggest connected component of the network as follows:

# Get all connected components
clusters(graph = network)
## $membership
##        llenos      ambicion   colombianos     liderados         negar 
##             1             1             1             1             1 
## resentimiento       votaron     resultado          debe    plebiscito 
##             1             1             2             2             2 
##        unidad        buscar      humildad         mucha          pais 
##             2             2             2             2             2 
##     recibimos          vote       ilusion         votar        guerra 
##             2             2             2             2             3 
##         deber            si           paz            da          vict 
##             4             2             2             5             1 
## responsabilid           mas      cumplido      colombia          gana 
##             2             3             4             2             5 
## 
## $csize
## [1]  8 16  2  2  2
## 
## $no
## [1] 5
# Select biggest connected component.  
V(network)$cluster <- clusters(graph = network)$membership

cc.network <- induced_subgraph(graph = network,
                               vids = which(V(network)$cluster == which.max(clusters(graph = network)$csize)))

cc.network 
## IGRAPH 33e034b UN-- 16 16 -- 
## + attr: name (v/c), degree (v/n), cluster (v/n), n (e/n), width
## | (e/n)
## + edges from 33e034b (vertex names):
##  [1] resultado--plebiscito    debe     --buscar       
##  [3] debe     --plebiscito    unidad   --pais         
##  [5] unidad   --buscar        humildad --mucha        
##  [7] mucha    --responsabilid pais     --recibimos    
##  [9] humildad --recibimos     vote     --paz          
## [11] vote     --ilusion       vote     --ilusion      
## [13] votar    --si            pais     --paz          
## + ... omitted several edges
# Store the degree.
V(cc.network)$degree <- degree(graph = cc.network)

# Compute the weight shares.
E(cc.network)$width <- E(cc.network)$n/max(E(cc.network)$n)

 plot(cc.network, 
      vertex.color = 'lightblue',
      # Scale node size by degree.
      vertex.size = 2*V(cc.network)$degree,
      vertex.label.color = 'black', 
      vertex.label.cex = 0.6, 
      vertex.label.dist = 1.6,
      edge.color = 'gray', 
      # Set edge width proportional to the weight relative value.
      edge.width = 3*E(cc.network)$width ,
      main = 'Bigram Count Network (Biggest Connected Component)', 
      sub = glue('Weiight Threshold: {threshold}'), 
      alpha = 50)

We can go a steph further and make our visualization more dynamic using the networkD3 library.

# Treshold
threshold <- 250

network <-  bi.gram.count %>%
              filter(n > threshold) %>%
              graph_from_data_frame(directed = FALSE)

# Store the degree.
V(network)$degree <- degree(graph = network)
# Compute the weight shares.
E(network)$width <- E(network)$n/max(E(network)$n)

# Create networkD3 object.
network.D3 <- igraph_to_networkD3(g = network)
# Define node size.
network.D3$nodes %<>% mutate(Degree = 10*V(network)$degree)
# Degine color group (I will explore this feature later).
network.D3$nodes %<>% mutate(Group = 1)
# Define edges width. 
network.D3$links$Width <- 10*E(network)$width

forceNetwork(Links = network.D3$links, 
             Nodes = network.D3$nodes, 
             Source = 'source', 
             Target = 'target',
             NodeID = 'name',
             Group = 'Group', 
             opacity = 0.9,
             Value = 'Width',
             Nodesize = 'Degree', 
             # We input a JavaScript function.
             linkWidth = JS("function(d) { return Math.sqrt(d.value); }"), 
             fontSize = 12,
             zoom = TRUE, 
             opacityNoHover = 1)

Let us now decrease the threshold to get a more complex network (zoom out to see it all!).

# Treshold
threshold <- 80

network <-  bi.gram.count %>%
              filter(n > threshold) %>%
              graph_from_data_frame(directed = FALSE)

# Store the degree.
V(network)$degree <- degree(graph = network)
# Compute the weight shares.
E(network)$width <- E(network)$n/max(E(network)$n)

# Create networkD3 object.
network.D3 <- igraph_to_networkD3(g = network)
# Define node size.
network.D3$nodes %<>% mutate(Degree = 3*V(network)$degree)
# Degine color group (I will explore this feature later).
network.D3$nodes %<>% mutate(Group = 1)
# Define edges width. 
network.D3$links$Width <- 10*E(network)$width

forceNetwork(Links = network.D3$links, 
             Nodes = network.D3$nodes, 
             Source = 'source', 
             Target = 'target',
             NodeID = 'name',
             Group = 'Group', 
             opacity = 0.9,
             Value = 'Width',
             Nodesize = 'Degree', 
             # We input a JavaScript function.
             linkWidth = JS("function(d) { return Math.sqrt(d.value); }"), 
             fontSize = 12,
             zoom = TRUE, 
             opacityNoHover = 1)

7.2 Skipgram Analyis

7.2.1 Network Definition

Now we are going to consider skipgrams, which allow a “jump” in thw word count:

skip.window <- 2

skip.gram.words <- tweets.df %>% 
                    unnest_tokens(input = Text, 
                                  output = skipgram, 
                                  token = 'skip_ngrams', 
                                  n = skip.window) %>% 
                    filter(! is.na(skipgram))

For example, consider the tweet:

tweets.df %>% slice(4) %>% pull(Text)
## [1] " voy acercando   puesto  votacion  siento mariposas   estomago siento nervios  felicidad "

The skipgrams are:

skip.gram.words %>% select(skipgram) %>% slice(10:20)
## # A tibble: 11 x 1
##    skipgram          
##    <chr>             
##  1 voy               
##  2 voy acercando     
##  3 voy puesto        
##  4 acercando         
##  5 acercando puesto  
##  6 acercando votacion
##  7 puesto            
##  8 puesto votacion   
##  9 puesto siento     
## 10 votacion          
## 11 votacion siento

We now count the skipgrams containing two words.

skip.gram.words$num_words <- skip.gram.words$skipgram %>% 
                              map_int(.f = ~ ngram::wordcount(.x))

skip.gram.words %<>% filter(num_words == 2) %>% select(- num_words)

skip.gram.words %<>% 
  separate(col = skipgram, into = c('word1', 'word2'), sep = ' ') %>% 
  filter(! word1 %in% stopwords.df$word) %>% 
  filter(! word2 %in% stopwords.df$word) %>% 
  filter(! is.na(word1)) %>% 
  filter(! is.na(word2)) 

skip.gram.count <- skip.gram.words  %>% count(word1, word2, sort = TRUE)

skip.gram.count %>% head
## # A tibble: 6 x 3
##   word1       word2             n
##   <chr>       <chr>         <int>
## 1 llenos      resentimiento  1688
## 2 colombianos llenos         1687
## 3 llenos      votaron        1687
## 4 ambicion    negar          1686
## 5 ambicion    votaron        1686
## 6 colombianos resentimiento  1686

7.2.2 Visualization

Similarly as above, we construct and visualize the corresponding network (we select the biggest connected component):

# Treshold
threshold <- 80

network <-  skip.gram.count %>%
              filter(n > threshold) %>%
              graph_from_data_frame(directed = FALSE)

# Select biggest connected component.  
V(network)$cluster <- clusters(graph = network)$membership

cc.network <- induced_subgraph(graph = network,
                               vids = which(V(network)$cluster == which.max(clusters(graph = network)$csize)))

# Store the degree.
V(cc.network)$degree <- degree(graph = cc.network)
# Compute the weight shares.
E(cc.network)$width <- E(cc.network)$n/max(E(cc.network)$n)

# Create networkD3 object.
network.D3 <- igraph_to_networkD3(g = cc.network)
# Define node size.
network.D3$nodes %<>% mutate(Degree = 2*V(cc.network)$degree)
# Degine color group (I will explore this feature later).
network.D3$nodes %<>% mutate(Group = 1)
# Define edges width. 
network.D3$links$Width <- 10*E(cc.network)$width

forceNetwork(Links = network.D3$links, 
             Nodes = network.D3$nodes, 
             Source = 'source', 
             Target = 'target',
             NodeID = 'name',
             Group = 'Group', 
             opacity = 0.9,
             Value = 'Width',
             Nodesize = 'Degree', 
             # We input a JavaScript function.
             linkWidth = JS("function(d) { return Math.sqrt(d.value); }"), 
             fontSize = 12,
             zoom = TRUE, 
             opacityNoHover = 1)

7.2.3 Node Importance

There are many notions of node importance in a network (A User’s Guide to Network Analysis in R, Section 7.2). Here we compare three of them

  • Degree centrality
  • Closeness centrality
  • Betweenness centrality
# Compute the centrality measures for the biggest connected component from above.
node.impo.df <- tibble(word = V(cc.network)$name,  
                       degree = degree(graph = cc.network),
                       closeness = closeness(graph = cc.network), 
                       betweenness = betweenness(graph = cc.network))

Now we rank the nodes with respect to these centrality measures:

  • Degree centrality
node.impo.df %>% arrange(- degree) %>% head(10)
## # A tibble: 10 x 4
##    word     degree closeness betweenness
##    <chr>     <dbl>     <dbl>       <dbl>
##  1 paz          50   0.00211       8673.
##  2 si           45   0.00212       8969.
##  3 mas          32   0.00183       3038.
##  4 hoy          24   0.00191       4206.
##  5 colombia     24   0.00190       3903.
##  6 pais         21   0.00173       2350.
##  7 voto         20   0.00181       2487.
##  8 pueblo       16   0.00172       3318.
##  9 guerra       15   0.00172       1087.
## 10 vote         10   0.00163       1692.
  • Closeness centrality
node.impo.df %>% arrange(- closeness) %>% head(10)
## # A tibble: 10 x 4
##    word     degree closeness betweenness
##    <chr>     <dbl>     <dbl>       <dbl>
##  1 si           45   0.00212       8969.
##  2 paz          50   0.00211       8673.
##  3 hoy          24   0.00191       4206.
##  4 colombia     24   0.00190       3903.
##  5 mas          32   0.00183       3038.
##  6 voto         20   0.00181       2487.
##  7 pais         21   0.00173       2350.
##  8 guerra       15   0.00172       1087.
##  9 pueblo       16   0.00172       3318.
## 10 vota          9   0.00169        495.
  • Betweenness centrality
node.impo.df %>% arrange(- betweenness) %>% head(10)
## # A tibble: 10 x 4
##    word       degree closeness betweenness
##    <chr>       <dbl>     <dbl>       <dbl>
##  1 si             45   0.00212       8969.
##  2 paz            50   0.00211       8673.
##  3 hoy            24   0.00191       4206.
##  4 colombia       24   0.00190       3903.
##  5 pueblo         16   0.00172       3318.
##  6 mas            32   0.00183       3038.
##  7 voto           20   0.00181       2487.
##  8 pais           21   0.00173       2350.
##  9 vote           10   0.00163       1692.
## 10 colombiano      8   0.00143       1253.

Let us see the distribution of these centrality measures.

plt.deg <- node.impo.df %>% 
            ggplot(mapping = aes(x = degree)) +
            theme_light() +
            geom_histogram(fill = 'blue', alpha = 0.8, bins = 30)

plt.clo <- node.impo.df %>% 
            ggplot(mapping = aes(x = closeness)) +
            theme_light() +
            geom_histogram(fill = 'red', alpha = 0.8, bins = 30)

plt.bet <- node.impo.df %>% 
            ggplot(mapping = aes(x = betweenness)) +
            theme_light() +
            geom_histogram(fill = 'green4', alpha = 0.8, bins = 30)

plot_grid(... = plt.deg, plt.clo, plt.bet, ncol = 1, align = 'v')

7.2.4 Community Detection

We can try to find clusters within the network. We use the Louvain Method for community detection:

comm.det.obj <- cluster_louvain(graph = cc.network, weights = E(cc.network)$n)

comm.det.obj
## IGRAPH clustering multi level, groups: 12, mod: 0.7
## + groups:
##   $`1`
##    [1] "vote"       "ilusion"    "colombia"   "paz"        "deber"     
##    [6] "dia"        "niños"      "nueva"      "cumplido"   "acuerdos"  
##   [11] "patria"     "ponen"      "dijeron"    "farcsantos" "peligro"   
##   [16] "justicia"   "estable"    "construir"  "gabriel"    "garcia"    
##   [21] "proceso"    "encuentre"  "mata"       "jaime"      "sueño"     
##   [26] "viva"       "social"     "queremos"   "infantil"   "democracia"
##   [31] "duradera"   "marquez"    "garzon"     "solo"       "t"         
##   [36] "pardo"     
##   
##   + ... omitted several groups/vertices

We see that 12 groups where identified and the modularity is 0.7 (which is good, as it is close to 1).

Modularity is as chance-corrected statistic, and is defined as the fraction of ties that fall within the given groups minus the expected such fraction if the ties were distributed at random. (A User’s Guide to Network Analysis in R, Section 8.3.1)

Now we encode the membership as a node atribute (zoom and click on each node to explore the clusters).

V(cc.network)$membership <- membership(comm.det.obj)
# We use the membership label to color the nodes.
network.D3$nodes$Group <- V(cc.network)$membership

forceNetwork(Links = network.D3$links, 
             Nodes = network.D3$nodes, 
             Source = 'source', 
             Target = 'target',
             NodeID = 'name',
             Group = 'Group', 
             opacity = 0.9,
             Value = 'Width',
             Nodesize = 'Degree', 
             # We input a JavaScript function.
             linkWidth = JS("function(d) { return Math.sqrt(d.value); }"), 
             fontSize = 12,
             zoom = TRUE, 
             opacityNoHover = 1)

Let us collect the words per cluster:

membership.df <- tibble(word = V(cc.network) %>% names,
                        cluster = V(cc.network)$membership)

V(cc.network)$membership %>%
  unique %>% 
  sort %>% 
  map_chr(.f = function(cluster.id) {
    
    membership.df %>% 
      filter(cluster == cluster.id) %>% 
      # Get 15 at most 15 words per cluster.
      slice(1:15) %>% 
      pull(word) %>% 
      str_c(collapse = ', ')
    
  }) 
##  [1] "vote, ilusion, colombia, paz, deber, dia, niños, nueva, cumplido, acuerdos, patria, ponen, dijeron, farcsantos, peligro"                
##  [2] "pueblo, civilizado, dijo, esquizofrenico, listo, mujica, voltear, siento, valiente, lucha, altivo, dijimos, gobernar, nariño, orgulloso"
##  [3] "si, hoy, votar, voto, sabremos, madrid, derecho, amsterdam, berlin, barcelona, shanghai, sibelgica, valencia, va, vota"                 
##  [4] "guerra, mas, acuerdo, amor, salida, ahora, cambiar, puede, importante, buen, gobierno, hacer, historia, mal, victimas"                  
##  [5] "hacemos, acompañan, estan, gaviria, militantes, pacho, espiritu, tc"                                                                    
##  [6] "hijo, arrebato, intolerancia, leonardo, posada, reina"                                                                                  
##  [7] "llenos, colombianos, ambicion, liderados, negar, resentimiento, votaron, vict"                                                          
##  [8] "resultado, debe, buscar, plebiscito, unidad, humildad, mucha, pais, recibimos, mejor, mama, soño, seguir, luchando, conviccion"         
##  [9] "da, hace, denuncia, estadio, haciendo, incumpliendo, norma, santos, trampa, gana"                                                       
## [10] "años, enemigo, quede, claro, matado, verdadero"                                                                                         
## [11] "votamos, familia, ninguna, abuelo, asesinado, farc, perdon, secuestrado, sino, premio, pase"                                            
## [12] "bandera, carmena, disimula, venezuela, vergonzoso"

7.3 Correlation Analysis (Phi Coefficient)

7.3.1 Network Definition

The focus of the phi coefficient is how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other. (Text Mining with R, Section 4.2.2). You can read more about this correlation measure here.

cor.words <- words.df %>% 
              group_by(word) %>% 
              filter(n() > 10) %>% 
              pairwise_cor(item = word, feature = ID) 

7.3.1 Visualization

Let us visualize the correlation of two important nodes in the network:

topic.words <- c('uribe', 'santos', 'farc')
# Set correlation threshold. 
threshold = 0.1

network <-  cor.words %>%
              # filter for relevant nodes.
              filter((item1 %in% topic.words | item2 %in% topic.words)) %>% 
              rename(n = correlation) %>% 
              filter(n > threshold) %>%
              graph_from_data_frame()
  
V(network)$degree <- degree(graph = network)

E(network)$width <- E(network)$n/max(E(network)$n)

network.D3 <- igraph_to_networkD3(g = network)

network.D3$nodes %<>% mutate(Degree = 4*V(network)$degree)

# Define color groups. 
network.D3$nodes$Group <- network.D3$nodes$name %>% 
                            as.character %>% 
                            map_dbl(.f = function(name) {
                                        index <- which(name == topic.words) 
                                        ifelse(test = length(index) > 0,
                                               yes = index, 
                                               no = 0)
                            })

network.D3$links %<>% mutate(Width = 10*E(network)$width)

forceNetwork(Links = network.D3$links, 
             Nodes = network.D3$nodes, 
             Source = 'source', 
             Target = 'target',
             NodeID = 'name',
             Group = 'Group', 
             # We color the nodes using JavaScript code.
             colourScale = JS('d3.scaleOrdinal().domain([0,1,2]).range(["gray", "blue", "red", "black"])'), 
             opacity = 0.8,
             Value = 'Width',
             Nodesize = 'Degree', 
             # We define edge properties using JavaScript code.
             linkWidth = JS("function(d) { return Math.sqrt(d.value); }"), 
             linkDistance = JS("function(d) { return 550/(d.value + 1); }"), 
             fontSize = 18,
             zoom = TRUE, 
             opacityNoHover = 1)

8. Conclusions & Remarks

In this post we explored how to get first insights from social media text data (Twitter). First, we presented how clean and normalize text data. Next, as a first approach, we saw how (pairwise) word counts give important information about the content and relations of the input text corpus. In addition, we studied how use networks as data structures to analyze and represent various count measures (bigram, skipgram, correlation). We also showed how we can cluster these words to get meaningful groups defining topics. But this is just the beginning! Next steps will include going into the meaning and class of the words (part of speech tagging and named entity recognition). We will explore these techniques in a future post.