Tutorial 7: Introduction to String Analysis.
Exercise 1: grep()/grepl()
a 1.1 Extract only the strings that contain an email address from the
following character vector
a <- c("www.google.com", "www.yahoo.com", "fisher@gmail.com", "www.youtube.com", "thompson@hotmail.com")
.
[1] "fisher@gmail.com" "thompson@hotmail.com"
b 1.2 Display TRUE
if the last name of the person starts with A or
G and FALSE
otherwise using the following character vector
b <- c("Anderson", "Abel", "Armstrong", "Barbosa", "Brunton", "Boucher", "Crossley", "Cameron", "Cleveland", "Delatorre", "Durrant", "Ellwood", "Eaton", "Gibbins", "Griff", "Guzman")
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE TRUE TRUE TRUE
Exercise 2: sub()/gsub()
a 2.1 Clean the following strings and transform this character vector
c <- c("cSVgKl9yyb", "e7w4o11oh8", "iYdWYvV7b2", "Epal3cNuGH", "NNhbMR0ocT", "fYaRvoag8B", "LO4fkHm7Kn", "JK8jKhS5De", "DcMAZ7Rxtp", "sV0tqC8XSd")
into an integer type of vector.
[1] 9 74118 72 3 0 8 47 85 7 8
- 2.2 Clean the following strings by removing the quotes the following
character vector
d <- c("\"abilene\"" , "\"christian\"", "\"university\"", "\"adelphi\"", "\"university\"", "\"adrian\"", "\"college\"", "\"agnes\"", "\"scott\"", "\"college\"", "\"alaska\"", "\"pacific\"", "\"university\"")
[1] "abilene" "christian" "university" "adelphi" "university"
[6] "adrian" "college" "agnes" "scott" "college"
[11] "alaska" "pacific" "university"
Exercise 3: strsplit
- 3.1 Split the following string of characters, identifying each
sentence in the string by using
"\\."
(the period) at the end of each sentence.f <- "Down, down, down. There was nothing else to do, so Alice soon began talking again. "Dinah'll miss me very much to-night, I should think!" (Dinah was the cat.) "I hope they'll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I'm afraid, but you might catch a bat, and that's very like a mouse, you know. But do cats eat bats, I wonder?" And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, "Do cats eat bats? Do cats eat bats?" and sometimes, "Do bats eat cats?" for, you see, as she couldn't answer either question, it didn't much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, "Now, Dinah, tell me the truth: did you ever eat a bat?" when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over."
[[1]]
[1] "Down, down, down"
[2] " There was nothing else to do, so Alice soon began talking again"
[3] " "Dinah'll miss me very much to-night, I should think!" (Dinah was the cat"
[4] ") "I hope they'll remember her saucer of milk at tea-time"
[5] " Dinah my dear! I wish you were down here with me! There are no mice in the air, I'm afraid, but you might catch a bat, and that's very like a mouse, you know"
[6] " But do cats eat bats, I wonder?" And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, "Do cats eat bats? Do cats eat bats?" and sometimes, "Do bats eat cats?" for, you see, as she couldn't answer either question, it didn't much matter which way she put it"
[7] " She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, "Now, Dinah, tell me the truth: did you ever eat a bat?" when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over"
- Create a matrix by splitting
f<-c("20-04-2018","15-07-2021","11-11-2022","08-12-2021","28-01-2020")
the following string, allocate one column for the day, month and year.
[,1] [,2] [,3]
[1,] 20 4 2018
[2,] 15 7 2021
[3,] 11 11 2022
[4,] 8 12 2021
[5,] 28 1 2020
Exercise 4: Semantic Analysis
a 4.1 Using the data set Corona_NLP_test_udpipe.csv
, (DOWNLOAD THE
DATA),
load the data, and with the udpipe
package generate barplot
of the
most common terms in the corpus (VERB, NOUN, ADJ, …).
b 4.2 Using the data set Corona_NLP_test_udpipe.csv
, load the data,
and with the udpipe
package generate barplot
of the most common
VERB, NOUN and ADJ in the corpus.
c 4.3 Using the packages wordcloud
, igraph
and udpipe
, generate a
cloud of words using the most common frequent VERB, NOUN and ADJ in the
corpus. What can you interpret from the current and previos plot?
Loading required package: RColorBrewer
Attaching package: 'igraph'
The following object is masked from 'package:tidyr':
crossing
The following object is masked from 'package:tibble':
as_data_frame
The following objects are masked from 'package:purrr':
compose, simplify
The following objects are masked from 'package:dplyr':
as_data_frame, groups, union
The following objects are masked from 'package:dials':
degree, neighbors
The following objects are masked from 'package:stats':
decompose, spectrum
The following object is masked from 'package:base':
union