Saturday, November 10, 2018

How to manipulate the Text Strings data with Regular Expressions in R

Text Strings data manipulation with Regular Expressions in R
In R, we have packages such as stringr and stringi which are loaded with all string manipulation functions.
R has several base functions for string manipulations. These functions are designed to complement regular expressions. The main differences between string manipulation functions and regular expressions are :
We use string manipulation functions to do simple tasks such as splitting a string, extracting the first three letters, etc.. We use regular expressions to do more complicated tasks such as extract email IDs or date from a set of text.
String manipulation functions are designed to respond in a certain way. They don't deviate from their natural behavior. Whereas, we can customize regular expressions in any way we want.
Regular Expressions :
The Regular Expressions (a.k.a regex) are a set of pattern matching commands used to detect string sequences in a large text data. These commands are designed to match a family (alphanumeric, digits, words) of text which makes then versatile enough to handle any text / string class.
The regular expressions used to get more out of text data.
For example, let's say we have scraped some data from the web which is messy. In such situations, you should use regular expressions to extract the extract email IDs, postal codes, numbers, date or time values from a set of text.

List of Regular Expression Commands
There are several other functions specially designed to deal with regular expressions (a.k.a regex). R is really powerful when it comes to parsing text data. In regex, there are multiple ways of doing a certain task.
For using regular expressions, the available base regex functions are grep(), gsub(), grepl(), regexpr(),  gregexpr(), regexec(), and regmatches(). 

Regular expressions in R can be divided into 5 categories:
1.Metacharacters
2.Sequences
3.Quantifiers
4.Character Classes
5.POSIX character classes


1.Metacharacters :
Metacharacters comprises a set of special operators which regex doesn't capture. The regex works by its own rules. These operators are most common in every line of text you would come across.
These characters include: . \ | ( ) [ ] { } $ * + ? If any of these characters are available in a string, regex won't detect them unless they are prefixed with double backslash (\\) in R.

Now we will see how to escape these characters in R.
From a given vector, we want to detect the string "percent%." We'll use the base grep() function used to detect strings given a pattern.

Also. we'll use the gsub() function to make the replacements as follows :

>dt <- c("percent%","percent")
>grep(pattern = "percent\\%",x = dt,value = T)
[1] "percent%"

> grep(pattern = "[a-z]\\%",x = dt,value = T)
[1] "percent%"

#detect all strings
>dt <- c("may?","money$","and&")
>grep(pattern = "[a-z][\\?-\\$-\\&]",x = dt,value = T)
[1] "may?" "money$" "and&"
> grep(pattern = "[a-z][\\?-\\$-\\&]",x = dt,value = F)
[1] 1 2 3


>gsub(pattern = "[\\?-\\$-\\&]",replacement = "",x = dt) 
[1] "may" "money" "and"

Suppose, if you find a double backslash in a string, you'll need to prefix it with another double backslash to get detected as per below:
>gsub(pattern = "\\\\",replacement = "-",x = "Barcelona\\Spain")
[1] "Barcelona-Spain"


2. Quantifiers :
Quantifiers are mainly used to determine the length of the resulting match. The quantifiers exercise their power on items to the immediate left of it. One position here and there can change the entire output value.
The quantifiers can be used with metacharacters, sequences, and character classes to return complex patterns. Combinations of these quantifiers help us to match a pattern.

Following is the list of quantifiers commonly used in detecting patterns in text.

The nature of these quantifiers is better known in two ways:
Greedy Quantifiers
The symbol .* is known as a greedy quantifier. It says that for a particular pattern to be matched, it will try to match the pattern as many times as its repetition are available.

Non-Greedy Quantifiers
The symbol .? is known as a non-greedy quantifier. Being non-greedy, for a particular pattern to be matched, it will stop at the first match.

Let's consider an example of greedy vs. non-greedy quantifier. From the given number, apart from the starting digit, we want to extract this number till the next digit '1' is detected. The desired result is 101.

>number <- "101000000000100"
#greedy quantifier

>regmatches(number, gregexpr(pattern = "1.*1",text = number))

[1] "1010000000001"

#non greedy quantifier
>regmatches(number, gregexpr(pattern = "1.?1",text = number))
[1] "101"


It works like this:
the greedy match starts from the first digit, moves ahead, and stumbles on the second '1' digit. Being greedy, it continues to search for '1' and stumbles on the third '1' in the number. Then, it continues to check further but couldn't find more. Hence, it returns the result as "1010000000001." On the other hand, the non-greedy quantifier, stops at the first match, thus returning "101."


#doesn't matter if 'z' is a match or not
> grep(pattern = "z*",x = names,value = T)
[1] "anna"     "crissy"   "puerto"   "cristian" "aannna"   "steven"   "aannnnaa" "gracia" 
> grep(pattern = "*",x = names,value = T)
[1] "anna"     "crissy"   "puerto"   "cristian" "aannna"   "steven"   "aannnnaa" "gracia"

#must match 't' one or more times in the string
> grep(pattern = "t+",x = names,value = T)

[1] "puerto" "cristian" "steven"


#must match 'n' atleast 2 times
> grep(pattern = "n{2}",x = names,value = T)
[1] "anna"     "aannna"   "aannnnaa"
> grep(pattern = "n{2,}",x = names,value = T)
[1] "anna"     "aannna"   "aannnnaa"

#must match 'n' atleast 3 times

> grep(pattern = "n{3}",x = names,value = T)

[1] "aannna"   "aannnnaa"

3. Sequences :

The sequences contain special characters used to describe a pattern in a given string.
Following are the commonly used sequences in R:
> mystring <- "I have been to London 20 times"
#match a digit
> gsub(pattern = "\\d+",replacement = "_",x = mystring)

[1] "I have been to London _ times"
> regmatches(mystring,regexpr(pattern = "\\d+",text = mystring))
[1] "20"

#match a non-digit
> gsub(pattern = "\\D+",replacement = "_",x = mystring)
[1] "_20_" 
> regmatches(mystring,regexpr(pattern = "\\D+",text = mystring))
[1] "I have been to London "

#match a space - returns positions and no.of spaces
> spstring<-"I have been to Paris 20 times"
> gregexpr(pattern = "\\s+",text = spstring)
[[1]]
[1] 2 7 13 18 28 31 #position of the spaces in the spstring
attr(,"match.length")
[1] 1 2 3 4 1 1 #no.of spaces in the spstring
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

#match a non space, replacing non-space characters with a string 'abc'
> mystring <- "I have been to London 20 times"
> gsub(pattern = "\\S+",replacement = "abc",x = mystring)
[1] "abc abc abc abc abc abc abc"

#match a word character, replace non space character with a string 'k'
> mystring <- "I have been to London 20 times"
> gsub(pattern = "\\w",replacement = "k",x = mystring)
[1] "k kkkk kkkk kk kkkkkk kk kkkkk"

#match a non-word character, replacing with a string 'k'
>  mystring <- "I have been to London 20 times"
> gsub(pattern = "\\W",replacement = "k",x = mystring)
[1] "IkhavekbeenktokLondonk20ktimes"

4. Character Classes
Character classes refer to a set of characters enclosed in a square bracket [ ]. These classes match only the characters enclosed in the bracket. These classes can also be used in conjunction with quantifiers. The use of the caret (^) symbol in character classes negates the expression and searches for everything except the specified pattern.
Following are the various types of character classes used in regex:

Let's look at some examples using character classes:
> mystring <- "25 students failed in the test. 15 got passed in that test"

#extract numbers from a string
> regmatches(x = mystring,gregexpr("[0-9]+",text = mystring))
[[1]]
[1] "25" "15"

#extract string without digits
> regmatches(x = mystring,gregexpr("[^0-9]+",text = mystring)) 
[[1]]
[1] " students failed in the test. " " got passed in that test" 

5. POSIX Character Classes
In R, the POSIX Characterclasses can be identified as enclosed within a double square bracket ([[ ]]). They work like character classes. A caret ahead of an expression negates the expression value. These classes more intuitive than the rest, and hence easier to learn.
Following are the posix character classes available in R:


> mystring <- c("I work 8 hours\n, a day","He work 10 hours\n a day.","Who work many\t hours ?")
> mystring
[1] "I work 8 hours\n, a day"   "He work 10 hours\n a day." "Who work many\t hours ?" 

#extract digits from a string
[1] "I work 8 hours\n, a day"   "He work 10 hours\n a day." "Who work many\t hours ?"  
> unlist(regmatches(mystring,gregexpr("[[:digit:]]+",text = mystring)))
[1] "8"  "10"

#remove punctuations

> gsub(pattern = "[[:punct:]]+",replacement = "",x = mystring)
[1] "I work 8 hours\n a day" "He work 10 hours\n a day" "Who work many\t hours "

#remove spaces

> gsub(pattern = "[[:blank:]]",replacement = "-",x = mystring)
[1] "I-work-8-hours\n,-a-day"   "He-work-10-hours\n-a-day." "Who-work-many--hours-?

#remove control characters
> gsub(pattern = "[[:cntrl:]]+",replacement = " ",x = mystring)

[1] "I work 8 hours , a day"   "He work 10 hours  a day." "Who work many  hours ?"

#remove non graphical characters
> gsub(pattern = "[^[:graph:]]+",replacement = "",x = mystring)

[1] "Iwork8hours,aday"   "Hework10hoursaday." "Whoworkmanyhours?"

Examples on Regular Expressions :
#Extract digits from a string of characters

> mystring<-"The sampler order number is abc1006cde781"
> gsub(pattern = "[^0-9]",replacement = "",x = mystring)
[1] "1006781"
> stringi::stri_extract_all_regex(str = mystring,pattern = "\\d+") #list
[[1]]
[1] "1006" "781"

> regmatches(mystring, regexpr("[0-9]+",mystring))
[1] "1006"
> regmatches(mystring, regexpr("[[:digit:]]+",mystring))
[1] "1006"

#Remove spaces from a line of strings
> gsub(pattern = "[[:space:]]",replacement = "",x=mystring)
[1] "Thesamplerordernumberisabc1006cde781"
> gsub(pattern = "[[:blank:]]",replacement = "",x=mystring)
[1] "Thesamplerordernumberisabc1006cde781"
> gsub(pattern = "\\s",replacement = "",x=mystring)
[1] "Thesamplerordernumberisabc1006cde781"

#Return if a value is present in a vector
>det <- c("A1","A2","A3","A4","A5","A6","A7")
>grep(pattern = "A1|A4",x = det,value =T)
[1] "A1" "A4"

#Extract strings which are available in key value pairs
> dat <- c("(monday :: 0.1231313213)","tomorrow","(tuesday :: 0.1434343412)")
> grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)",x = dat,value = T)

[1] "(monday :: 0.1231313213)" "(tuesday :: 0.1434343412)"

Explanation: You might find it complicated to understand, so let's look at it bit by bit.
"\(" is used to escape the metacharacter. "[a-z]+" matches letters one or more times. "(0\.[0-9]+)" matches the decimal value, where the metacharacter (.) is escaped using double backslash, so is the period. The numbers are matched using "[0-9]+."

#In a key value pair, extract the values
> mystring<- c("G1:E001", "G2:E002", "G3:E003")
> gsub(pattern = ".*:",replacement = "",x = mystring)
[1] "E001" "E002" "E003"

Explanation: In the regex above, ".*:" matches everything (except newspace) it can until it reaches colon (:), then gsub() function replaces it with blank. Hence, we get the desired output.


#Remove punctuation from a line of text
> mystring<- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-="
> gsub(pattern = "[[:punct:]]+",replacement = "",x = mystring)
[1] "a1bcdefg"#Remove digits from a string which contains alphanumeric characters

> mystring <- "the day of 2nd ID5 Conference 19 12 2005"
> gsub(pattern = "\\b\\d+\\b",replacement = "",x = mystring)
[1] "the day of 2nd ID5 Conference "
> gsub(pattern = "[[:digit:]]+",replacement = "",x = mystring)
[1] "the day of nd ID Conference "

#Find the location of digits in a string

> mystring <- "The day of 2nd ID5 Conference"
> gregexpr(pattern = '\\d',text = mystring )
[[1]]
[1] 12 18
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

> unlist(gregexpr(pattern = '\\d',text =mystring ))
[1] 12 18

#Extract information available inside parentheses (brackets) in a string

> mystring<-"which (game) we play (chess) tomorrow ? (cricket) or (badminton)?"
> gsub("[\\(\\)]","",regmatches(mystring, gregexpr("\\(.*?\\)", mystring))[[1]])
[1] "game" "chess" "cricket" "badminton"

#Extract only the first digit in a range
> x <- c("75 to 79", "80 to 84", "85 to 89")
> gsub(" .*\\d+", "", x)
[1] "75" "80" "85"


> x <- c("75at55 to 79", "80 to 84", "85at65 to 89")
> gsub(" .*\\d+", "", x)
[1] "75at55" "80" "85at65"

#Extract email addresses from a given string
> mystring <- c("My email address is abc@test.com","my email address is def@temp.com","lotus leaf","white rose")
> unlist(regmatches(x = mystring, gregexpr(pattern = "[[:alnum:]]+\\@[[:alpha:]]+\\.com",text = mystring)))
[1] "abc@test.com" "def@temp.com"


--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------

No comments:

Post a Comment

Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.