Friday, November 9, 2018

How to manipulate the Text Strings data with String Functions in R

Text Strings data manipulation with String Functions in R
String Manipulation :
The string manipulation comprises a series of functions used to extract information from text variables. In machine learning, these functions are being widely used for doing feature engineering, i.e., to create new features out of existing string features.
In R, we have packages such as stringr and stringi which are loaded with all string manipulation functions.
R has several base functions for string manipulations. These functions are designed to complement regular expressions. The practical differences between string manipulation functions and regular expressions are :

We use string manipulation functions to do simple tasks such as splitting a string, extracting the first three letters, etc.. We use regular expressions to do more complicated tasks such as extract email IDs or date from a set of text.
String manipulation functions are designed to respond in a certain way. They don't deviate from their natural behavior. Whereas, we can customize regular expressions in any way we want.

Regular Expressions :
The Regular Expressions (a.k.a regex) are a set of pattern matching commands used to detect string sequences in a large text data. These commands are designed to match a family (alphanumeric, digits, words) of text which makes then versatile enough to handle any text / string class.
The regular expressions used to get more out of text data.
For example, let's say we have scraped some data from the web which really messy.In such situations, you should use regular expressions to extract the extract email IDs, postal codes, numbers, date or time values from a set of text.

In this Post, we will discuss about only the String manipulation functions.

String Manipulation Functions :
In R, a string is any value enclosed in quotes (" "). we can even have number as strings. R notifies strings under the class character
text <- "United States of India"
typeof(text)
[1] "character"

num <- c("15","25","35")
typeof(num)
[1] "character"

# paste() function :
R's base paste function is used to combine (or paste) set of strings. In machine learning, it is quite frequently used in creating / re-structuring variable names.
In the paste function, we can pass a vector of values and if the vector lengths aren't equal, this function will recycle the length of the shorter vector till it matches the length of longer vector.
Examples :
> var1<-("United")
> var2<-c("States","of")
> var3<-paste(var1,var2)
> var3
[1] "United States" "United of"
  
> var4<-paste(var1,var2,"India")
> var4
[1] "United States India" "United of India"
 
#specifying the separator between the strings   
> var5<-paste(var1,var2,"India",sep="")
> var5
[1] "UnitedStatesIndia" "UnitedofIndia"  
 
> var6<-paste(var1,var2,"India",sep="_")
> var6
[1] "United_States_India" "United_of_India"

In the example below, we taken a vector of length 5 (1:5) and combined it with a vector of length 2 which consists of c("?","!"):
paste(1:5,c("?","!"),sep = "-")
[1] "1-?" "2-!" "3-?" "4-!" "5-?"

# cat() function :
As we see above, all the outputs are returned within quotes, thus making them a character class. Alternatively, R also allows us to print and concatenate strings without quotes. we can do this using cat function.
In stringr package, its substitute function is str_c() or str_join().
> str1<-"Los Angeles"
> cat(str1,"USA",sep = "-")
Los Angeles-USA

> cat(month.name[1:5],sep = " ")
January February March April May


# toString() function :
The toString function allows you to convert any non-character value to a string.
>toString (1:10)
[1] "1,2,3,4,5,6,7,8,9,10"

The below are the some of the most commonly used base R functions (also available in stringr) for String manipulation:


Now we will discuss the most widely used string functions to manipulate the data.
>library(stringr)
>mystring <- "Los Angeles, officially the City of Los Angeles and often known by its initials L.A., is the second-most populous city in the United States (after New York City), the most populous city in California and the county seat of Los Angeles County. Situated in Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity, sprawling metropolis, and as a major center of the American entertainment industry."

> strwrap(mystring)
[1] "Los Angeles, officially the City of Los Angeles and often known by its initials L.A., is the second-most populous city in the United States"
[2] "(after New York City), the most populous city in California and the county seat of Los Angeles County. Situated in Southern California, Los"
[3] "Angeles is known for its Mediterranean climate, ethnic diversity, sprawling metropolis, and as a major center of the American entertainment"
[4] "industry.

#count number of characters
> nchar(mystring)
[1] 429
> str_length(mystring)
[1] 429

#convert the string to lower case
>tolower(mystring)
>str_to_lower(mystring)

#convert to the string to upper case

>toupper(mystring)
>str_to_upper(mystring)

#replace strings :
> chartr("and","for",x = mystring#letters a,n,d get replaced by f,o,r
[1] "Los Aogeles, officiflly the City of Los Aogeles for ofteo koowo by its ioitifls L.A., is the secoor-most populous city io the Uoiter Stftes (ffter New York City), the most populous city io Cfliforoif for the couoty seft of Los Aogeles Couoty. Situfter io Southero Cfliforoif, Los Aogeles is koowo for its Meriterrfoefo climfte, ethoic riversity, sprfwliog metropolis, for fs f mfjor ceoter of the Americfo eotertfiomeot iorustry."

> str_replace_all(string = mystring, pattern = c("City"),replacement = "state")
#this is case sentitive
[1] "Los Angeles, officially the state of Los Angeles and often known by its initials L.A., is the second-most populous city in the United States (after New York state), the most populous city in California and the county seat of Los Angeles County. Situated in Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity, sprawling metropolis, and as a major center of the American entertainment industry."

#extracting the parts(substring) of string

> substr(x = mystring,start = 5,stop = 11)

[1] "Angeles"
> str_sub(string = mystring, start = 5, end = 11)
[1] "Angeles"

#get difference between two vectors (compare one with second)
>setdiff(c("monday","tuesday","wednesday"),c("monday","thursday","friday"))
[1] "tuesday"   "Wednesday"

> abbreviate(c("monday","tuesday","wednesday"),minlength = 3)
   monday   tuesday wednesday
    "mnd"     "tsd"     "wdn"

#splitting strings
> strsplit(x = c("ID-101","ID-102","ID-103","ID-104"),split = "-")
[[1]]
[1] "ID"  "101"

[[2]]
[1] "ID"  "102"

[[3]]
[1] "ID"  "103"

[[4]]
[1] "ID"  "104"

> str_split(string = c("ID-101","ID-102","ID-103","ID-104"),pattern = "-",simplify = T)
     [,1] [,2]
[1,] "ID" "101"
[2,] "ID" "102"
[3,] "ID" "103"
[4,] "ID" "104"


#find and replace first match
> sub(pattern = "L",replacement = "B",x = mystring,ignore.case = T)

#replaces the "L" with "B" in the first occurrence of the string
[1] "Bos Angeles, officially the City of Los Angeles and often known by its initials L.A., is the second-most populous city in the United States (after New York City), the most populous city in California and the county seat of Los Angeles County. Situated in Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity, sprawling metropolis, and as a major center of the American entertainment industry."

#find and replace all match cases 
> gsub(pattern = "Los",replacement = "Bos",x = string,ignore.case = T)

[1] "Bos Angeles, officially the City of Bos Angeles and often known by its initials L.A., is the second-most populous city in the United States (after New York City), the most populous city in California and the county seat of Bos Angeles County. Situated in Southern California, Bos Angeles is known for its Mediterranean climate, ethnic diversity, sprawling metropolis, and as a major center of the American entertainment industry."

Note :
The pattern parameter in the functions above also accept regular expressions. These functions when combined with regular expressions can do highly complex search operations.
The String manipulations with regular expressions will be discussed in an another Post.

--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------

No comments:

Post a Comment

Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.