Sunday, January 28, 2018

What are the Factors and How to Create them in R Programming

The Factors in R Programming
Factors are the data Objects which are used to categorize the data and store it as levels.They can store both strings and integers.They are useful in the Columns which have limited no.of unique values, like "Male", "FeMale" , and "True", "False" etc.

Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors. The Factors are crucial in R because they determine how data is analyzed and presented visually.


As you’ve seen, variables can be described as nominal, ordinal, or continuous. Nominal
variables are categorical
, without an implied order. Diabetes (Type1, Type2) is an example of a nominal variable. Even if Type1 is coded as a 1 and Type2 is coded as a 2 in the data, no order is implied

Ordinal variables imply order but not amount. Status (poor, improved, excellent) is a good example of an ordinal variable. We know that a patient with a poor status isn’t doing as well as a patient with an improved status, but not by how much.

Continuous variables can take on any value within some range,and both order and amount are implied. Age in years is a continuous variable and can take on values such as 14.5 or 22.8 and any value in between. You know that someone who is 15 is one year older than someone who is 14.

Factors are created using the the Factor() function by taking vector as input.

Create a Factor:
# Creating a vector as input
> data<-c("East","West","East","North","North","East","West","West","West","East","North")

> print(data)
 [1] "East"  "West"  "East"  "North" "North" "East"  "West"  "West"  "West"  "East"  "North"

> print(is.factor(data))
[1] FALSE

# Creating a Factor with above vector as input with factor() function
> factor_data<-factor(data)
> print(factor_data)
 [1] East  West  East  North North East  West  West  West  East  North
Levels: East North West
> print(is.factor(factor_data))
[1] TRUE

Factors in Data Frame:
On creating any data frame (dataset) with a column of text data , R treats the text column as categorical data and creates the factors on it.
# Creating the vectors for a data frame
> height<-c(132,151,162,139,166,147,122)
> weight<-c(48,49,66,53,67,52,40)
> gender<-c("male","male","female","female","male","female","male")


Creating the data frame from above vectors
> input_data<-data.frame(height,weight,gender)
> print(input_data)
  height weight gender
1    132     48   male
2    151     49   male
3    162     66 female
4    139     53 female
5    166     67   male
6    147     52 female
7    122     40   male

Test the which column is a factor
> print(is.factor(input_data$height))
[1] FALSE
> print(is.factor(input_data$weight))
[1] FALSE

> print(is.factor(input_data$gender))
[1] TRUE

Print the gender column to see the factor levels
> print(input_data$gender)

[1] male   male   female female male   female male  

Levels: female male

Changing the Order of Levels in Factor:
# Creating a Factor with above vector as input with factor() function
> data<-c("East","West","East","North","North","East","West","West","West","East","North")
> factor_data<-factor(data)
> print(factor_data)
 [1] East  West  East  North North East  West  West  West  East  North
Levels: East North West
By default the order of the Levels are shown in Alphabetical (a,b,c..) order. We can also change the order as we wish by specifying the levels in factor as shown below.
> new_order_data<-factor( factor_data, levels=c("East","West","North"))
> new_order_data
 [1] East  West  East  North North East  West  West  West  East  North
Levels: East West North

Generating Factor Levels:
We can generate the Factor levels by using the gl() function. It takes two integers (n,k) as input, where n indicates the no.of levels and k indicates the no.of times each level should replicate(repeat).

syntax:
gl(n,k, labels)

here,n indicates the no.of levels and k indicates the no.of times each level should replicate and labels is a vector of labels for the resulting factor levels.

> v<-gl(3,4,labels=c("East","West","North"))
> v
 [1] East  East  East  East  West  West  West  West  North North North North
Levels: East West North

Nominal vs Ordinal Factors :
The function factor() stores the categorical values as a vector of integers in the range [1… k], (where k is the number of unique values in the nominal variable) and an internal vector of character strings (the original values) mapped to these integers.
For example, assume that we have this vector:
>diabetes <- c("Type1", "Type2", "Type1", "Type1")

> diabetes
[1] "Type1" "Type2" "Type1" "Type1"
> diabetes <- factor(diabetes)
> diabetes

[1] Type1 Type2 Type1 Type1

Levels: Type1 Type2

The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with 1 = Type1 and 2 = Type2 internally (the assignment is alphabetical).
Any analyses performed on the vector diabetes will treat the variable as nominal and select the statistical methods appropriate for this level of measurement.

For vectors representing ordinal variables, you add the parameter ordered=TRUE to the factor() function.
>status <- c("Poor", "Improved", "Excellent", "Poor")
>status <- factor(status, ordered=TRUE)
>status

[1] Poor      Improved  Excellent Poor     

Levels: Excellent < Improved < Poor
The statement status <- factor(status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and associate these values internally as 1 = Excellent, 2 = Improved, and 3 =Poor. Additionally, any analyses performed on this vector will treat the variable as ordinal and select the statistical methods appropriately.

By default, factor levels for character vectors are created in alphabetical order. This worked for the status factor, because the order “Excellent,” “Improved,” “Poor” made sense. 

There would have been a problem if “Poor” had been coded as “Average” instead, because the order would have been “Average,” “Excellent,” “Improved.” A similar problem would exist if the desired order was “Poor,” “Improved,” “Excellent.” 

For ordered factors, the alphabetical default is rarely sufficient.You can override the default by specifying a levels option. 
Example:
> status <- factor(status, order=TRUE, levels=c("Poor", "Improved", "Excellent"))

> status
[1] Poor      Improved  Excellent Poor
Levels: Poor < Improved < Excellent

The above statement assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent. Be sure the specified levels match your actual data values. Any data values not in the list will be set to missing.

Numeric variables can be coded as factors using the levels and labels options. If the gender was coded as 1 for male and 2 for female in the original data, then
>gender<-c(1,2)
> gender<- factor(gender)
> gender
[1] 1 2
Levels: 1 2
>gender<- factor(gender, levels=c(1, 2), labels=c("Male", "Female")) would convert the variable to an un ordered factor.
>gender
[1] Male   Female
Levels: Male Female

Note that the order of the labels must match the order of the levels. In this example, gender would be treated as categorical,the labels “Male” and “Female” would appear in the output instead of 1 and 2, and any gender value that wasn’t initially coded as a 1 or 2 would be set to missing.

Example :
The following listing demonstrates how specifying factors and ordered factors impacts data analyses.
First we enter the data as vectors. Then you specify that diabetes is a factor and status is an ordered factor. Finally, you combine the data into a data frame. 
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)

> patientdata <- data.frame(patientID, age, diabetes, status)
> str(patientdata)
‘data.frame’: 4 obs. of 4 variables:
$ patientID: num 1 2 3 4
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3

> summary(patientdata)

patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00


Notes :
The function str(object) provides information about an object(here, the data frame) in R .It clearly shows that diabetes is a factor and status is an ordered factor, along with how they are coded internally. 

Note that the summary() function treats the variables differently. It provides the minimum, maximum, mean, and quartiles for the continuous variable age, and frequency counts for the categorical variables diabetes and status.


Thanks, TAMATAM

No comments:

Post a Comment

Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.

Featured Post from this Blog

How to compare Current Snapshot Data with Previous Snapshot in Power BI

How to Dynamically compare two Snapshots Data in Power BI Scenario: Suppose, we have a sample Sales data, which is stored with Monthly Snaps...

Popular Posts from this Blog