How to do Subsetting the data from a Dataframe in R

Subsetting the data from a Dataframe in R
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations.

Suppose we have the dataframe "mydata" on which we perform various subsetting operations..

> filepath="T:/My_Tech_Lab/R_Lab/Dataframes/Dataset_FY16.txt"
> mydata<-read.csv(filepath,header=T,sep="\t",as.is=T,stringsAsFactors =F)
> mydata # with 48 observations and 9 variables

#Subsetting the data by excluding/removing only the Variables (2 to 4 and 6) , then storing the result into a new Dataframe subset1

> subset1<-mydata[,-c(2:4,6)]

> head(subset1)
SalesOrder_Id Cust_Segment Gross_Sales COGS Net_Sales
1     123456889       Pharma      735000 58800    654150
2     123456932 Automobiles        7000   350      6440
3     123456926 Automobiles        5100   306      4794
4     123456895       Pharma      420000 29400    378000
5     123456913      Banking        4500   225      4275
6     123456919      Banking      150000 12000    130500

Note :
Since the result is assigning to new Dataframe "subset1", the data in the original Dataframe "mydata" will remain as it is.

#Subsetting the data by excluding/removing the Observations whose row id >15 and the Variables 2 to 6 , then storing the result into a new Dataframe subset2

> subset2<-mydata[as.numeric(rownames(mydata))<=15,-c(2:6)]

> subset2
SalesOrder_Id Gross_Sales   COGS Net_Sales
1      123456889      735000 58800    654150
2      123456932        7000    350      6440
3      123456926        5100    306      4794
4      123456895      420000 29400    378000
5      123456913        4500    225      4275
6      123456919      150000 12000    130500
7      123456927      975000 87750    877500
8      123456933        5100    306      4743
9      123456922      450000 40500    391500
10     123456900        3600    216      3276
11     123456916     1505000 105350   1354500
12     123456906        4000    200      3800
13     123456907        5850    351      5265
14     123456901      500000 45000    445000
15     123456892      315000 25200    277200

#Subsetting the data by Filtering Observations based on a Variable "FiscalQuarter" and selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset3.

> subset3<-mydata[mydata$FiscalQuarter=="Q1-2016"|mydata$FiscalQuarter=="Q4-2016",c(1,3,9)]

> attach(mydata)

> subset3<-mydata[FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016",c(1,3,9)]

> detach(mydata)

> subset3

   SalesOrder_Id FiscalQuarter Net_Sales
1      123456889       Q1-2016  654150.0
2      123456932       Q1-2016    6440.0
3      123456926       Q1-2016    4794.0
4      123456895       Q1-2016  378000.0
29     123456911       Q4-2016 1080000.0
30     123456890       Q4-2016    2136.0
31     123456905       Q4-2016 1236900.0
32     123456896       Q4-2016    4700.0
33     123456891       Q4-2016  556250.0
34     123456928       Q4-2016  382200.0
35     123456897       Q4-2016    4777.5
36     123456934       Q4-2016 1464750.0
37     123456915       Q4-2016 1214850.0
38     123456935       Q4-2016  586950.0
39     123456921       Q4-2016  413250.0
40     123456929       Q4-2016    6110.0
41     123456908       Q1-2016 1610000.0
42     123456912       Q1-2016  360000.0
43     123456902       Q1-2016   67500.0
44     123456918       Q1-2016    4725.0
45     123456888       Q1-2016  477750.0
46     123456909       Q1-2016  535050.0
47     123456903       Q1-2016   15015.0
48     123456894       Q1-2016  946000.0

#Subsetting the data by Filtering Observations based on the Variables "FiscalQuarter" and Net_Sales, selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset4.
> attach(mydata)

> subset4<-mydata[(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") &
as.integer(Net_Sales)<=16500,c(1,3,9)]
> detach(mydata)

#Selection using the Subset Function:
The subset( ) function is the easiest way to select variables and observations.
>subset4<-subset(mydata,(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") & as.integer(Net_Sales)<=16500,select=c(1,3,9))

> subset4
SalesOrder_Id FiscalQuarter Net_Sales
2      123456932       Q1-2016    6440.0
3      123456926       Q1-2016    4794.0
30     123456890       Q4-2016    2136.0
32     123456896       Q4-2016    4700.0
35     123456897       Q4-2016    4777.5
40     123456929       Q4-2016    6110.0
44     123456918       Q1-2016    4725.0
47     123456903       Q1-2016   15015.0

# Taking a random sample of size 20 from a dataset mydata, sample without replacement
Use the sample( ) function to take a random sample of size n from a dataset.
> mysample <- mydata[sample(1:nrow(mydata), 20,replace=FALSE),c(1,3,9)]

> mysample
   SalesOrder_Id FiscalQuarter Net_Sales
37     123456915       Q4-2016 1214850.0
1      123456889       Q1-2016 654150.0
35     123456897       Q4-2016    4777.5
17     123456924       Q3-2016 740600.0
18     123456930       Q3-2016 1057500.0
43     123456902       Q1-2016   67500.0
5      123456913       Q2-2016    4275.0
28     123456910       Q3-2016    5076.0
21     123456917       Q3-2016    6762.0
12     123456906       Q2-2016    3800.0
7      123456927       Q2-2016 877500.0
13     123456907       Q2-2016    5265.0
26     123456904       Q3-2016 652500.0
14     123456901       Q2-2016 445000.0
23     123456923       Q3-2016   17100.0
30     123456890       Q4-2016    2136.0
29     123456911       Q4-2016 1080000.0
6      123456919       Q2-2016 130500.0
22     123456931       Q3-2016 1399650.0
38     123456935       Q4-2016 586950.0

# Taking a random sample of size 20 from a dataset mydata, sample with replacement

> mysample <- mydata[sample(1:nrow(mydata), 20,replace=TRUE),c(1,3,9)]

> mysample
     SalesOrder_Id FiscalQuarter Net_Sales
46       123456909       Q1-2016 535050.0
16       123456898       Q2-2016 1464050.0
25       123456920       Q3-2016 495900.0
10       123456900       Q2-2016    3276.0
30       123456890       Q4-2016    2136.0
15       123456892       Q2-2016 277200.0
35       123456897       Q4-2016    4777.5
27       123456914       Q3-2016   13500.0
40       123456929       Q4-2016    6110.0
31       123456905       Q4-2016 1236900.0
28       123456910       Q3-2016    5076.0
15.1     123456892       Q2-2016 277200.0
6        123456919       Q2-2016 130500.0
22       123456931       Q3-2016 1399650.0
2        123456932       Q1-2016    6440.0
36       123456934       Q4-2016 1464750.0
25.1     123456920       Q3-2016 495900.0
2.1      123456932       Q1-2016    6440.0
6.1      123456919       Q2-2016 130500.0
48       123456894       Q1-2016 946000.0

Note :
When we say replace=TRUE, few Observations will be repeated in the Random sample with a unique row id (Eg:15.1 ,15.2... for 15)

--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------

Power BI Analytics Kingdom | Blog

Search This Blog

Monday, November 12, 2018

How to do Subsetting the data from a Dataframe in R

No comments:

Post a Comment

Featured Post from this Blog

How to compare Current Snapshot Data with Previous Snapshot in Power BI

Popular Posts from this Blog