Monday, November 12, 2018

How to do Subsetting the data from a Dataframe in R

Subsetting the data from a Dataframe in R
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations.
Suppose we have the dataframe  "mydata" on which we perform various subsetting operations..
> filepath="T:/My_Tech_Lab/R_Lab/Dataframes/Dataset_FY16.txt"
> mydata<-read.csv(filepath,header=T,sep="\t",as.is=T,stringsAsFactors =F)

> mydata 
# with 48 observations and 9 variables


#Subsetting the data by excluding/removing only the Variables (2 to 4 and 6) , then storing the result into a new Dataframe subset1

> subset1<-mydata[,-c(2:4,6)]

> head(subset1)

  SalesOrder_Id Cust_Segment Gross_Sales  COGS Net_Sales
1     123456889       Pharma      735000 58800    654150
2     123456932  Automobiles        7000   350      6440
3     123456926  Automobiles        5100   306      4794
4     123456895       Pharma      420000 29400    378000
5     123456913      Banking        4500   225      4275
6     123456919      Banking      150000 12000    130500


Note :

Since the result is assigning to new Dataframe "subset1", the data in the original Dataframe "mydata" will remain as it is.

#Subsetting the data by excluding/removing the Observations whose row id >15 and the Variables 2 to 6 , then storing the result into a new Dataframe subset2
> subset2<-mydata[as.numeric(rownames(mydata))<=15,-c(2:6)]

> subset2   
SalesOrder_Id Gross_Sales   COGS Net_Sales
1      123456889      735000  58800    654150
2      123456932        7000    350      6440
3      123456926        5100    306      4794
4      123456895      420000  29400    378000
5      123456913        4500    225      4275
6      123456919      150000  12000    130500
7      123456927      975000  87750    877500
8      123456933        5100    306      4743
9      123456922      450000  40500    391500
10     123456900        3600    216      3276
11     123456916     1505000 105350   1354500
12     123456906        4000    200      3800
13     123456907        5850    351      5265
14     123456901      500000  45000    445000
15     123456892      315000  25200    277200

#Subsetting the data by Filtering Observations based on a Variable "FiscalQuarter"  and selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset3.
> subset3<-mydata[mydata$FiscalQuarter=="Q1-2016"|mydata$FiscalQuarter=="Q4-2016",c(1,3,9)]

OR

> attach(mydata)
> subset3<-mydata[FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016",c(1,3,9)]
> detach(mydata)

> subset3

   SalesOrder_Id FiscalQuarter Net_Sales
1      123456889       Q1-2016  654150.0
2      123456932       Q1-2016    6440.0
3      123456926       Q1-2016    4794.0
4      123456895       Q1-2016  378000.0
29     123456911       Q4-2016 1080000.0
30     123456890       Q4-2016    2136.0
31     123456905       Q4-2016 1236900.0
32     123456896       Q4-2016    4700.0
33     123456891       Q4-2016  556250.0
34     123456928       Q4-2016  382200.0
35     123456897       Q4-2016    4777.5
36     123456934       Q4-2016 1464750.0
37     123456915       Q4-2016 1214850.0
38     123456935       Q4-2016  586950.0
39     123456921       Q4-2016  413250.0
40     123456929       Q4-2016    6110.0
41     123456908       Q1-2016 1610000.0
42     123456912       Q1-2016  360000.0
43     123456902       Q1-2016   67500.0
44     123456918       Q1-2016    4725.0
45     123456888       Q1-2016  477750.0
46     123456909       Q1-2016  535050.0
47     123456903       Q1-2016   15015.0
48     123456894       Q1-2016  946000.0

#Subsetting the data by Filtering Observations based on the Variables "FiscalQuarter" and Net_Sales, selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset4.
> attach(mydata)
> subset4<-mydata[(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") &
as.integer(Net_Sales)<=16500,c(1,3,9)]

> detach(mydata)


OR

#Selection using the Subset Function:
The subset( ) function is the easiest way to select variables and observations.

>subset4<-subset(mydata,(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") & as.integer(Net_Sales)<=16500,select=c(1,3,9))


> subset4 
SalesOrder_Id FiscalQuarter Net_Sales
2      123456932       Q1-2016    6440.0
3      123456926       Q1-2016    4794.0
30     123456890       Q4-2016    2136.0
32     123456896       Q4-2016    4700.0
35     123456897       Q4-2016    4777.5
40     123456929       Q4-2016    6110.0
44     123456918       Q1-2016    4725.0
47     123456903       Q1-2016   15015.0

# Taking a random sample of size 20 from a dataset mydata, sample without replacement
Use the sample( ) function to take a random sample of size n from a dataset.
> mysample <- mydata[sample(1:nrow(mydata), 20,replace=FALSE),c(1,3,9)]


> mysample
 
  SalesOrder_Id FiscalQuarter Net_Sales
37     123456915       Q4-2016 1214850.0
1      123456889       Q1-2016  654150.0
35     123456897       Q4-2016    4777.5
17     123456924       Q3-2016  740600.0
18     123456930       Q3-2016 1057500.0
43     123456902       Q1-2016   67500.0
5      123456913       Q2-2016    4275.0
28     123456910       Q3-2016    5076.0
21     123456917       Q3-2016    6762.0
12     123456906       Q2-2016    3800.0
7      123456927       Q2-2016  877500.0
13     123456907       Q2-2016    5265.0
26     123456904       Q3-2016  652500.0
14     123456901       Q2-2016  445000.0
23     123456923       Q3-2016   17100.0
30     123456890       Q4-2016    2136.0
29     123456911       Q4-2016 1080000.0
6      123456919       Q2-2016  130500.0
22     123456931       Q3-2016 1399650.0
38     123456935       Q4-2016  586950.0


# Taking a random sample of size 20 from a dataset mydata, sample with replacement
> mysample <- mydata[sample(1:nrow(mydata), 20,replace=TRUE),c(1,3,9)]

> mysample

    
SalesOrder_Id FiscalQuarter Net_Sales
46       123456909       Q1-2016  535050.0
16       123456898       Q2-2016 1464050.0
25       123456920       Q3-2016  495900.0
10       123456900       Q2-2016    3276.0
30       123456890       Q4-2016    2136.0
15       123456892       Q2-2016  277200.0
35       123456897       Q4-2016    4777.5
27       123456914       Q3-2016   13500.0
40       123456929       Q4-2016    6110.0
31       123456905       Q4-2016 1236900.0
28       123456910       Q3-2016    5076.0
15.1     123456892       Q2-2016  277200.0
6        123456919       Q2-2016  130500.0
22       123456931       Q3-2016 1399650.0
2        123456932       Q1-2016    6440.0
36       123456934       Q4-2016 1464750.0
25.1     123456920       Q3-2016  495900.0
2.1      123456932       Q1-2016    6440.0
6.1      123456919       Q2-2016  130500.0
48       123456894       Q1-2016  946000.0

Note :
When we say replace=TRUE, few Observations will be repeated in the Random sample with a unique row id (Eg:15.1 ,15.2... for 15)

--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------

No comments:

Post a Comment

Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.

Featured Post from this Blog

How to compare Current Snapshot Data with Previous Snapshot in Power BI

How to Dynamically compare two Snapshots Data in Power BI Scenario: Suppose, we have a sample Sales data, which is stored with Monthly Snaps...

Popular Posts from this Blog