Subsetting the data from a Dataframe in R
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations.
Note :
Since the result is assigning to new Dataframe "subset1", the data in the original Dataframe "mydata" will remain as it is.
#Subsetting the data by excluding/removing the Observations whose row id >15 and the Variables 2 to 6 , then storing the result into a new Dataframe subset2
#Subsetting the data by Filtering Observations based on a Variable "FiscalQuarter" and selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset3.
> subset3<-mydata[mydata$FiscalQuarter=="Q1-2016"|mydata$FiscalQuarter=="Q4-2016",c(1,3,9)]
> subset3
#Subsetting the data by Filtering Observations based on the Variables "FiscalQuarter" and Net_Sales, selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset4.
> attach(mydata)
# Taking a random sample of size 20 from a dataset mydata, sample without replacement
Use the sample( ) function to take a random sample of size n from a dataset.
> mysample <- mydata[sample(1:nrow(mydata), 20,replace=FALSE),c(1,3,9)]
> mysample
SalesOrder_Id FiscalQuarter Net_Sales
37 123456915 Q4-2016 1214850.0
1 123456889 Q1-2016 654150.0
35 123456897 Q4-2016 4777.5
17 123456924 Q3-2016 740600.0
18 123456930 Q3-2016 1057500.0
43 123456902 Q1-2016 67500.0
5 123456913 Q2-2016 4275.0
28 123456910 Q3-2016 5076.0
21 123456917 Q3-2016 6762.0
12 123456906 Q2-2016 3800.0
7 123456927 Q2-2016 877500.0
13 123456907 Q2-2016 5265.0
26 123456904 Q3-2016 652500.0
14 123456901 Q2-2016 445000.0
23 123456923 Q3-2016 17100.0
30 123456890 Q4-2016 2136.0
29 123456911 Q4-2016 1080000.0
6 123456919 Q2-2016 130500.0
22 123456931 Q3-2016 1399650.0
38 123456935 Q4-2016 586950.0
Note :
When we say replace=TRUE, few Observations will be repeated in the Random sample with a unique row id (Eg:15.1 ,15.2... for 15)
R has powerful indexing features for accessing object elements. These features can be used to select and exclude variables and observations.
Suppose we have the dataframe "mydata" on which we perform various subsetting operations..
> filepath="T:/My_Tech_Lab/R_Lab/Dataframes/Dataset_FY16.txt"
> mydata<-read.csv(filepath,header=T,sep="\t",as.is=T,stringsAsFactors =F)
> mydata # with 48 observations and 9 variables
> mydata<-read.csv(filepath,header=T,sep="\t",as.is=T,stringsAsFactors =F)
> mydata # with 48 observations and 9 variables
#Subsetting the data by excluding/removing only the Variables (2 to 4 and 6) , then storing the result into a new Dataframe subset1
> subset1<-mydata[,-c(2:4,6)]
> head(subset1)
SalesOrder_Id Cust_Segment Gross_Sales COGS Net_Sales
1 123456889 Pharma 735000 58800 654150
2 123456932 Automobiles 7000 350 6440
3 123456926 Automobiles 5100 306 4794
4 123456895 Pharma 420000 29400 378000
5 123456913 Banking 4500 225 4275
6 123456919 Banking 150000 12000 130500
> head(subset1)
SalesOrder_Id Cust_Segment Gross_Sales COGS Net_Sales
1 123456889 Pharma 735000 58800 654150
2 123456932 Automobiles 7000 350 6440
3 123456926 Automobiles 5100 306 4794
4 123456895 Pharma 420000 29400 378000
5 123456913 Banking 4500 225 4275
6 123456919 Banking 150000 12000 130500
Note :
Since the result is assigning to new Dataframe "subset1", the data in the original Dataframe "mydata" will remain as it is.
#Subsetting the data by excluding/removing the Observations whose row id >15 and the Variables 2 to 6 , then storing the result into a new Dataframe subset2
> subset2<-mydata[as.numeric(rownames(mydata))<=15,-c(2:6)]
> subset2
SalesOrder_Id Gross_Sales COGS Net_Sales
1 123456889 735000 58800 654150
2 123456932 7000 350 6440
3 123456926 5100 306 4794
4 123456895 420000 29400 378000
5 123456913 4500 225 4275
6 123456919 150000 12000 130500
7 123456927 975000 87750 877500
8 123456933 5100 306 4743
9 123456922 450000 40500 391500
10 123456900 3600 216 3276
11 123456916 1505000 105350 1354500
12 123456906 4000 200 3800
13 123456907 5850 351 5265
14 123456901 500000 45000 445000
15 123456892 315000 25200 277200
> subset2
SalesOrder_Id Gross_Sales COGS Net_Sales
1 123456889 735000 58800 654150
2 123456932 7000 350 6440
3 123456926 5100 306 4794
4 123456895 420000 29400 378000
5 123456913 4500 225 4275
6 123456919 150000 12000 130500
7 123456927 975000 87750 877500
8 123456933 5100 306 4743
9 123456922 450000 40500 391500
10 123456900 3600 216 3276
11 123456916 1505000 105350 1354500
12 123456906 4000 200 3800
13 123456907 5850 351 5265
14 123456901 500000 45000 445000
15 123456892 315000 25200 277200
#Subsetting the data by Filtering Observations based on a Variable "FiscalQuarter" and selecting only specific Variables (1,3,9) , then storing the result into a new Dataframe subset3.
OR
> attach(mydata)
> subset3<-mydata[FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016",c(1,3,9)]
> detach(mydata)
> subset3
SalesOrder_Id FiscalQuarter Net_Sales
1 123456889 Q1-2016 654150.0
2 123456932 Q1-2016 6440.0
3 123456926 Q1-2016 4794.0
4 123456895 Q1-2016 378000.0
29 123456911 Q4-2016 1080000.0
30 123456890 Q4-2016 2136.0
31 123456905 Q4-2016 1236900.0
32 123456896 Q4-2016 4700.0
33 123456891 Q4-2016 556250.0
34 123456928 Q4-2016 382200.0
35 123456897 Q4-2016 4777.5
36 123456934 Q4-2016 1464750.0
37 123456915 Q4-2016 1214850.0
38 123456935 Q4-2016 586950.0
39 123456921 Q4-2016 413250.0
40 123456929 Q4-2016 6110.0
41 123456908 Q1-2016 1610000.0
42 123456912 Q1-2016 360000.0
43 123456902 Q1-2016 67500.0
44 123456918 Q1-2016 4725.0
45 123456888 Q1-2016 477750.0
46 123456909 Q1-2016 535050.0
47 123456903 Q1-2016 15015.0
48 123456894 Q1-2016 946000.0
> attach(mydata)
> subset4<-mydata[(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") &
as.integer(Net_Sales)<=16500,c(1,3,9)]
> detach(mydata)
as.integer(Net_Sales)<=16500,c(1,3,9)]
> detach(mydata)
OR
#Selection using the Subset Function:
The subset( ) function is the easiest way to select variables and observations.
>subset4<-subset(mydata,(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") & as.integer(Net_Sales)<=16500,select=c(1,3,9))
The subset( ) function is the easiest way to select variables and observations.
>subset4<-subset(mydata,(FiscalQuarter=="Q1-2016"|FiscalQuarter=="Q4-2016") & as.integer(Net_Sales)<=16500,select=c(1,3,9))
> subset4
SalesOrder_Id FiscalQuarter Net_Sales
2 123456932 Q1-2016 6440.0
3 123456926 Q1-2016 4794.0
30 123456890 Q4-2016 2136.0
32 123456896 Q4-2016 4700.0
35 123456897 Q4-2016 4777.5
40 123456929 Q4-2016 6110.0
44 123456918 Q1-2016 4725.0
47 123456903 Q1-2016 15015.0
SalesOrder_Id FiscalQuarter Net_Sales
2 123456932 Q1-2016 6440.0
3 123456926 Q1-2016 4794.0
30 123456890 Q4-2016 2136.0
32 123456896 Q4-2016 4700.0
35 123456897 Q4-2016 4777.5
40 123456929 Q4-2016 6110.0
44 123456918 Q1-2016 4725.0
47 123456903 Q1-2016 15015.0
# Taking a random sample of size 20 from a dataset mydata, sample without replacement
Use the sample( ) function to take a random sample of size n from a dataset.
> mysample <- mydata[sample(1:nrow(mydata), 20,replace=FALSE),c(1,3,9)]
> mysample
SalesOrder_Id FiscalQuarter Net_Sales
37 123456915 Q4-2016 1214850.0
1 123456889 Q1-2016 654150.0
35 123456897 Q4-2016 4777.5
17 123456924 Q3-2016 740600.0
18 123456930 Q3-2016 1057500.0
43 123456902 Q1-2016 67500.0
5 123456913 Q2-2016 4275.0
28 123456910 Q3-2016 5076.0
21 123456917 Q3-2016 6762.0
12 123456906 Q2-2016 3800.0
7 123456927 Q2-2016 877500.0
13 123456907 Q2-2016 5265.0
26 123456904 Q3-2016 652500.0
14 123456901 Q2-2016 445000.0
23 123456923 Q3-2016 17100.0
30 123456890 Q4-2016 2136.0
29 123456911 Q4-2016 1080000.0
6 123456919 Q2-2016 130500.0
22 123456931 Q3-2016 1399650.0
38 123456935 Q4-2016 586950.0
# Taking a random sample of size 20 from a dataset mydata, sample with replacement
> mysample <- mydata[sample(1:nrow(mydata), 20,replace=TRUE),c(1,3,9)]
> mysample
SalesOrder_Id FiscalQuarter Net_Sales
46 123456909 Q1-2016 535050.0
16 123456898 Q2-2016 1464050.0
25 123456920 Q3-2016 495900.0
10 123456900 Q2-2016 3276.0
30 123456890 Q4-2016 2136.0
15 123456892 Q2-2016 277200.0
35 123456897 Q4-2016 4777.5
27 123456914 Q3-2016 13500.0
40 123456929 Q4-2016 6110.0
31 123456905 Q4-2016 1236900.0
28 123456910 Q3-2016 5076.0
15.1 123456892 Q2-2016 277200.0
6 123456919 Q2-2016 130500.0
22 123456931 Q3-2016 1399650.0
2 123456932 Q1-2016 6440.0
36 123456934 Q4-2016 1464750.0
25.1 123456920 Q3-2016 495900.0
2.1 123456932 Q1-2016 6440.0
6.1 123456919 Q2-2016 130500.0
48 123456894 Q1-2016 946000.0
> mysample
SalesOrder_Id FiscalQuarter Net_Sales
46 123456909 Q1-2016 535050.0
16 123456898 Q2-2016 1464050.0
25 123456920 Q3-2016 495900.0
10 123456900 Q2-2016 3276.0
30 123456890 Q4-2016 2136.0
15 123456892 Q2-2016 277200.0
35 123456897 Q4-2016 4777.5
27 123456914 Q3-2016 13500.0
40 123456929 Q4-2016 6110.0
31 123456905 Q4-2016 1236900.0
28 123456910 Q3-2016 5076.0
15.1 123456892 Q2-2016 277200.0
6 123456919 Q2-2016 130500.0
22 123456931 Q3-2016 1399650.0
2 123456932 Q1-2016 6440.0
36 123456934 Q4-2016 1464750.0
25.1 123456920 Q3-2016 495900.0
2.1 123456932 Q1-2016 6440.0
6.1 123456919 Q2-2016 130500.0
48 123456894 Q1-2016 946000.0
Note :
When we say replace=TRUE, few Observations will be repeated in the Random sample with a unique row id (Eg:15.1 ,15.2... for 15)
--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------
Thanks, TAMATAM ; Business Intelligence & Analytics Professional
--------------------------------------------------------------------------------------------------------
No comments:
Post a Comment
Hi User, Thank You for visiting My Blog. Please post your genuine Feedback or comments only related to this Blog Posts. Please do not post any Spam comments or Advertising kind of comments which will be Ignored.