Following from my long theoretical post about the people in the middle, I now need to split a very long list of users into three groups, according to the number of posts (frequency), so that each groups is responsible for 1/3 of posts.
In other words, group one will be responsible for percentiles 0% to 33% of posts. Group two will be responsible for percentiles 34% to 66% of posts. Finally, group three will be responsible for percentiles 67% to 100% of posts.
To visualise it, and to have a better idea of what type of data we want to end up with, we should probably test the procedure on a smaller data set, where we can easily check the answers. The following table is a sample I created, and the aim is to provide the data for columns 1 and 2, and then use R to calculate the data for columns 3 to 5.
user Freq pcp cumSum groups 1 11 0.11 0.11 1 2 11 0.11 0.22 1 3 11 0.11 0.33 1 4 7 0.07 0.40 2 5 7 0.07 0.47 2 6 7 0.07 0.54 2 7 6 0.06 0.60 2 8 6 0.06 0.66 2 9 5 0.05 0.71 3 10 5 0.05 0.76 3 11 5 0.05 0.81 3 12 5 0.05 0.86 3 13 5 0.05 0.91 3 14 5 0.05 0.96 3 15 4 0.04 1.00 3
So we have, a total of 15 users, with 100 posts between them. The pcp column is the user’s post percentage, calculated as user posts divided by total posts. The cumSum column is the cumulative sum of those percentages. Finally, the last column, groups, is the one we are really interested in, where users are divided along percentiles as described above.
So the frist step is to add the data the first two columns.
user <- c(1:15) Freq <- c(11,11,11,7,7,7,6,6,5,5,5,5,5,5,4)
The 1:15
part is a neat little trick which is short for “all the values between 1 and 15”. This trick can be used for other things too, like if you want to choose columns 4, 5, 6, 7, and 8 all you need to do is write 4:8
and you’re done.
Then we create the data frame called dfAL
and put the two above columns in it
dfAL <- data.frame(user,Freq)
Then we do a quick
dfAL
Just to make sure it’s all there.
Okay, then we need to add the pcp
column, like this
dfAL$pcp <- dfAL$Freq/sum(dfAL$Freq)
A more detailed explanation of what all that means can be found here, but for now, that’s how it’s done…honest,
Again, let’s do a quick check and make sure we’ve not messed things up…
dfAL
Now, let’s go for the cumulative sum in the cumSum
column. This is why R is awesome, no need to do any complicated stuff, just a simple cumsum( )
command will suffice.
However, we still need to tell it where to put the data.
dfAL$cumSum <- cumsum(dfAL$pcp)
Freaking AWESOME!!
Now for the last part. I have to admit. I am sure there is a better way of doing this, but this is the only way I know. If you have a better method, PLEASE let me know.
Anyhow, here is how I do it.
First we tell R to create a new column called groups and, if the value in column sumCum
is smaller than or equal to 0.33, then it should write the number 1 in the new groups column. Like this….
dfAL$groups[(dfAL$cumSum)<=0.33]<-1
Doing this will grab all percentiles from 0 to 33 and put them in group 1. Then, we need to tell it that all posts that fall between the 0.33 and 0.66 percentiles should be labelled 2.
Therefore, the value of cumSum should be more than 0.33 AND smaller or equal to 0.66, like this
dfAL$groups[(dfAL$cumSum)>0.33 & (dfAL$cumSum)<=0.66 ]<-2
Finally, all users that are in the top percentiles (more than 0.66) are to be placed in group 3, like this
dfAL$groups[(dfAL$cumSum)>0.66]<-3
So, if we check the data frame with a quick
dfAL
we should get the exact same table we have at the beginning of this post.
We can also check how many users are in each group with
table(dfAL$groups)
which should return
1 2 3 3 5 7
The Lazy Man’s Copy and Paste
user <- c(1:15) Freq <- c(11,11,11,7,7,7,6,6,5,5,5,5,5,5,4) dfAL <- data.frame(user,Freq) dfAL$pcp <- dfAL$Freq/sum(dfAL$Freq) dfAL$cumSum <- cumsum(dfAL$pcp) dfAL$groups[(dfAL$cumSum)<=0.33] <-1 dfAL$groups[(dfAL$cumSum)>0.33 & (dfAL$cumSum)<=0.66 ] <-2 dfAL$groups[(dfAL$cumSum)>0.66] <-3
One reply on “The People in the Middle – R”
[…] The method, based on Kevin Crowston et al’s “Core and Periphery” paper, is explained in greater detail here. […]