Following from my long theoretical post about the people in the middle, I now need to split a very long list of users into three groups, according to the number of posts (frequency), so that each groups is responsible for 1/3 of posts.
In other words, group one will be responsible for percentiles 0% to 33% of posts. Group two will be responsible for percentiles 34% to 66% of posts. Finally, group three will be responsible for percentiles 67% to 100% of posts.
To visualise it, and to have a better idea of what type of data we want to end up with, we should probably test the procedure on a smaller data set, where we can easily check the answers. The following table is a sample I created, and the aim is to provide the data for columns 1 and 2, and then use R to calculate the data for columns 3 to 5.
user Freq pcp cumSum groups 1 11 0.11 0.11 1 2 11 0.11 0.22 1 3 11 0.11 0.33 1 4 7 0.07 0.40 2 5 7 0.07 0.47 2 6 7 0.07 0.54 2 7 6 0.06 0.60 2 8 6 0.06 0.66 2 9 5 0.05 0.71 3 10 5 0.05 0.76 3 11 5 0.05 0.81 3 12 5 0.05 0.86 3 13 5 0.05 0.91 3 14 5 0.05 0.96 3 15 4 0.04 1.00 3
So we have, a total of 15 users, with 100 posts between them. The pcp column is the user’s post percentage, calculated as user posts divided by total posts. The cumSum column is the cumulative sum of those percentages. Finally, the last column, groups, is the one we are really interested in, where users are divided along percentiles as described above.
So the frist step is to add the data the first two columns.
user <- c(1:15) Freq <- c(11,11,11,7,7,7,6,6,5,5,5,5,5,5,4)
1:15 part is a neat little trick which is short for “all the values between 1 and 15”. This trick can be used for other things too, like if you want to choose columns 4, 5, 6, 7, and 8 all you need to do is write
4:8 and you’re done.
Then we create the data frame called
dfAL and put the two above columns in it
dfAL <- data.frame(user,Freq)
Then we do a quick
Just to make sure it’s all there.
Okay, then we need to add the
pcp column, like this
dfAL$pcp <- dfAL$Freq/sum(dfAL$Freq)
A more detailed explanation of what all that means can be found here, but for now, that’s how it’s done…honest,
Again, let’s do a quick check and make sure we’ve not messed things up…
Now, let’s go for the cumulative sum in the
cumSum column. This is why R is awesome, no need to do any complicated stuff, just a simple
cumsum( ) command will suffice.
However, we still need to tell it where to put the data.
dfAL$cumSum <- cumsum(dfAL$pcp)
Now for the last part. I have to admit. I am sure there is a better way of doing this, but this is the only way I know. If you have a better method, PLEASE let me know.
Anyhow, here is how I do it.
First we tell R to create a new column called groups and, if the value in column
sumCum is smaller than or equal to 0.33, then it should write the number 1 in the new groups column. Like this….
Doing this will grab all percentiles from 0 to 33 and put them in group 1. Then, we need to tell it that all posts that fall between the 0.33 and 0.66 percentiles should be labelled 2.
Therefore, the value of cumSum should be more than 0.33 AND smaller or equal to 0.66, like this
dfAL$groups[(dfAL$cumSum)>0.33 & (dfAL$cumSum)<=0.66 ]<-2
Finally, all users that are in the top percentiles (more than 0.66) are to be placed in group 3, like this
So, if we check the data frame with a quick
we should get the exact same table we have at the beginning of this post.
We can also check how many users are in each group with
which should return
1 2 3 3 5 7
The Lazy Man’s Copy and Paste
user <- c(1:15) Freq <- c(11,11,11,7,7,7,6,6,5,5,5,5,5,5,4) dfAL <- data.frame(user,Freq) dfAL$pcp <- dfAL$Freq/sum(dfAL$Freq) dfAL$cumSum <- cumsum(dfAL$pcp) dfAL$groups[(dfAL$cumSum)<=0.33] <-1 dfAL$groups[(dfAL$cumSum)>0.33 & (dfAL$cumSum)<=0.66 ] <-2 dfAL$groups[(dfAL$cumSum)>0.66] <-3