Get a Frequency Count
The user file looks like this
userID,user,posts 1,user1,581 2,user2,281 3,user3,196 ... 2002,usern-2,1 2003,usern-1,1 2004,usern,1
First thing is to read the file
> df<-read.csv('path/to/file.csv')
Then we get the frequency count
# first we figure out what the names # of the columns are > names(df) [1] "userID" "user" "posts" # we want to count the posts...so > postFreqCount<-data.frame(table(df['posts']))
The frequency count should now return something like this
> postFreqCount Var1 Freq 1 1 723 2 2 314 3 3 186 ... 84 196 1 85 281 1 86 851 1
Building the Scatter Plot
We need to use Freq
as the x coordinates and Var1
as the y coordinates.
> x<-as.matrix(postFreqCount['Freq']) > y<-as.matrix(postFreqCount['Var1'])
Now a simple scatter plot can be made like so
> plot(x,y)
Which will look like this…
Which does not look that great, so we will have to apply the log scales.
A simple way of doing it is like this…
> plot(log(x),log(y))
Which will give you something like this…
Which does look a bit better, except that the scales on the axes are from 0 to 6 instead of the real values.
Applying the log scales
To get the scales right we need to change the way we construct the plot( )
function.
First, we use the xy.coords( )
function to set the coordinates for the plot.
Then we add the scale range for each axis using xlim
and ylim
– both starting from 1 to their maximum value (starting from zero will give you an error).
Now we can apply the log for both axes using log="xy"
.
Finally, we can lable both axes with xlab
and ylab
.
The final function with all its parameters looks a bit like this…
plot( xy.coords(x,y), xlim=c(1,max(x)), ylim=c(1,max(y)), log="xy", xlab="Frequency", ylab="Posts" )
The graph now looks like this….
PERFECT!
Function that saves it to pdf
logPlot<-function(fileDir,name){ df<-read.csv(fileDir) pfc<-data.frame(table(df['posts'])) x<-as.matrix(pfc['Freq']) y<-as.matrix(pfc['Var1']) pdfPath=paste('Desktop/',name,'.pdf') pdf(pdfPath) plot( xy.coords(x,y), xlim=c(1,max(x)), ylim=c(1,as.integer(max(y))), log="xy", xlab="Frequency", ylab="Posts", main=name ) dev.off() }