I’m moving!

Got a new host and I’m moving the site over to them. Same address – http://www.josechristian.com/

I’ve been wanting to get a better host so I can upload some d3.js things i’ve been working on.

The blog will continue at

http://www.josechristian.com/blog

Right now I’m a bit busy, so it will take me a while to put up a home page.

But it should all be up and running soon.

Advertisements

Dealing with dates in pandas – Python

First thing first, import the pandas module and read the csv file

>>> import pandas as pd
>>> df = pd.read_csv('path/to/file.csv', encoding='utf-8')

So, we have a simple data frame like this…

>>> df
        start         end
0  2001-06-01  2004-02-01
1  2001-11-01  2003-12-01
2  2005-04-01  2007-03-01
3  2005-05-01  2007-03-01

…and we want to calculate the amount of time between the start and the end.

The main problems is that the dates are read as string (when read from a file), and therefore there is very little we can do with this right now.

To change the column to dates, we can use the to_datetime function in pandas. We also don’t want to be completely destructive and risk messing up the data, so we put the newly formatted dates into new columns (start_d and end_d), like this…

>>> df['start_d'] = pd.to_datetime(df['start'])
>>> df['end_d'] = pd.to_datetime(df['end'])

Now, our data frame should look a bit like this

>>> df
        start         end    start_d      end_d
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01

to calculate the length of time between start and end, we simply subtract start_d from end_d, like this

>>> df['len'] = df['end_d'] - df['start_d']

which will result in the difference being calculated in days, leaving the data frame looking like this

        start         end    start_d      end_d      len
0  2001-06-01  2004-02-01 2001-06-01 2004-02-01 975 days
1  2001-11-01  2003-12-01 2001-11-01 2003-12-01 760 days
2  2005-04-01  2007-03-01 2005-04-01 2007-03-01 699 days
3  2005-05-01  2007-03-01 2005-05-01 2007-03-01 669 days

Using the .isin() function in Pandas – Python

Tags

, , , ,

The .isin() function is a powerful tool that can help you search search for a number of values in a data frame.

This is how it’s done.

We start by creating a simple data frame


import pandas as pd

df = pd.DataFrame({'col_1':[1,2,3,4],'col_2':[2,3,4,1]})

The data frame should look something like this

   col_1  col_2
0      1      2
1      2      3
2      3      4
3      4      1

Now, we will use the .isin() function to select all the rows that have either the number 1 or 4 in col_1, and we’ll put them in a new data frame called df_14.

We do this by using a list of the values and placing that list in the .isin() function.

either like this, if we have a short list…

df_14 = df[df['col_1'].isin([1,4])]

…or like this, when we have a longer list.

list_numbers = [1,4]
df_14 = df[df['col_1'].isin(list_numbers)]

The new data frame should look like this

   col_1  col_2
0      1      2
3      4      1

Simple.

But what if we want all the values that DO NOT match those in the list?

We can do this by adding ==False to the function. Like this

df_not_14 = df[df['col_1'].isin([1,4])==False]

and this new data frame would look like this

   col_1  col_2
1      2      3
2      3      4

Create a range of dates using Pandas – Python

Tags

, , ,

Here is how to create a range of dates using the Pandas module.

The range will start from April 2 2014 and will end October 1 2014.

Well…here it is

# import datetime so that we can format 
# the date to what we want
import pandas as pd
from datetime import datetime

# start and end dates
str_start = '2014-04-02'
str_end = '2014-10-01'

# create the range using the pandas function
pd_date_range = pd.date_range(str_start,str_end)

# now we can print it in the yyyy-mm-dd format.
for pd_date in pd_date_range:
    print pd_date.strftime('%Y-%m-%d')

That’s pretty much it!

Reddit User Info – Python

Tags

, , , ,

This one is just for me. No explanation. As is.

This script will let you download all the posts submitted by any Reddit user. Just put the user name in line 9.

# user_info_donwload.py

import urllib2 as ul2
import json
import pandas as pd
from datetime import datetime

# user name
str_un = 'Any_user_name'

# page counter
int_counter = 1

# list page ids
list_pids = ['']

# this is the output list with all the data
list_output = []

# do the loop
while int_counter!=0:

    # raw url
    str_url = 'http://www.reddit.com/user/%s/submitted.json' % str_un

    # if its not the first page
    if int_counter>=2:
        # complete url
        str_url = '%s?after=%s' % (str_url, list_pids[0])

    # request, connect, then read
    ul_req = ul2.Request(str_url, headers={'User-agent':'Mozilla/5.0'})
    str_json = ul2.urlopen(ul_req).read()

    js_d = json.loads(str_json)


    for js_post in js_d['data']['children']:

        # get the correct date format
        dt_uf = datetime.utcfromtimestamp(js_post['data']['created'])
        str_date = dt_uf.strftime('%Y%m%d')

        # all the data
        tup_out = (js_post['data']['id'],
                    str_date,
                    js_post['data']['title'],
                    js_post['data']['subreddit'],
                    js_post['data']['score'],
                    js_post['data']['gilded'],
                    js_post['data']['stickied'],
                    js_post['data']['over_18'])
        # append it 
        list_output.append(tup_out)

    # tell me when page is done
    print 'page %i done' % int_counter

    # is there another page
    if js_d['data']['after']!=None:
        list_pids[0] = js_d['data']['after']
        int_counter+=1

    # if not, then end the loop
    elif js_d['data']['after']==None:
        int_counter=0
        print 'done'

# turn output into data frame
df_d = pd.DataFrame(list_output)
df_d.columns = ['id','created','title','subreddit',
                'score','gilded','stickied','over_18']

# save as csv
df_d.to_csv('%s.csv' % str_un, index=None)

Exploring jason files – Python

Tags

, , , , , , ,

Working with json files can be freaking horrible, specially if you don’t know what data is in the file.

Let me give you and example of how unreadable it can be.

If you use Apple’s iTunes search API, and you search for user id 112018 you get this in return

{
 "resultCount":1,
 "results": [
{"wrapperType":"artist", "artistType":"Artist", "artistName":"Nirvana", "artistLinkUrl":"https://itunes.apple.com/us/artist/nirvana/id112018?uo=4", "artistId":112018, "amgArtistId":5034, "primaryGenreName":"Rock", "primaryGenreId":21, "radioStationUrl":"https://itunes.apple.com/station/idra.112018"}]
}

Which is crappy because you don’t know what data you have, specially if it’s a large file.

The best a quickest way of sorting that out is by using the .dumps() function in the json module, which works in a similar way as the .prettify() function in BeautifulSoup.

This is how you do it.

# pretty_json.py
import urllib2 as ul2
import json

# first, the url for the query
str_url = 'https://itunes.apple.com/lookup?id=112018'

# the request
ul_req = ul2.Request(str_url, 
                    headers={'User-agent':'Mozilla/5.0'})

# open connection and read info
str_json = ul2.urlopen(ul_req).read()

# read the json data
js_d = json.loads(str_json)

# this bit will prettify it,
# it's the indent=True that does it
str_output = json.dumps(js_d ,indent=True)

# save the readable json to a file
with open('output.json','w') as fw:
    fw.write(str_output)

And the output file looks a bit like this…nicer and readable.

{
 "resultCount": 1, 
 "results": [
  {
   "artistType": "Artist", 
   "amgArtistId": 5034, 
   "wrapperType": "artist", 
   "artistId": 112018, 
   "artistLinkUrl": "https://itunes.apple.com/us/artist/nirvana/id112018?uo=4", 
   "radioStationUrl": "https://itunes.apple.com/station/idra.112018", 
   "artistName": "Nirvana", 
   "primaryGenreId": 21, 
   "primaryGenreName": "Rock"
  }
 ]
}

Now you know what data you have.