First thing first, import the pandas module and read the csv file
>>> import pandas as pd >>> df = pd.read_csv('path/to/file.csv', encoding='utf-8')
So, we have a simple data frame like this…
>>> df start end 0 2001-06-01 2004-02-01 1 2001-11-01 2003-12-01 2 2005-04-01 2007-03-01 3 2005-05-01 2007-03-01
…and we want to calculate the amount of time between the start and the end.
The main problems is that the dates are read as string (when read from a file), and therefore there is very little we can do with this right now.
To change the column to dates, we can use the to_datetime
function in pandas. We also don’t want to be completely destructive and risk messing up the data, so we put the newly formatted dates into new columns (start_d
and end_d
), like this…
>>> df['start_d'] = pd.to_datetime(df['start']) >>> df['end_d'] = pd.to_datetime(df['end'])
Now, our data frame should look a bit like this
>>> df start end start_d end_d 0 2001-06-01 2004-02-01 2001-06-01 2004-02-01 1 2001-11-01 2003-12-01 2001-11-01 2003-12-01 2 2005-04-01 2007-03-01 2005-04-01 2007-03-01 3 2005-05-01 2007-03-01 2005-05-01 2007-03-01
to calculate the length of time between start and end, we simply subtract start_d
from end_d
, like this
>>> df['len'] = df['end_d'] - df['start_d']
which will result in the difference being calculated in days, leaving the data frame looking like this
start end start_d end_d len 0 2001-06-01 2004-02-01 2001-06-01 2004-02-01 975 days 1 2001-11-01 2003-12-01 2001-11-01 2003-12-01 760 days 2 2005-04-01 2007-03-01 2005-04-01 2007-03-01 699 days 3 2005-05-01 2007-03-01 2005-05-01 2007-03-01 669 days