Categories
data science

Bibtex to YAML – Python

I’m writing my thesis right now, so I haven’t had much time to post.

I am now going through my literature review and I was looking for ways of storing and analysing all my citations so I can do a bit of bibliometrics.

Long story short, after trying json and xml, I stumbled across yaml. So now I feed a yaml file into Python and look at how many publications each author has, what were the most active years, most common journals and so on.

All my citations, however, are in bib format (I use latex) and I need to change that to yaml. In other words, the input bib entries look like this

@article{Hippel:horiInno:2007,
	Author = {von Hippel, Eric},
	Date-Modified = {2014-07-02 10:47:11 +0000},
	Journal = {Industrial and Corporate Change},
	Number = {2},
	Pages = {293 - 315},
	Read = {1},
	Title = {Horizontal Innovation Networks By and For Users},
	Volume = {16},
	Year = {2007}}

and I need them to look like this.

-article: &HippelhoriInno2007
  author:
    -von Hippel, Eric
  date-modified: "2014-07-02 10:47:11 +0000"
  journal: "Industrial and Corporate Change"
  number: 2
  pages: "293 - 315"
  title: "Horizontal Innovation Networks By and For Users"
  volume: 16
  year: 2007

pretty straight forward if you ask me.

To make things easier, I have also added a very basic command line option for input and output files. This means that if you save the script as ‘bib2yaml.py’ then you can use it like so

$ python bib2yaml.py my_input_file.bib my_output_file.yaml

Anyhow, here is the code. Right now it will only work for articles, I haven’t tested it for books or media.

# bib2yaml.py
import re
import sys

# from terminal arguments
str_input = sys.argv[1]
str_output = sys.argv[2]

# open the file
with open(str_input, 'r') as fr:
    list_lines = fr.readlines()

# list the output line
list_output = []

# go through the lines
for str_line in list_lines:
    
    # first line with id
    if str_line.startswith('@'):
        sg_t1 = re.search('^@(.*){(.*),$', str_line)
        str_id = re.sub(':','',sg_t1.group(2))
        str_first = '\n-%s: &%s' % (sg_t1.group(1), str_id)
        list_output.append(str_first)

    # for the other lines
    elif str_line.startswith('\t'):
        sg_tn = re.search('^\t(.*) = {(.*?)}', str_line)
        str_cat = sg_tn.group(1).lower()
        str_val = sg_tn.group(2)

        # make a list of all the authors
        if str_cat=='author':
            list_authors = str_val.split(' and ')
            str_auths = '\n    -'.join(list_authors)
            str_aut_out = '  %s:\n    -%s' % (str_cat, str_auths)
            list_output.append(str_aut_out)

        # all the integer values
        list_ints = ['number','volume','read','year']
        if str_cat in list_ints:
            str_jt_out = '  %s: %s' % (str_cat, str_val)
            list_output.append(str_jt_out)
            
        # all the string values
        else:
            str_gen_out = '  %s: "%s"' % (str_cat, str_val)
            list_output.append(str_gen_out)

with open(str_output,'w') as fw:
    fw.write('\n'.join(list_output))

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s