Dataset parser

The Python3 script for the parser can be found here

The data scraper will take a Web of Science database export file and convert it to JSON format so that the manual classifier can read it.

How to use it

1 – Searching and exporting WoS data

First thing first, you have to conduct a search on the Web of Science. In this example I have just conducted a search using the term “User Innovation”.

My results show I have 362 hits. If you get more feel free to refine the results – it’s really up to you. Typically for the algorithms to work it is recommended that you have a training set of around 150 or more.

Once you are happy with the results list, export the data by first selecting “Other File Formats”.

Next, for the “Record Content” select “Full Record”. This will mean that you retain as much meta data as possible. Then for the actual file format, you need to select “Tab-delimited” and make sure it’s “UTF-8” encoded (whether its Mac or Win doesn’t really matter).

You should end up with a file called savedrecs.txt in your download folder. This is the one you will enter into the parser.

2 – Using the parser

For this next step I have created a new folder titled litreview, and i’ve moved the savedrecs.txt file to it.

The next step is to download the parser from here by clicking on the “download zip” button on the top right. Unzip the file, open the folder and move the wostojson.py file so that it is next to your savedrecs.txt. The litreview folder should look like this.

Now you need to open the terminal and change the working directory to the lit review folder. On a Mac you do this by typing cd on the terminal and then dragging and dropping the litreview folder. Then press enter.

Now, all you have to do is type in the following command.

python3 wostojson.py

If everything goes well, you should get the following message in the terminal.

362 records parsed successfully

A file named output_records.json will now appear in your folder, this is the file you will use in the manual classifier to create the training set.