Categories
data science

Character encoding in with JSON and PYTHON

I was data scraping the other day and saving the output to a JSON file, but the text in one of the entries was coming out wrong.

Instead of “Antal Dovcsák”, it was coming out as “Antal Dovcs\u00e1k” instead.

For context, my code looked something like this:

import requests
import json
import bs4

...

dicData = {'name': foundData.text, 'url': foundData['href']}

with open('output_file.json', 'w') as fw:
    fw.write(json.dumps(dicData, indent=2))

And the output was this…

{
    "name": "Antal Dovcs\u00e1k",
    "url": "/wiki/Antal_Dovcs%C3%A1k"
}

After some googling I found this awesome entry in Stakoverflow.

My code now looks like this now:

import requests
import json
import bs4

...

dicData = {'name': foundData.text, 'url': foundData['href']}

jsonOutput = json.dumps(dicData, indent=2, ensure_ascii=False).encode('utf8')

with open('output_file.json', 'w') as fw:
    fw.write(jsonOutput.decode())

This will now give me the following output.

{
    "name": "Antal Dovcsák",
    "url": "/wiki/Antal_Dovcs%C3%A1k"
}

Perfect!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s