Scraping HTML Generated by JavaScript – Python

Say we want to get some data from stats.cyanogenmod.com.

If we were to do a simple wget or curl on the above url, all we would get would be something like this.

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>CyanogenMod Statistics</title>
    <script src="http://code.jquery.com/jquery-1.8.2.min.js" type="text/javascript"></script>
    <script src="js/underscore.min.js" type="text/javascript"></script>
    <script src="js/utils.js" type="text/javascript"></script>
    <script src="js/app.js" type="text/javascript"></script>
    <link rel="stylesheet" href="css/bootstrap.min.css">
    <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
    <style type="text/css">
      body {
        padding: 40px;
      }
      table tbody tr td:first-child {
        width: 150px;
      }
    </style>
  </head>
  <body>
    <div class="container">
      <!-- Page Header -->
      <div class="row">
        <div class="span12"><div class="page-header"><h1>CyanogenMod Statistics</h1></div></div>
      </div>

      <!-- Content -->
      <div id="totalInstalls"></div>
    </div>
    <script type="text/javascript">
      $(document).ready(function() {
        CMStats.init();
      });
    </script>
  </body>
</html>

and within the HTML we can see the following lines

      $(document).ready(function() {
        CMStats.init();
      });

this is the bit that generates the actual data we want to download.

The way to do it is to render the page using webkit and download the resulting code.

This solution requires the PyQt4 library with the Qt Framework.

The main part that will do most of the work was taken from here.

All I have done is made it into a separate file so that it can then be imported and incorporated with other things.

So the main file will be called wkRender.py, and will contain the following code.

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
  
class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()  
  
    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit()

def getHtml(str_url):
    r_html = Render(str_url)  
    html = r_html.frame.toHtml()

    return html

The main bit that will do the damage is the getHtml() function.

This can then be implemented by importing the above function to another file we will call dataCollect.py.

from wkRender import getHtml

that is, assuming both files are in the same directory.

Once it has been imported, all we need to do is declare the url and get the HTML as a string.


str_url = 'http://stats.cyanogenmod.com/'

str_html = getHtml(str_url)

print str_html

Share this:

Related

2 replies on “Scraping HTML Generated by JavaScript – Python”

Leave a reply to Random Destiny Cancel reply