miércoles, 5 de noviembre de 2014

Web scrapping with Go and PhatomJS

Some time ago I wrote a blog called Web scrapping with Julia and PhantomJS...then I wrote another blog called Web scrapping with Haskell and PhantomJS...

This time...it's Go's time -;)

The concept is the same...we create a PhantomJS script that will read a "user" Twitter page and get the hashtags of the first 5 pages...here's the PhantomJS script...

Hashtags.js
var system = require('system');

var webpage = require('webpage').create();
webpage.viewportSize = { width: 1280, height: 800 };
webpage.scrollPosition = { top: 0, left: 0 };

var userid = system.args[1];
var profileUrl = "http://www.twitter.com/" + userid;

webpage.open(profileUrl, function(status) {
 if (status === 'fail') {
  console.error('webpage did not open successfully');
  phantom.exit(1);
 }
 var i = 0,
 top,
 queryFn = function() {
  return document.body.scrollHeight;
 };
 setInterval(function() {
  top = webpage.evaluate(queryFn);
  i++;
   
  webpage.scrollPosition = { top: top + 1, left: 0 };

  if (i >= 5) {
   var twitter = webpage.evaluate(function () {
    var twitter = [];
    forEach = Array.prototype.forEach;
    var tweets = document.querySelectorAll('[data-query-source="hashtag_click"]');
    forEach.call(tweets, function(el) {
     twitter.push(el.innerText);
    });
    return twitter;
   });

   twitter.forEach(function(t) {
    console.log(t);
   });

   phantom.exit();
  }
}, 3000);
});

If we run the script we're going to see the following output...


Now...what I want to do with this information...is to send it to Go...and get the most used hashtags...so I will summarize them and then get rid of the ones that only appear less than 5 times...

Let's see the Go code...

TwitterHashtags.go
package main

import ( "os/exec"
  "strings" 
  "fmt" )

func main() {
 cmd := exec.Command("phantomjs","--ssl-protocol=any","Hashtags.js", "Blag")
 out, err := cmd.Output()
 if err != nil {
  println(err.Error())
  return
 }
 
 Tweets := strings.Split(string(out), "\n")
 charmap := make(map[string]int)
 for _, value := range Tweets {
  if value != "" {
   charmap[value] += 1
  }
 }
 
 for key, value := range charmap {
  if value >= 5 {
   fmt.Print("(", key, ", ")
   fmt.Println(value, ")")
  }
 }
}

The only problem with this script is that there's not an easy way to sort a map[string]int...so I will simply leave it like that -:)

Here's the result...


If someone knows an easy way to sort this...please let me know -:)

Greetings,

Blag.
Development Culture.

No hay comentarios: