Wakatta!

Like Eureka!, only cooler

Seven Languages in Seven Weeks Scala Day 3

On the last day with Scala, the book introduces the XML support and concurrency.

Scala and XML

XML is a rather unfortunate part of modern computing, one that is tedious with most languages.

Scala chose to solve this by making XML part of the language. This means that the following fragment (from Burak’s draft scala xml book, linked to from A Tour of Scala: XML Processing):

Scala and XML
1
2
3
4
5
6
7
8
9
10
11
12
13
val labPhoneBook =
    <phonebook>
      <descr>
        This is the <b>phonebook</b> of the
        <a href="http://acme.org">ACME</a> corporation.
      </descr>
      <entry>
        <name>Burak Emir</name>
        <phone where="work">+41 21 693 68 67</phone>
      </entry>
    </phonebook>

println(labPhoneBook)

stores an XML fragment, not a string, in labPhoneBook.

Braces can be used to insert Scala code directly into the XML:

Scala: dynamic XML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
scala> val name = "Fred"
name: java.lang.String = Fred

scala> val fragment = <blog>
     | <post>
     | <author>{name}</author>
     | </post>
     | </blog>
fragment: scala.xml.Elem =
<blog>
<post>
<author>Fred</author>
</post>
</blog>

The interactive Scala shell recognizes XML fragments, so it knew the expression was not complete until </blog>. Very, very nice.

Braces can also be used to extract information from the XML, as seen in the book. This, combined with the \ XPath projection operator (which, of course, returns a data structure that can be iterated over with the usual methods), provides a very pleasant way to parse XML.

There is far more to Scala’s XML support (see the link above). I don’t usually do much XML processing (most of my handling of XML is performed through dedicated libraries that abstract XML away entirely), but if I had to, I’d certainly would give a Scala a try.

Scala Concurrency

Exercises

Displaying links

To collect the links, I thought about using HTML or XML parsing, but could not find a parser robust enough to handle the pages. In the end, I just used a regular expression.

The code is fairly short:

  • first import the relevant package and class
  • define the regular expression
  • slightly reorganize PageLoader to list all the links found through the regular expression; the links are added to a Set to ensure unicity, then iterated over
Displaying links
1
2
3
4
5
6
7
8
9
10
11
12
13
import scala.util.matching.Regex

val linkPattern = new Regex("""<a +href=\"([^\"]+)\"[^>]*>""", "link")

object PageLoader {
  def getPageSize(url : String) = {
    val text = Source.fromURL(url).mkString
    ((Set(): Set[String]) /: linkPattern.findAllIn(text).matchData) {
      (s, md) => s + md.group("link") } foreach {
        link => println(url + " => " + link) }
    text.length
  }
}

The code to add the links to a Set is not the simplest I could think of, but it is the simplest that worked. The problem I have is that the collections methods other return iterators, but I cannot simply build a new collection from an interator. I guess there must be a simpler way, but right now it is eluding me.

Full code:

sizer.scala, full code (sizer.scala) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import scala.io._
import scala.actors._
import Actor._
import scala.util.matching.Regex

val linkPattern = new Regex("""<a +href=\"([^\"]+)\"[^>]*>""", "link")

object PageLoader {
  def getPageSize(url : String) = {
    val text = Source.fromURL(url).mkString
    ((Set(): Set[String]) /: linkPattern.findAllIn(text).matchData) {
      (s, md) => s + md.group("link") } foreach {
        link => println(url + " => " + link) }
    text.length
  }
}

val urls = List("http://www.amazon.com/",
                "http://www.twitter.com/",
                "http://www.google.com/",
                "http://www.cnn.com/" )

def timeMethod(method: () => Unit) = {
  val start = System.nanoTime
  method()
  val end = System.nanoTime
  println("Method took " + (end - start)/1000000000.0 + " seconds.")
}

def getPageSizeSequentially() = {
  for(url <- urls) {
    println("Size for " + url + ": " + PageLoader.getPageSize(url)) }
}

def getPageSizeConcurrently() = {
  val caller = self

  for(url <- urls) {
    actor { caller ! (url, PageLoader.getPageSize(url)) }
  }

  for(i <- 1 to urls.size) {
    receive {
      case (url, size) =>
        println("Size for " + url + ": " + size)
    }
  }
}

println("Sequential run:")
timeMethod { getPageSizeSequentially }

println("Concurrent run")
timeMethod { getPageSizeConcurrently }

Following links

This exercise build on the previous one. Now that we have the links, we should try to follow them and add their size.

First, the links should be somewhat normalized. A basic idea would be to make sure each contains a host; those that do not would be prefixed with the main url (the logic is not fool proof: checking the errors, I found a few javascript fragments that were interpreted as a link.)

Then there is a second, more serious issue: some links take forever to load. So I rewrote the fetching logic to add a timeout. Because try blocks are actually expression in Scala, I just return an empty string as the content of pages that cannot be read.

Finally, at this stage I found it hard to propagate the concurrency method (sequential or parallel) down to the PageLoader.getPageSize method, I just implemented the two methods in separate versions:

Sequential version

This version takes over 10 minutes to run on my machine.

sizer_links.scala (sizer_links_seq.scala) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import scala.io._
import scala.actors._
import Actor._
import scala.util.matching.Regex

val linkPattern = new Regex("""<a +href=\"([^\"]+)\"[^>]*>""", "link")

object PageLoader {
  def getPageSize(url : String, withLinks : Boolean): Int = {
    //println("Reading " + url)
    import java.net._
    val urlCon = new URL(url).openConnection
    urlCon.setConnectTimeout(5000)
    urlCon.setReadTimeout(5000)

    val text =
      try {
        Source.fromInputStream(urlCon.getInputStream).mkString
      } catch {
        case e: Exception => {
          Console.err.println("Could not read [" + url + "]")
          ""
        }
      }
    val size = text.length
    if (withLinks) {
      val links = ((Set(): Set[String]) /: linkPattern.findAllIn(text).matchData) {
      (s, md) =>
        val link = md.group("link")
        s + (if (link.indexOf("http") != 0) {
              if (link(0) == '/')
                (url + link)
              else
                (url + '/' + link)
            } else
              link
            )
      }

      (size /: links) { (s, link) => s + PageLoader.getPageSize(link, false) }
    } else
      size
  }
}

val urls = List("http://www.amazon.com",
                "http://www.twitter.com",
                "http://www.google.com",
                "http://www.cnn.com" )

def timeMethod(method : () => Unit) = {
  val start = System.nanoTime
  method()
  val end = System.nanoTime

  println("Method took " + (end - start) / 1000000000.0 + " seconds.")
}

def getPageSizeSequentially() = {
  for(url <- urls) {
    println("Size for " + url + ": " + PageLoader.getPageSize(url, true))
  }
}

println("Sequential run:")
timeMethod { getPageSizeSequentially }

Output:

(seq_output.txt) download
1
2
3
4
5
6
Sequential run:
Size for http://www.amazon.com: 12869997
Size for http://www.twitter.com: 291445
Size for http://www.google.com: 280876
Size for http://www.cnn.com: 11343152
Method took 667.821383 seconds.

Parallel version

This version takes about 100 seconds (less than a sixth of the sequential version). Obviously, the code is more complex, but clearly much faster as well.

sizer_links.scala (sizer_links_par.scala) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import scala.io._
import scala.actors._
import Actor._
import scala.util.matching.Regex

val linkPattern = new Regex("""<a +href=\"([^\"]+)\"[^>]*>""", "link")

object PageLoader {
  def getPageSize(url : String, withLinks : Boolean): Int = {
    //println("Reading " + url)
    import java.net._
    val urlCon = new URL(url).openConnection
    urlCon.setConnectTimeout(5000)
    urlCon.setReadTimeout(5000)

    val text =
      try {
        Source.fromInputStream(urlCon.getInputStream).mkString
      } catch {
        case e: Exception => {
          Console.err.println("Could not read [" + url + "]")
          ""
        }
      }
    val size = text.length
    if (withLinks) {
      val links = ((Set(): Set[String]) /: linkPattern.findAllIn(text).matchData) {
      (s, md) =>
        val link = md.group("link")
        s + (if (link.indexOf("http") != 0) {
              if (link(0) == '/')
                (url + link)
              else
                (url + '/' + link)
            } else
              link
            )
      }

      val caller = self

      for (link <- links) {
        actor { caller ! (link, PageLoader.getPageSize(link, false) ) }
      }

      (size /: links) { (sum, link) =>
        receive {
          case(url, s: Int) =>
            sum + s
        }
      }
    } else
      size
  }
}

val urls = List("http://www.amazon.com",
                "http://www.twitter.com",
                "http://www.google.com",
                "http://www.cnn.com" )

def timeMethod(method : () => Unit) = {
  val start = System.nanoTime
  method()
  val end = System.nanoTime

  println("Method took " + (end - start) / 1000000000.0 + " seconds.")
}

def getPageSizeConcurrently() {
  val caller = self

  for(url <- urls) {
    actor { caller ! (url, PageLoader.getPageSize(url, true)) }
  }

  for (i <- 1 to urls.size) {
    receive {
      case(url, size) =>
        println("Size for " + url + ": " + size)
    }
  }
}

println("Concurrent run:")
timeMethod { getPageSizeConcurrently }

Output:

(par_output.txt) download
1
2
3
4
5
6
Concurrent run:
Size for http://www.google.com: 280972
Size for http://www.twitter.com: 291430
Size for http://www.amazon.com: 12387773
Size for http://www.cnn.com: 11463136
Method took 103.645557 seconds.

Wrapping up Day 3 and Scala

I really feel I would enjoy working full time with Scala. I disagree with the author statement that Scala has an academic syntax. It does not take that much time to get used to, and it feels quite lighter than Java’s (of course, coming from Ruby would provide an entirely different perspective).

As I was doing the exercises, I found a few dark corners, most notably about collections and iterators.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Welcome to Scala version 2.9.1.final (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_26).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import scala.util.matching.Regex
import scala.util.matching.Regex

scala>

scala> val re = new Regex("a")
re: scala.util.matching.Regex = a

scala>

scala> val str = "abracadabra"
str: java.lang.String = abracadabra

scala>

scala> val matches = re.findAllIn(str).matchData
matches: java.lang.Object with Iterator[scala.util.matching.Regex.Match] = non-empty iterator

scala>

scala> matches foreach {md => println(md)}
a
a
a
a
a

scala> println(matches.size)
0

The code just finds regular expression matches in a string, prints each of them, them prints how many it found (at least that’s what I expected it would do).

But apparently, the foreach method consumes the data, and size returns 0.

Scala tries to tell me something with its warning about a non-empty iterator. Unfortunately, the same code run non interactively does not display any warning.

Obviously, it is not safe to keep references to some intermediate data structures (like iterators). I hope that as I learn more about the language, I will also learn to recognize such situations.

Comments