Quantcast
Channel: SCN : Blog List - All Communities
Viewing all articles
Browse latest Browse all 2548

[HANA Vora] The Simple Word Count Example

$
0
0

For an overview of what SAP Hana Vora is then please check out:

SAP HANA Vora: An Overview

[SAP HANA Academy] Learn How to Install SAP HANA Vora on a Single Node

SAP HANA Vora - YouTube

SAP HANA Vora 1.0 – SAP Help Portal Page

 

With the introductions aside, Spark is fast becoming the de-facto data processing engine for Hadoop; it's fast, flexible and operates "In-Memory" (when the dataset can fit). I like to think of HANA Vora as an add-on to Spark. It provides added business features as well as "best in class" integration with HANA Databases.

 

Lets now dive right into the typical "hello world" style example for Hadoop - The Simple Word Count.

 

How often is "Watson" referred to directly in the "The Adventures Of Sherlock Holmes"?

Is the answer:

A) 42

B) 81

C) 136

D) The Sum of the Above

 

Note: The answer is at the bottom.

 

Spark and Vora support several languages such as Scala, Python and Java. Since Scala is still slightly ahead, in terms of popularity with Spark, I'll use that. For utilising Vora you can use the Spark shell or use Notebooks application such as Zeppelin. In this example I use Zeppelin, which is also covered in the installation steps of Vora, as well as in the Hana Academy videos.

 

Firstly lets download a free copy of the book, strip out all special characters, collect the words, aggregate the results and finally store as, an "in-Memory", resilient distributed dataset (RDD):

 

Scala: Process The File

import java.net.URL

import java.io.File

import org.apache.commons.io.FileUtils

 

//Load External File to HDFS

val HDFS_NAMENODE = "107.20.0.138:8020"

val HDFS_DIR      = "/user/vora"

val tmpFile = new File(s"""hdfs://${HDFS_NAMENODE}${HDFS_DIR}/SherlockHolmes.txt""")

FileUtils.copyURLToFile(new URL("https://ia600300.us.archive.org/10/items/TheAdventuresOfHolmesSherlock/DoyleArthurConan-AdventuresOfSherlockHolmesThe.txt"), tmpFile)

 

//Read Files line as Array[String] into Spark RDD

val textFile = sc.parallelize(FileUtils.readLines(tmpFile).toArray.map(x => x.toString))

 

println("----------------------------------------")

 

//Print first 2 Lines of File

textFile.take(2).foreach(println)

 

//Rows

println("Rows in File: " + textFile.count() )

 

//Perform full word count, strip ou specify chacters

val word_counts = textFile.flatMap(line => line.replaceAll("[^\\p{L}\\p{Nd}\\s]+", "").toLowerCase.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

 

//Put results into a resilient distributed dataset (RDD)

case class WordCount(word: String, wordcount: Long)

var wcRDD = word_counts.map(t => WordCount(t._1, t._2))

 

//First 10 Rows of Word Count

println("----------------------------------------")

println("First 10 Rows of Word Count:")

wcRDD.take(10).foreach(println)

 

In Zeppelin it appears as follows:

bl1.PNG

 

The Results of Executing are:

bl2.PNG

 

Next lets use Vora to register the RDD as a temporary "In Memory" table and then perform a SQL Query to find how many times "Watson" appears in the book:

 

Scala: Use Vora to Register the RDD as a Temporary Table then Query Results

import org.apache.spark.sql._

val sapSqlContext = new SapSQLContext (sc)

 

 

val wordCountDataFrame = sapSqlContext.createDataFrame(wcRDD)

 

 

wordCountDataFrame.registerTempTable("wc")

 

 

val results = sapSqlContext.sql("SELECT word, wordcount FROM wc where word = 'watson' ").map{

case Row(word: String, wordcount: Long) => {

      word + "\t" + wordcount

}}.collect()

 

Execute in Zeppelin:

bl3.PNG

 

Finally  lets use Zeppelin to Visualise the results:

Visualise with Zeppelin
println("%table Word\tCount\n" + results.mkString("\n"))

 

Execute in Zeppelin:

bl4.PNG

 

Note:  Zepplin's Visualisation capabilities are better demonstrated with %vora sql statements if the results have been stored to HDFS.

 

 

So the Answer is  B) 81

 

Did you guess right?

 

From Tech Ed 2015 SAP Hana Vora has also been demonstrated to process a Petabyte of data for Intel, so hopeful some more challenging Vora examples in SCN will follow from  Mr Appleby and co soon.

 

But in the meantime it's still always fun to use a sledge hammer to smash a nut. I hope you enjoyed.


Viewing all articles
Browse latest Browse all 2548

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>