For an overview of what SAP Hana Vora is then please check out:
[SAP HANA Academy] Learn How to Install SAP HANA Vora on a Single Node
SAP HANA Vora 1.0 – SAP Help Portal Page
With the introductions aside, Spark is fast becoming the de-facto data processing engine for Hadoop; it's fast, flexible and operates "In-Memory" (when the dataset can fit). I like to think of HANA Vora as an add-on to Spark. It provides added business features as well as "best in class" integration with HANA Databases.
Lets now dive right into the typical "hello world" style example for Hadoop - The Simple Word Count.
How often is "Watson" referred to directly in the "The Adventures Of Sherlock Holmes"?
Is the answer:
A) 42
B) 81
C) 136
D) The Sum of the Above
Note: The answer is at the bottom.
Spark and Vora support several languages such as Scala, Python and Java. Since Scala is still slightly ahead, in terms of popularity with Spark, I'll use that. For utilising Vora you can use the Spark shell or use Notebooks application such as Zeppelin. In this example I use Zeppelin, which is also covered in the installation steps of Vora, as well as in the Hana Academy videos.
Firstly lets download a free copy of the book, strip out all special characters, collect the words, aggregate the results and finally store as, an "in-Memory", resilient distributed dataset (RDD):
Scala: Process The File |
---|
import java.net.URL import java.io.File import org.apache.commons.io.FileUtils
//Load External File to HDFS val HDFS_NAMENODE = "107.20.0.138:8020" val HDFS_DIR = "/user/vora" val tmpFile = new File(s"""hdfs://${HDFS_NAMENODE}${HDFS_DIR}/SherlockHolmes.txt""") FileUtils.copyURLToFile(new URL("https://ia600300.us.archive.org/10/items/TheAdventuresOfHolmesSherlock/DoyleArthurConan-AdventuresOfSherlockHolmesThe.txt"), tmpFile)
//Read Files line as Array[String] into Spark RDD val textFile = sc.parallelize(FileUtils.readLines(tmpFile).toArray.map(x => x.toString))
println("----------------------------------------")
//Print first 2 Lines of File textFile.take(2).foreach(println)
//Rows println("Rows in File: " + textFile.count() )
//Perform full word count, strip ou specify chacters val word_counts = textFile.flatMap(line => line.replaceAll("[^\\p{L}\\p{Nd}\\s]+", "").toLowerCase.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
//Put results into a resilient distributed dataset (RDD) case class WordCount(word: String, wordcount: Long) var wcRDD = word_counts.map(t => WordCount(t._1, t._2))
//First 10 Rows of Word Count println("----------------------------------------") println("First 10 Rows of Word Count:") wcRDD.take(10).foreach(println) |
In Zeppelin it appears as follows:
The Results of Executing are:
Next lets use Vora to register the RDD as a temporary "In Memory" table and then perform a SQL Query to find how many times "Watson" appears in the book:
Scala: Use Vora to Register the RDD as a Temporary Table then Query Results |
---|
import org.apache.spark.sql._ val sapSqlContext = new SapSQLContext (sc)
val wordCountDataFrame = sapSqlContext.createDataFrame(wcRDD)
wordCountDataFrame.registerTempTable("wc")
val results = sapSqlContext.sql("SELECT word, wordcount FROM wc where word = 'watson' ").map{ case Row(word: String, wordcount: Long) => { word + "\t" + wordcount }}.collect() |
Execute in Zeppelin:
Finally lets use Zeppelin to Visualise the results:
Visualise with Zeppelin |
---|
println("%table Word\tCount\n" + results.mkString("\n")) |
Execute in Zeppelin:
Note: Zepplin's Visualisation capabilities are better demonstrated with %vora sql statements if the results have been stored to HDFS.
So the Answer is B) 81
Did you guess right?
From Tech Ed 2015 SAP Hana Vora has also been demonstrated to process a Petabyte of data for Intel, so hopeful some more challenging Vora examples in SCN will follow from Mr Appleby and co soon.
But in the meantime it's still always fun to use a sledge hammer to smash a nut. I hope you enjoyed.