Native Spark Modeling feature has been released since SAP BusinessObjects Predictive Analytics version 2.5. This version supported Native Spark Modeling for classification scenarios. The latest release of SAP BusinessObjects Predictive Analytics (version 3.0) now supports regression scenarios as well. The business benefits gained from Native Spark Modelling are primarily able to train more models in shorter period of time, hence obtain better insights of the business challenges by learning from the predictive models and targeting the right customers very quickly.
Native Spark Modeling is also known as IDBM (InDatabase Modeling), with this feature of SAP BusinessObjects Predictive Analytics the model training and scoring can be pushed down to the Hadoop database level through Spark layer. Native Spark Modeling capability is delivered in Hadoop through a Scala program in Spark engine.
In this blog you will get familiar with the end-to-end configuration of Native Spark Modeling on Hadoop using SAP BusinessObjects Predictive Analytics.
Steps to set up Native Spark modeling:
Let’s review the configuration steps in detail below:
1. Install SAP BusinessObjects Predictive Analytics
According to your deployment choice, install either the desktop or the client/server mode. Refer to the steps mentioned in the installation overview link or installation guides Install PA– section installation.
During installation, all the required configuration files and the pre-delivered packages for Native Spark Modeling will be installed in the local desktop or server location.
2. Check SAP BusinessObjects Predictive Analytics installation
In this scenario, SAP BusinessObjects Predictive Analytics server has been chosen as deployment option and it will be installed in on a Windows server. After the successful installation of SAP BusinessObjects Predictive Analytics server, within Windows server local directory you will be able to see the folder structure as below.
As SAP BusinessObjects Predictive Analytics 3.0 Server is installed, on the Windows server navigate into the SAP Predictive Analytics\Server 3.0\ folder. You will see the folder SparkConnector which contains all the required configuration files and the developed Native Spark Modeling functionality in form of ‘jar’ f iles.
Click on the SparkConnector folder to check the following directory structure. The below folder structure will show up.
3. Check whether the winutils.exe file exists in “bin” folder for windows installation
Apache Spark requires the executable file winutils.exe to function correctly on the Windows Operating System when running against a non-Windows cluster.
4. Check the required client configuration xml files in “hadoopConfig” folder
Create a sub folder for each Hive ODBC DSN. For example, in this scenario the sub folder is named “IDBM_HIVE_DUB_CLOUDERA”.(Note: This is not a fixed name, you can name it according to your preference).
Each sub folder should contain the 3 Hadoop client XML configuration files for the cluster (core-site.xml, hive-site.xml, yarn-site.xml). Download client configuration xml files. You can use admin tools such as Hortonworks Ambari or Cloudera Manager to download these files.
Note: This sub folder is linked to the Hive ODBC DSN by the SparkConnections.ini file property "HadoopConfigDir", not by the subfolder name.
5. Download required Spark version jar in the folder “Jars”
Download the additional assembly jar files from the link below and copy them into the SparkConnector/Jars folder.
- http://help.sap.com/disclaimer?site=http://archive.apache.org/dist/spark/spark-1.5.0/spark-1.5.0-bin-hadoop2.6.tgz
- http://help.sap.com/disclaimer?site=http://archive.apache.org/dist/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz
6. Configure Spark.cfg (for client-server server mode) or KJWizardjni.ini (for desktop mode) to set the right spark version and path
As SAP BusinessObjects Predictive Analytics server is installed here, within the Server 3.0 folder open the Spark.cfg file in Notepad or any other text editors. Native Spark Modeling supports both Spark versions offered by two major Hadoop enterprise vendors at present day (Cloudera and Hortonworks).
As the Cloudera Hadoop server is being used in this scenario, you should keep the configuration path of Spark version 1.5.0 of Cloudera server active in the Spark.cfg configuration file and comment out the Hortonworks server’s Spark version. Also path to connection folders and some tuning options can be set here.
Navigate to folder location: C:\Program Files\SAP Predictive Analytics\Server 3.0\SparkConnector\ and edit Spark.cfg file.
For Desktop the file location: Navigate to the folder location C:\Program Files\SAP Predictive Analytics\Server 3.0\EXE\Clients\KJWizardJNI and edit KJWizardJNI.ini file.
7. Set up Model Training Delegation for Native Spark Modeling-
In Automated Analytics Menu, navigate to the following path. File -> Preferences -> Model Training Delegation.
By default the “Native Spark Modeling when possible” flag should be SWITCHED ON, if it is not, please ensure it is SWITCHED ON. Then press OK button.
8. Create an ODBC connection to Hive Server as a data source for Native Spark Modeling
This connection will be later used in Automated Analytics to select Analytic Data Source (ADS) or Hive tables as input data source for the Native Spark modeling.
- Open the Windows ODBC Data Source Administrator
- In the User DSN tab press Add
- Select the: 'DataDirect 7.1 SP5 Apache Hive Wire Protocol' from the driver list and press Finish
- In the General tab enter:
- Data Source Name: IDBM_HIVE_DUB_CLOUDERA (This is just an example – no fixed name is compulsory for this to work)
- Host Name: xxxxx.xxx.corp
- PortNumber: 10000
- Database Name: default
- In the Security tab set Authentication Method to: '0 - User ID/Password' and set the User Name and password.
- SWITCH ON the flag “Use Native Catalog Functions”. Select Use Native Catalog Functions to disable the SQL Connector feature and allow the driver to execute HiveQL directly.
- Press the "Test Connect" button.
- If the connection is successful, press APPLY and then OK. If the connection test fails even when the connection information is correct, please make sure that the Hive Thrift server is running.
9. Set up the SparkConnection.ini file for your individual ODBC DSN
This file contains Spark connection entries, specific for each particular Hive data source name (DSN). For example, in the case that there are 3 Hive ODBC DSNs , the user has a flexibility to say two should run on IDBM and not the last one i.e. 1 of the DSNs not present in SparkConnection.ini file will fall back to normal modelling process using Automated Analytics engine. To set the required configuration parameters for Native Spark Modeling, navigate to the SAPBusinessObjects Predictive Analytics 3.0 Desktop/Server installation folder (in case of server go to folder location: C:\Program Files\SAP Predictive Analytics\Server 3.0\SparkConnector\ OR in the case of a Desktop installation, go to the folder location C:\Program Files\SAP Predictive Analytics\Desktop 3.0\Automated\SparkConnector and edit the SparkConnections.ini file then save it.
As in this scenario a Cloudera Hadoop box is being used you need to set the parameters in the file as per the configuration requirement of Cloudera clusters.
For Cloudera Clusters:
- To enable Native Spark Modeling against a Hive data source, you need to define at least the below minimum properties.
Each entry after "SparkConnection" needs to match exactly the Hive ODBC DSN (Data Source Name).
- Upload the spark 1.5.0 assembly jar to HDFS and reference the HDFS location.
SparkConnection.IDBM_HIVE_DUB_CLOUDERA.native."spark.yarn.jar"="hdfs://hostname:8020/jars/spark-assembly-1.5.0-hadoop2.6.0.jar"
- Set hadoopConfigDir and hadoopUserName, as they are mandatory.
There are two mandatory parameters that have to be set for each DSN
- hadoopConfigDir (The directory of the core-site.xml, yarn-site.xml and hive-site.xml files for this DSN)
Use relative paths to the Hadoop client XML config files (yarn-site.xml, hive-site.xml, core-site.xml)
For e.g.
SparkConnection.IDBM_HIVE_DUB_CLOUDERA.hadoopConfigDir="../../../SparkConnector/hadoopConfig/ IDBM_HIVE_DUB_CLOUDERA"
- hadoopUserName (An user name with privileges to run Spark on YARN)
For e.g. SparkConnection. IDBM_HIVE_DUB_CLOUDERA.hadoopUserName="hive"
- It is possible to pass in native Spark parameters to set Spark properties using "native" in the property.
For e.g. SparkConnection.MY_HDP_HIVE_DSN.native."spark.executor.instances"="4"
(This parameter sets the number of executors. Note that this property is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used.)
FF For Hortonworks Clusters:
Apart from the other configuration stated above, the below 2 properties are also mandatory and need to match
the HDP version exactly
- SparkConnection. IDBM_HIVE_HDW.native."spark.yarn.am.extraJavaOptions"="-Dhdp.version=2.3.2.0-2950"
- SparkConnection. IDBM_HIVE_HDW.native."spark.driver.extraJavaOptions"="-Dhdp.version=2.3.2.0-2950"
(Where IDBM_HIVE_HDW is the ODBC connection to the Hortonworks Hadoop system.)
(Note:This is one time configuration activity. You can get help from your IT administrator to set up SparkConnection.ini file.)
The tuning options for Spark also can be set here. This is useful when the dataset is larger than the available memory and default spark settings on cluster.
# (Optional) performance tuning parameters
#SparkConnection.IDBM_HIVE_DUB_CLOUDERA.native."spark.executor.instances"="4"
#SparkConnection.IDBM_HIVE_DUB_CLOUDERA.native."spark.executor.cores"="2"
#SparkConnection.IDBM_HIVE_DUB_CLOUDERA.native."spark.executor.memory"="4g"
#SparkConnection.IDBM_HIVE_DUB_CLOUDERA.native."spark.driver.maxResultSize"="4g"
10. Run the Native spark modeling using Automated Analytics Modeler
Open Automated Analytics and select classification or regression data mining method.
Select ODBC connection to the Hadoop server (e.g. IDBM_HIVE_DUB_CLOUDERA) and select the hive table as data source.
Choose an existing Hive Table using 'Use Database Table' option or an Analytical Dataset which is based on Hive tables using the 'Use Data Manager' option.
Click NEXT button and load the description of the dataset from a local file or click analyze to pull the metadata of the Hive table in Automated Analytics.
In the next screen, select the input and target variables (for classification scenario an example of target variable could be a variable which indicates which customers of a bank has credit card or not. Credit_card_Exist (=Yes/No)).
Then navigate to next screen and click on generate model training.
Now the process will be delegated to the Spark layer. In the progress bar you will notice several processing steps that will take place in sequence.
You can observe Spark jobs in details through the application Web UI which can be started by typing http://localhost:4040 in the browser.
Configuring Native Spark Modeling doesn’t require one to code at all, with a very short amount of time, you can set up Native Spark Modeling within SAP BusinessObjects Predictive Analytics and work with Hadoop data source as you would work with any other database. From the user’s point of view, you won’t experience any difference.
Call for action.
For more information on how Native Spark Modeling works please refer to the blog: http://scn.sap.com/community/predictive-analytics/blog/2016/03/18/big-data-native-spark-modeling-in-sap-predictive-analytics-25.
Please let us know your experience when using NSM -> new abbreviation?. If you encounter any problems, feel free to ask questions in our SCN.