In this blog post, we will explain, how to set up Jupyter as a browser-based frontend to easily query and visualize your data.
Jupyter is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text, see Project Jupyter.
This tutorial consists of two parts.
You are currently reading part one, which explains the basic steps how to set up and configure Jupyter.
It is essential to complete part one before continuing with part two!
Part two demonstrates how to run queries in Python and how to visualize data using matplotlib.
Prerequisites
Before starting this tutorial, please make sure your cluster is up and running.
You should have at least once started the spark shell and run some queries to test its functionality.
To complete part 2 of this tutorial, you need sample data, which can be downloaded here:
https://mdocs.sap.com/mcm/public/v1/open?shr=LMV6pH_012dtA13N-rtiwGZUAnulqt2zX4MSfGXQ51w
This file contains TPC-H sample data at scale facto 0.001.
Please download the file and extract its content to your HDFS.
Alternatively, you may generate the sample data on your own by downloading and compiling DBGEN:
http://www.tpc.org/tpch/tools_download/dbgen-download-request.asp
Installation
To get startet, we need to install several packages, that should come bundled with your Linux distribution.
Please run the following commands on a RedHat-based machine:
sudo yum install python-pip sudo yum install python-matplotlib sudo yum install gcc-c++ sudo pip install --upgrade pip sudo pip install jupyter
You may install Jupyter on a jumpbox outside the cluster, for example, on an Ubuntu-based system.
Then, the first two commands are slightly different:
sudo apt-get install python-pip sudo apt-get install python-matplotlib sudo apt-get install g++ sudo pip install --upgrade pip sudo pip install jupyter
Environment
Next, we need to set some environment variables to inform Jupyter about our Spark and Python settings.
Please adjust the paths and version number below according to your local environment, then either run these commands on the shell as the "vora" user, or put them in your ".profile", to have them loaded every time you log in:
export PYTHONPATH=/home/vora/vora/python:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip export ADD_JARS=/home/vora/vora/lib/spark-sap-datasources-<version>-assembly.jar export SPARK_CLASSPATH=$ADD_JARS export PYSPARK_SUBMIT_ARGS="--master yarn-client --jars $ADD_JARS pyspark-shell"
Configure Jupyter
Please run this command as the user "vora" to generate the initial configuration for Jupyter:
jupyter notebook --generate-config
Now, open an editor and edit the file "~/.jupyter/jupyter_notebook_config.py"
Since we are running on a remote machine with no Window Manager, we configure Jupyter to not open up a webbrowser on startup.
Please uncomment the line
# c.NotebookApp.open_browser = False
Uncomment means removing the pound sign at the beginning of the line.
To be able to access Jupyter from remote, we need to uncomment the following line as well:
# c.NotebookApp.ip = '*'
Notice: This will give everyone access to the Jupyter webinterface.
In a production environment, you might want to set up access control.
Please refer to this guide, how to secure your Jupyter installation:
After applying the above changes to the config file, please save your changes and close the editor.
Notice: Usually, cloud providers and IT departments are very restrictive and may block access to Jupyter's TCP port (default: 8888). Please make sure to include a rule in the firewall configuration allowing access to the port on the machine running Jupyter. Consult the provider's documentation or your IT department for details. |
Running Jupyter
To run Jupyter, first, create an empty folder where you want to store your notebooks, and go into that folder.
Then run the following command as the user "vora", e.g.:
mkdir notebooks cd notebooks jupyter notebook
This will start a Jupyter notebook server, listening on port 8888 for connections.
The console output will be similar to this:
[I 09:39:29.176 NotebookApp] Writing notebook server cookie secret to /run/user/1000/jupyter/notebook_cookie_secret [W 09:39:29.200 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. [W 09:39:29.200 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended. [I 09:39:29.204 NotebookApp] Serving notebooks from local directory: /home/d062985/notebooks [I 09:39:29.204 NotebookApp] 0 active kernels [I 09:39:29.204 NotebookApp] The IPython Notebook is running at: http://[all ip addresses on your system]:8888/ [I 09:39:29.204 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Now we can fire up a webbrowser on another machine and navigate to the URL of the host running Jupyter, e.g. http://jumpbox.yourcluster.internal:8888/
You should see a website like this:
By clicking New, you can start a new notebook, that is waiting for your input:
After clicking, the empty notebook will open up:
Now, we can start submitting queries by entering the query into a paragraph and hitting the play button on top.
This will then execute the snippet in the background and return results to the webpage.
Submitting queries and plotting data
The final part of this tutorial will take place in Jupyter.
Please download the attached Jupyter Notebook "PythonBindings.ipynb.zip", unzip it, and copy it to the notebook folder on your machine running jupyter.
Then, open the file in the Jupyter webinterface in your webbrowser.