CASSANDRA-SPARK INSTALLATION AND CONNECTION
4 min read- Install JDK, PYTHON, SCALA
- Set path for JDK and Python3
- Install & Configure Cassandra
- Load Data In Cassandra
- Install & Configure Spark
- Set Path for Spark
- Setup Cassandra-spark connector and test
INSTALLATION OF JAVA ON CENTOS 7
ISSUE RESOLUTION RELATED TO SUDOERS FILE:
[root@cass-1 ~]# usermod -aG wheel cass [root@cass-1 ~]# su - cass [cass@cass-1 ~]$ sudo yum install java-11-openjdk
Verify the installation: [root@cass-1 ~]# java –version
INSTALL SCALA ON CENTOS7
[cass@cass-1 ~]$ sudo yum install wget -y
[sudo] password for cass:
[cass@cass-1 ~]$ wget https://downloads.lightbend.com/scala/2.13.0/scala-2.13.0.rpm [cass@cass-1 ~]$ sudo yum install scala-2.13.0.rpm
Verify Scala Version:
[cass@cass-1 ~]$ scala -version
INSTALL PYTHON 3 ON CENTOS 7
[cass@cass-1 ~]$ yum install -y python3
Verify Python:
[cass@cass-1 ~]$ python3
Set bash-profile for JDK and Python3
[cass@cass-1 ~]$ vi .bash_profile
# .bash_profile
#Get the aliases and functions
if [ -f ~/.bashrc ]; then . ~/.bashrc fi
#User specific environment and startup programs
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 JRE_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip SPARK_HOME=/spark-ser/spark/ PATH=$PATH:$HOME/.local/bin:$HOME/bin:$JAVA_HOME/bin:$JRE_HOME/bin:$SPARK_HOME/bin:$PYTHONPATH export PATH Reload Bash Profile [cass@cass-1 ~]$ source ~/.bash_profile
INSTALLATION & CONFIGURATION OF CASSANDRA
[cass@cass-1 ~]$ sudo vi /etc/yum.repos.d/cassandra.repo
[sudo] password for cass:
[cassandra] name=Apache Cassandra baseurl=https://www.apache.org/dist/cassandra/redhat/22x/ gpgcheck=1 repo_gpgcheck=1 gpgkey=https://www.apache.org/dist/cassandra/KEYS
[cass@cass-1 ~]$ sudo yum install cassandra –y [cass@cass-1 ~]$ sudo mkdir /cassandra-ser [cass@cass-1 ~]$ sudo chown -R cass:cass /cassandra-ser [cass@cass-1 ~]$ sudo chmod -R 777 /cassandra-ser
CHANGES DONE IN THE /etc/cassandra/conf/cassandra.yaml
cluster_name: ‘test_clustert’ seeds: 192.168.1.200 listen_address: 192.168.1.200 rpc_address: 192.168.1.200 endpoint_snitch: GossipingPropertyFileSnitch data_file_directories: /cassandra-ser/data commitlog_directory: /cassandra-ser/commitlog saved_caches_directory: /cassandra-ser/saved_caches
Start Cassandra and Enable it to start with system
[root@cass-1 conf]# sudo service cassandra start [root@cass-1 conf]# nodetool status [root@cass-1 conf]# sudo systemctl enable cassandra.service
NOTE: #Start Cassandra: sudo service cassandra start #Start Cassandra at startup sudo systemctl enable cassandra.service #Stop cassandra sudo service cassandra stop
INSERTING DATA INTO CASSANDRA
DOWNLOAD DATASET FOR MOVIES & RATINGS FROM GOOGLE AND TRANSFER IT TO DIRECTORY.
[cass@cass-1 ~]$ cd /cassandra-ser/ [cass@cass-1 cassandra-ser]$ ls -ltrh total 727M drwxr-xr-x. 2 cassandra cassandra 6 Apr 27 06:59 saved_caches drwxr-xr-x. 2 cassandra cassandra 80 Apr 27 06:59 commitlog drwxr-xr-x. 6 cassandra cassandra 86 Apr 27 06:59 data -rw-rw-r--. 1 cass cass 2.8M Apr 27 07:11 movies.csv -rw-rw-r--. 1 cass cass 725M Apr 27 07:11 ratings.csv
[cass@cass-1 cassandra-ser]$ cqlsh 192.168.1.200 cqlsh> create keyspace lab with replication = {'class': 'SimpleStrategy', 'replication_factor': 3} ; cqlsh> create table lab.movies (movie_id int primary key, title text, genres text); cqlsh> use lab; cqlsh:lab> copy movies(movie_id, title, genres) from '/cassandra-ser/movies.csv' with header = true; cqlsh:lab> create table lab.ratings (user_id int, movie_id int, rating double, timestamp bigint, primary key((user_id), movie_id)); cqlsh:lab> copy ratings(user_id, movie_id, rating, timestamp) from '/cassandra-ser/ratings.csv' with header = true;
INSTALLATION & CONFIGURATION OF SPARK
[cass@cass-1 cassandra-ser]$ sudo mkdir /spark-ser [cass@cass-1 cassandra-ser]$ sudo chown cass:cass /spark-ser [cass@cass-1 cassandra-ser]$ cd /spark-ser/ [cass@cass-1 spark-ser]$ wget https://mirrors.estointernet.in/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz [cass@cass-1 spark-ser]$ tar -xvf spark-3.1.1-bin-hadoop3.2.tgz [cass@cass-1 spark-ser]$ ln -s /spark-ser/spark-3.1.1-bin-hadoop3.2 /spark-ser/spark [cass@cass-1 spark-ser]$ ls –ltrh total 219M drwxr-xr-x. 13 cass cass 211 Feb 21 21:11 spark-3.1.1-bin-hadoop3.2 -rw-rw-r--. 1 cass cass 219M Feb 21 21:45 spark-3.1.1-bin-hadoop3.2.tgz lrwxrwxrwx. 1 cass cass 36 Apr 27 07:34 spark -> /spark-ser/spark-3.1.1-bin-hadoop3.2 [cass@cass-1 spark-ser]$ cd spark/conf [cass@cass-1 conf]$ ls -ltrh total 36K -rw-r--r--. 1 cass cass 865 Feb 21 21:11 workers.template -rwxr-xr-x. 1 cass cass 4.4K Feb 21 21:11 spark-env.sh.template -rw-r--r--. 1 cass cass 1.3K Feb 21 21:11 spark-defaults.conf.template -rw-r--r--. 1 cass cass 9.0K Feb 21 21:11 metrics.properties.template -rw-r--r--. 1 cass cass 2.0K Feb 21 21:11 log4j.properties.template -rw-r--r--. 1 cass cass 1.1K Feb 21 21:11 fairscheduler.xml.template [cass@cass-1 conf]$ cp workers.template workers [cass@cass-1 conf]$ cp spark-defaults.conf.template spark-defaults.conf
Change the config file:
[cass@cass-1 conf]$ vi workers [cass@cass-1 conf]$ vi spark-defaults.conf
Establish passwordless ssh connectivity among nodes
For one node [cass@cass-1 ~]$ cd [cass@cass-1 ~]$ ssh-keygensudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [cass@cass-1 ~]$ sudo chmod -R 750 ~/.ssh/authorized_keys [cass@cass-1 ~]$ sudo vi /etc/systemd/system/spark.service
[Service] User=cass Type=forking ExecStart= $SPARK_HOME/sbin/start-all.sh ExecStop= $SPARK_HOME/sbin/stop-all.sh TimeoutSec=30 Restart=on-failure RestartSec=30 StartLimitInterval=350 StartLimitBurst=10
[Install] WantedBy=multi-user.target [cass@cass-1 ~]$ sudo systemctl daemon-reload [cass@cass-1 ~]$ sudo systemctl start spark.service [cass@cass-1 ~]$ sudo systemctl enable spark.service
NOTE: #Reload the daemon processes: sudo systemctl daemon-reload #Start Spark using systemctl sudo systemctl start spark.service #Enable Spark to start at system startup sudo systemctl enable spark.service
#Check the web ui for status http://192.168.1.100:8080/ #Access Spark-Shell $SPARK_HOME/bin/spark-shell #Start Spark Manually $SPARK_HOME/sbin/start-all #Stop Spark Manually $SPARK_HOME/sbin/stop-all
Find the latest cassandra connector for spark in the maven repo. Launch spark shell with Cassandra connector and Cassandra host configuration.
#Scala/Spark-shell $SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=192.168.1.200
#Python Shell $SPARK_HOME/bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=192.168.1.200 #Once you’re in the interactive shell, you can start with loading required python libraries, and test your connectivity: >>> from pyspark import SparkContext, SparkConf >>> from pyspark.sql import SQLContext >>> load_options = { "table": "movies", "keyspace": "lab"} >>> df=spark.read.format("org.apache.spark.sql.cassandra").options(**load_options).load() >>> df.show() >>> df.write.csv('/tmp/mycsv.csv')