A technical troubleshooting blog about Oracle with other Databases & Cloud Technologies.

CASSANDRA-SPARK INSTALLATION AND CONNECTION

4 min read
  1. Install JDK, PYTHON, SCALA
  2. Set path for JDK and Python3
  3. Install & Configure Cassandra
  4. Load Data In Cassandra
  5. Install & Configure Spark
  6. Set Path for Spark
  7. Setup Cassandra-spark connector and test

INSTALLATION OF JAVA ON CENTOS 7

ISSUE RESOLUTION RELATED TO SUDOERS FILE:

[root@cass-1 ~]# usermod -aG wheel cass
[root@cass-1 ~]# su - cass
[cass@cass-1 ~]$ sudo yum install java-11-openjdk
Verify the installation:

[root@cass-1 ~]#  java –version

INSTALL SCALA ON CENTOS7

[cass@cass-1 ~]$ sudo yum install wget -y

[sudo] password for cass:

[cass@cass-1 ~]$ wget https://downloads.lightbend.com/scala/2.13.0/scala-2.13.0.rpm
[cass@cass-1 ~]$ sudo yum install scala-2.13.0.rpm

Verify Scala Version:

[cass@cass-1 ~]$ scala -version

INSTALL PYTHON 3 ON CENTOS 7

[cass@cass-1 ~]$ yum install -y python3

Verify Python:

[cass@cass-1 ~]$ python3

Set bash-profile for JDK and Python3

[cass@cass-1 ~]$ vi .bash_profile

# .bash_profile

#Get the aliases and functions

if [ -f ~/.bashrc ]; then
      . ~/.bashrc
fi

#User specific environment and startup programs

JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64

JRE_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre

PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip

SPARK_HOME=/spark-ser/spark/

PATH=$PATH:$HOME/.local/bin:$HOME/bin:$JAVA_HOME/bin:$JRE_HOME/bin:$SPARK_HOME/bin:$PYTHONPATH
export PATH

Reload Bash Profile

[cass@cass-1 ~]$ source ~/.bash_profile

INSTALLATION & CONFIGURATION OF CASSANDRA

[cass@cass-1 ~]$ sudo vi /etc/yum.repos.d/cassandra.repo

[sudo] password for cass:

[cassandra]

name=Apache Cassandra

baseurl=https://www.apache.org/dist/cassandra/redhat/22x/

gpgcheck=1

repo_gpgcheck=1

gpgkey=https://www.apache.org/dist/cassandra/KEYS
[cass@cass-1 ~]$ sudo yum install cassandra –y
[cass@cass-1 ~]$ sudo mkdir /cassandra-ser

[cass@cass-1 ~]$ sudo chown -R cass:cass /cassandra-ser

[cass@cass-1 ~]$ sudo chmod -R 777 /cassandra-ser

CHANGES DONE IN THE /etc/cassandra/conf/cassandra.yaml

cluster_name:                             ‘test_clustert’ 

seeds:                                     192.168.1.200

listen_address:                            192.168.1.200

rpc_address:                               192.168.1.200

endpoint_snitch:                   GossipingPropertyFileSnitch

data_file_directories:                  /cassandra-ser/data

commitlog_directory:                 /cassandra-ser/commitlog

saved_caches_directory:            /cassandra-ser/saved_caches

Start Cassandra and Enable it to start with system

 [root@cass-1 conf]# sudo service cassandra start

 [root@cass-1 conf]# nodetool status

 [root@cass-1 conf]# sudo systemctl enable cassandra.service

NOTE:

            #Start Cassandra:

                        sudo service cassandra start

            #Start Cassandra at startup

                        sudo systemctl enable cassandra.service

            #Stop cassandra

                        sudo service cassandra stop

INSERTING DATA INTO CASSANDRA

DOWNLOAD DATASET FOR MOVIES & RATINGS FROM GOOGLE AND TRANSFER IT TO DIRECTORY.

[cass@cass-1 ~]$ cd /cassandra-ser/

[cass@cass-1 cassandra-ser]$ ls -ltrh
total 727M

drwxr-xr-x. 2 cassandra cassandra    6 Apr 27 06:59 saved_caches

drwxr-xr-x. 2 cassandra cassandra   80 Apr 27 06:59 commitlog

drwxr-xr-x. 6 cassandra cassandra   86 Apr 27 06:59 data

-rw-rw-r--. 1 cass      cass      2.8M Apr 27 07:11 movies.csv

-rw-rw-r--. 1 cass      cass      725M Apr 27 07:11 ratings.csv
[cass@cass-1 cassandra-ser]$ cqlsh 192.168.1.200

cqlsh> create keyspace lab with replication = {'class': 'SimpleStrategy', 'replication_factor': 3} ;

cqlsh> create table lab.movies (movie_id int primary key, title text, genres text);

cqlsh> use lab;

cqlsh:lab> copy movies(movie_id, title, genres) from '/cassandra-ser/movies.csv' with header = true;

cqlsh:lab> create table lab.ratings (user_id int, movie_id int, rating double, timestamp bigint, primary key((user_id), movie_id));

cqlsh:lab>  copy ratings(user_id, movie_id, rating, timestamp) from '/cassandra-ser/ratings.csv' with header = true;

INSTALLATION & CONFIGURATION OF SPARK

[cass@cass-1 cassandra-ser]$ sudo mkdir /spark-ser

[cass@cass-1 cassandra-ser]$ sudo chown  cass:cass /spark-ser

[cass@cass-1 cassandra-ser]$ cd /spark-ser/

[cass@cass-1 spark-ser]$ wget https://mirrors.estointernet.in/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

[cass@cass-1 spark-ser]$ tar -xvf spark-3.1.1-bin-hadoop3.2.tgz

[cass@cass-1 spark-ser]$ ln -s /spark-ser/spark-3.1.1-bin-hadoop3.2 /spark-ser/spark

[cass@cass-1 spark-ser]$ ls –ltrh
 total 219M

drwxr-xr-x. 13 cass cass  211 Feb 21 21:11 spark-3.1.1-bin-hadoop3.2
-rw-rw-r--.  1 cass cass 219M Feb 21 21:45 spark-3.1.1-bin-hadoop3.2.tgz
lrwxrwxrwx.  1 cass cass   36 Apr 27 07:34 spark -> /spark-ser/spark-3.1.1-bin-hadoop3.2

[cass@cass-1 spark-ser]$ cd spark/conf

[cass@cass-1 conf]$ ls -ltrh
total 36K

-rw-r--r--. 1 cass cass  865 Feb 21 21:11 workers.template
-rwxr-xr-x. 1 cass cass 4.4K Feb 21 21:11 spark-env.sh.template
-rw-r--r--. 1 cass cass 1.3K Feb 21 21:11 spark-defaults.conf.template
-rw-r--r--. 1 cass cass 9.0K Feb 21 21:11 metrics.properties.template
-rw-r--r--. 1 cass cass 2.0K Feb 21 21:11 log4j.properties.template
-rw-r--r--. 1 cass cass 1.1K Feb 21 21:11 fairscheduler.xml.template

[cass@cass-1 conf]$ cp workers.template workers

[cass@cass-1 conf]$ cp spark-defaults.conf.template spark-defaults.conf

 Change the config file:

[cass@cass-1 conf]$ vi workers
[cass@cass-1 conf]$ vi spark-defaults.conf

Establish passwordless ssh connectivity among nodes

For one node

[cass@cass-1 ~]$ cd

[cass@cass-1 ~]$ ssh-keygensudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

[cass@cass-1 ~]$ sudo chmod -R 750 ~/.ssh/authorized_keys


[cass@cass-1 ~]$ sudo vi /etc/systemd/system/spark.service


[Service]

User=cass
Type=forking
ExecStart= $SPARK_HOME/sbin/start-all.sh
ExecStop= $SPARK_HOME/sbin/stop-all.sh
TimeoutSec=30
Restart=on-failure
RestartSec=30
StartLimitInterval=350
StartLimitBurst=10
[Install]

WantedBy=multi-user.target

[cass@cass-1 ~]$ sudo systemctl daemon-reload

[cass@cass-1 ~]$ sudo systemctl start spark.service

[cass@cass-1 ~]$ sudo systemctl enable spark.service
NOTE:

#Reload the daemon processes:

            sudo systemctl daemon-reload

#Start Spark using systemctl

            sudo systemctl start spark.service

#Enable Spark to start at system startup

            sudo systemctl enable spark.service
#Check the web ui for status

            http://192.168.1.100:8080/

#Access Spark-Shell

            $SPARK_HOME/bin/spark-shell

#Start Spark Manually

            $SPARK_HOME/sbin/start-all

#Stop Spark Manually

            $SPARK_HOME/sbin/stop-all

Find the latest cassandra connector for spark in the maven repo. Launch spark shell with Cassandra connector and Cassandra host configuration.

#Scala/Spark-shell

$SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=192.168.1.200
#Python Shell

$SPARK_HOME/bin/pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=192.168.1.200

#Once you’re in the interactive shell, you can start with loading required python libraries, and test your connectivity:

>>> from pyspark import SparkContext, SparkConf

>>> from pyspark.sql import SQLContext

>>> load_options = { "table": "movies", "keyspace": "lab"}

>>> df=spark.read.format("org.apache.spark.sql.cassandra").options(**load_options).load()

>>> df.show()

>>> df.write.csv('/tmp/mycsv.csv')