Data Science

Learning resources

Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UNIX environment.

Shell scripting does suffice for many tasks

but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010).

Data Science Methodology

The Data Science Methodology aims to answer the following 10 questions in this prescribed sequence:


Command-line based


Web-based notebooks

Interpreters: md, pyspark, sh, sql

import java.nio.charset.Charset

val bankText = sc.parallelize(
        new URL(""),

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bankRDD = => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt

val bank= bankRDD.toDF()


Input file is a csv file named also dataset Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext) by default.

  1. Converting the csv file to RDD

RDD: Resilient Distributed Datasets (

RDD is a fundamental data structure of Spark.

It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient.

  1. Data Cleaning using Map and Filter

Create an RDD of tuples from the original csv dataset Create the schema/class represented to define name and type of column Apply the schema to the RDD

  1. Create a DataFrame

We create a DataFrame using toDF() function. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases or existing RDDs. SparkSQL supports operating on a variety of data sources through the DataFrame interface. The entry point into all relational functionality in Spark is the SQLContext class. With a SQLContext, you can create DataFrames from an existing RDD. The Scala interface to Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case are read using reflection and become the names of the columns

  1. Register DataFrame as a Temporary Table

The DataFrame can be registered as a table. Tables can be used in subsequent SQL statements. The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

from pyspark.sql.types import *

bankDataset = sc.textFile("bank.csv")
print bankDataset.take(5)

bankRDD = s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))
print bankRDD.take(5)

bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bankdf = spark.createDataFrame(bankRDD,bankSchema)


Python-based tools/technologies for Data Science

scikit-learn is an open-source machine learning library that supports supervised and unsupervised learning.


Back to top