Introduction to Data Science with R

Title Introduction to Data Science with R
Quarter Spring 2017
Instructors Adam Ginensky ( Nicolas Fermin Cota (
Syllabus Course Description
Course Description Introduction to Data Science with R is a fast paced 10 week course. It is an old adage in the analytics community that as much as 90 % of a project consists of analysis of the data prior to the actual modeling. This class will attempt to implement that adage. Modeling can include cleaning of the data, but the heart of any predictive analytics project is understanding the basic properties of the data and the inter connections between the different variables. This amounts to doing a number of standard statistical procedures and goes by various names including data wrangling, data munging and ETL (extract, transform, and load) . Therefore at the heart of any modern data analysis is to first load the data into a computer and then to use software to understand the basic statistical properties of the data. I should add that visualizing the data is an important component of this process and will be an important component of this class too ! Twenty years ago, this class would have been called ’Statistics’. However in modern statistics, it is pointless to understand a technique without understanding how to implement it code. Similarly I think it is wrong to learn how to perform statistical procedures in R without understanding what one is doing. The output of a statistical function shouldn’t be a number, but rather a better understanding of the data. So the goal of this class is to learn how to understand any data set via using R to visualize and analyze the data.

Course Contents
    Statistics Topics
    Summary Statistics
  • Histograms
  • mean,median,mode
  • standard deviation and variance
  • correlation
  • Linear model
  • information in linear model
  • anova
  • t test and z test
  • chi squared, KSS test
    R Topics
    The R console
  • Rstudio
  • mean,median,mode
  • Packages
  • Functions, Scripts, Reports
  • Basic graphing and plotting
    Data importing
  • Basic read functions and their defaults
  • data.table and other packages for large data sets
    Manipulating Data
  • dplyr
  • ggplot
  • other packages in ’the Hadleyverse’
    Advanced R functions and packages
  • lm (running regressions in R)
  • glm (running logistic regressions in R)
  • anova
Course Objectives:
At the completion of the course, students will be able to do the following:
  • Load data in the R Studio, transform data and populate data into various data structures
  • Read and write data in various formats
  • Using R functions for developing Regression models
  • Apply statistical tests for checking robustness of models
Instruction Format Coursework will have following three important components:
  • Self Study: The course instructor will provide reading material, short videos explaining key concepts.
  • Weekly Lectures: Instructor will hold weekly live online lectures to go over concepts.
  • TA Sessions :
    • Discussion on topics from the week
    • Working on homework problems.
    • Working on group and individual projects
  • Reading assignments and Homework for the week.
Assessment A letter grade A,B,C,D or F for the course will be decided based on:
  • Project: 50 % of the final grade. The project will take the place of a final exam.
  • There will be only one project, but it will be a multi-week project including a ’week 11’ for the students to complete their work.
  • Mid Term Exam: 15 % of the final grade.
  • 30 minutes duration which may include both multiple choice and subjective problems.
  • Homework: 20 % of the final grade.
  • There will be 5-7 homework which will be graded (by the TA) and feedback will be provided.
  • Class Participation:15 % of the final grade.
  • Your participation will be evaluated based on lab discussion, questions/comments, replies on the discussion forum and teamwork on the group projects.
TextBook There will be two texts for the class.
  • ¨Introductory R: A Beginner’s Guide to Data Visualisation, Statistical Analysis and Programming in R ¨by Richard Knell. This e-book is available from either amazon from google play. The cost of the book is $ 5.00
  • R for data science ¨by Grolemnund and Hadley. It is available on line at . Hard copies can also be purchased. It is not an expensive book and you may prefer to own your own copy.
Pre-Requisite The pre-requisites are the ability to do basic algebraic manipulations and some programming experience. In particular you must be able to download and install R and RStudio on your computer
Time Lecture Time:Sat 10 am EST/7:30 PM IST
Lab Time:Sunday 10 am EST/7:30 PM IST
Location   Online
TA Information Steven Thornton
Effort Required
  • It is expected that each student will need to devote six to ten hours per week for the class.
  • This includes reading, attending on line sessions, and watching assigned videos, and doing labs at the weekly sessions
Certification Students who successfully complete the course will receive an instructor-signed certificate with a letter grade