# Workshop 1A: Split Data

## What you need to do before workshop 

Before you begin with the workshop exercises, it is important for you to complete all the activities of the first week of Module 3A. It will help you to understand and apply your learning. 

#### Optional

To refresh your Python skills, go to the 'course resources' section in the 'Welcome to the course' module.

## Guide to use Jupyter Notebook


### What is Jupyter Notebook?

The Jupyter Notebook is a powerful tool for interactively developing and presenting programming exercises and assignments. 

A Jupyter Notebook consists of the code and its output into a single document. In a single document, you can combine visualisations, narrative text, mathematical equations, and codes to explain the exercise. You'll be able to run the code, display the output, and also add explanations, formulas, charts in a notebook and make the exercise more transparent, and understandable.


### What is .ipynb file?

Each .ipynb file is one Jupyter notebook and gets this .ipynb extension. If you create or download a new notebook, a new  .ipynb file will be created or downloaded. 


### Jupyter Notebook Interface

![blank-notebook-ui.png](attachment:blank-notebook-ui.png)



There are two fairly prominent terms that you should understand: cells and kernels. 

A kernel is a “computational engine” that executes the code contained in a notebook document.
A cell is a container for text to be displayed in the notebook or code to be executed by the notebook’s kernel.


### Cells

Cells form the body of a notebook. The notebook consists of a sequence of cells. For example, this notebook consists of many  cells. 

There are two main cell types that we will use in this module:

#### Code Cell

A code cell contains code to be executed in the kernel. When the code is run, the notebook displays the output below the code cell that generated it.

#### Markdown Cell

A Markdown cell contains text formatted using Markdown and displays its output in-place when the Markdown cell is run.
Every cell starts off being a code cell, but its type can be changed by using a drop-down on the toolbar (which will be “Code”, initially).
The markdown cell will not give you any output if you run it as it is not a code cell.

#### Run a cell
To run a code cell, click on 'run' button on the toolbar or click cell on the menu bar and select run cells.


You can name a Jupyter Notebook by clicking on 'untitled' on the top of the notebook as shown in the screenshot above. Click on 'save' on the toolbar to save your work. You can download your notebook by clicking on 'file' in the menubar and download it as a notebook with .ipynb extension. 

## Let's get started

This week you learned about the concept of machine learning and its types. You covered linear regression as a simple method of implementing supervised learning.You also explored how to generalise a model to avoid underfitting or overfitting. Let's apply linear regression using Python. 

In this exercise, you'll build and evaluate a linear regression based machine learning model and the following are the steps:
1. Import the libraries.
2. Load input data (.csv file).
3. Pre-process data.
4. Create functions to calculate mean, variance, co-variance and estimate co-efficients and root mean squared error. To implement a simple linear regression model and evaluate it, firstly, you need to find  mean  variance, co-variance and co-efficients that you learned from the weekly activities. To do this, the functions to calculate all these parameters need to be written in Python.
5. Create a functions to implement a linear regression model, evaluate and viusalise the model. After developing all the functions, you'll use these functions to implement linear regression model on 'insurance' dataset and then evalaute and visualise the model.
6. Implement linear regression model on 'insurance' dataset. 
7. Evaluate the linear regression model on 'insurance' dataset. 
8. Visualise the linear regression model on 'insurance' dataset.

Note that a function is written in Python to perform each step listed above and returns a value(s) that can be used by another function as an input(s). The assessments are also designed in similar fashion where you'll complete the coding of the functions and ensure that the results are correct.

## 1. Import the needed libraries

The first step is to import the following Python libraries. 

Import the needed libraries

In [2]:
from random import seed
import functions
from csv import reader
from random import randrange

Load a CSV file

In [3]:
def load_csv(filename, skip=False):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        if skip:
            next(csv_reader, None)
        for row in csv_reader:
            dataset.append(row)
    return dataset

Split the dataset into training and test data

In [7]:
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    test = list(dataset)

    while len(train) < train_size:
        index = randrange(len(test))
        train.append(test.pop(index))

    return train, test

Seed the random value

In [8]:
seed(1)

Load and prepare data

In [9]:
filename = 'small_heart.csv'
# dataset = functions.load_csv(filename, skip = True)
dataset = load_csv(filename, skip = True)
# training, test = functions.train_test_split(dataset, 0.5)
training, test = train_test_split(dataset, 0.5)

In [10]:
print(len(training))

50


In [11]:
print(len(test))

49


In [12]:
print(training)

[['66', '0', '3', '150', '226', '0', '1', '114', '0', '2.6', '0', '0', '2', '1'], ['51', '1', '0', '140', '261', '0', '0', '186', '1', '0', '2', '0', '2', '1'], ['52', '1', '2', '172', '199', '1', '1', '162', '0', '0.5', '2', '0', '3', '1'], ['51', '1', '3', '125', '213', '0', '0', '125', '1', '1.4', '2', '1', '2', '1'], ['58', '0', '2', '120', '340', '0', '1', '172', '0', '0', '2', '0', '2', '1'], ['45', '0', '1', '130', '234', '0', '0', '175', '0', '0.6', '1', '0', '2', '1'], ['54', '1', '1', '108', '309', '0', '1', '156', '0', '0', '2', '0', '3', '1'], ['35', '0', '0', '138', '183', '0', '1', '182', '0', '1.4', '2', '0', '2', '1'], ['57', '1', '0', '132', '207', '0', '1', '168', '1', '0', '2', '0', '3', '1'], ['62', '1', '2', '130', '231', '0', '1', '146', '0', '1.8', '1', '3', '3', '1'], ['53', '1', '2', '130', '197', '1', '0', '152', '0', '1.2', '0', '0', '2', '1'], ['64', '1', '3', '110', '211', '0', '0', '144', '1', '1.8', '1', '0', '2', '1'], ['29', '1', '1', '130', '204', '0',