Linear regression is simplest algorithm of ML. This algorithm is very easy to understand and apply. Our main task is to get(predict) the variable of interest, y, which is conventionally called the “dependent variable”(endogenous variable or, output variable), using x, which are called the “independent variables”(exogenous variables or, input variables). We will use term output and input variable as our convention. It is a supervised learning method.
Typical equation for linear regression is: y = α + βx where α and β are constant.
Suppose we have a dataset containing area of different houses and their prices.
|Area (square feet)||Price in ₹(*10000)|
Here we have a single input variable area (our x) and output variable is price (y). Our objective here is to find an appropriate value of α and β using given examples. To do that will define a hypothesis (h), and using our given data (we can call it training data) we will improve it so that we can get a hypothesis which will get approximately correct values of y given unknown x.
Let us define hypothesis as: hθ(x) = θo + θ1x Where θ is called ‘parameters’ or ‘weights’ are constants. hθ(x) will give us predicted value(y) for given input variable x. If we take a close look on our hypothesis we can see that for different values for θ it can be a lot of different equation (hypothesis). So what we defined here is a “hypothesis space” and using our knowledge (training examples) we will try to find the correct hypothesis (values of θo and θ1).
Suppose using training data we found the final hypothesis. You can see in the graph that not all the training sets are on our hypothesis line, some are near, and some are far. This is because we always try to make our hypothesis generic, so it can perform better on unknown input. If you see the blue dots on the graph which are for unknown input variable, our hypothesis predict slightly less price then the original price. It is fine if our hypothesis not 100% correct (you can see that 100% is not achievable). Our function have some error. We calculate this error and try to reduce it as much as possible to improve our results.
Cost function basically defines correctness of our hypothesis. More the cost more the error in our prediction. Cost function calculates squared error from difference of predicted value (h(x)) and expected value (y).
Where: m – is number of training examples
h(xi) – is hypothesis output for example i
yi – is expected output (from training example)
To calculate Cost Function (J(θ0,θ1)), we take difference of h(x) and y, square it and, add it over our all the training examples(taking square normalizes the difference, we consider that predicting less price is as bad as predicting higher price).
Now we have a function which calculates the accuracy of our hypothesis, we can use it to improve our performance.
If our hypothesis perfectly predict the value of y for all the input variable then (h(xi) – yi) will be zero for all i and, so does the cost function.
Now we will discuss the technique to find the values of θ0,θ1 for our hypothesis.
Gradient descent is a technique in which we use cost function to improve our hypothesis step by step.
For starting we assume a value for θ, -ε>θ>ε (where ε is very small integer). Now we use following function to get new values for θ.
Where: α: learning rate (positive)
:= means we use simultaneous update (that we only replace previous values of θ with new one after getting all the new θ values)
We repeat this again and again till it converges to minima