### Machine Learning Part 1 -- Linear Regression

In working with the Drone project, I ran into an issue. It turns out to be really hard to control the drone while transitioning from/to VR. That is to say, when the drone is in the air, flying it in VR is really comfortable and easy. However, getting it to the air and then transitioning to VR mode is a challenge. Also, what happens when the VR mode allows the drone to do something that would allow it to crash. There's two potential solutions to this problem. 1.) I could advocate for having a buddy control the drone and keep it in stable flight allowing manual switch over for when ready for VR control, or 2.) I could advocate for writing some AI to allow the drone to control itself.
I'm Opting for #2 for a few reasons, first I'd rather be able to fly the drone whenever I wish whether a buddy is available or not. Also I'd like to avoid creating multiple control capabilities and then require a certain skill level to control my drone. Finally, I'm a programmer, this is a programming solution. So with that, I refuse to introduce such an awesome topic as AI to one of my projects without first explaining it. So much like we did with quaternions, a quick tangental aside is needed here.
As this is my first time I've written about Machine Learning / AI concepts, I need to start with the basics and build up. My end goal will be to have a learning algorithm able to fly the drone, successfully control it in any situation, and land it safely without crashing into any obstacle and able to recover from any emergency (being hit with something).
Let's start with what an AI is. For the purposes of explaining the way I understand AI, I'll start by giving definitions. Intelligence is defined by the process with which it learns. Learning is the action of deciding if an experience is good or bad. An experience is defined by a set of actions or inactions inspired from some given stimuli. Stimuli is defined by sensing some kind of environment. Thus an AI has 5 parts:
• An environment and a way to sense it.
• An ability to look up similar past experiences from given input by querying the KB (Knowledge Base).
• An ability to decide what action to take.
• An ability to act upon the environment in a way that changes it, preferably in a way that the sensory can determine the results of the action.
• An ability to decide if the action taken was good or bad and then store it in the KB (Knowledge Base).
Now in the above basics, there's a distinct lack of discrete knowledge for what happens next. No where do we have a rule that says when stimuli happens, do this or that. The key here is that we don't know ahead of time what the decision will be, the AI will make a decision based upon what it senses and what it has from the experience of previous decisions. So in the world of Machine Learning, we need to start with the very basic idea of what it is and how it works. What we typically want in any problem is to come up with a way to predict what data will come out of some defined input. We want to target creating a function that will describe what the input will map to. So let's start with a very typical hypothetical example. Predict the value of a house based around given data for the square feet of the house in a market. In this case, you take the historical data showing what the market has selected the value of each house and plot it against the advertised area of the house. The graph will typically look something like this:

Data above is completely fictional, however, it shows what data typically will look like. For some input X, there comes a result y. Which I phrased that way on purpose, it relates to the way we think about functions. What our goal would be is to predict what a result y is for any given X. So we'd want to find some function which best predicts where the most amount of points comes from.

As an aside, the way I think about the above is that each data point is an instance where the AI has sensed the environment. When we fit a line in test, we're going to see if it holds for all past experiences. Then we'll test to see how good the line is we fit (test the distance of all points to the line). Next we'll move the line in some direction which will minimize the distance of all data points to the line. This would constitute the decision and then make a decision if it's good or bad. When the minimum is reached, then we have a "pretty good" prediction for all future points and an established correlation. From a conceptual point of view, we could either stop and assume that all future points will be close to the predicted line, or we can continuously update our predicted line by running over the new set of points (update the KB), which is called "online" AI. In an updating system, I recommend establishing a discrete upper bound where we "forget" old data points so as to keep the decision quick and light weight.

In any case, the next step is again, pick a line, then measure the distance of all points to the line, then move the line in a direction that minimizes the distance of all points. This conceptually will look like this for each iteration of guess until global optimum is reached:

To explain how this works, first we pick a line which is the generic form of y = mx + b. Now we need to know what the total "cost" is of this line that is picked. The cost can be found by calculating the distance of each point to the line. You can think of this as a simple distance calculation from a single point to a line. We know what the point on the line that we want to take by our original line function y = mx + b. This is to say, if you know what the m and the b are by guessing, then you know what y is for a given x.

Now we know from our dataset what actual y values are at a given x. We want to know the distance of the actual prediction to our predicted position on the line. So by remembering the distance formula dist = \sqrt{x^2 + y^2} we can calculate the distance of each point. If you add up all the actual points distances, then you'll get the cost of picking a particular line so concretely: y = \frac{1}{m} * \sum_{i=1}^m \sqrt{predictY_i^2 - actualY_i^2}. However, square root is a costly function; and we only really care about finding the minimum cost, so we can use something called the Mean Square Error we can eliminate the square root and maintain the ability of deciding which cost is the least so we can rewrite our cost function as this and gain an optimization bonus point: y = \frac{1}{m} * \sum_{i=1}^m (predictY_i - actualY_i)^2.

So now that we know what the cost is for our given prediction, we can use that to determine what a better prediction is. The idea here is we select a m and a b that can feed into our cost function until we reach a global minimum. The easiest function to grasp to do this is "Gradient Descent." Gradient Descent is the idea of exactly what we've been talking about, find the correct step to make for both m and b in our line function to reach the shortest distance from all points. It is a convex function we're looking at, so it always converges to the global minimum.

Putting everything together; here's the math for a gradient descent. First in partial derivative form:

\frac{\partial}{\partial\thetaj} \frac{1}{2m}\sum_{i=1}^m((\theta_0 + \theta_1*x_i) + y_i)^2

This function is run over the values of j = 0 and j = 1. So we can rewrite it as:

Where j := 0 = \frac{\partial}{\partial\theta0} = \frac{1}{m}\sum_{i=1}^m((\theta_0 + \theta_1*x_i) + y_i)
Where j := 1 = \frac{\partial}{\partial\theta1} = \frac{1}{m}\sum_{i=1}^m((\theta_0 + \theta_1*x_i) + y_i) * x_i
Update both functions simultaneously (that is note that theta0 and theta1 get updated in each step so if you do the first function then use the output from that within the second, then the results are going to be wrong. Repeat updating this function until you have convergence (no further repeats result in a lower cost.

Okay, if you're still with me, this is the very basics of linear regression. One can replace the hypothesis of saying it must be a linear line by introducing polynomials however, doing so can run into problems with over fitting and under fitting. Over fitting happens when we create a polynomial function that perfectly passes through every sample location in our dataset. The problem is that this often means that the next, future value might not fit on the line. Thus it turns into a perfect predictor for existing data, but not worthy for our desire to predict future values. The opposite situation, under fitting, is where we just have a simple line that goes through a large area and results in a poor performing prediction. In this situation, experimenting with adding complexity to a polynomial should result in a better predicting function.

I've avoided giving code here as I'm trying to keep this discussion really high level; as a simple introduction of the concepts. In a future post, I'll describe how to translate this into code. When looking at how to use ML in an embedded project that isn't on the internet such as in the case of our Snapdragon Flight drone, we need to be able to translate all this to code directly. However, when you look at this function, it's heavy on math and takes a lot of linear algebra happening often. So, we can use a library that's optimized for it on Snapdragon boards. Here's the best math lib for this job.

Stay tuned and I'll give you some code for a generic Linear Regression prediction model and then move onto a Neural Network. However, the next steps are coming up in our Drone.