Chapter 1
Introduction
Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.Consider as a computer program is said to learn from experience (E), with respect to some task (T), and some performance measure (P): if its performance on T, as mesure by P, improves with experience E.
In other words, let's consider the situation where we have an email program that watches which emails you do or do not mark as spam, and based on that learns, how to better filter spam:
a) The task (T) is what I basically want to achieve, which means classifying emails as spam or not spam. It's the goal.
b) The performance measure (P) is me getting a number of emails correctly classified as spam or not spam. This is the result that I can obtain, since applying an algorithm.
c) And the experience (E) is the learning process, meaning that the emails labeled as spam or not will be watched.
The machine learning has two main types of learning algorithms methods: supervised and unsupervised learning. Others are: reinforcement learning and recommender systems.
Supervised Learning
In the supervised learning situation, the right answers are given. For example, I have an understanding of the price of houses per square feet in a certain are, and by collecting them, I can predict a price base for a specific house, with a specific size, that I want to sell. This might not be entirely accurate, but I'm getting closer to the value of my prediction.
However, if I can also creat a curve using predictive arguments, as a trend, or by using the prediction of continuous valued output (i.e. the price used in 2008 versus 2021), and with that I would then define what is called Regression. If instead I define a discrete valued output, that is called Classification.
A better way to understand the difference between Classification or Regression problems is by using examples:
a) You have an inventory of identical items, and you want to predict how many of these items you will sell over in the next 3 months. You are going to be defining what the items are, and the time, and you will be monitoring how much of them were sold int he past, and by using a predictive algorithm, you might get an approximate number of how many will be sold in the the stipulated period of time. This is Regression.
b) However, in the following example, I want to examine individual customer accounts, and for each account decide if it has been hacked or compromised. I can define a discrete valued 0 for not hacked and 1 for hacked, and once applying the algorithm, I can obtain the result for these two categories. This is Classification.
However, if I can also creat a curve using predictive arguments, as a trend, or by using the prediction of continuous valued output (i.e. the price used in 2008 versus 2021), and with that I would then define what is called Regression. If instead I define a discrete valued output, that is called Classification.
A better way to understand the difference between Classification or Regression problems is by using examples:
a) You have an inventory of identical items, and you want to predict how many of these items you will sell over in the next 3 months. You are going to be defining what the items are, and the time, and you will be monitoring how much of them were sold int he past, and by using a predictive algorithm, you might get an approximate number of how many will be sold in the the stipulated period of time. This is Regression.
b) However, in the following example, I want to examine individual customer accounts, and for each account decide if it has been hacked or compromised. I can define a discrete valued 0 for not hacked and 1 for hacked, and once applying the algorithm, I can obtain the result for these two categories. This is Classification.
To remember then, when you define a data set classification with a discrete value (Classification) or by trend (Regression), we’re talking about supervised learning.
Unsupervised Learning
The unsupervised learning case is when I'm not applying any methods. I have a data set, and based on that, I want to discover the result.
Good examples are news.google.com, that scans different news, and then group the similar ones in one single category, merging the most highlighted information in one single link.
Other one, is a data set of genes and different individuals. By applying an algorithm, it is possible to obtain a group of different races, or other information, by an autonomous learning process. This can happen not only with genome, but when organising computers, doing some social analysis or market segmentation, or astronomical data analysis.
The “cocktail party problem” expands a bit more on how, by a set of data, machine learning can help separating voice of two different people in two different categories for you. You don't choose which voice or language, who has the deeper or weaker voice, who is male or female, or even if it’s music or a person speaking in the background: the algorithm should be able to separate that in categories for you automatically, rather than you defining it at priori.
Therefore, the best way to remember this is, when not defining a classification, this is using a unsupervised method.
Good examples are news.google.com, that scans different news, and then group the similar ones in one single category, merging the most highlighted information in one single link.
Other one, is a data set of genes and different individuals. By applying an algorithm, it is possible to obtain a group of different races, or other information, by an autonomous learning process. This can happen not only with genome, but when organising computers, doing some social analysis or market segmentation, or astronomical data analysis.
The “cocktail party problem” expands a bit more on how, by a set of data, machine learning can help separating voice of two different people in two different categories for you. You don't choose which voice or language, who has the deeper or weaker voice, who is male or female, or even if it’s music or a person speaking in the background: the algorithm should be able to separate that in categories for you automatically, rather than you defining it at priori.
Therefore, the best way to remember this is, when not defining a classification, this is using a unsupervised method.