The less uncertain the conditions the more information we can deduce and effectively use?

By Matt DiCicco

Entropy Explained

Entropy is really an essential idea of thermodynamics, which would tell physicists how much disorder there is.

Machine learning practitioners must have then adopted the idea because of its ability to classify data on an unsupervised front (and in supervised situations like decision trees).

Entropy can quantify the amount of uncertainty in an entire probability distribution. Some say it is a measure of chaos, but I don't like to delineate it in this way.

That entropy is just a metric for information gained.

The more information gained then the more we can rule out certain scenarios or the more we can tell how something is going to happen.

Implications:

Low entropy-> less chaos-> more information obtained

High entropy -> more chaos -> less information gain

The lower entropy, the less chaos or the purer the set is.

Meaning the next state might be easier to predict if the quantified entropy is Low.

Common Application: One application of entropy is in decision trees. In a decision tree you decide on which feature you would like to split your data-set on first.

To figure out which feature is most suitable you can use entropy. You would loop over your features and depending on which feature has the lowest entropy, you would use this as the first split in the decision tree.

This is because the lowest entropy would tell you the feature that describes the data the best. Meaning it classifies most of the data right off the bat into a specific class.

Day to day example of entropy:

Imagine you have the features height, skin tone, and day of the week. And the goal is to predict an individual's weight.

So now you loop over the features and realize the lowest entropy is a person's height. This is because height is the best predictor of a person's weight, hence the most information gained.

Equation:

C- is the number of clusters (class of data) you would like to go up to for an unsupervised case. But C can also be the number of classes or the number of dependent variables that something could be classified as in a supervised situation.

(I believe should be called Independent variables, since the dependent variable is the result of the entropy calculation)

The intuitive reader would realize that entropy could be used to figure out the proper number of clusters for a dataset.

P- is the probability of the class in the total dataset.

An example, 3/10 people are male, and 7/10 are female. And we are trying to find the entropy, or the information gained. We have 2 classes (male and female), and probabilities of 3/10 and 7/10.

So, the equation would look like, -3/10log_2(3/10) — 7/10log_2(7/10). This would equal 0.88, which is considered a high entropy due to the fact entropy is bounded by 0 and 1.

What does this tell us? It tells us we don't have much information gain.

More on that topic later.

  • Matt DiCicco
  • 2107741848
  • matt.diciccomhs@aol

This free site is ad-supported. Learn more