How to increase information gain

Question

1) This is "uncooked" raw dataset has the max entropy "1"

Click here open full screen in a new tab

2) This is same dataset, but is "cooked".

Click here open full screen in a new tab

Any dataset on Socrata, Arcgis Server, Arcgis online, are "raw" dataset. "Raw" dataset are unclassified dataset, heterogeneous data mixed together, has the most "uncertainty", most "surprise".

The information gain is very low. That is why you feel those unclassified "raw" data are not very useful.

To make the unclassified 'raw' dataset more useful

We need to split the dataset, called "classify" by some criteria

ESRI Arcgis symbology is a good example of reduce entropy by classify using color and symbol.

For example, the zoning map, with the color symbol, you feel useful since you can read commercial zone easily by color.

Since all our raw dataset structured by row and column in nature

A simple way of "classify" is split by column.

of cause, if raw dataset is not structured by row and column, the classification process will change accordingly. But that is not what I am goint to talk about today.

When you classify by different column, you will get different decrease of entropy. Click here for details

1, 2, 3, 4, 5

The more entropy you reduced, the more information gain you will get.

Information gain = original entropy - reduced entropy

Spatial information is a 'column' , is a classification criteria, is a dimension by nature

The raw dataset can hold "any" thing

Don't simplify think raw dataset is just "row" and "column". "row" and "column" is fundermental structure of dataset. Think of row and column as floor and unit room in self storage building. Floor is row, column is unit number. Unit room can hold any thing. The same as dataset, the raw dataset can hold the data for a "decision tree", to be able to see the "decision tree", you will need special tool to extract the "decision tree" from the raw dataset. Raw dataset could hold raw data for two dimension model, three dimension model, time seriels multi-dimension model, spatial model, multi-dimensional spatial model, time series spatial model, and so on, it is endless model you can imagine.

All those super complex model can be stored in "row" and "column" in form of raw dataset.

Socrata and Esri Arcgis online, Arcgis Server has huge amount of those kind of super complex model, but in "raw" form.

The "raw" form of data is not very useful, is same as the "raw" food is not easy to digest.

You need to "cook" the raw dataset to make it useful, as you need to "cook" the raw food to make it easier to digest.

To "cook" the raw dataset you have many ways, "classification by column" is just a commonly used method, same as steamming the food, is just a common cook method.

There are many cook methods, such as fry, boil, microwave, oven, bake, etc..

It is same there are many method to "cook" raw dataset, classification is just one of them.

I don't know what the others are, that is the unknow area, need someone to explore

All the tools I invent here is all about different ways of "cooking" raw dataset.

No matter what method you use to "cook" dataset,

the goal is to reduce entropy, increase information gain,

make it useful, make it eaiser to digest.

Data author can inject super complex model into "raw" ( row and column ) form.

Socrata and ESRI Arcgis Online provide data author a tool to publish it in "raw" form

To digest the raw dataset, the audience have 2 options :

1) make simple tools by themself from the platform. Socrata and ESRI provide some simple tools to let user operate some simple "cook" method.

2) use third party (well maded efficient) tools to cook the raw data.

The third party tool maker is still a unexplored area, is a unexplored market.

by hoogw

How to increase information gain

Please log in or register