Lab 9 - Data Mining


Objectives

  1. SQL or Data Mining?
  2. Decision Tree (example)
  3. Python Practice: Reading from files


SQL or Data Mining?

  1. Given records of hospital treatments we need to find out how many of these took more than 2 days.
  2. Given records of patients check-ups we need to predict the month (Jan, Feb, etc) when a patient will come in for a check up for the next 12 months.
  3. Assuming that our predictions from 2 are correct, we need to find the month for which the hospital will perform the most check-ups.
  4. We have micro-array expression data of various genes. We need to determine which genes lead to a certain genetic condition.
  5. We have micro-array expression data of various genes. We need to find the amount of genes expressed more than a specified threshold t over all our features.
  6. We want to discover relationships between products sold by an e-store.


Decision Tree

Here are our training data:



Here is the decision tree:

 

Here are some test records:

[22 , high , no , fair , yes]

[45 , high , no , excellent , yes]

[32 , low , yes , excellent , yes]

How would our decision tree classify these records?


Python Practice: Reading from files

Download this excel file. It contains information about the average number of children per woman in many different countries for the years 1989 and 2009

Before we use the data in any data mining or visualization procedure, we usually want to correct them, purge them or even transform them into something new. As an example to that, the dataset you downloaded has some missing values. One way to cope with them is the following: If only one number is missing (i.e. for either 1989 or 2009 we don't have any statistics for that country), give it the value of the other year. If both are missing, do not include them in the final dataset.

  1. First, open this file in excel and save it as .csv (comma separated values)
  2. Write a python program that does the preprocessing that we described before
  3. Write the result to a new file
  4. Upload this file to Many-Eyes and see what visualizations you can create to depict this information


CS105
CS105 Labs