Lab 8 - Python for Data Processing


Objectives

In the rest of the course, we will learn data visualization and data mining. Typically a generic dataset might not be ready for this job. So as a first step, we need to perform some processing on the dataset to make it ready for visualization or mining. Today's lab provides an illustrative example of data processing using Python.


Problem Statement

Assume that our task is to draw a chart that shows the number of films in various runtime ranges in a database. We have a database as the source of information. The table we are concerned has the following schema:

Movie ( id , name , year , rating , runtime , genre , earnings_rank )

Our task is to count the number of films in the following runtime ranges: 60-69 min, 70-79 min, 80-89 min, 90-99 min, 100-109 min, 110-119 min, 120-129 min, 130-139 min, 140-149 min, .. , 230 - 239 min.


Get Ready

If you did not do this in last lab, download this database to your z:\db, also download this Python program to the same folder. Read the program and try to answer the following question: What does the program do?

Execute the program and look at the file generated.


Data Processing from the Database

Now lets accomplish our task by including a set of queries in a Python program named "getMovies1.py". It is straightforward to write several queries to get the number of movies in the runtime ranges above, however, a more scalable way is to write a parameterized query, and use a for loop to generate the runtime ranges, which in turn provide the parameters for the query.

For instance, you can use two variables to keep track of lower and upper values for a time range. These two variables are used for the parameterized query. In each iteration you modify the values of these two variables.

Write the results of the queries into a file named "movie1.txt". The format of the file is the following: except for the first line, each line records the information for a time range. There are two columns in each line: time range and the number of films. They are separated by "\t". The first line is the "headline", which tells the reader what information is provided in each column. Your file should look like this:

Run Time Number of  Films
Between 60 - 69 ...
Between 70 - 79 ...
Between 80 - 89 ...
... ...

You can refer to the program provided in "Get Ready" part for connecting to a database, performing queries and writing to a file.

In the next part, we will try to interpret this data.


Data Visualization

Many-eyes is a data visualization tool which can be found at http://www.many-eyes.com/
By using many-eyes, you can load your own data to the system, and visualize it with any type of chart you want. You can also publish your charts by using this website.

How can we visualize and interpret the data, which we found as a result of the question in the previous part?

Now, let's go to many-eyes web site, load our data into the system and visualize our data.

In order to load your own data, you have to register to the system. Don't worry, it doesn't take so much time.

After registering, login to the system and click on "Create Visualization" from the left menu.

Then, select "upload your own dataset".

Fill in the form. You can use this text file as a sample input.

When you click on upload after filling the form, you will be directed to a page where you can select the type of chart among a bunch of choices.

What kind of chart is convenient for your data? Bar chart, pie chart?


CS105
CS105 Labs