Today, we reviewed the first quarter test.  The answer for question #1 was incorrectly specified, so I will go back and change people's grades to reflect that.  Sorry for the error. 

Today, we talked about ways to think about the entire distribution of a variable.  We created frequency tables -- both on data that students found on the internet, and on income data as a class.  The PivotTable feature in Excel makes it easy to group cases together into intervals.  After creating the PivotTable, one just right-clicks on the numbers,  selects the "Group" option and then decides the size of the intervals with which to group the data. 

We created histograms, which are simply bar charts based on frequency distributions. 

A histogram can tell you the shape of the distribution.  Distributions can be symmetrical or skewed.  When the tail on the right side tail of the distribution is longer than the left side tail, then we say the distribution is skewed to the right.  When the left-hand tail is longer than the right-hand tail, then we say that the distribution is skewed to the left. 

An example of a distribution skewed to the right is income in the US (and just about everywhere.)  There are a small group of people making a lot of money, which pulls the tail of the income distribution out to the right.  This rightward skew causes mean (average) income to be higher than median income.  Recall that extreme values (called outliers) affect the mean but do not affect the median.

The homework for Tuesday is posted on the homework part of the website.


 
First, make sure that you have all of the in-class assignments and homeworks saved on your USB drive.  Next class, as you are doing the in-class activity, learning more about Excel, I will come around the room and check assignments.

For your homework, click the homework link on the left for instructions.
 
The test will be multiple choice on Blackboard.  You are allowed to consult your notes. 

The test covers the following:
  1. Level of measurement - (types of variables) - you should be able to determine whether a variable is nominal, ordinal or interval.
  2. What is data -- you should know what data is for our purposes -- that it is made up of variables and cases.  What are variables?  What are cases?  Know how to find out how many cases there are in a dataset.
  3. Measures of central tendency: mean (average), median and mode.  Know what they are and how to calculate them in Excel.  Know how the median can behave differently than the mean. 
  4. Standard deviation - know what it measures, and what it tells you.
  5. Rates.  Why do we use them rather than just simple counts or frequencies?  How do we calculate them?  What problems can arise if we calculate rates by place,
  6. Facility with Excel.  You should be comfortable opening data in Excel and creating tables (CTRL-T) based on the data.
  7. Using Pivot Tables to obtain counts, totals, and averages of one variable by different categories of another variable (remember the Major League Baseball team payrolls? the payrolls were sums of player salaries by team.)  
  8. Naming ranges of cells and then using functions to calculate summary statistics of these ranges of cells, e.g. =AVERAGE(), =MEDIAN(), =STDEV()
  9. Creating new variables based on existing variables.  (Remember that you needed to do this to create crime rates.)
 
This week, we calculated crime rates for different municipalities in New Jersey.  Our data come from the Uniform Crime Reports, a set of crime statistics compiled by the FBI from police reports.  We should keep in mind that our data only include crimes reported to the police. 

First, why use rates of crime rather than simply counts of crime? 


 
Today, we will first do an exercise that will help you review some of the material we have learned so far: functions, and basic statistics like the mean, median, mode and standard deviation. 

Next, we will launch into the main activity for the day in order to answer important question:

There's been a lot of talk about climate change and global warming in the last few years.  But you are probably more accustomed to hearing the rhetoric from politicians than hearing about the actual data. 

Has it been getting warmer in recent years?  In New Jersey?  We will use some data on average monthly temperatures to address this question.
 
Today, we're going to explore the baseball data.  Some people think that rich teams like the Yankees have too much of an advantage over poor teams like the Pittsburgh Pirates. 

How much variation is there in the resources of baseball teams?   This is a question we will examine today.  We will create a bar chart showing the payrolls of Major League Baseball (MLB) teams in the year 2009.

We will use tables from a baseball dataset that I downloaded for free online.  One table contains information on player salaries.  In order to get payroll information by team, we will use the PivotTable feature to sum up salaries by team. 

But we also want the names of the teams.  Team names are available in a team table.  I will show you how to combine information from the team table with information from the salary table.  The end goal is a bar chart displaying the payrolls of each baseball team in the Major Leagues in 2009. 

This exercise includes the following new tasks:

* import data from an MS Access database. 

We will not actually use the MS Access program.  Instead, you will learn how to import data from Access.  Why store data in Access?  Excel isn't so good for storing big datasets.  By contrast, Access allows you to store tons of data.  The baseball dataset is very large -- it has tables with statistics on players, teams, and other information.  This is why the baseball dataset is stored in Access.  Once you get the hang of it, its fairly straightforward to import tables from Access into Excel.

* use named ranges.  When you use a variable a bunch of times, it is handy to name it.  This way, you can refer to the variable by the name that you assigned to it.  This makes life easier.

* use a new function called =VLOOKUP. 

* Use Excel's bar chart feature.



 
Level of measurement

The first thing we did today was test our knowledge of level of measurement . Variables can be categorized according to level of measurement, namely: nominal, ordinal and interval.  Using the online polls allowed me to get a sense of how much the class knows about these topics.  With practice, you should be able to identify the level of measurement of a variable almost immediately!

Variables and cases - what datasets are made of

All this talk of level of measurement leads us back to the question: what is a variable anyway?  There is often confusion about this.  Technically speaking, variables are collections of characteristics of people, groups,  places, or things.
 
The most fundamental variable is a coin toss that can come up heads or come up tails.  It is important to be able to identify what is a variable and what is not a variable.  The variable here is set of possible outcomes of a coin toss.  So the variable consists of a set of items: heads / tails.  The coin ITSELF is not a variable.

To reiterate, variables are characteristics of things.  We call the things cases.  For our purposes, a dataset must have cases in the rows and variables in the columns.  Each case takes up one and only one row.  Each variable occupies one and only one column.

Datasets must contain cases that are all the SAME type of thing.  So if the dataset is a list of players in your favorite sports team, the cases are players, and you have to make sure there isn't anything else mixed up in there. 

If the cases are individual people, then ALL the variables in the dataset must be characteristics that describe individual people.  If the cases are nations, then ALL the variables in the dataset must be characteristics that describe nations. 

Telling a story with data

Telling stories about data is the point of this class.  It is important to know the levels of measurement of your variables, because this information tells you what techniques you might use to tell the story.

We considered the Forbes List of 100 Top Celebrities.  One story one might want to tell is about which types of celebrities get paid the most.  Or I can rephrase this as a question: what types of celebrities get paid the most? 

What kind of variable is pay?  Pay is an interval level variable, because it is measured precisely in (countable) dollars.  The variable "category" tells what type of business the celebrity is in.  Category is nominal because there is no order between the different categories.

A common data analysis technique is to take the average of an interval variable within different categories of a nominal variable.  This technique would help us answer the question: what types of celebrities get paid the most?  Specifically, we can look at average pay within categories.

How do you do this in Excel?  See the instructions for details.