Up until now, we have mainly focused on only ONE variable at a time.  Occasionally, a second variable crept in, but we have mainly focused on one. 

Now, we're going to start looking at two variables at one time.  In particular, we're going to look at associations between variables.  What does it mean to say that one variable is associated with another?  It doesn't mean that the two variables are similar.  Instead, a variable X is associated with another variable Y if knowledge of X enables you to predict Y with better than chance accu
 
Today, we reviewed the first quarter test.  The answer for question #1 was incorrectly specified, so I will go back and change people's grades to reflect that.  Sorry for the error. 

Today, we talked about ways to think about the entire distribution of a variable.  We created frequency tables -- both on data that students found on the internet, and on income data as a class.  The PivotTable feature in Excel makes it easy to group cases together into intervals.  After creating the PivotTable, one just right-clicks on the numbers,  selects the "Group" option and then decides the size of the intervals with which to group the data. 

We created histograms, which are simply bar charts based on frequency distributions. 

A histogram can tell you the shape of the distribution.  Distributions can be symmetrical or skewed.  When the tail on the right side tail of the distribution is longer than the left side tail, then we say the distribution is skewed to the right.  When the left-hand tail is longer than the right-hand tail, then we say that the distribution is skewed to the left. 

An example of a distribution skewed to the right is income in the US (and just about everywhere.)  There are a small group of people making a lot of money, which pulls the tail of the income distribution out to the right.  This rightward skew causes mean (average) income to be higher than median income.  Recall that extreme values (called outliers) affect the mean but do not affect the median.

The homework for Tuesday is posted on the homework part of the website.


 
First, make sure that you have all of the in-class assignments and homeworks saved on your USB drive.  Next class, as you are doing the in-class activity, learning more about Excel, I will come around the room and check assignments.

For your homework, click the homework link on the left for instructions.
 
The test will be multiple choice on Blackboard.  You are allowed to consult your notes. 

The test covers the following:
  1. Level of measurement - (types of variables) - you should be able to determine whether a variable is nominal, ordinal or interval.
  2. What is data -- you should know what data is for our purposes -- that it is made up of variables and cases.  What are variables?  What are cases?  Know how to find out how many cases there are in a dataset.
  3. Measures of central tendency: mean (average), median and mode.  Know what they are and how to calculate them in Excel.  Know how the median can behave differently than the mean. 
  4. Standard deviation - know what it measures, and what it tells you.
  5. Rates.  Why do we use them rather than just simple counts or frequencies?  How do we calculate them?  What problems can arise if we calculate rates by place,
  6. Facility with Excel.  You should be comfortable opening data in Excel and creating tables (CTRL-T) based on the data.
  7. Using Pivot Tables to obtain counts, totals, and averages of one variable by different categories of another variable (remember the Major League Baseball team payrolls? the payrolls were sums of player salaries by team.)  
  8. Naming ranges of cells and then using functions to calculate summary statistics of these ranges of cells, e.g. =AVERAGE(), =MEDIAN(), =STDEV()
  9. Creating new variables based on existing variables.  (Remember that you needed to do this to create crime rates.)
 
This week, we calculated crime rates for different municipalities in New Jersey.  Our data come from the Uniform Crime Reports, a set of crime statistics compiled by the FBI from police reports.  We should keep in mind that our data only include crimes reported to the police. 

First, why use rates of crime rather than simply counts of crime? 


 
Today, we will first do an exercise that will help you review some of the material we have learned so far: functions, and basic statistics like the mean, median, mode and standard deviation. 

Next, we will launch into the main activity for the day in order to answer important question:

There's been a lot of talk about climate change and global warming in the last few years.  But you are probably more accustomed to hearing the rhetoric from politicians than hearing about the actual data. 

Has it been getting warmer in recent years?  In New Jersey?  We will use some data on average monthly temperatures to address this question.
 
Today, we're going to explore the baseball data.  Some people think that rich teams like the Yankees have too much of an advantage over poor teams like the Pittsburgh Pirates. 

How much variation is there in the resources of baseball teams?   This is a question we will examine today.  We will create a bar chart showing the payrolls of Major League Baseball (MLB) teams in the year 2009.

We will use tables from a baseball dataset that I downloaded for free online.  One table contains information on player salaries.  In order to get payroll information by team, we will use the PivotTable feature to sum up salaries by team. 

But we also want the names of the teams.  Team names are available in a team table.  I will show you how to combine information from the team table with information from the salary table.  The end goal is a bar chart displaying the payrolls of each baseball team in the Major Leagues in 2009. 

This exercise includes the following new tasks:

* import data from an MS Access database. 

We will not actually use the MS Access program.  Instead, you will learn how to import data from Access.  Why store data in Access?  Excel isn't so good for storing big datasets.  By contrast, Access allows you to store tons of data.  The baseball dataset is very large -- it has tables with statistics on players, teams, and other information.  This is why the baseball dataset is stored in Access.  Once you get the hang of it, its fairly straightforward to import tables from Access into Excel.

* use named ranges.  When you use a variable a bunch of times, it is handy to name it.  This way, you can refer to the variable by the name that you assigned to it.  This makes life easier.

* use a new function called =VLOOKUP. 

* Use Excel's bar chart feature.



 
Level of measurement

The first thing we did today was test our knowledge of level of measurement . Variables can be categorized according to level of measurement, namely: nominal, ordinal and interval.  Using the online polls allowed me to get a sense of how much the class knows about these topics.  With practice, you should be able to identify the level of measurement of a variable almost immediately!

Variables and cases - what datasets are made of

All this talk of level of measurement leads us back to the question: what is a variable anyway?  There is often confusion about this.  Technically speaking, variables are collections of characteristics of people, groups,  places, or things.
 
The most fundamental variable is a coin toss that can come up heads or come up tails.  It is important to be able to identify what is a variable and what is not a variable.  The variable here is set of possible outcomes of a coin toss.  So the variable consists of a set of items: heads / tails.  The coin ITSELF is not a variable.

To reiterate, variables are characteristics of things.  We call the things cases.  For our purposes, a dataset must have cases in the rows and variables in the columns.  Each case takes up one and only one row.  Each variable occupies one and only one column.

Datasets must contain cases that are all the SAME type of thing.  So if the dataset is a list of players in your favorite sports team, the cases are players, and you have to make sure there isn't anything else mixed up in there. 

If the cases are individual people, then ALL the variables in the dataset must be characteristics that describe individual people.  If the cases are nations, then ALL the variables in the dataset must be characteristics that describe nations. 

Telling a story with data

Telling stories about data is the point of this class.  It is important to know the levels of measurement of your variables, because this information tells you what techniques you might use to tell the story.

We considered the Forbes List of 100 Top Celebrities.  One story one might want to tell is about which types of celebrities get paid the most.  Or I can rephrase this as a question: what types of celebrities get paid the most? 

What kind of variable is pay?  Pay is an interval level variable, because it is measured precisely in (countable) dollars.  The variable "category" tells what type of business the celebrity is in.  Category is nominal because there is no order between the different categories.

A common data analysis technique is to take the average of an interval variable within different categories of a nominal variable.  This technique would help us answer the question: what types of celebrities get paid the most?  Specifically, we can look at average pay within categories.

How do you do this in Excel?  See the instructions for details.


 
Check this website for all information about the course.  Make sure that you buy the the textbook THIS week if have not already.  Statistics for People Who (Think They) Hate Statistics. 

In addition, you will need to bring a flash drive to class with you.  You can get a 4 gigabyte Lexar flash drive for about $14 at Office Depot on Hamburg Turnpike.  You will need a flash drive to store the data files and class assignments.

Main Points from class today:

  • In this class, you will learn about how to manipulate data and generate statistics, tables and charts in Excel.  
  • This class is a resume-builder!  After you successfully complete this class, you can add the following statement to your resume: "Proficient in Excel."  
  • The world is awash in data.  Since so many of us are online so often, so many of our activities are recorded!  If you're online, you're generating data that is being collected by a computer somewhere and packaged together with information from millions of other people.   This is rather scary, but it is the world we live in.  In a world of data, it is very useful to know how to manipulate some of that data, and to know how to interpret the data and tell stories with the data.  We will use Excel to gain mastery over data!  
  • What do I mean by data?  When I refer to data, I'm referring to a rectangular matrix with rows and columns.  The rows often correspond to people, and the columns often contain characteristics of those people.  The Forbes list of the top 100 celebrities is an example of a dataset.  The rows are called cases.  The columns are called  variables.
  • Cases can be all kinds of things, individuals, organizations, nations, etc.  In one dataset however, all cases have to be the same KIND of thing.  You can't throw information about a state or a company into a dataset that consists of people  All cases in a dataset must be the same kind of thing.
  • Each class, I will give some pointers, and then pass out a set of instructions.  Your job will be to follow the instructions as carefully as you can and answer any questions along the way.  I encourage you to go at your own pace.  If you finish early, I encourage you to help others!  Teaching others will help you reinforce your knowledge.
  • Homework will consist of reading from the book and assignments in Excel as well.
  • Feel free to stay as long as you need to in order to finish the assignment.  Although the class technically ends at 8:30, I am perfectly willing to stay and help students answer questions and resolve problems until 10:00 pm if necessary.  I doubt anyone will want to stay beyond that time.
  • There will be three quarter tests and a final exam.  Dates of the tests are listed on the syllabus.  Note also the percentage of the grade that is based on each of the tests, and on in-class assignments.  All of this information is listed on the syllabus. 
  • On the tests, you will be expected to manipulate data in order to get information and calculate figures.  All tests will be online (blackboard) and open book.  The point here is to build data analysis skills, rather than to memorize a set of terms or facts.
  • We talked about different kinds of variables: nominal, ordinal and interval.  Some variables are not ordered -- these are nominal.  Examples include names of people, places and things, and social security numbers.  Other variables have order but are not precisely measured.  These variables are called ordinal.  Examples include survey responses about how happy you are (very happy, somewhat happy, not so happy), rankings of people as 1st, 2nd, 3rd, etc.   If a variable tells you that someone is at a higher level than someone else, but not how much higher, than the variable is ordinal.  The third type of variable, interval, include those variables that are precisely measured and ordered.  Examples include money when measured in dollars, and age measured in years.  If its countable, its interval.
  • No order = nominal.
  • Order but imprecise & not countable = ordinal
  • Ordered precisely and countable = interval.