Your challenge is to take what we have done in class and apply it to datasets of your choosing.
Your task is to find data to produce the following three sets of analyses. You have conducted each of these analyses in class before according to detailed instructions.
Now, the challenge is to execute these analyses in a creative way. You can use data posted on this website or data online. But your analyses must be original. (In other words, don't do something that we have already done in class.)
1. Cross-tabulations with chi-square test
For this analysis, you should find a dataset that contains two nominal-level variables that might possibly relate to each other. In other words, you should find a dataset which contains a nominal-level variable that just might possibly cause another nominal-level variable. Remember that the cause is the independent variable and the effect is the dependent variable. (Example: gender causes someone's attitudes about marijuana.)
For hints on how to create cross-tabulations, look back at what we did with the ratemyprofessor.com data. We produced three sets of identical tables, examining ratings of professors on helpfulness by clarity, and helpfulness by easiness. I would like you to create tables similar to these, except this time, you will choose your own variables.
In a crosstabulation, your independent variable should go in the columns, and your dependent variable should go in the rows. Remember this when creating the Pivot Table.
Your output here should include three tables: (1) counts of dependent variable by the independent variable, (2) (column) percentages of dependent variable by independent variable, and (3) expected counts. Finally, (4) use the =CHITEST function to obtain a chi-square test statistic.
The chi-square test statistic compares the ACTUAL counts in the table with the counts we would expect if there was no relationship between the variables.
Finally, interpret your findings. Is there a relationship between the variables or not? Why or why not? (Speculate.)
2. Regression & correlation
For this analysis, find a dataset that contains two ordered (ordinal or interval-level) variables that might relate to each other. You should have one variable that could be a cause or independent variable, and another variable that could be an effect or dependent variable.
Example from class: students' ratings of professor's easiness (independent) and students' overall ratings of professor
Your output for this analysis should include:
(1) a scatterplot with the dependent variable in the Y-axis and the independent variable in the X-axis. Insert a trend (regression) line into the scatterplot. Then include an overall title above. Label the X-axis and the Y-axis. Delete the legend on the side, as it is not informative in this case.
(2) a slope of the regression line (the slope is m in the equation
y = mx+b.) Interpret the slope. What does it mean in your data?
(3) the Y-intercept of the regression line. The y-intercept is b in the equation y=mx+b. Interpret the y-intercept. What does mean in this case?
(4) predicted values for each case in the dataset. Predicted values are obtained by multiplying the value of the independent variable by the slope and then adding the y-intercept. (predicted y = mx+b.)
(5) residuals for each case in the dataset. Calculate the residual, or the difference between the actual value of the dependent variable and the predicted value. The residual = y - predicted y. Then sort the data by the residuals. Are there cases with very high or very low residuals? What is going on with those cases with very high residuals and those cases with very low residuals?
(6) a correlation between Y and X. Use the =CORREL function. Interpret the correlation. There are two aspects of a correlation that require interpretation: (a) direction and (b) magnitude. Direction: is the correlation positive or negative, and what does this say about the relationship between the independent and dependent variables? Magnitude: Correlations farther from zero are higher in magnitude. High correlations indicate that variables are strongly related to each other. Low correlations indicate that variables are weakly related to each other. Correlations of about 0.1 or -0.1 are weak. Correlations of 0.7 and -0.7 or above are strong. Correlations of about 0.3 or -0.3 are moderate in size.
3. Comparison of means between two groups, accompanied by a t-test
For this analysis, you need to find a dataset that contains (1) an independent variable that is nominal-level, and (2) a dependent variable that is interval-level.
Example: gender (cause) and income (effect). We would look at average incomes for men and women.
You should create a PivotTable to calculate averages, standard deviations, and counts of an interval-level by the categories of a nominal-level variable.
Compare the averages and stanard deviations across the groups. Use the =TTEST function to test whether the difference in averages between two groups is statistically significant. (By statistically significant, I mean that the difference in means between the groups is unlikely to have occurred through chance alone.)
Imagine a realtor is trying to decide how to price a home for a client. Realtors often base their decision on the prices of similar homes that sold recently. But what if the realtor has difficulty finding very similar homes?
If the realtor has data on the attributes and prices of recently sold homes, then she can use regression to estimate a price for her client.
In this example, price is the dependent variable. Features of the home such as size, year built, value of the land, cost of materials, and so on, would be independent variables.
The ideal regression for a realtor would have many more than two variables, representing many features of the homes. For our purposes however, we will focus on just one variable at a time.
We are going to get more practice with regression using some data on condo sales in two buildings in New York City.
Today, we are going to see how we can use regression lines to summarize the relationship between two ordered variables. We are also going to see how you can use regression lines to predict values of one variables based on the values of another variable.
Specifically, in the case of the data from RateMyProfessor.com, we can predict professor's overall rating based on his or her easiness rating.
Today, we will be looking at data that I collected from the RateMyProfessor (RMP) website on professors at William Paterson University. I placed the data in an MS Access database. You can download the data from the links page.
The database includes two tables:
- population. This data comes from the list for William Paterson. Each row is a different professor. I removed names to protect the innocent (and the guilty). Variables include department, average quality (the average of all helpfulness and clarity ratings for a professor), and n (the number of people rating the professor).
- sample. This data comes from individual faculty pages on RMP. Because this was a time intensive process, I did not include all professors here. Instead, I took a random sample of approximately 90 professors from the list of all professors in the population table. For each professor, I extracted data from up to 20 students on helpfulness, clarity, easiness, and interest. The definitions of these concepts are provided here.
What sorts of questions can you address with this data? We are going to start today by considering what questions might be addressed by analyzing this data.
One thing we are going to look at is Bayesian averages
I have decided to have you complete Quarter Test 2 on your own time as well: this weekend. The test will posted on Blackboard by the end of the day tomorrow (Thursday) and will be due by next Monday. The test will cover some concepts (like the chi-square test). You will also have to find the answers to questions about data using Excel. I will talk more about the test tomorrow in class.
Today, we are going to to look at trends in the salaries of Major League baseball players between 1985-2009. We are going to extract data from the salary table as we did before.
When comparing $ figures over time, we are confronted with a problem: inflation. A dollar was worth more twenty years ago than a dollar today.
Whenever you compare $ figures over time, you need to adjust them for inflation. We will do this by using data from the Consumer Price Index. This data can be downloaded on the Links page.
We will need to adjust the salaries of baseball players to 2009 dollars in order to make comparisons across the period for which we have data: 1985-2009.
Today, we will start by interpreting the line graphs that we all created yesterday. Next, I want to reinforce the concept of expected frequencies, and the use a chi-square test which compares expected with actual frequencies. For this, we will use this data
on birthdays from 1980 to 1994 from a life insurance company.
Finally, we will return to the baseball data. This time, we will work on baseball salaries.
The main point of this class is to enable you to answer questions about data.
Today, we are going to be using a baseball dataset to examine two questions:
1. Many kids dream of being professional ball players. Of course, raw skill and amount of practice are both going to play a large role in determining whether someone gets a chance to play professionally. But there's something else that matters: what month you were born. We will take a look at this.
2. How have baseball player salaries changed over the last 25 years? Has the average increased, and by how much? Has inequality in pay between players increased?
Download the following: Baseball database in MS Access.
You will also need to import the data below:
The test is postponed until Thursday, 6/3/2010.
The test will cover the following material:
- Connection between variables and cases
- Individual versus aggregate data
- Level of measurement: nominal, ordinal and interval
- Frequency distribution (what is it? what it is good for?)
- Histogram (what is it? what does it tell you?)
- Shape of a distribution: skewness & kurtosis - p. 96-97.
- Measures of central tendency: mean / median / mode -- How do you find them?
- Why does the mean offer differ from the median?
- Sample versus population
- Inferential statistics
As you have seen, you can take most of the world and divide it up into variables and cases. When the applied statistician thinks about data, he or she is usually thinking about a rectangular data matrix, with cases in the rows and variables in the columns. There are usually variable names in the first or "header") row of the matrix.
- What is a variable? I think it is best to think about variables as coins or dice. A coin is like a variable with two categories or attributes. The attributes of a coin are "heads" and "tails." The six-sided die is like a variable with six categories or attributes. The attributes of a six-sided die are the numbers 1, 2, 3, 4, 5, and 6. But you can have variables with thousands of attributes like income, for example.
- What is a case? Cases are simply collections of attributes. Cases can be individuals or aggregations of individuals. If we had data on people, our variables might include gender, race, and income. So each case would be a collection of an individual's gender, race and income.
- What is the connection between variables and cases? If you were interested in attributes of potential voters, then the cases would be the potential voters. If you are interested in baseball player performance stats, then the cases are baseball players. If you're interested in the characteristics of nations, then the cases would be nations.
- What is the difference between individual data and aggregate data? You can have data on individuals or data on aggregates. Aggregates are simply collections of people or things. Aggregates may be baseball teams or states in the union. If the variable is a crime rate, then you are dealing with aggregate data. The cases in your dataset should be either all individual-level, or all aggregate-level. Don't mix cases of individuals with cases of aggregates.
- How can you tell the level of measurement of a variable? In this class, we considered three types of variables: nominal, ordinal and interval/ratio . How can you tell a variable's level of measurement? See the decision tree and examples.
- What is a frequency distribution? A frequency distribution is one way we can summarize a variable. If our variables have many categories, we can choose a set of categories ("bins") within which to collapse our data. We would choose a set of exhaustive and mutually exclusive categories, and count the number of cases in each category. Then we would calculate percentages in each category.
- What is a histogram? A histogram is simply a bar chart representation of a frequency distribution. You can look at a histogram to get information about the shape of a distribution.
- What is skewness and kurtosis? These terms are defined on pages 96 and 97 of the book. These are different ways of describing the shape of the distribution of a variable. Skewness has to do with the lack of symmetry. Kurtosis has to do with how flat or peaked a distribution appears.
- What are the measures of central tendency, and how are they calculated? There are three commonly used measures of central tendency: mean, median and mode. The average or mean of a variable is equal to the sum of all of the values in the variable divided by the number of cases. The median is obtained by sorting the variable, and finding the middle number. If there are an odd number of cases, this is easy. If there an even number of cases, then there are two middle numbers. The median is simply the average of these two numbers. Finally, the mode is equal to the most common value. In practice, we can use the Excel functions =AVERAGE, =MEDIAN, and =MODE to obtain these measures of central tendency. See demonstrations of these commands in the spreadsheet here.
- Why does the mean often differ from the median? Click on the link above for a demonstration on NY Mets salary data. Why is the median lower than the mean in this case?
- Statistically speaking, what is a population and what is a sample? In simple terms, a sample is the cases we have, and the population is the group of cases we wish we had. The population is the group that we want to (and can) generalize about. Sometimes we have the entire population in our sample. But survey researchers rarely are able to sample an entire population. Instead, survey researchers use samples of populations.
- What are inferential statistics? Inferential statistics is the branch of statistics that is used when our sample is smaller than the population we wish to generalize about. The word 'inferential' is used because we are inferring from the sample to the population.