## Test will be Thursday 6/3/2010

6/1/2010

The test is postponed until Thursday, 6/3/2010.

The test will cover the following material:

1. Variables
2. Cases
3. Connection between variables and cases
4. Individual versus aggregate data
5. Level of measurement: nominal, ordinal and interval
6. Frequency distribution (what is it? what it is good for?)
7. Histogram (what is it? what does it tell you?)
8. Shape of a distribution: skewness & kurtosis - p. 96-97.
9. Measures of central tendency: mean / median / mode -- How do you find them?
10. Why does the mean offer differ from the median?
11. Sample versus population
12. Inferential statistics

As you have seen, you can take most of the world and divide it up into variables and cases.  When the applied statistician thinks about data, he or she is usually thinking about a rectangular data matrix, with cases in the rows and variables in the columns.  There are usually variable names in the first  or "header") row of the matrix.
1. What is a variable?  I think it is best to think about variables as coins or dice.  A coin is like a variable with two categories or attributes.  The attributes of a coin are "heads" and "tails."  The six-sided die is like a variable with six categories or attributes.  The attributes of a six-sided die are the numbers 1, 2, 3, 4, 5, and 6.  But you can have variables with thousands of attributes like income, for example.
2. What is a case?  Cases are simply collections of attributes.  Cases can be individuals or aggregations of individuals.  If we had data on people, our variables might include gender, race, and income.  So each case would be a collection of an individual's gender, race and income.
3. What is the connection between variables and cases? If you were interested in attributes of potential voters, then the cases would be the potential voters.  If you are interested in baseball player performance stats, then the cases are baseball players.   If you're interested in the characteristics of nations, then the cases would be nations.
4. What is the difference between individual data and aggregate data? You can have data on individuals or data on aggregates.  Aggregates are simply collections of people or things.  Aggregates may be baseball teams or states in the union.    If the variable is a crime rate, then you are dealing with aggregate data.  The cases in your dataset should be either all individual-level, or all aggregate-level.  Don't mix cases of individuals with cases of aggregates.
5. How can you tell the level of measurement of a variable?  In this class, we considered three types of variables: nominal, ordinal and interval/ratio .  How can you tell a variable's level of measurement?  See the decision tree and examples
6. What is a frequency distribution?  A frequency distribution is one way we can summarize a variable.  If our variables have many categories, we can choose a set of categories ("bins") within which to collapse our data.  We would choose a set of exhaustive and mutually exclusive categories, and count the number of cases in each category.  Then we would calculate percentages in each category.
7. What is a histogram?  A histogram is simply a bar chart representation of a frequency distribution.  You can look at a histogram to get information about the shape of a distribution.
8. What is skewness and kurtosis? These terms are defined on pages 96 and 97 of the book.  These are different ways of describing the shape of the distribution of a variable.  Skewness has to do with the lack of symmetry.  Kurtosis has to do with how flat or peaked a distribution appears.
9. What are the measures of central tendency, and how are they calculated?  There are three commonly used measures of central tendency: mean, median and mode.  The average or mean of a variable is equal to the sum of all of the values in the variable divided by the number of cases.  The median is obtained by sorting the variable, and finding the middle number.  If there are an odd number of cases, this is easy.  If there an even number of cases, then there are two middle numbers.  The median is simply the average of these two numbers.  Finally, the mode is equal to the most common value.  In practice, we can use the Excel functions =AVERAGE, =MEDIAN, and =MODE to obtain these measures of central tendency.  See demonstrations of these commands in the spreadsheet here.
10. Why does the mean often differ from the median?  Click on the link above for a demonstration on NY Mets salary data.  Why is the median lower than the mean in this case?
11. Statistically speaking, what is a population and what is a sample? In simple terms, a sample is the cases we have, and the population is the group of cases we wish we had.  The population is the group that we want to (and can) generalize about.  Sometimes we have the entire population in our sample.  But survey researchers rarely are able to sample an entire population.  Instead, survey researchers use samples  of populations.
12. What are inferential statistics? Inferential statistics is the branch of statistics that is used when our sample is smaller than the population we wish to generalize about.  The word 'inferential' is used because we are inferring from the sample to the population.