There are many different ways we can graphically display data. Simple things like bar graphs, pie charts, and stem-and-leaf displays are common place. On the other hand, it would be quicker and more informative to numerically summarize data instead of presenting a visual. Thus, we will talk about both methods to summarize your dataset.
Parameters and Statistics
Parameters are values that represent a population, or a whole set of subjects. Statistics are values that represent a sample, or a subset of a population. Parameters will only be known if we have a measurement for each subject of a population, so depending on what you’ve defined as your population, you may never actually know the parameter value. We can, however, take a sample of that population, calculate the statistic, and then make inferences about what that population parameter would be. In general, parameters are represented by Greek letters, and statistics are usually from Latin alphabet, with a bar or caret on top of them.
Measures of center
The mean, or average, is easily calculated by taking the sum of all the observations, and dividing by the total number of observations. When we’re working with a symmetric distribution, the mean and the median are identical. Conversely, the median is the better measure of center in a skewed distribution.
Unfortunately, in inferential statistics, the median is not easy to work with. The mean has some special properties we’ll discuss later that makes it easy to make inferences about. That’s one of the reasons why we like working with symmetric distributions. If a dataset is binary (Yes/No, 0/1, etc.) we can represent that average using a proportion, or a percentage.
Measures of spread
Knowing the center of a dataset is helpful, but it won’t describe the shape of the distribution. More often in Statistics, we’ll look at variance and standard deviation to understand the spread of a dataset. Dataset with more spread will have larger variances, while datasets with little spread will have variances closer to zero. Taking the square root of the variance provides us with standard deviation. This is often used because there is a general interpretation for standard deviation ― it is the average distance from the mean for each observation. For instance, if we have a standard deviation of 3, that means that the average distance from the mean for each observation is 3.
One final thing we need to talk about is sample size. Sample size is simply the number of subjects in our sample. We represent this using n. Theoretically, if we were talking about a population size, we can represent that value with N, but this is rarely used because we rarely know the exact value of the population.
-
Numerical summaries
1) Mean: mean()
→ parameter(μ), statistics(𝑥̅)
2) Median: median()
3) Range: range()
→ This function will give the min & max.
4) Variance: var()
→ parameter(σ2), statistics(s2)
5) Standard deviation: sd()
→ parameter(σ), statistics(s)
6) IQR: IQR()
7) Sample size: length()
* Proportions: parameter(P), statistics(𝑝)̂
2. Graphical summaries
1) Histogram: hist()
→ continuous variables
A. Probability distributions
a. R value (= correlation value)
b. skewed right, skewed left, or symmetric
→ symmetric distribution: normal distribution (= Gaussian distribution), student’s t-distribution
→ skewed distribution: Chi-squared distribution, F distribution
2) Box and Whisker plot: boxplot()
3) Contingency table: xtabs(~A+B)
* contingency table uses qualitative data.
4) Scatterplot: plot(A,B)
→ two continuous variables
5) Quantile-Quantile plot (Q-Q plot): qqnorm()
→ normal distribution
3. Data to test matching
1) Single sample methods
A. Dependent variable is continuous: most likely be working with a mean and using a t-test
B. Dependent variable is binary: working with proportions and using a z-test
2) Two sample methods
A. Dependent variable is continuous: working with a mean and using two sample t-test
B. Dependent variable is binary: working with proportions and using two sample z-test
3) More than two samples (if dependent variable is continuous): ANOVA (analysis of variance)
4) Two categorical variables (= can use a contingency table): Chi-Squared test
5) Two continuous variables: a correlation test or a regression test
A. correlation test: measures the strength and direction of a linear relationship.
B. regression test: provides an estimated equation of a line that best fits the data.