If you have ever studied Statistics, you may recall the importance of summarizing data and looking at how values are distributed. This tutorial covers a handful of spreadsheet formulas that will allow you to perform basic Statistics functions. Along the way, I use real-life numbers and address why the formulas and functions used are meaningful.

The spreadsheet functions covered in this tutorial are: average(), median(), mode(), min(), max(), counta(), stdevp(), stdev(), percentile(), countif(), countifs(), and skew().

## Get exciting numeric data

This tutorial draws from this spreadsheet containing data on contributors to Jeb Bush’s gubernatorial campaigns in 1998 and 2002. I compiled it from data downloaded from the National Institute on Money in State Politics.

As I write this, it is a near-certainty that Jeb Bush will enter the 2016 race for U.S. President. Other than being from a famous family, he is known for having been Governor of Florida from 1999 to 2007.

This tutorial shows you how to summarize and analyze this contribution data with the goal of transforming it from mere data into useful and meaningful *information*.

## Where to start?

I usually approach data sets by reviewing how it is put together. I pay close attention to column titles and the values contained underneath. I often delete columns that are not necessary to my calculations or analysis so as to keep my data set as lean as possible.

The spreadsheet I created for this tutorial is the result of that process. And, having reviewed the data, I decided to focus on **contributors and the amounts donated**.

## Space, at the corner of No and Where

When analyzing numeric data, I typically start by focusing on, what I believe to be, an important variable and then finding various ways to summarize the entire set of corresponding values.

This tutorial uses records of contributions to Jeb Bush’s gubernatorial campaigns. I have decided that the single variable most important for study is *contribution totals*. I would like to summarize the entire set of contribution totals with one single number, *an average*.

### Centered

The number of contribution totals found in the data set is large. Let me next summarize these donations by finding the center of them all.

formula | result |
---|---|

=average(G2:G39482) | 485 |

=median(G2:G39482) | 500 |

=mode(G2:G39482) | 500 |

The functions *average()*, *median()*, and *mode()* each have the same argument – a range of cells. In this example it is *G2:G39482*, which is titled *Amount*, and is located in this spreadsheet.

### Size it up

I will next get a better sense of the size and extent of campaign contribution totals. Three similar spreadsheet functions will allow me to find the sample size and the range of values.

formula | result |
---|---|

=min(G2:G39482) | -500 |

=max(G2:G39482) | 4798785 |

The functions *min()* and *max()* each have only one argument – a range of cells. For this example, I used the range *G2:G39482*, which points to a column titled *Amount* found within this spreadsheet.

formula | result |
---|---|

=COUNTA(A2:A39482) | 39481 |

The range of cells in this example is *A2:A39482*. This range is in the column titled *Contributor:id* in this spreadsheet.

The idea behind using *counta()* in the example is this: since *Contributor:id* contains non-duplicate identification numbers that correspond with each contributor, I simply took a count of non-empty cells to find the total number of unique contributors.

#### What does the data look like so far?

I now know that 39,481 contributors gave Jeb Bush’s 1998 and 2002 gubernatorial campaigns donations totaling anywhere from -$500 to as much as $4,798,785. The median donation total was $500 and the mean was slightly lower at $485.

Note that the *center* of the set of contribution amounts is not located anywhere near the mid-point of the range -$500:$4,798,785, which would be somewhere around $2.4 million. Instead, I found that the set of 39,000+ donation totals centers around $500.

## What is *normal* anyway?

Do you remember the classic *bell curve*, also known as a *normal distribution curve*? It illustrates predicted observations of data, assuming the data has normal distribution.

The mean, median, and 50th percentile are the same in a normal distribution. It is around this central value that the bulk of observations are found.

In fact, 68.2% of observations within a normal bell curve are expected to fall within +/- 1 standard deviation (SD) from the mean. Ninety-five percent (95.4%) of observations are expected to fall within +/- 2 SD.

If you want to brush up on percentiles, standard deviation, or a normal distribution, I recommend StatisticalPractice.com.

formula | result |
---|---|

=stdevp(G2:G39482) | 24,150 |

The functions *stdev()* and *stdevp()* both use the same argument – a range of cells. In this example the range of cells is *G2:G39482*. The function *stdev()* is used when dealing with a sample, while *stdevp()* is used when dealing with a population.

As the table shows, the standard deviation for the contribution totals of the population of 39,000+ donors is +/-$24,150. This is a very large number, particularly when compared to the mean, which is $485. I can conclude that contribution totals to Jeb Bush’s 1998 and 2002 gubernatorial campaigns have a lot of variability.

If you think about it, the only way for the set of 39,000+ contribution totals to be distributed normally, with the mean and standard deviation being what they are, would be for them to be distributed within a remarkably *flat* bell curve.

SD | formula | SD, $ |
---|---|---|

-3 | =$J$18+(J53*$M$14) | -$71,952 |

-2 | =$J$18+(J54*$M$14) | -$47,802 |

-1 | =$J$18+(J55*$M$14) | -$23,651 |

0 | =$J$18+(J56*$M$14) | $500 |

1 | =$J$18+(J57*$M$14) | $24,651 |

2 | =$J$18+(J58*$M$14) | $48,802 |

3 | =$J$18+(J59*$M$14) | $72,952 |

The arguments in the example start with a reference to cell *J18*. It contains the median value *$500*. The second cell referenced in each argument, *J53,J54,… J59*, contains a number of desired standard deviations *-3,-2,…, 0,…, 3* as shown in the column titled *Standard deviation (SD)*. The final reference *M14* contains the standard deviation *$24,150*.

### How is it distributed?

I have found certain summary statistics for the set of contribution totals to Jeb Bush’s gubernatorial campaigns, including sample size, range, center, and degree of variability.

I know that campaign totals are highly variable, but let me next look at their distribution.

percentile | formula | result (max value) |
---|---|---|

1 | =PERCENTILE($G$2:$G$39482,0.01) | 5 |

5 | =PERCENTILE($G$2:$G$39482,0.05) | 20 |

10 | =PERCENTILE($G$2:$G$39482,0.1) | 25 |

20 | =PERCENTILE($G$2:$G$39482,0.2) | 100 |

30 | =PERCENTILE($G$2:$G$39482,0.3) | 125 |

40 | =PERCENTILE($G$2:$G$39482,0.4) | 250 |

50 | =PERCENTILE($G$2:$G$39482,0.5) | 500 |

60 | =PERCENTILE($G$2:$G$39482,0.6) | 500 |

70 | =PERCENTILE($G$2:$G$39482,0.7) | 500 |

80 | =PERCENTILE($G$2:$G$39482,0.8) | 500 |

90 | =PERCENTILE($G$2:$G$39482,0.9) | 500 |

95 | =PERCENTILE($G$2:$G$39482,0.95) | 750 |

99 | =PERCENTILE($G$2:$G$39482,0.99) | 1000 |

The function *percentile()*, has two arguments. First is the range of cells. The second is the desired percentile itself, as a percent in decimal format. For example, even though you may want the 40th percentile, you would enter *0.4* rather than *40* as the second argument for *percentile()*.

Let me review the first few percentiles listed in the table.

- At the first percentile, the bottom 1% of the set, contribution totals are less than or equal to $5.
- At the fifth percentile, the lowest 5% of the data set’s values, totals are $20 or less.
- At the tenth percentile, the bottom 10% of observed contribution totals from 39,000+ donors, the maximum is $25.

By calculating the maximum values for the percentiles in the example above, I created classes or categories, ideal for graphing. What remains is taking a count of the number of contribution totals that fall within each class.

percentile | formula | count |
---|---|---|

1 | =COUNTIF($G$2:$G$39482,”<=5″) | 462 |

5 | =COUNTIFS($G$2:$G$39482,”>5″,$G$2:$G$39482,”<=20″) | 1614 |

10 | =COUNTIFS($G$2:$G$39482,”>20″,$G$2:$G$39482,”<=25″) | 2364 |

20 | =COUNTIFS($G$2:$G$39482,”>25″,$G$2:$G$39482,”<=100″) | 7257 |

30 | =COUNTIFS($G$2:$G$39482,”>100″,$G$2:$G$39482,”<=125″) | 259 |

40 | =COUNTIFS($G$2:$G$39482,”>125″,$G$2:$G$39482,”<=250″) | 4018 |

50 | =COUNTIFS($G$2:$G$39482,”>250″,$G$2:$G$39482,”<=500″) | 21308 |

60 | =COUNTIFS($G$2:$G$39482,”>500″,$G$2:$G$39482,”<=500″) | 0 |

70 | =COUNTIFS($G$2:$G$39482,”>500″,$G$2:$G$39482,”<=500″) | 0 |

80 | =COUNTIFS($G$2:$G$39482,”>500″,$G$2:$G$39482,”<=500″) | 0 |

90 | =COUNTIFS($G$2:$G$39482,”>500″,$G$2:$G$39482,”<=500″) | 0 |

95 | =COUNTIFS($G$2:$G$39482,”>500″,$G$2:$G$39482,”<=750″) | 328 |

99 | =COUNTIFS($G$2:$G$39482,”>750″,$G$2:$G$39482,”<=1000″) | 1665 |

The function *countif()*, has two arguments. First is the range of cells from which you would like to count something. The range of cells in the example is *G2:G39482*. The second argument is the criteria, the something that you want to count. The function *countifs()* works much the same way, except that you are not limited to a single criteria. Each criteria should be preceded by a range: *COUNTIFS(range1, criteria1, range2, criteria2, …)*.

Now that I have made a frequency table, a *histogram* is not far behind.

A quick conversion back into dollars may help remove a level of abstraction.

## Abnormal it is?

Let me start laying out a series of observations. A good method is to look for departures from the ideal normal distribution curve. A great resource, which I found very helpful for this section in particular, can be found at StatisticalPractice.com.

**Symmetry:** The distribution is not symmetrical. The tail on the right is much shorter than the left tail.

**Tail length:** I cannot say that the data is *short-* or *long-tailed*, since it is not distributed symmetrically. The left tail is certainly much longer than the one on the right.

**Outliers:** While not shown in the histograms, there is a single value that is far removed from other observations. This was found earlier to be $4,798,785. As a comparison, note the that the maximum value at the 99th percentile is $1,000.

**Modes:** The set of contribution totals does not seem to have a normal distribution with a single peak at the center. Instead, the data appears to have a *multimodal* distribution.

**Skew:** The data is *skewed to the right*. It has a *positive skew*.

formula | result |
---|---|

=SKEW($G$2:$G$39482) | 198.66 |

The function *skew()*, has one argument, a range of cells. The example uses *G2:G39482*. The result *198.66* confirms the earlier observation that the data of contribution sums made by over 39,000 donor to Jeb Bush’s 1998 and 2002 gubernatorial campaigns are skewed to the right.

## Putting it all together

The data used for this tutorial were contribution records for Jeb Bush’s 1998 and 2002 campaigns to be Governor of Florida. The sums given by 39,481 contributors to Mr. Bush’s campaigns ranged from -$500 to as much as $4,798,785. The most common sum was $500; it was also the median. The mean was slightly lower at $485.

Contribution totals to Jeb Bush’s gubernatorial campaigns were highly variable. The standard deviation was +/-$24,150.

The distribution of values is not symmetrical. There appear to be three distinct clusters at approximately $100, $500, and $1,000, although the median value of $500 is most predominant. The data, contribution totals, are skewed to the right. It is safe to say that it does not have a normal distribution.

One outlier stood out. It was $4,798,785 in donations made by the Florida Republican Party. This is over 9,000 times the size of the median sum of $500.

## What if?

If you look closely at the spreadsheet I created for this tutorial, it includes an outlier of $4.8 million, a contribution total of -$500, and over 80 contribution totals of $0. What if we excluded these from our analysis? What might the results look like then?

Well, I did exactly that. To start with, the median and mode remain unchanged, but the mean decreased to $364. The presence of a single contributor’s total of $4.8 million had so much influence on the mean that it was $485 when I included it in the set previously.

The minimum sum that a contributor donated was $1. In fact, 52 contributors each gave a total of $1 to Jeb Bush’s 1998 and 2002 campaigns. The maximum contributor total dropped to $11,077 once we excluded the outlier. This new maximum contributor total is 22 times the median sum of $500, compared to over 9,000 times previously.

Contribution totals to both of Jeb Bush’s gubernatorial campaigns were still highly variable, even after excluding outliers. The standard deviation was +/- $270, which is more than half of the median and almost 75% of the average.

The histogram showing frequency of contributions at various levels looks much the same as it did earlier – not very symmetrical and with clusters around $100, $500, and $1,000. Contribution totals are skewed to the right.

## What do you think?

Did this tutorial work for you? Is it clear as mud? Did you find errors? Do you want to add your own interpretations or observations of the data?

Show some love! Ask questions, leave a comment, or simply share this with someone you know would really benefit.

If you find yourself in a pinch, you should know that **I am available to help with spreadsheet projects**.