Main Content
Lesson 2: Data Source and Analysis
Simple Statistical Analysis Measures
Once your data have been collected and downloaded, you need to understand how to use the tools necessary to analyze them. In the remainder of this lesson, I am going to illustrate some simple statistical and descriptive measures that will enable you to properly describe a series of data. Regression analysis and more on economic modeling will be covered in the next lessons.
Mean: Describes the “center” of the distribution of a data series; it represents what you can expect, on average, to observe if you randomly select one instance from the series of data.
For finding the mean, let X be a data series made by n observations (x1,…,xn), that is n data points. The mean of the data series (or the average of the data series, or sample average) is then defined by
Median: Represents the middle point of a data series. If all the observations in the data series are sorted in either ascending or descending order, the median would be that data point that separates the sorted series in two halves.
The mean and the median are also called centrality measures of a data series. If they are close to one another, the data distribution is likely to be symmetric, and it should show some well-defined properties.
Mode: The data value with the largest number of observations; this value reoccurs the most in the data series. For a data series that presents only a few values, the mode is an important indicator because it allows you to see what the most reoccuring event is.
Variance: The measure of dispersion (or variability) of the data around their mean; it is calculated as the sum of the squared differences between the individual observations and the mean of the data series, divided by the number of observations minus 1:
It should be noted that when more observations are closer to the mean, the distribution of the data is tighter, or less dispersed, and the variance is small. When many observations are far from the mean, the distribution of the data is more dispersed and the variance is large. Data points that exist far from the rest are outliers, and they could cause the variance to be very large.
Standard Deviation: Another measure of dispersion, which is calculated as the square root of the variance. One of the main advantages of using the standard deviation in place of the variance is that it is expressed in the same unit of measure as the data.
It is not uncommon to assess the dispersion of the data by considering how much of the data are “contained” within plus or minus” ( ± ) a certain number of standard deviations from the mean. It is customary to calculate intervals equal to one and two standard deviations above and below the mean. For many data series, the largest part of the observations fall within this interval. One classical example is the normal distribution.