Main Content

Lesson 2: Data Source and Analysis

Video Tutorial

In the demonstration that follows, you will analyze two data series representing the U.S. domestic production of cheddar cheeses and Italian chesses to assess whether the two data series are related. Consider repeating this tutorial on your own as it will show you how to download and manipulate data from the USDA National Agricultural Statistics Services Quick Stats 2.0.

To complete the tutorial you can also access the file AGBM320_Lesson_2_complete.

Loading the player.

PROFESSOR: In this tutorial, we will download and manipulate some data series from publicly available sources. The data that we will be using would be accessed via the Quick Stats 2.0 database of the USDA National Agricultural Statistics Service. You will find the link here on the slides. Also, please open the file AGBM 320 underscore lesson 2 underscore complete, which contains the final output of all the data manipulation and analysis that we will perform during this tutorial.

So I have already accessed the Quick Stats 2.0 database. In order to unload the data we need to follow a series of steps. The first thing that we have to do is to select from the program menu the Survey Label. Why? Because we are going to look for data which are collected on an annual basis, even on a monthly basis. While the census of agriculture are the economic census only take place in certain specific years.

From the Sector menu you're going to select Animal and Products. Then from Group, Dairy. From the Commodity menu we are going to select Cheese. You will see that new menus appear. Every time you select a different commodity or a different group, this software will generate alternative menus for you to choose your variables from. In the Category menu we select Production.

We will focus on two products-- cheddar cheese, which here is label us Cheese American Cheddar, and then Italian cheese. In order to collect both data, we need to hold the Control button down. So if we push Control, then we can select multiple data entries at once. Once we have selected the two data items that we want to download from the domain, we select Total, which in this case is the only option available.

Then we need to select the geographic level at which we want to data. For this example we're going to use national data. So for the National Data that is the only option given to US total. Next thing that we need to select is the years. We will select years from 2005 to 2014. Last thing that we need to select is whether we want to download annual or monthly data. And we'll select Monthly.

Which months do we want to download? All of them, all the calendar months. At this point, we are ready to download the data, and we can click on Get The Data. You will see a new screen appearing. Once the information appears on the screen, let's just quickly assess whether the data that we wanted to download had actually been downloaded. In this case, we have the production of cheddar cheese [INAUDIBLE] production of Italian cheeses and we have-- it takes a while to load. And we have all the data that we were looking for.

The next step is to download the file. We're going to download it as a spreadsheet. As you open the file, download it as a spreadsheet, I would suggest that you save it in Excel format. And you can use any name that you want. I'm going to call it Cheese Data, and you're going save it as an Excel workbook.

We need to do quite a bit of data manipulation in order to have the data ready to be analyzed. The first thing that you are going to do is to eliminate columns, which in this specific case-- beware, not in all the cases-- but in this specific case are not useful for us. We would be left with the year period, the data item, which is-- the data item color, which contains the indication of the different types of products that we are analyzing, and last, the value.

In order to use this data though, we need to order them in a way that we can evaluate-- and this is the goal of this tutorial-- how much the production of both products have changed over time. So we need to go to Data and Sort. We need to sort the variables using three different levels. In the first level, we're going to do data item. In the second, we're going to use the year. In the third level, we're going to use the period, which is the month.

So, in the case of the month, we need to use some special way to rearrange the cells. Otherwise Excel will automatically arrange them either in ascending or descending alphabetical order. So let's click on Custom List and select the option that allows us to use the calendar months.

Let's make sure that the box My Data Has Headers has been checked. And then we'll see that the data is organized in a way that we're going to have first of all the cheddar cheese data and then the Italian cheese data. So let's see rename these columns. We can rename these Cheddar. We write in cell E1 Italian.

Since the data is already organized, we can copy the production of Italian cheese from the bottom of our database so that we have the production for the Italian and cheddar cheese for the same calendar months close together with one another. We can now delete all the remainder of the data.

At this point, we can also delete the Data Item column. And we are left with a database which is ready to be analyzed. The first thing that is advisable to do every time you're working with a new database is to create a graph and see how the database behaves. Let's select the data first, and then let's click select either a line or a scatter database. Now, scatter plot and lines allow us to work with the data in a very easy way if we want to isolate trends.

Let's work with the scatter plot database. Let's make sure that the software knows which ones are variables of interest, which ones are not. Now we see that we have the two data series that are already plotted over time. We can add labels. We can add a title to this chart.

Now, in order to save time, we're just going to look at the complete database and see how the final graph should look like. We have added a title for the axis, which you obtained from right clicking to the y-axis. From the format axis, we have also changed the range. And then we have, in the y-axis, we have also changed the range, and we also changed the display units. So that the display units appears in millions.

For each data series we have added trend lines. The trend lines are linear. And we wanted to display the equation for each one of the two trend lines. So as you calculate the trend lines yourself, select Linear and Display Equation in Chart. We have already done this for both the Italian and the cheddar cheese line.

So you should notice a few things in this graph. First, there is quite a bit of fluctuations for the two time series. There is a lot of seasonality. One of the things that you should keep in mind that there is one month in the data where the production will be lower by construction, which is February. We're going to see some seasonality in the first place because simply by looking at the data. In February it will include fewer production days.

But the trend lines we can see also that in spite of both productions to have increased considerably in the 10 years that we are analyzing, the production of Italian cheeses as increased on average monthly by 820,000 pounds per month, the one for cheddar cheese only by 100,000. Both markets are growing, but definitely the market for Italian cheeses is growing much faster.

The next thing that we need to look at when we look at data series is how the two databases are distributed. Excel gives us the option of running detailed data analysis using summary statistics or descriptive statistics. In order for you to have the data analysis tools installed in your Excel, you should perhaps look at whether or not you have it in your options. So, if you look at your Options and the Add-Ins, you should see that the Analysis Tool Pack is installed. Otherwise you need to select it and install it.

During the entire course, we will repeatedly use functions or tools which are contained in the Analysis Too Pack. So it would be a good idea if that was installed in your Excel. If you follow the instructions that are given to you, you could run descriptive statistics. In order to run descriptive statistics, the first thing that you need to do is select both data series.

Go to Data Analysis, go to Descriptive Statistics. You can set up your input range, again, if the software has not found it yet. Make sure that the software knows that you have selected columns that are labels in the first row and that we want summary statistics. We need these new output to be delivered in a new worksheet.

This was already done here in the complete file. So what we obtain are values of the mean, the standard error, this medians, the mode, and so forth. Now we can see that in both cases, the median and the mean are relatively close. This is a good indication that overall probably these two data serious might be close to normal.

We can see also that the variance of the cheddar cheese production is smaller than the one of the Italian, and the same for the standard deviation. That might be because the data series for Italian is larger in magnitude overall and also because there is larger variation as the upward trend is more marked. Another thing that you're asked to do is to calculate these two amounts, the mean plus 2 times the standard deviation, which was calculated, as you can see, and the mean minus 2 times the standard deviation. Here again, it was already calculated for you.

Why do we need to calculate these two measures? Because if the data are distributed normally, we will see that approximately 90% of the data points are within plus and minus two standard deviations from the mean. And let's compare the mean and the maximum with the mean plus 2 standard deviations, the mean minus 2 standard deviation. Well, the mean minus 2 standard deviation is slightly higher than the minimum for both the data series. The mean plus 2 standard deviation in the case of Italian is slightly lower than the maximum. In the case of the cheddar, it's instead slightly higher than the maximum. So if anything, the cheddar production data is slightly less likely to be distributed as a normal distribution, compared to the Italian cheeses data.

One last statistic so we want to evaluate is the correlation between the two data series. You can follow the instructions that are given to you. Simply look at the values that is reported in the AGBM 320 underscore lesson 2 underscore complete. In this case, the correlation between the two data series is less than 0.5, which means that the two data series are only weakly correlated from one another. Even though the two markets can be relate because we are looking at production of similar products, it seems like there might be different driving market forces. The last thing that we are going to do for these data sources is to use, again, the Data Analysis Toolkit to create histograms on each one of these two data series.

We're going to click on Data Analysis Histogram. Click on OK. Select the Input Range. We're going to do this for cheddar. Make sure that you check the box Labels and that you select both the Cumulative Percentage [INAUDIBLE] Output. Make sure that the graph and the output is delivered in a new worksheet. This will sort it out for you.

And we have the Histogram Cheddar and the Histogram Italian charts already ready for you to consult. We can look at them in the [INAUDIBLE] presentation. Overall we'll learn a few things about these two markets. The production of both cheddar cheese and Italian cheeses is increasing. The one for Italian cheeses is increasing much more than that of cheddar cheese. The data series show wide marked seasonality, and also the Italian cheese production data, given the evidence that we have found from the summary statistics and from the histogram, seems a lot more likely to be distributed as normal.

You can see that the distribution of the cheddar cheese production data is actually more skewed to the right, while the distribution of the Italian cheese production data is more concentrated towards the center. All the steps that we have taken just to have an need idea of our data serious behavior is something that you should keep in mind every time you download and analyze any data series, especially time series of data. 


Top of page