3.3 - Anatomy of SAS programming for ANOVA

In Lesson 1 we used a SAS program in a copy and paste mode just to get familiar with SAS access. Here we need to understand how SAS is set up to run the ANOVA.  If you are having difficulty with basic SAS procedures, please refer back to the STAT 480 course series for assistance. 

Here is the program we used in Lesson 1 for illustration:

SAS code for Lesson 1:

data lesson1;
input Fert $ Height;

The first line begins with the word ‘data’ and invokes the datastep. Notice that the end of each SAS statements has a semi-colon. This is essential. In the datastep we are assigning a name to the data to be used and defining the variables we will use. Note that SAS assumes variables are numeric in the input statement, so if we are going to use a variable with alpha-numeric values (e.g. F1 or Control), then we have to follow the name of the variable in the input statement with a “$” sign.

A simple way to input small datasets is shown in this code, wherein we embed the data in the program. This is done with the word “datalines” or (from older SAS versions) “cards”.


datalines;
Control      21
Control      19.5
Control      22.5
Control      21.5
Control      20.5
Control      21
F1      32
F1      30.5
F1      25
F1      27.5
F1      28
F1      28.6
F2      22.5
F2      26
F2      28
F2      27
F2      26.5
F2      25.2
F3      28
F3      27.5
F3      31
F3      29.5
F3      30
F3      29.2
;

The semi-colon here ends the datastep.

SAS then produces output of interest using “proc” statements, short for “procedure”. You only need to use the first four letters, so SAS code is full of “proc” statements to do various tasks. Here we just wanted to print the data to be sure it read it in OK.

proc print data=lesson1;
title 'Raw Data for Lesson 1'; run;

Notice that we specified the data to be used in the proc print command. This is an important habit to develop. Technically, SAS will use the last created data set, but this can get you in trouble if you create output datasets or are working with multiple input datasets. I make it a point to always specify the dataset so I know what it is doing.

We then used the summary procedure. This is a very powerful program in SAS, for both EDA (exploratory data analysis), and in some instances to output means, etc. for further analysis. In the summary procedure we identify the categorical classes with the “class” statement. Any variable NOT listed in the class statement is treated as a continuous variable. The target variable for which the summary will be made is specified by the “var” (for variable) statement.

The output statement creates an output dataset and the out= part assigns a name of your choice to the output. The summary procedure can compute means, standard deviations, standard errors, min, max, etc, in fact many descriptive statistics. When we say mean= we are naming the output column for that quantity. I named the means as “mean” and the standard errors as “se”.

The summary procedure will simply run and not produce any output that we can see, unless we add the “proc print” command. In the proc print command we can place a title on the output to help stay organized with large outputs.

proc summary data=lesson1;
class fert;
var height;
output out=output1 mean=mean stderr=se;
run;
proc print data=output1;
title 'Summary Output for Lesson 1';
run;

I then added the last statement “title; run;” at the end to erase the title assignment. SAS has a very aggravating feature of using the last created title to label everything thereafter. So this is a matter of controlling the title assignment process.

title; run;