When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:
RETAIN variable1 variable2 ... variablen;
You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:
RETAIN var1 0 var2 3 a b c 'XYZ'
the variable var1 is assigned the value 0; the variable var2 is assigned the value 3, and the variables a, b, and c are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.
Note that it is redundant to name any of the following items in a RETAIN statement, since their values are automatically retained from one iteration of the DATA step to the next:
- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option
Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.
Example 17.5 Section
Throughout the remainder of the lesson, we will work with the grades data set that is created in the following DATA step:
DATA grades;
input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
cards;
10 Smith E1 78
10 Smith E2 82
10 Smith E3 86
10 Smith E4 69
10 Smith P1 97
10 Smith F1 160
11 Simon E1 88
11 Simon E2 72
11 Simon E3 86
11 Simon E4 99
11 Simon P1 100
11 Simon F1 170
12 Jones E1 98
12 Jones E2 92
12 Jones E3 92
12 Jones E4 99
12 Jones P1 99
12 Jones F1 185
;
RUN;
PROC PRINT data = grades NOOBS;
title 'The grades data set';
RUN;
idno | l_name | gtype | grade |
---|---|---|---|
10 | Smith | E1 | 78 |
10 | Smith | E2 | 82 |
10 | Smith | E3 | 86 |
10 | Smith | E4 | 69 |
10 | Smith | P1 | 97 |
10 | Smith | F1 | 160 |
11 | Simon | E1 | 88 |
11 | Simon | E2 | 72 |
11 | Simon | E3 | 86 |
11 | Simon | E4 | 99 |
11 | Simon | P1 | 100 |
11 | Simon | F1 | 170 |
12 | Jones | E1 | 98 |
12 | Jones | E2 | 92 |
12 | Jones | E3 | 92 |
12 | Jones | E4 | 99 |
12 | Jones | P1 | 99 |
12 | Jones | F1 | 185 |
The grades data set is what we call a "subject- and grade-specific" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (E1, E2, E3, E4) not including the final exam. Launch and run the SAS program so that we can work with the grades data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the grades data set.
Example 17.6 Section
One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade for the four semester exams:
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));
RUN;
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno;
retain lowgrade lowtype;
if first.idno then lowgrade = grade;
lowgrade = min(lowgrade, grade);
if grade = lowgrade then lowtype = gtype;
if last.idno then output;
drop gtype;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4'))); *Because we’re not interested in grade types P1 and F1, don’t read them in;
RUN;
*The data step will find the lowest grade for each student and store those observations in the dataset 'grades';
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno; *Process the exams dataset by idno;
retain lowgrade lowtype; *Retain lowgrade and lowtype as each student is processed;
if first.idno then lowgrade = grade; *The first grade for each student is the starts as the lowest;
lowgrade = min(lowgrade, grade); *Assign lowgrade the lower of the current grade and the lowest grade thus far;
if grade = lowgrade then lowtype = gtype; *If the current grade is the lowest, update lowtype;
if last.idno then output; *For the last observation for a student, output one observation containing the lowest grade;
drop gtype; *Drop gtype here and rename lowtype as gtype in the data statement;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
Obs | idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|---|
1 | 10 | Smith | 69 | 69 | E4 |
2 | 11 | Simon | 99 | 72 | E2 |
3 | 12 | Jones | 99 | 92 | E3 |
Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called exams by selecting only the exam grades (E1, E2, E3, and E4) from the data set grades.
It's the second DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll try to help you understand the code in three different ways. First, let's summarize the procedure. The DATA step searches through the exams data set for each subject ("by idno") and looks for the lowest grade ("min(lowgrade, grade)"). Because SAS would otherwise set the variables lowgrade and lowtype to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student ("last.idno") it outputs the data corresponding to the lowest exam type (lowtype) and grade (lowgrade) to the lowest data set. (Note that the statement "if last.idno then output;" effectively collapses multiple observations per student into one observation per student.) So that we can merge the lowest data set back into the grades data set, by idno and gtype, the variable lowtype is renamed back to gtype.
Now, let's dive in a bit deeper by investigating how SAS would process the exams data set. As you read through what follows, you'll want to refer to both the DATA step code and the exams data set (which is the same as the grades data set minus the P1 and F1 observations). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the exams data set (idno, l_name, gtype, and grade), two variables defined within the DATA step (lowgrade and lowtype), and as a result of the BY statement, a first.idno and a last.idno variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | . | . | . | . | . |
SAS reads the first observation from the exams data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because first.idno equals 1, the lowgrade variable is assigned the same value as that of the grade variable, that is, 78. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since both values are 78, the value of the lowgrade variable remains unchanged. Because grade equals lowgrade (they are both 78), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E1. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | E1 | 78 | 78 | E1 | 1 | 0 |
Since last.idno does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, lowgrade and lowtype, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the exams data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 82 and lowgrade equals 78 from the previous iteration, the value of the lowgrade variable remains 78. Because grade does not equal lowgrade now, the value of the lowtype variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
2 | 0 | 10 | Smith | E2 | 82 | 78 | E1 | 0 | 0 |
Again, since last.idno still does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of lowgrade and lowtype are retained from the previous iteration. SAS reads the third observation from the exams data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 86 and lowgrade still equals 78 from the previous iteration, the value of the lowgrade variable still remains 78. Because grade still does not equal lowgrade now, the value of the lowtype variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
3 | 0 | 10 | smith | E3 | 86 | 78 | E1 | 0 | 0 |
Again, since last.idno still does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of lowgrade and lowtype are retained from the previous iteration. SAS reads the fourth observation from the exams data set. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 69 and lowgrade still equals 78 from the previous iteration, the value of the lowgrade variable is updated to 69. Because grade equals lowgrade (they are both 69), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E4. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
4 | 0 | 10 | Smith | E4 | 69 | 69 | E4 | 0 | 1 |
Now, since last.idno equals 1, SAS writes the contents of the program data vector to the lowest data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype variable and renames the lowtype variable to gtype. So, here's what the lowest data set looks like after processing the first four observations:
idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|
10 | Smith | 69 | 69 | E4 |
Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken four observations and collapsed them into one observation.
There's just one more thing you might want to do to help you understand that second DATA step. That is, click on the Inspect! button below the code to see an explanation of each line in the DATA step. Then, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the lowest data set contains the lowest exam grade for each student. Also, note that the lowest data set contains one observation per student rather than six observations per student as in the original grades data set.