# 17.2 - The RETAIN Statement

17.2 - The RETAIN Statement

When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:

RETAIN variable1 variable2 ... variablen;

You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:

RETAIN var1 0 var2 3 a b c 'XYZ'

the variable var1 is assigned the value 0; the variable var2 is assigned the value 3, and the variables a, b, and c are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.

Note that it is redundant to name any of the following items in a RETAIN statement, since their values are automatically retained from one iteration of the DATA step to the next:

• variables read with a SET, MERGE, or UPDATE statement
• a variable whose value is assigned in a SUM statement
• variables created by the IN = option

Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.

## Example 17.5

Throughout the remainder of the lesson, we will work with the grades data set that is created in the following DATA step:

     DATA grades;
input idno 1-2 l_name $5-9 gtype$ 12-13 grade 15-17;
cards;
10  Smith  E1  78
10  Smith  E2  82
10  Smith  E3  86
10  Smith  E4  69
10  Smith  P1  97
10  Smith  F1 160
11  Simon  E1  88
11  Simon  E2  72
11  Simon  E3  86
11  Simon  E4  99
11  Simon  P1 100
11  Simon  F1 170
12  Jones  E1  98
12  Jones  E2  92
12  Jones  E3  92
12  Jones  E4  99
12  Jones  P1  99
12  Jones  F1 185
;
RUN;

PROC PRINT data = grades NOOBS;
RUN;

The grades data set is what we call a "subject- and grade-specific" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (E1, E2, E3, E4) not including the final exam. Launch and run the SAS program so that we can work with the grades data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the grades data set.

## Example 17.6

One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade of the four semester exams:

    DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));
RUN;

DATA lowest (rename = (lowtype = gtype));
set exams;
by idno;
if last.idno then output;
drop gtype;
RUN;

PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;

Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called exams by selecting only the exam grades (E1, E2, E3, and E4) from the data set grades.

Now, let's dive in a bit deeper by investigating how SAS would process the exams data set. As you read through what follows, you'll want to refer to both the DATA step code and the exams data set (which is the same as the grades data set minus the P1 and F1 observations). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the exams data set (idno, l_name, gtype, and grade), two variables defined within the DATA step (lowgrade and lowtype), and as a result of the BY statement, a first.idno and a last.idno variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

1 0 .     . .   . .

SAS reads the first observation from the exams data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because first.idno equals 1, the lowgrade variable is assigned the same value as that of the grade variable, that is, 78. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since both values are 78, the value of the lowgrade variable remains unchanged. Because grade equals lowgrade (they are both 78), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E1. Here's what the program data vector looks like now:

1 0 10 Smith E1 78 78 E1 1 0

2 0 10 Smith E2 82 78 E1 0 0

3 0 10 smith E3 86 78 E1 0 0

4 0 10 Smith E4 69 69 E4 0 1

Now, since last.idno equals 1, SAS writes the contents of the program data vector to the lowest data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype variable and renames the lowtype variable to gtype. So, here's what the lowest data set looks like after processing the first four observations: