Lesson 17: Using the OUTPUT and RETAIN statements

Overview

When processing any DATA step, SAS follows two default procedures:

When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. (SAS does not reset variables to missing if they were created by a SUM statement, or if the values came from a SAS data set via a SET or MERGE statement.)
At the end of the DATA step after completing an iteration of the DATA step, SAS outputs the values of the variables in the program data vector to the SAS data set being created.

In this lesson, we'll learn how to modify these default processes by using the OUTPUT and RETAIN statements:

The OUTPUT statement allows you to control when and to which data set you want an observation written.
The RETAIN statement causes a variable created in the DATA step to retain its value from the current observation into the next observation rather than being set to missing at the beginning of each iteration of the DATA step.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

write a RETAIN statement to tell SAS to retain the value of a variable from one iteration of the data step to the next
state which kind of variables SAS automatically retains
write a RETAIN statement to compare values across observations
program successfully with the RETAIN statement
write the "FIRST." and "LAST." variables in conjunction with an OUTPUT statement in order to collapse multiple observations in a data set into a single observation
write a SUM statement to accumulate totals across a set of observations
write a "LAST." variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement in order to transpose a data set
write an OUTPUT statement to tell SAS to output the current observation when the output statement is processed
write an OUTPUT statement to write observations to multiple data sets
write an OUTPUT statement to control the output of observations to data sets based on certain conditions
recall that if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step
recall that assignment statements must precede OUTPUT statements
write the today( ) function to determine today's date

17.1 - The OUTPUT Statement

An OUTPUT statement overrides the default process by telling SAS to output the current observation when the OUTPUT statement is processed — not at the end of the DATA step. The OUTPUT statement takes the form:

 OUTPUT dataset1 dataset2 ... datasetn;

where you may name as few or as many data sets as you like. If you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to each of the data sets named in the DATA step. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement.

The OUTPUT statement is pretty powerful in that, among other things, it gives us a way:

to write observations to multiple data sets
to control the output of observations to data sets based on certain conditions
to transpose datasets using the OUTPUT statement in conjunction with the RETAIN statement, BY group processing, and the LAST.variable statement.

Throughout the rest of this section, we'll look at examples that illustrate how to use OUTPUT statements correctly. We'll work with the following subset of the ICDB Study's log data set:

The icblog data set

SUBJ	V_TYPE	V_DATE	FORM
210006	12	05/06/94	cmed
210006	12	05/06/94	diet
210006	12	05/06/94	med
210006	12	05/06/94	phytrt
210006	12	05/06/94	purg
210006	12	05/06/94	qul
210006	12	05/06/94	sympts
210006	12	05/06/94	urn
210006	12	05/06/94	void
310032	24	09/19/95	backf
310032	24	09/19/95	cmed
310032	24	09/19/95	diet
310032	24	09/19/95	med
310032	24	09/19/95	medhxf
310032	24	09/19/95	phs
310032	24	09/19/95	phytrt
310032	24	09/19/95	preg
310032	24	09/19/95	purg
310032	24	09/19/95	qul
310032	24	09/19/95	sympts
310032	24	09/19/95	urn
310032	24	09/19/95	void
410010	6	05/12/94	cmed
410010	6	05/12/94	diet
410010	6	05/12/94	med
410010	6	05/12/94	phytrt
410010	6	05/12/94	purg
410010	6	05/12/94	qul
410010	6	05/12/94	sympts
410010	6	05/12/94	urn
410010	6	05/12/94	void

As you can see, this log data set contains four variables:

subj: the subject's identification number
v_type: the type of clinic visit, which means the number of months since the subject was first seen in the clinic
v_date: the date of the clinic visit
form: codes that indicate the data forms that were completed during the subject's clinic visit

The log data set is a rather typical data set that arises from large national clinical studies in which there are a number of sites around the country where data are collected. Typically, the clinical sites collect the data on data forms and then "ship" the data forms either electronically or by mail to a centralized location called a Data Coordinating Center (DCC). As you can well imagine, keeping track of the data forms at the DCC is a monumental task. For the ICDB Study, for example, the DCC received more than 68,000 data forms over the course of the study.

In order to keep track of the data forms that arrive at the DCC, they are "logged" into a database and subsequently tracked as they are processed at the DCC. In reality, a log database will contain many more variables than we have in our subset, such as the dates the data on the forms were entered into the database, who entered the data, the dates the entered data was verified, who verified the data, and so on. To keep our lives simple, we'll just use the four variables described above.

Example 17.1

This example uses the OUTPUT statement to tell SAS to write observations to data sets based on certain conditions. Specifically, the following program uses the OUTPUT statement to create three SAS data sets — s210006, s310032, and s410010 — based on whether the subject identification numbers in the icdblog data set meet a certain condition:

OPTIONS PS=58 LS=80 NODATE NONUMBER;
LIBNAME stat481 'C:\yourdrivename\Stat481WC\05retain\sasndata';

DATA s210006 s310032 s410010;
    set stat481.icdblog;
        if (subj = 210006) then output s210006;
    else if (subj = 310032) then output s310032;
    else if (subj = 410010) then output s410010;
RUN;
PROC PRINT data = s210006 NOOBS;
    title 'The s210006 data set';
RUN;
PROC PRINT data = s310032 NOOBS;
    title 'The s310032 data set';
RUN;
PROC PRINT                NOOBS;
    title 'The s410010 data set';
RUN;

The s210006 data set
SUBJ	V_TYPE	V_DATE	FORM
210006	12	05/06/94	cmed
210006	12	05/06/94	diet
210006	12	05/06/94	med
210006	12	05/06/94	phytrt
210006	12	05/06/94	purg
210006	12	05/06/94	qul
210006	12	05/06/94	sympts
210006	12	05/06/94	urn
210006	12	05/06/94	void

The s310032 data set
SUBJ	V_TYPE	V_DATE	FORM
310032	24	09/19/95	backf
310032	24	09/19/95	cmed
310032	24	09/19/95	diet
310032	24	09/19/95	med
310032	24	09/19/95	medhxf
310032	24	09/19/95	phs
310032	24	09/19/95	phytrt
310032	24	09/19/95	preg
310032	24	09/19/95	purg
310032	24	09/19/95	qul
310032	24	09/19/95	sympts
310032	24	09/19/95	urn
310032	24	09/19/95	void

The s410010 data set
SUBJ	V_TYPE	V_DATE	FORM
410010	6	05/12/94	cmed
410010	6	05/12/94	diet
410010	6	05/12/94	med
410010	6	05/12/94	phytrt
410010	6	05/12/94	purg
410010	6	05/12/94	qul
410010	6	05/12/94	sympts
410010	6	05/12/94	urn
410010	6	05/12/94	void

As you can see, the DATA statement contains three data set names — s210006, s310032, and s410010. That tells SAS that we want to create three data sets with the given names. The SET statement, of course, tells SAS to read observations from the permanent data set called stat481.icdblog. Then comes the IF-THEN-ELSE and OUTPUT statements that make it all work. The first IF-THEN tells SAS to output any observations pertaining to subject 210006 to the s210006 data set; the second IF-THEN tells SAS to output any observations pertaining to subject 310032 to the s310032 data set; and, the third IF-THEN statement tells SAS to output any observations pertaining to subject 410010 to the s410010 data set. SAS will hiccup if you have a data set name that appears in an OUTPUT statement without it also appearing in the DATA statement.

The PRINT procedures, of course, tell SAS to print the three newly created data sets. Note that the last PRINT procedure does not have a DATA= option. That's because when you name more than one data set in a single DATA statement, the last name on the DATA statement is the most recently created data set, and the one that subsequent procedures use by default. Therefore, the last PRINT procedure will print the s410010 data set by default.

Now, before launching and running the SAS program, right-click to save the icdblog data set to a convenient location on your computer. Then, launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run the program and review the output from the PRINT procedures. You should see that, as expected, the data set s210006 contains data on subject 210006; the data set s310032 contains data on subject 310032; and s410010 contains data on subject 410010.

Incidentally, note that the IF-THEN-ELSE construct used here in conjunction with the OUTPUT statement is comparable to attaching the WHERE= option to each of the data sets appearing in the DATA statement.

Example 17.2

Using an OUTPUT statement suppresses the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step. The following SAS program illustrates what happens if you fail to direct all of the observations to output:

DATA subj210006 subj310032;
    set stat481.icdblog;
    if (subj = 210006) then output subj210006;
RUN;
PROC PRINT data = subj210006 NOOBS;
    title 'The subj210006 data set';
RUN;
PROC PRINT data = subj310032 NOOBS;
    title 'The subj310032 data set';
RUN;

The subj210006 data set
SUBJ	V_TYPE	V_DATE	FORM
210006	12	05/06/94	cmed
210006	12	05/06/94	diet
210006	12	05/06/94	med
210006	12	05/06/94	phytrt
210006	12	05/06/94	purg
210006	12	05/06/94	qul
210006	12	05/06/94	sympts
210006	12	05/06/94	urn
210006	12	05/06/94	void

The DATA statement contains two data set names, subj210006 and subj310032, telling SAS that we intend to create two data sets. However, as you can see, the IF statement contains an OUTPUT statement that directs output to the subj210006 data set, but no OUTPUT statement directs output to the subj310032 data set. Launch and run the SAS program to convince yourself that the subj210006 data set contains data for subject 210006, while the subj310032 data set contains 0 observations. You should see a message like this in the log window:

PROC PRINT data = subj310032 NOOBS;
       title 'The subj310032 data set';
RUN;
NOTE: No observations in data set WORK.SUBJ310032.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

as well as see that no output for the subj310032 data set appears in the output window.

Example 17.3

If you use an assignment statement to create a new variable in a DATA step in the presence of OUTPUT statements, you have to make sure that you place the assignment statement before the OUTPUT statements. Otherwise, SAS will have already written the observation to the SAS data set, and the newly created variable will be set to missing. The following SAS program illustrates an example of how two variables, current and days_vis, get set to missing in the output data sets because their values get calculated after SAS has already written the observation to the SAS data set:

DATA subj210006 subj310032 subj410010;
    set stat481.icdblog;
        if (subj = 210006) then output subj210006;
    else if (subj = 310032) then output subj310032;
    else if (subj = 410010) then output subj410010;
    current = today();
    days_vis = current - v_date;
    format current mmddyy8.;
RUN;
PROC PRINT data = subj310032 NOOBS;
    title 'The subj310032 data set';
RUN;

The subj310032 data set
SUBJ	V_TYPE	V_DATE	FORM	current	days_vis
310032	24	09/19/95	backf	.	.
310032	24	09/19/95	cmed	.	.
310032	24	09/19/95	diet	.	.
310032	24	09/19/95	med	.	.
310032	24	09/19/95	medhxf	.	.
310032	24	09/19/95	phs	.	.
310032	24	09/19/95	phytrt	.	.
310032	24	09/19/95	preg	.	.
310032	24	09/19/95	purg	.	.
310032	24	09/19/95	qul	.	.
310032	24	09/19/95	sympts	.	.
310032	24	09/19/95	urn	.	.
310032	24	09/19/95	void	.	.

The main thing to note in this program is that the current and days_vis assignment statements appear after the IF-THEN-ELSE and OUTPUT statements. That means that each observation will be written to one of the three output data sets before the current and days_vis values are even calculated. Because SAS sets variables created in the DATA step as missing at the beginning of each iteration of the DATA step, the values of current and days_vis will remain missing for each observation.

By the way, the today( ) function, which is assigned to the variable current, creates a date variable containing today's date. Therefore, the variable days_vis is meant to contain the number of days since the subject's recorded visit v_date. However, as described above, the values of current and days_vis get set to missing. Launch and run the SAS program to convince yourself that the current and days_vis variables in the subj310032 data set contain only missing values. If we were to print the subj210006 and subj410020 data sets, we would see the same thing.

The following SAS program illustrates the corrected code for the previous DATA step, that is, for creating new variables with assignment statements in the presence of OUTPUT statements:

DATA subj210006 subj310032 subj410010;
    set stat481.icdblog;
    current = today();
    days_vis = current - v_date;
    format current mmddyy8.;
        if (subj = 210006) then output subj210006;
    else if (subj = 310032) then output subj310032;
    else if (subj = 410010) then output subj410010;
RUN;
PROC PRINT data = subj310032 NOOBS;
    title 'The subj310032 data set';
RUN;

The subj310032 data set
SUBJ	V_TYPE	V_DATE	FORM	current	days_vis
310032	24	09/19/95	backf	09/06/23	10214
310032	24	09/19/95	cmed	09/06/23	10214
310032	24	09/19/95	diet	09/06/23	10214
310032	24	09/19/95	med	09/06/23	10214
310032	24	09/19/95	medhxf	09/06/23	10214
310032	24	09/19/95	phs	09/06/23	10214
310032	24	09/19/95	phytrt	09/06/23	10214
310032	24	09/19/95	preg	09/06/23	10214
310032	24	09/19/95	purg	09/06/23	10214
310032	24	09/19/95	qul	09/06/23	10214
310032	24	09/19/95	sympts	09/06/23	10214
310032	24	09/19/95	urn	09/06/23	10214
310032	24	09/19/95	void	09/06/23	10214

Now, since the assignment statements precede the OUTPUT statements, the variables are correctly written to the output data sets. That is, now the variable current contains the date in which the program was run and the variable days_vis contains the number of days since that date and the date of the subject's visit. Launch and run the SAS program to convince yourself that the current and days_vis variables are properly written to the subj310032 data set. If we were to print the subj210006 and subj410020 data sets, we would see similar results.

Example 17.4

After SAS processes an OUTPUT statement within a DATA step, the observation remains in the program data vector and you can continue programming with it. You can even output the observation again to the same SAS data set or to a different one! The following SAS program illustrates how you can create different data sets with some of the same observations. That is, the data sets created in your DATA statement do not have to be mutually exclusive:

DATA symptoms visitsix;
    set stat481.icdblog;
    if form = 'sympts' then output symptoms;
    if v_type = 6 then output visitsix;
RUN;
PROC PRINT data = symptoms NOOBS;
    title 'The symptoms data set';
RUN;
PROC PRINT data = visitsix NOOBS;
    title 'The visitsix data set';
RUN;

The symptoms data set


SUBJ	V_TYPE	V_DATE	FORM
210006	12	05/06/94	sympts
310032	24	09/19/95	sympts
410010	6	05/12/94	sympts

The visitsix data set
SUBJ	V_TYPE	V_DATE	FORM
410010	6	05/12/94	cmed
410010	6	05/12/94	diet
410010	6	05/12/94	med
410010	6	05/12/94	phytrt
410010	6	05/12/94	purg
410010	6	05/12/94	qul
410010	6	05/12/94	sympts
410010	6	05/12/94	urn
410010	6	05/12/94	void

The DATA step creates two temporary data sets, symptoms and visitsix. The symptoms data set contains only those observations containing a form code of sympts. The visitsix data set, on the other hand, contains observations for which v_type equals 6. The observations in the two data sets are therefore not necessarily mutually exclusive. In fact, launch and run the SAS program and review the output from the PRINT procedures. Note that the observation for subject 410010 in which form = sympts is contained in both the symptoms and visitsix data sets.

17.2 - The RETAIN Statement

When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:

RETAIN variable1 variable2 ... variablen;

You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:

RETAIN var1 0 var2 3 a b c 'XYZ'

the variable var1 is assigned the value 0; the variable var2 is assigned the value 3, and the variables a, b, and c are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.

Note that it is redundant to name any of the following items in a RETAIN statement, since their values are automatically retained from one iteration of the DATA step to the next:

variables read with a SET, MERGE, or UPDATE statement
a variable whose value is assigned in a SUM statement
variables created by the IN = option

Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.

Example 17.5

Throughout the remainder of the lesson, we will work with the grades data set that is created in the following DATA step:

DATA grades;
    input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
    cards;
10  Smith  E1  78
10  Smith  E2  82
10  Smith  E3  86
10  Smith  E4  69
10  Smith  P1  97
10  Smith  F1 160
11  Simon  E1  88
11  Simon  E2  72
11  Simon  E3  86
11  Simon  E4  99
11  Simon  P1 100
11  Simon  F1 170
12  Jones  E1  98
12  Jones  E2  92
12  Jones  E3  92
12  Jones  E4  99
12  Jones  P1  99
12  Jones  F1 185
;
RUN;
PROC PRINT data = grades NOOBS;
    title 'The grades data set';
RUN;

The grades data set
idno	l_name	gtype	grade
10	Smith	E1	78
10	Smith	E2	82
10	Smith	E3	86
10	Smith	E4	69
10	Smith	P1	97
10	Smith	F1	160
11	Simon	E1	88
11	Simon	E2	72
11	Simon	E3	86
11	Simon	E4	99
11	Simon	P1	100
11	Simon	F1	170
12	Jones	E1	98
12	Jones	E2	92
12	Jones	E3	92
12	Jones	E4	99
12	Jones	P1	99
12	Jones	F1	185

The grades data set is what we call a "subject- and grade-specific" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (E1, E2, E3, E4) not including the final exam. Launch and run the SAS program so that we can work with the grades data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the grades data set.

Example 17.6

One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade for the four semester exams:

DATA exams;
    set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));
RUN;
DATA lowest (rename = (lowtype = gtype));
    set exams;
    by idno;
    retain lowgrade lowtype;
    if first.idno then lowgrade = grade;
    lowgrade = min(lowgrade, grade);
    if grade = lowgrade then lowtype = gtype;
    if last.idno then output;
    drop gtype;
RUN;
PROC PRINT data=lowest;
    title 'Output Dataset: LOWEST';
RUN;

Note: In the upper right-hand corner of the code block you will have the option of copying () the code to your clipboard or downloading () the file to your computer.

DATA exams; 
    set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4'))); *Because we’re not interested in grade types P1 and F1, don’t read them in; 
RUN; 
*The data step will find the lowest grade for each student and store those observations in the dataset 'grades'; 
DATA lowest (rename = (lowtype = gtype)); 
    set exams; 
    by idno; *Process the exams dataset by idno; 
    retain lowgrade lowtype; *Retain lowgrade and lowtype as each student is processed; 
 
    if first.idno then lowgrade = grade; *The first grade for each student is the starts as the lowest; 

    lowgrade = min(lowgrade, grade); *Assign lowgrade the lower of the current grade and the lowest grade thus far; 

    if grade = lowgrade then lowtype = gtype; *If the current grade is the lowest, update lowtype; 

    if last.idno then output; *For the last observation for a student, output one observation containing the lowest grade; 

    drop gtype; *Drop gtype here and rename lowtype as gtype in the data statement; 
RUN;  
PROC PRINT data=lowest; 
    title 'Output Dataset: LOWEST'; 
RUN;

Output Dataset: LOWEST
Obs	idno	l_name	grade	lowgrade	gtype
1	10	Smith	69	69	E4
2	11	Simon	99	72	E2
3	12	Jones	99	92	E3

Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called exams by selecting only the exam grades (E1, E2, E3, and E4) from the data set grades.

It's the second DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll try to help you understand the code in three different ways. First, let's summarize the procedure. The DATA step searches through the exams data set for each subject ("by idno") and looks for the lowest grade ("min(lowgrade, grade)"). Because SAS would otherwise set the variables lowgrade and lowtype to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student ("last.idno") it outputs the data corresponding to the lowest exam type (lowtype) and grade (lowgrade) to the lowest data set. (Note that the statement "if last.idno then output;" effectively collapses multiple observations per student into one observation per student.) So that we can merge the lowest data set back into the grades data set, by idno and gtype, the variable lowtype is renamed back to gtype.

Now, let's dive in a bit deeper by investigating how SAS would process the exams data set. As you read through what follows, you'll want to refer to both the DATA step code and the exams data set (which is the same as the grades data set minus the P1 and F1 observations). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the exams data set (idno, l_name, gtype, and grade), two variables defined within the DATA step (lowgrade and lowtype), and as a result of the BY statement, a first.idno and a last.idno variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_	_ERROR_	idno	l_name	gtype	grade	lowgrade	lowtype	first.idno	last.idno
1	0	.			.	.		.	.

SAS reads the first observation from the exams data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because first.idno equals 1, the lowgrade variable is assigned the same value as that of the grade variable, that is, 78. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since both values are 78, the value of the lowgrade variable remains unchanged. Because grade equals lowgrade (they are both 78), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E1. Here's what the program data vector looks like now:

_N_	_ERROR_	idno	l_name	gtype	grade	lowgrade	lowtype	first.idno	last.idno
1	0	10	Smith	E1	78	78	E1	1	0

Since last.idno does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, lowgrade and lowtype, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the exams data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 82 and lowgrade equals 78 from the previous iteration, the value of the lowgrade variable remains 78. Because grade does not equal lowgrade now, the value of the lowtype variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:

_N_	_ERROR_	idno	l_name	gtype	grade	lowgrade	lowtype	first.idno	last.idno
2	0	10	Smith	E2	82	78	E1	0	0

Again, since last.idno still does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of lowgrade and lowtype are retained from the previous iteration. SAS reads the third observation from the exams data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 86 and lowgrade still equals 78 from the previous iteration, the value of the lowgrade variable still remains 78. Because grade still does not equal lowgrade now, the value of the lowtype variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:

_N_	_ERROR_	idno	l_name	gtype	grade	lowgrade	lowtype	first.idno	last.idno
3	0	10	smith	E3	86	78	E1	0	0

Again, since last.idno still does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of lowgrade and lowtype are retained from the previous iteration. SAS reads the fourth observation from the exams data set. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 69 and lowgrade still equals 78 from the previous iteration, the value of the lowgrade variable is updated to 69. Because grade equals lowgrade (they are both 69), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E4. Here's what the program data vector looks like now:

_N_	_ERROR_	idno	l_name	gtype	grade	lowgrade	lowtype	first.idno	last.idno
4	0	10	Smith	E4	69	69	E4	0	1

Now, since last.idno equals 1, SAS writes the contents of the program data vector to the lowest data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype variable and renames the lowtype variable to gtype. So, here's what the lowest data set looks like after processing the first four observations:

idno	l_name	grade	lowgrade	gtype
10	Smith	69	69	E4

Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken four observations and collapsed them into one observation.

There's just one more thing you might want to do to help you understand that second DATA step. That is, click on the Inspect! button below the code to see an explanation of each line in the DATA step. Then, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the lowest data set contains the lowest exam grade for each student. Also, note that the lowest data set contains one observation per student rather than six observations per student as in the original grades data set.

17.3 - Automatic Retention

As mentioned earlier, values of the following variables are automatically retained from one iteration of the DATA step to the next:

variables read with a SET, MERGE, or UPDATE statement
a variable whose value is assigned in a SUM statement
variables created by the IN = option

It is not necessary, therefore, to specify their variable names in a RETAIN statement. In this section, we'll take a look at an example that illustrates this point.

Example 17.7

The following SAS program calculates the students' final grades, and in so doing illustrates the automatic retention of two variables — a variable whose value is assigned in a SUM statement (total) and a variable created by the IN = option (lowest):

PROC SORT data=grades;
    by idno gtype;
RUN;
DATA final;
    merge grades lowest (in=lowest);
    by idno gtype;
    if lowest then delete;  
    if first.idno then total = 0;
    total + grade;
    if last.idno then fnl = (total/600)*100;
    format fnl 5.1;
    drop lowgrade gtype;
RUN;
PROC PRINT data=final;
    title 'Output Dataset: FINAL GRADES';
RUN;

Output Dataset: FINAL GRADES
Obs	idno	l_name	grade	total	fnl
1	10	Smith	78	78	.
2	10	Smith	82	160	.
3	10	Smith	86	246	.
4	10	Smith	160	406	.
5	10	Smith	97	503	83.8
6	11	Simon	88	88	.
7	11	Simon	86	174	.
8	11	Simon	99	273	.
9	11	Simon	170	443	.
10	11	Simon	100	543	90.5
11	12	Jones	98	98	.
12	12	Jones	92	190	.
13	12	Jones	99	289	.
14	12	Jones	185	474	.
15	12	Jones	99	573	95.5

Because the instructor wants to process the grades and lowest data sets by idno and gtype, the SORT procedure prepares the grades data set for doing so. The lowest data set is alread sorted by idno and gtype, and so no SORT procedure is necessary for it.

It's again the DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll again try to help you understand the code in three different ways. First, a written summary ... the DATA step merges the grades and lowest data sets back together and calculates each student's final grade. Recall that the grades data set contains multiple observations for each student, while the lowest data set contains only one observation per student. Also, recall that the lowest data set indicates which exam grade (gtype) is lowest for each student. When SAS merges the two data sets by idno and gtype, the resulting merged data set final contains multiple observations for each student. Because the "in = lowest" option was attached to the lowest data set, the variable lowest will take the value 1 for the one observation for each student that contains their lowest exam score. The "if lowest then delete" statement deletes the observation pertaining to each student's lowest exam score from the final data set. That is, the final data set contains only those observations needed to calculate each students' final grade. Note that there is no RETAIN statement for the variable lowest, and yet its value is automatically retained from one iteration of the data step to the next.

The variable total is used to keep track of each student's total point accumulation. Therefore, for each student (first.idno), the variable total is initially set to 0. As you know, the expression:

total + grade;

is called a SUM statement. It merely adds the grade appearing in the current observation to the student's total accumulator. You could think of the SUM statement as this kind of statement that you'd typically see instead in other programming languages:

total = total + grade;

Note that there is no RETAIN statement for the variable total, and yet its value is automatically retained from one iteration of the DATA step to the next. When SAS reaches each student's last observation, the final grade (fnl) is calculated.

Now, let's dive in a bit deeper by investigating how SAS would process the grades and lowest data sets. As you read through what follows, you'll want to refer to both the DATA step code and the grades data set (you'll have to envision the data set as it is sorted by idno and gtype). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), a variable in the lowest data set (lowgrade), a variable called lowest that is created as a result of the IN= option attached to the lowest data set, two variables defined within the DATA step (total and fnl), and as a result of the BY statement, a first.idno, a last.idno, a first.gtype and a last.gtype variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_	_ERR_	idno	l_name	gtype	grade	lowgrade	lowest	first.idno	last.idno	total	fnl
1	0	.			.	.	.	.	.	.	.

For the sake of room, note that the _ERROR_ variable has been abbreviated to _Err_. Also, the first.gtype and last.gtype variables are present, but not depicted, in the program data vector, because we never use them in the code. Now, SAS reads the first observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E1. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1, and last.idno is assigned a value of 0. Because first.idno equals 1, the total variable is initialized to 0, and then immediately updated to the sum of 0 and 78, or 78. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:

_N_	_ERR_	idno	l_name	gtype	grade	lowgrade	lowest	first.idno	last.idno	total	fnl
1	0	10	Smith	E1	78	.	0	1	0	78	.

Because we've reached the end of the first iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set. In so doing, the lowgrade and gtype variables are dropped, as are all of the internal variables (_N_, _ERROR_, lowest, first.idno, last.idno, first.gtype, and last.gtype). Here's what the final data set looks like after the first iteration of the DATA step:

idno	l_name	grade	total	fnl
10	Smith	78	78	.