Lesson 17: Using the OUTPUT and RETAIN statements
Lesson 17: Using the OUTPUT and RETAIN statementsOverview
When processing any DATA step, SAS follows two default procedures:
- When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. (SAS does not reset variables to missing if they were created by a SUM statement, or if the values came from a SAS data set via a SET or MERGE statement.)
- At the end of the DATA step after completing an iteration of the DATA step, SAS outputs the values of the variables in the program data vector to the SAS data set being created.
In this lesson, we'll learn how to modify these default processes by using the OUTPUT and RETAIN statements:
- The OUTPUT statement allows you to control when and to which data set you want an observation written.
- The RETAIN statement causes a variable created in the DATA step to retain its value from the current observation into the next observation rather than being set to missing at the beginning of each iteration of the DATA step.
Objectives
Upon completing this lesson, you should be able to do the following:
- write a RETAIN statement to tell SAS to retain the value of a variable from one iteration of the data step to the next
- state which kind of variables SAS automatically retains
- write a RETAIN statement to compare values across observations
- program successfully with the RETAIN statement
- write the "FIRST." and "LAST." variables in conjunction with an OUTPUT statement in order to collapse multiple observations in a data set into a single observation
- write a SUM statement to accumulate totals across a set of observations
- write a "LAST." variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement in order to transpose a data set
- write an OUTPUT statement to tell SAS to output the current observation when the output statement is processed
- write an OUTPUT statement to write observations to multiple data sets
- write an OUTPUT statement to control the output of observations to data sets based on certain conditions
- recall that if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step
- recall that assignment statements must precede OUTPUT statements
- write the today( ) function to determine today's date
17.1 - The OUTPUT Statement
17.1 - The OUTPUT StatementAn OUTPUT statement overrides the default process by telling SAS to output the current observation when the OUTPUT statement is processed — not at the end of the DATA step. The OUTPUT statement takes the form:
OUTPUT dataset1 dataset2 ... datasetn;
where you may name as few or as many data sets as you like. If you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to each of the data sets named in the DATA step. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement.
The OUTPUT statement is pretty powerful in that, among other things, it gives us a way:
- to write observations to multiple data sets
- to control the output of observations to data sets based on certain conditions
- to transpose datasets using the OUTPUT statement in conjunction with the RETAIN statement, BY group processing, and the LAST.variable statement.
Throughout the rest of this section, we'll look at examples that illustrate how to use OUTPUT statements correctly. We'll work with the following subset of the ICDB Study's log data set:
The icblog data set
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
210006 | 12 | 05/06/94 | cmed |
210006 | 12 | 05/06/94 | diet |
210006 | 12 | 05/06/94 | med |
210006 | 12 | 05/06/94 | phytrt |
210006 | 12 | 05/06/94 | purg |
210006 | 12 | 05/06/94 | qul |
210006 | 12 | 05/06/94 | sympts |
210006 | 12 | 05/06/94 | urn |
210006 | 12 | 05/06/94 | void |
310032 | 24 | 09/19/95 | backf |
310032 | 24 | 09/19/95 | cmed |
310032 | 24 | 09/19/95 | diet |
310032 | 24 | 09/19/95 | med |
310032 | 24 | 09/19/95 | medhxf |
310032 | 24 | 09/19/95 | phs |
310032 | 24 | 09/19/95 | phytrt |
310032 | 24 | 09/19/95 | preg |
310032 | 24 | 09/19/95 | purg |
310032 | 24 | 09/19/95 | qul |
310032 | 24 | 09/19/95 | sympts |
310032 | 24 | 09/19/95 | urn |
310032 | 24 | 09/19/95 | void |
410010 | 6 | 05/12/94 | cmed |
410010 | 6 | 05/12/94 | diet |
410010 | 6 | 05/12/94 | med |
410010 | 6 | 05/12/94 | phytrt |
410010 | 6 | 05/12/94 | purg |
410010 | 6 | 05/12/94 | qul |
410010 | 6 | 05/12/94 | sympts |
410010 | 6 | 05/12/94 | urn |
410010 | 6 | 05/12/94 | void |
As you can see, this log data set contains four variables:
- subj: the subject's identification number
- v_type: the type of clinic visit, which means the number of months since the subject was first seen in the clinic
- v_date: the date of the clinic visit
- form: codes that indicate the data forms that were completed during the subject's clinic visit
The log data set is a rather typical data set that arises from large national clinical studies in which there are a number of sites around the country where data are collected. Typically, the clinical sites collect the data on data forms and then "ship" the data forms either electronically or by mail to a centralized location called a Data Coordinating Center (DCC). As you can well imagine, keeping track of the data forms at the DCC is a monumental task. For the ICDB Study, for example, the DCC received more than 68,000 data forms over the course of the study.
In order to keep track of the data forms that arrive at the DCC, they are "logged" into a database and subsequently tracked as they are processed at the DCC. In reality, a log database will contain many more variables than we have in our subset, such as the dates the data on the forms were entered into the database, who entered the data, the dates the entered data was verified, who verified the data, and so on. To keep our lives simple, we'll just use the four variables described above.
Example 17.1
This example uses the OUTPUT statement to tell SAS to write observations to data sets based on certain conditions. Specifically, the following program uses the OUTPUT statement to create three SAS data sets — s210006, s310032, and s410010 — based on whether the subject identification numbers in the icdblog data set meet a certain condition:
OPTIONS PS=58 LS=80 NODATE NONUMBER;
LIBNAME stat481 'C:\yourdrivename\Stat481WC\05retain\sasndata';
DATA s210006 s310032 s410010;
set stat481.icdblog;
if (subj = 210006) then output s210006;
else if (subj = 310032) then output s310032;
else if (subj = 410010) then output s410010;
RUN;
PROC PRINT data = s210006 NOOBS;
title 'The s210006 data set';
RUN;
PROC PRINT data = s310032 NOOBS;
title 'The s310032 data set';
RUN;
PROC PRINT NOOBS;
title 'The s410010 data set';
RUN;
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
210006 | 12 | 05/06/94 | cmed |
210006 | 12 | 05/06/94 | diet |
210006 | 12 | 05/06/94 | med |
210006 | 12 | 05/06/94 | phytrt |
210006 | 12 | 05/06/94 | purg |
210006 | 12 | 05/06/94 | qul |
210006 | 12 | 05/06/94 | sympts |
210006 | 12 | 05/06/94 | urn |
210006 | 12 | 05/06/94 | void |
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
310032 | 24 | 09/19/95 | backf |
310032 | 24 | 09/19/95 | cmed |
310032 | 24 | 09/19/95 | diet |
310032 | 24 | 09/19/95 | med |
310032 | 24 | 09/19/95 | medhxf |
310032 | 24 | 09/19/95 | phs |
310032 | 24 | 09/19/95 | phytrt |
310032 | 24 | 09/19/95 | preg |
310032 | 24 | 09/19/95 | purg |
310032 | 24 | 09/19/95 | qul |
310032 | 24 | 09/19/95 | sympts |
310032 | 24 | 09/19/95 | urn |
310032 | 24 | 09/19/95 | void |
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
410010 | 6 | 05/12/94 | cmed |
410010 | 6 | 05/12/94 | diet |
410010 | 6 | 05/12/94 | med |
410010 | 6 | 05/12/94 | phytrt |
410010 | 6 | 05/12/94 | purg |
410010 | 6 | 05/12/94 | qul |
410010 | 6 | 05/12/94 | sympts |
410010 | 6 | 05/12/94 | urn |
410010 | 6 | 05/12/94 | void |
As you can see, the DATA statement contains three data set names — s210006, s310032, and s410010. That tells SAS that we want to create three data sets with the given names. The SET statement, of course, tells SAS to read observations from the permanent data set called stat481.icdblog. Then comes the IF-THEN-ELSE and OUTPUT statements that make it all work. The first IF-THEN tells SAS to output any observations pertaining to subject 210006 to the s210006 data set; the second IF-THEN tells SAS to output any observations pertaining to subject 310032 to the s310032 data set; and, the third IF-THEN statement tells SAS to output any observations pertaining to subject 410010 to the s410010 data set. SAS will hiccup if you have a data set name that appears in an OUTPUT statement without it also appearing in the DATA statement.
The PRINT procedures, of course, tell SAS to print the three newly created data sets. Note that the last PRINT procedure does not have a DATA= option. That's because when you name more than one data set in a single DATA statement, the last name on the DATA statement is the most recently created data set, and the one that subsequent procedures use by default. Therefore, the last PRINT procedure will print the s410010 data set by default.
Now, before launching and running the SAS program, right-click to save the icdblog data set to a convenient location on your computer. Then, launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run the program and review the output from the PRINT procedures. You should see that, as expected, the data set s210006 contains data on subject 210006; the data set s310032 contains data on subject 310032; and s410010 contains data on subject 410010.
Incidentally, note that the IF-THEN-ELSE construct used here in conjunction with the OUTPUT statement is comparable to attaching the WHERE= option to each of the data sets appearing in the DATA statement.
Example 17.2
Using an OUTPUT statement suppresses the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step. The following SAS program illustrates what happens if you fail to direct all of the observations to output:
DATA subj210006 subj310032;
set stat481.icdblog;
if (subj = 210006) then output subj210006;
RUN;
PROC PRINT data = subj210006 NOOBS;
title 'The subj210006 data set';
RUN;
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
210006 | 12 | 05/06/94 | cmed |
210006 | 12 | 05/06/94 | diet |
210006 | 12 | 05/06/94 | med |
210006 | 12 | 05/06/94 | phytrt |
210006 | 12 | 05/06/94 | purg |
210006 | 12 | 05/06/94 | qul |
210006 | 12 | 05/06/94 | sympts |
210006 | 12 | 05/06/94 | urn |
210006 | 12 | 05/06/94 | void |
The DATA statement contains two data set names, subj210006 and subj310032, telling SAS that we intend to create two data sets. However, as you can see, the IF statement contains an OUTPUT statement that directs output to the subj210006 data set, but no OUTPUT statement directs output to the subj310032 data set. Launch and run the SAS program to convince yourself that the subj210006 data set contains data for subject 210006, while the subj310032 data set contains 0 observations. You should see a message like this in the log window:
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
NOTE: No observations in data set WORK.SUBJ310032.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
as well as see that no output for the subj310032 data set appears in the output window.
Example 17.3
If you use an assignment statement to create a new variable in a DATA step in the presence of OUTPUT statements, you have to make sure that you place the assignment statement before the OUTPUT statements. Otherwise, SAS will have already written the observation to the SAS data set, and the newly created variable will be set to missing. The following SAS program illustrates an example of how two variables, current and days_vis, get set to missing in the output data sets because their values get calculated after SAS has already written the observation to the SAS data set:
DATA subj210006 subj310032 subj410010;
set stat481.icdblog;
if (subj = 210006) then output subj210006;
else if (subj = 310032) then output subj310032;
else if (subj = 410010) then output subj410010;
current = today();
days_vis = current - v_date;
format current mmddyy8.;
RUN;
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
SUBJ | V_TYPE | V_DATE | FORM | current | days_vis |
---|---|---|---|---|---|
310032 | 24 | 09/19/95 | backf | . | . |
310032 | 24 | 09/19/95 | cmed | . | . |
310032 | 24 | 09/19/95 | diet | . | . |
310032 | 24 | 09/19/95 | med | . | . |
310032 | 24 | 09/19/95 | medhxf | . | . |
310032 | 24 | 09/19/95 | phs | . | . |
310032 | 24 | 09/19/95 | phytrt | . | . |
310032 | 24 | 09/19/95 | preg | . | . |
310032 | 24 | 09/19/95 | purg | . | . |
310032 | 24 | 09/19/95 | qul | . | . |
310032 | 24 | 09/19/95 | sympts | . | . |
310032 | 24 | 09/19/95 | urn | . | . |
310032 | 24 | 09/19/95 | void | . | . |
The main thing to note in this program is that the current and days_vis assignment statements appear after the IF-THEN-ELSE and OUTPUT statements. That means that each observation will be written to one of the three output data sets before the current and days_vis values are even calculated. Because SAS sets variables created in the DATA step as missing at the beginning of each iteration of the DATA step, the values of current and days_vis will remain missing for each observation.
By the way, the today( ) function, which is assigned to the variable current, creates a date variable containing today's date. Therefore, the variable days_vis is meant to contain the number of days since the subject's recorded visit v_date. However, as described above, the values of current and days_vis get set to missing. Launch and run the SAS program to convince yourself that the current and days_vis variables in the subj310032 data set contain only missing values. If we were to print the subj210006 and subj410020 data sets, we would see the same thing.
The following SAS program illustrates the corrected code for the previous DATA step, that is, for creating new variables with assignment statements in the presence of OUTPUT statements:
DATA subj210006 subj310032 subj410010;
set stat481.icdblog;
current = today();
days_vis = current - v_date;
format current mmddyy8.;
if (subj = 210006) then output subj210006;
else if (subj = 310032) then output subj310032;
else if (subj = 410010) then output subj410010;
RUN;
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
SUBJ | V_TYPE | V_DATE | FORM | current | days_vis |
---|---|---|---|---|---|
310032 | 24 | 09/19/95 | backf | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | cmed | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | diet | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | med | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | medhxf | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | phs | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | phytrt | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | preg | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | purg | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | qul | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | sympts | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | urn | 09/06/23 | 10214 |
310032 | 24 | 09/19/95 | void | 09/06/23 | 10214 |
Now, since the assignment statements precede the OUTPUT statements, the variables are correctly written to the output data sets. That is, now the variable current contains the date in which the program was run and the variable days_vis contains the number of days since that date and the date of the subject's visit. Launch and run the SAS program to convince yourself that the current and days_vis variables are properly written to the subj310032 data set. If we were to print the subj210006 and subj410020 data sets, we would see similar results.
Example 17.4
After SAS processes an OUTPUT statement within a DATA step, the observation remains in the program data vector and you can continue programming with it. You can even output the observation again to the same SAS data set or to a different one! The following SAS program illustrates how you can create different data sets with some of the same observations. That is, the data sets created in your DATA statement do not have to be mutually exclusive:
DATA symptoms visitsix;
set stat481.icdblog;
if form = 'sympts' then output symptoms;
if v_type = 6 then output visitsix;
RUN;
PROC PRINT data = symptoms NOOBS;
title 'The symptoms data set';
RUN;
PROC PRINT data = visitsix NOOBS;
title 'The visitsix data set';
RUN;
The symptoms data set
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
210006 | 12 | 05/06/94 | sympts |
310032 | 24 | 09/19/95 | sympts |
410010 | 6 | 05/12/94 | sympts |
SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|
410010 | 6 | 05/12/94 | cmed |
410010 | 6 | 05/12/94 | diet |
410010 | 6 | 05/12/94 | med |
410010 | 6 | 05/12/94 | phytrt |
410010 | 6 | 05/12/94 | purg |
410010 | 6 | 05/12/94 | qul |
410010 | 6 | 05/12/94 | sympts |
410010 | 6 | 05/12/94 | urn |
410010 | 6 | 05/12/94 | void |
The DATA step creates two temporary data sets, symptoms and visitsix. The symptoms data set contains only those observations containing a form code of sympts. The visitsix data set, on the other hand, contains observations for which v_type equals 6. The observations in the two data sets are therefore not necessarily mutually exclusive. In fact, launch and run the SAS program and review the output from the PRINT procedures. Note that the observation for subject 410010 in which form = sympts is contained in both the symptoms and visitsix data sets.
17.2 - The RETAIN Statement
17.2 - The RETAIN StatementWhen SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:
RETAIN variable1 variable2 ... variablen;
You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:
RETAIN var1 0 var2 3 a b c 'XYZ'
the variable var1 is assigned the value 0; the variable var2 is assigned the value 3, and the variables a, b, and c are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.
Note that it is redundant to name any of the following items in a RETAIN statement, since their values are automatically retained from one iteration of the DATA step to the next:
- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option
Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.
Example 17.5
Throughout the remainder of the lesson, we will work with the grades data set that is created in the following DATA step:
DATA grades;
input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
cards;
10 Smith E1 78
10 Smith E2 82
10 Smith E3 86
10 Smith E4 69
10 Smith P1 97
10 Smith F1 160
11 Simon E1 88
11 Simon E2 72
11 Simon E3 86
11 Simon E4 99
11 Simon P1 100
11 Simon F1 170
12 Jones E1 98
12 Jones E2 92
12 Jones E3 92
12 Jones E4 99
12 Jones P1 99
12 Jones F1 185
;
RUN;
PROC PRINT data = grades NOOBS;
title 'The grades data set';
RUN;
idno | l_name | gtype | grade |
---|---|---|---|
10 | Smith | E1 | 78 |
10 | Smith | E2 | 82 |
10 | Smith | E3 | 86 |
10 | Smith | E4 | 69 |
10 | Smith | P1 | 97 |
10 | Smith | F1 | 160 |
11 | Simon | E1 | 88 |
11 | Simon | E2 | 72 |
11 | Simon | E3 | 86 |
11 | Simon | E4 | 99 |
11 | Simon | P1 | 100 |
11 | Simon | F1 | 170 |
12 | Jones | E1 | 98 |
12 | Jones | E2 | 92 |
12 | Jones | E3 | 92 |
12 | Jones | E4 | 99 |
12 | Jones | P1 | 99 |
12 | Jones | F1 | 185 |
The grades data set is what we call a "subject- and grade-specific" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (E1, E2, E3, E4) not including the final exam. Launch and run the SAS program so that we can work with the grades data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the grades data set.
Example 17.6
One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade for the four semester exams:
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));
RUN;
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno;
retain lowgrade lowtype;
if first.idno then lowgrade = grade;
lowgrade = min(lowgrade, grade);
if grade = lowgrade then lowtype = gtype;
if last.idno then output;
drop gtype;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4'))); *Because we’re not interested in grade types P1 and F1, don’t read them in;
RUN;
*The data step will find the lowest grade for each student and store those observations in the dataset 'grades';
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno; *Process the exams dataset by idno;
retain lowgrade lowtype; *Retain lowgrade and lowtype as each student is processed;
if first.idno then lowgrade = grade; *The first grade for each student is the starts as the lowest;
lowgrade = min(lowgrade, grade); *Assign lowgrade the lower of the current grade and the lowest grade thus far;
if grade = lowgrade then lowtype = gtype; *If the current grade is the lowest, update lowtype;
if last.idno then output; *For the last observation for a student, output one observation containing the lowest grade;
drop gtype; *Drop gtype here and rename lowtype as gtype in the data statement;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
Obs | idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|---|
1 | 10 | Smith | 69 | 69 | E4 |
2 | 11 | Simon | 99 | 72 | E2 |
3 | 12 | Jones | 99 | 92 | E3 |
Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called exams by selecting only the exam grades (E1, E2, E3, and E4) from the data set grades.
It's the second DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll try to help you understand the code in three different ways. First, let's summarize the procedure. The DATA step searches through the exams data set for each subject ("by idno") and looks for the lowest grade ("min(lowgrade, grade)"). Because SAS would otherwise set the variables lowgrade and lowtype to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student ("last.idno") it outputs the data corresponding to the lowest exam type (lowtype) and grade (lowgrade) to the lowest data set. (Note that the statement "if last.idno then output;" effectively collapses multiple observations per student into one observation per student.) So that we can merge the lowest data set back into the grades data set, by idno and gtype, the variable lowtype is renamed back to gtype.
Now, let's dive in a bit deeper by investigating how SAS would process the exams data set. As you read through what follows, you'll want to refer to both the DATA step code and the exams data set (which is the same as the grades data set minus the P1 and F1 observations). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the exams data set (idno, l_name, gtype, and grade), two variables defined within the DATA step (lowgrade and lowtype), and as a result of the BY statement, a first.idno and a last.idno variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | . | . | . | . | . |
SAS reads the first observation from the exams data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because first.idno equals 1, the lowgrade variable is assigned the same value as that of the grade variable, that is, 78. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since both values are 78, the value of the lowgrade variable remains unchanged. Because grade equals lowgrade (they are both 78), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E1. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | E1 | 78 | 78 | E1 | 1 | 0 |
Since last.idno does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, lowgrade and lowtype, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the exams data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 82 and lowgrade equals 78 from the previous iteration, the value of the lowgrade variable remains 78. Because grade does not equal lowgrade now, the value of the lowtype variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
2 | 0 | 10 | Smith | E2 | 82 | 78 | E1 | 0 | 0 |
Again, since last.idno still does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of lowgrade and lowtype are retained from the previous iteration. SAS reads the third observation from the exams data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 86 and lowgrade still equals 78 from the previous iteration, the value of the lowgrade variable still remains 78. Because grade still does not equal lowgrade now, the value of the lowtype variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
3 | 0 | 10 | smith | E3 | 86 | 78 | E1 | 0 | 0 |
Again, since last.idno still does not equal 1, SAS does not write the contents of the program data vector to the lowest data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of lowgrade and lowtype are retained from the previous iteration. SAS reads the fourth observation from the exams data set. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0. The lowgrade variable is then assigned the smallest value of the lowgrade and grade variables. Since grade now equals 69 and lowgrade still equals 78 from the previous iteration, the value of the lowgrade variable is updated to 69. Because grade equals lowgrade (they are both 69), SAS assigns the lowtype variable the same value as that of the gtype variable, that is, E4. Here's what the program data vector looks like now:
_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|
4 | 0 | 10 | Smith | E4 | 69 | 69 | E4 | 0 | 1 |
Now, since last.idno equals 1, SAS writes the contents of the program data vector to the lowest data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype variable and renames the lowtype variable to gtype. So, here's what the lowest data set looks like after processing the first four observations:
idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|
10 | Smith | 69 | 69 | E4 |
Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken four observations and collapsed them into one observation.
There's just one more thing you might want to do to help you understand that second DATA step. That is, click on the Inspect! button below the code to see an explanation of each line in the DATA step. Then, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the lowest data set contains the lowest exam grade for each student. Also, note that the lowest data set contains one observation per student rather than six observations per student as in the original grades data set.
17.3 - Automatic Retention
17.3 - Automatic RetentionAs mentioned earlier, values of the following variables are automatically retained from one iteration of the DATA step to the next:
- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option
It is not necessary, therefore, to specify their variable names in a RETAIN statement. In this section, we'll take a look at an example that illustrates this point.
Example 17.7
The following SAS program calculates the students' final grades, and in so doing illustrates the automatic retention of two variables — a variable whose value is assigned in a SUM statement (total) and a variable created by the IN = option (lowest):
PROC SORT data=grades;
by idno gtype;
RUN;
DATA final;
merge grades lowest (in=lowest);
by idno gtype;
if lowest then delete;
if first.idno then total = 0;
total + grade;
if last.idno then fnl = (total/600)*100;
format fnl 5.1;
drop lowgrade gtype;
RUN;
PROC PRINT data=final;
title 'Output Dataset: FINAL GRADES';
RUN;
Obs | idno | l_name | grade | total | fnl |
---|---|---|---|---|---|
1 | 10 | Smith | 78 | 78 | . |
2 | 10 | Smith | 82 | 160 | . |
3 | 10 | Smith | 86 | 246 | . |
4 | 10 | Smith | 160 | 406 | . |
5 | 10 | Smith | 97 | 503 | 83.8 |
6 | 11 | Simon | 88 | 88 | . |
7 | 11 | Simon | 86 | 174 | . |
8 | 11 | Simon | 99 | 273 | . |
9 | 11 | Simon | 170 | 443 | . |
10 | 11 | Simon | 100 | 543 | 90.5 |
11 | 12 | Jones | 98 | 98 | . |
12 | 12 | Jones | 92 | 190 | . |
13 | 12 | Jones | 99 | 289 | . |
14 | 12 | Jones | 185 | 474 | . |
15 | 12 | Jones | 99 | 573 | 95.5 |
Because the instructor wants to process the grades and lowest data sets by idno and gtype, the SORT procedure prepares the grades data set for doing so. The lowest data set is alread sorted by idno and gtype, and so no SORT procedure is necessary for it.
It's again the DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll again try to help you understand the code in three different ways. First, a written summary ... the DATA step merges the grades and lowest data sets back together and calculates each student's final grade. Recall that the grades data set contains multiple observations for each student, while the lowest data set contains only one observation per student. Also, recall that the lowest data set indicates which exam grade (gtype) is lowest for each student. When SAS merges the two data sets by idno and gtype, the resulting merged data set final contains multiple observations for each student. Because the "in = lowest" option was attached to the lowest data set, the variable lowest will take the value 1 for the one observation for each student that contains their lowest exam score. The "if lowest then delete" statement deletes the observation pertaining to each student's lowest exam score from the final data set. That is, the final data set contains only those observations needed to calculate each students' final grade. Note that there is no RETAIN statement for the variable lowest, and yet its value is automatically retained from one iteration of the data step to the next.
The variable total is used to keep track of each student's total point accumulation. Therefore, for each student (first.idno), the variable total is initially set to 0. As you know, the expression:
total + grade;
is called a SUM statement. It merely adds the grade appearing in the current observation to the student's total accumulator. You could think of the SUM statement as this kind of statement that you'd typically see instead in other programming languages:
total = total + grade;
Note that there is no RETAIN statement for the variable total, and yet its value is automatically retained from one iteration of the DATA step to the next. When SAS reaches each student's last observation, the final grade (fnl) is calculated.
Now, let's dive in a bit deeper by investigating how SAS would process the grades and lowest data sets. As you read through what follows, you'll want to refer to both the DATA step code and the grades data set (you'll have to envision the data set as it is sorted by idno and gtype). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), a variable in the lowest data set (lowgrade), a variable called lowest that is created as a result of the IN= option attached to the lowest data set, two variables defined within the DATA step (total and fnl), and as a result of the BY statement, a first.idno, a last.idno, a first.gtype and a last.gtype variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | . | . | . | . | . | . | . | . |
For the sake of room, note that the _ERROR_ variable has been abbreviated to _Err_. Also, the first.gtype and last.gtype variables are present, but not depicted, in the program data vector, because we never use them in the code. Now, SAS reads the first observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E1. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1, and last.idno is assigned a value of 0. Because first.idno equals 1, the total variable is initialized to 0, and then immediately updated to the sum of 0 and 78, or 78. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | E1 | 78 | . | 0 | 1 | 0 | 78 | . |
Because we've reached the end of the first iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set. In so doing, the lowgrade and gtype variables are dropped, as are all of the internal variables (_N_, _ERROR_, lowest, first.idno, last.idno, first.gtype, and last.gtype). Here's what the final data set looks like after the first iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | . | . | . | 0 | . | . | 78 | . |
Note that the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the second observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E2. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value and the lowest variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 78 and 82, or 160. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | 10 | Smith | E2 | 82 | . | 0 | 0 | 0 | 160 | . |
Because we've reached the end of the second iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the second iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | . | . | . | 0 | . | . | 160 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the third observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E3. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value and the lowest variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 160 and 86, or 246. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | 10 | Smith | E3 | 86 | . | 0 | 0 | 0 | 246 | . |
Because we've reached the end of the third iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the third iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | . | . | . | 0 | . | . | 246 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the fourth observation from the sorted grades data set. Now, there is an observation in the lowest data set for which idno = 10 and gtype = E4. Therefore, the lowest data set does contribute to the observation currently being built. The lowgrade variable is read from the lowest data set, getting a value of 69. The lowest variable is assigned a value of 1. Therefore, SAS does not write the observation to the final data set. The program data vector looks like this:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | 10 | Smith | E4 | 69 | 69 | 1 | 0 | 0 | 246 | . |
and the final data set still looks like this after the fourth iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 0 | . | . | . | 1 | . | . | 246 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the fifth observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = F1. (Don't forget that the grades data set was sorted by idno and gtype, so the observation for which gtype = F1 is processed before the observation for which gtype = P1.) Again, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 246 and 160, or 406. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 0 | 10 | Smith | F1 | 160 | . | 0 | 0 | 0 | 406 | . |
Because we've reached the end of the fifth iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the fifth iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
10 | Smith | 160 | 406 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | . | . | . | 0 | . | . | 406 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the sixth observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = P1. Again, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 406 and 97, or 503. Because last.idno equals 1, the fnl variable is calculated to equal 503 divided by 600, and multiplied by 100, to get 83.8. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | 10 | Smith | P1 | 97 | . | 0 | 0 | 1 | 503 | 83.8 |
Because we've reached the end of the sixth iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the sixth iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
10 | Smith | 160 | 406 | . |
10 | Smith | 97 | 503 | 83.8 |
Whew!! This is a lot of work! You can take it from here... the process continues as described.
There's just one more thing you might want to do to help you understand the DATA step. That is, click on the Inspect! button below the code to see an explanation of each line in the DATA step. Then, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the final grades are appropriately calculated.
17.4 - Transposing a Data Set
17.4 - Transposing a Data SetIt is very likely that some day you will encounter the need to turn what I call a "tall" data set, such as grades:
idno | l_name | gtype | grade |
---|---|---|---|
10 | Smith | E1 | 78 |
10 | Smith | E2 | 82 |
10 | Smith | E3 | 86 |
10 | Smith | E4 | 69 |
10 | Smith | P1 | 97 |
10 | Smith | F1 | 160 |
11 | Simon | E1 | 88 |
11 | Simon | E2 | 72 |
11 | Simon | E3 | 86 |
11 | Simon | E4 | 99 |
11 | Simon | P1 | 100 |
11 | Simon | F1 | 170 |
12 | Jones | E1 | 98 |
12 | Jones | E2 | 92 |
12 | Jones | E3 | 92 |
12 | Jones | E4 | 99 |
12 | Jones | P1 | 99 |
12 | Jones | F1 | 185 |
into a fat data set, say gradestoo, containing the same information but in a different structure:
idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|
10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |
12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |
We can use, as you'll soon see, a "LAST." variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement to transpose the grades data set.
Example 17.8
The following program illustrates a failed attempt at transposing the original "tall" grades data set (one observation per grade) to a "fat" data set (one observation per student):
DATA grades2;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
drop gtype grade;
RUN;
PROC PRINT data=grades2;
title 'Output Dataset: FAULTY TRANSPOSED GRADES';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | . | . | . | . | 97 | . |
2 | 11 | Simon | . | . | . | . | 100 | . |
3 | 12 | Jones | . | . | . | . | 99 | . |
In summary, the program is trying to take the grades of each student ("by idno") that appear in the variable grade and depending on what type of grade they are ("if gtype =") assign them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") are the data output to the grades2 data set. The only problem is that because there is no RETAIN statement to tell SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. Therefore, by the time SAS goes to output the program data vector, the values of E1, E2, ..., F1 have been set to missing.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are not retained from iteration to iteration and therefore are all missing. The reason the grade for P1 is not missing is that its value is encountered last in the sorted grades data set. Therefore, there are no more iterations of the DATA step after its value has been assigned, and SAS has no opportunity to set it to missing.
The following program correctly makes the transposition using the RETAIN statement:
DATA grades3;
DATA grades3;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
retain E1 E2 E3 E4 P1 F1;
drop gtype grade;
RUN;
PROC PRINT data=grades3;
title 'Output Dataset: TRANSPOSED GRADES';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
2 | 11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |
3 | 12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |
Note that the DATA step in this program is identical to the DATA step in the previous program except this one has a RETAIN statement while the previous one did not. First, inspect the program so you understand what each line of the program is doing. Then, let's work our way through how SAS processes the DATA step.
The program data vector contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), six variables created within the DATA step (E1, E2, E3, E4, P1, and F1), and as a result of the BY statement, a first.idno and a last.idno variable. Noting that the _ERROR_ variable has been abbreviated to _Err_, here's what the program data vector looks like at the beginning of the first iteration of the DATA step:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | . | . | . | . | . | . | . | . | . |
Now, SAS reads the first observation from the grades data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because gtype equals E1, the E1 variable is assigned the value contained in the grade variable, that is, 78. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | E1 | 78 | 1 | 0 | 78 | . | . | . | . | . |
Because last.idno does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, E1, E2, ..., and F1, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E2, the E2 variable is assigned the value contained in the grade variable, that is, 82. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | 10 | Smith | E2 | 82 | 0 | 0 | 78 | 82 | . | . | . | . |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the third observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E3, the E3 variable is assigned the value contained in the grade variable, that is, 86. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | 10 | Smith | E3 | 86 | 0 | 0 | 78 | 82 | 86 | . | . | . |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the fourth observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E4, the E4 variable is assigned the value contained in the grade variable, that is, 69. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | 10 | Smith | E4 | 69 | 0 | 0 | 78 | 82 | 86 | 69 | . | . |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the fifth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the fifth observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals F1 (don't forget that we sorted the grades data set by idno and gtype), the F1 variable is assigned the value contained in the grade variable, that is, 160. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | F1 | 160 | 0 | 0 | 78 | 82 | 86 | 69 | . | 160 |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the sixth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the sixth observation from the grades data set. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0.. Because gtype equals P1 (don't forget that we sorted the grades data set by idno and gtype), the P1 variable is assigned the value contained in the grade variable, that is, 97. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | 10 | Smith | P1 | 97 | 0 | 1 | 78 | 82 | 86 | 69 | 97 | 160 |
Ahhhh... finally.... now, because last.idno equals 1, SAS writes the contents of the program data vector to the grades3 data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype and grade variables. So, here's what the grades3 data set looks like after processing the first six observations:
idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|
0 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken six observations and collapsed them into one fat observation. Now, SAS moves back to the top of the DATA step and begins processing the seventh observation. Because every combination of student and grade type is represented in the data set, it is not necessary to reset the retained values to missing, however, you may wish to do so, just in case. You can take it from here.
Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that this time the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the grades data set from a "tall" data set to a "fat" data set.
Just to close the loop, the following program calculates the final grades of the students using the newly transposed data set:
DATA grades4;
set grades3;
if E1 = min(E1, E2, E3, E4) then E1 = .;
else if E2 = min(E1, E2, E3, E4) then E2 = .;
else if E3 = min(E1, E2, E3, E4) then E3 = .;
else if E4 = min(E1, E2, E3, E4) then E4 = .;
fnl = sum(E1, E2, E3, E4, P1, F1)/6;
format fnl 5.1;
RUN;
PROC PRINT data=grades4;
title 'Output Dataset: FINAL GRADES calculated from TRANSPOSED GRADES';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 | fnl |
---|---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | 78 | 82 | 86 | . | 97 | 160 | 83.8 |
2 | 11 | Simon | 88 | . | 86 | 99 | 100 | 170 | 90.5 |
3 | 12 | Jones | 98 | . | 92 | 99 | 99 | 185 | 95.5 |
The MIN function:
min(var1, var2, var3, .., varn)
returns the smallest number of the given arguments. Therefore, the IF-THEN-ELSE statement in the DATA step replaces the lowest exam grade encountered for each student with a missing value. For example, if SAS deems that a student's E1 value is the smallest of her four exam grades, then SAS will set E1 to missing. Then, SAS calculates the student's final (fnl) grade by merely summing the student's available grades and dividing by 6.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the student's final grades have been again appropriately calculated.
17.5 - Summary
17.5 - SummaryIn this lesson, we learned how the OUTPUT and RETAIN statements allow us to control the output of data sets.
The homework for this lesson will give you practice with these techniques.