A few lessons back, we learned how to transpose a data set by taking advantage of the last. variable and RETAIN and OUTPUT statements. In this section, we'll learn how to use an array to transpose a "tall" data set into a "fat" data set.
Example 19.15 Section
Throughout this section, we will work with the tallgrades data set that is created in the following DATA step:
DATA tallgrades;
input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
cards;
10 Smith E1 78
10 Smith E2 82
10 Smith E3 86
10 Smith E4 69
10 Smith P1 97
10 Smith F1 160
11 Simon E1 88
11 Simon E2 72
11 Simon E3 86
11 Simon E4 99
11 Simon P1 100
11 Simon F1 170
12 Jones E1 98
12 Jones E2 92
12 Jones E3 92
12 Jones E4 99
12 Jones P1 99
12 Jones F1 185
;
RUN;
PROC PRINT data = tallgrades NOOBS;
TITLE 'The tall grades data set';
RUN;
idno | l_name | gtype | grade |
---|---|---|---|
10 | Smith | E1 | 78 |
10 | Smith | E2 | 82 |
10 | Smith | E3 | 86 |
10 | Smith | E4 | 69 |
10 | Smith | P1 | 97 |
10 | Smith | F1 | 160 |
11 | Simon | E1 | 88 |
11 | Simon | E2 | 72 |
11 | Simon | E3 | 86 |
11 | Simon | E4 | 99 |
11 | Simon | P1 | 100 |
11 | Simon | F1 | 170 |
12 | Jones | E1 | 98 |
12 | Jones | E2 | 92 |
12 | Jones | E3 | 92 |
12 | Jones | E4 | 99 |
12 | Jones | P1 | 99 |
12 | Jones | F1 | 185 |
The tallgrades data set contains one observation for each grade for each student. Students are identified by their ID number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. Launch and run the SAS program so that we can work with the tallgrades data set in the next two examples.
Example 19.16 Section
(You might recall seeing this program a few lessons ago.) Using RETAIN and OUTPUT statements, the following program takes advantage of BY-group processing, as well as RETAIN and OUTPUT statements, to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):
DATA fatgrades;
set tallgrades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
retain E1 E2 E3 E4 P1 F1;
drop gtype grade;
RUN;
PROC PRINT data=fatgrades;
title 'The fat grades data set';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
2 | 11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |
3 | 12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |
First, note that the program takes the grades of each student ("by idno") that appear in the variable grade and, depending on what type of grade they are ("if gtype ="), assigns them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") is the data output to the fatgrades data set. The RETAIN statement tells SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. If the RETAIN statement was not present, by the time SAS went to output the program data vector, the values of E1, E2, ..., F1 would be set to missing.
Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.
Example 19.17 Section
The following program uses an array to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):
DATA fatgrades;
set tallgrades;
by idno;
array allgrades (6) G1 - G6;
if first.idno then i = 1;
allgrades(i) = grade;
if last.idno then output;
i + 1;
retain G1 - G6;
drop i gtype grade;
RUN;
PROC PRINT data=fatgrades;
title 'The fat grades data set';
RUN;
Obs | idno | l_name | G1 | G2 | G3 | G4 | G5 | G6 |
---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
2 | 11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |
3 | 12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |
Yikes! This code looks even scarier! Let's dissect it a bit. First, just as in the previous program, the tallgrades data set is processed BY idno. Doing so makes the first.idno and last.idno variables available for us to use. The ARRAY statement defines an array called allgrades and, using a numbered range list, associates the array with six (uninitialized) variables G1, G2, ..., G6. The allgrades array is used to hold the six grades for each student before they are output in their transposed direction to the fatgrades data set. Because the elements of any array, and therefore allgrades, must be assigned using an index variable, this is how the transposition takes place:
- ("if first.idno then i = 1;") If the input observation contains a student idno that hasn't yet been encountered in the data set, then the index variable i is initialized to 1. If the input observation doesn't contain a new student idno, then do nothing other than advance to the next step.
- ("allgrades(i) = grade;") The grade from the current observation is assigned to the array allgrades. (For example, if the input observation contains Smith's first grade, then allgrades(1) is assigned the value 78. If the input observation contains Smith's second grade, then allgrades(2) are assigned the value 82. And so on.)
- ("if last.idno then output;") If the input observation is the last observation in the data set that contains the student idno, then dump the program data vector (which contains allgrades) to the output data set. (For example, if the input observation is Smith's final exam grade, then output the now fat observation containing his six grades). If the input observation is not the last observation in the data set that contains the student idno, do nothing other than advance to the next step.
- ("i + 1;") Then, increase the index variable i by 1. (For example, if i is 1, change i to 2.)
- ("retain G1-G6;") Rather than setting G1, G2, ..., G6 to missing at the beginning of the next iteration of the data step, retain their current values. (So, for example, for Smith, allgrades(1) would retain its value of 78, allgrades (2) would retain its value of 82, and so on.)
The program would keep cycling through the above five steps until it encountered the last observation in the data set. Then, the variables i, gtype, and grade would be dropped from the output fatgrades data set.
Now, launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that we have now successfully used arrays to transpose the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.