19.6 - Using Arrays to Transpose a Data Set

A few lessons back, we learned how to transpose a data set by taking advantage of the last. variable and RETAIN and OUTPUT statements. In this section, we'll learn how to use an array to transpose a "tall" data set into a "fat" data set.

Example 19.15 Section

Throughout this section, we will work with the tallgrades data set that is created in the following DATA step:

DATA tallgrades;
    input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
    cards;
10  Smith  E1  78
10  Smith  E2  82
10  Smith  E3  86
10  Smith  E4  69
10  Smith  P1  97
10  Smith  F1 160
11  Simon  E1  88
11  Simon  E2  72
11  Simon  E3  86
11  Simon  E4  99
11  Simon  P1 100
11  Simon  F1 170
12  Jones  E1  98
12  Jones  E2  92
12  Jones  E3  92
12  Jones  E4  99
12  Jones  P1  99
12  Jones  F1 185
;
RUN;

PROC PRINT data = tallgrades NOOBS;
   TITLE 'The tall grades data set';
RUN;

The tall grades data set
idno	l_name	gtype	grade
10	Smith	E1	78
10	Smith	E2	82
10	Smith	E3	86
10	Smith	E4	69
10	Smith	P1	97
10	Smith	F1	160
11	Simon	E1	88
11	Simon	E2	72
11	Simon	E3	86
11	Simon	E4	99
11	Simon	P1	100
11	Simon	F1	170
12	Jones	E1	98
12	Jones	E2	92
12	Jones	E3	92
12	Jones	E4	99
12	Jones	P1	99
12	Jones	F1	185

The tallgrades data set contains one observation for each grade for each student. Students are identified by their ID number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. Launch and run the SAS program so that we can work with the tallgrades data set in the next two examples.

Example 19.16 Section

(You might recall seeing this program a few lessons ago.) Using RETAIN and OUTPUT statements, the following program takes advantage of BY-group processing, as well as RETAIN and OUTPUT statements, to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):

DATA fatgrades;
	set tallgrades;
	by idno;
			 if gtype = 'E1' then E1 = grade;
		else if gtype = 'E2' then E2 = grade;
		else if gtype = 'E3' then E3 = grade;
		else if gtype = 'E4' then E4 = grade;
		else if gtype = 'P1' then P1 = grade;
		else if gtype = 'F1' then F1 = grade;
	if last.idno then output;
	retain E1 E2 E3 E4 P1 F1;
	drop gtype grade;
RUN;
 
PROC PRINT data=fatgrades;
	title 'The fat grades data set';
RUN;

The fat grades data set
Obs	idno	l_name	E1	E2	E3	E4	P1	F1
1	10	Smith	78	82	86	69	97	160
2	11	Simon	88	72	86	99	100	170
3	12	Jones	98	92	92	99	99	185

First, note that the program takes the grades of each student ("by idno") that appear in the variable grade and, depending on what type of grade they are ("if gtype ="), assigns them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") is the data output to the fatgrades data set. The RETAIN statement tells SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. If the RETAIN statement was not present, by the time SAS went to output the program data vector, the values of E1, E2, ..., F1 would be set to missing.

Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.

Example 19.17 Section

The following program uses an array to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):

DATA fatgrades;
	set tallgrades;
   by idno;
   array allgrades (6) G1 - G6;
   if first.idno then i = 1;
   allgrades(i) = grade;
   if last.idno then output;
   i + 1;
   retain G1 - G6;
   drop i gtype grade;
RUN;
 
PROC PRINT data=fatgrades;
  title 'The fat grades data set';
RUN;

The fat grades data set
Obs	idno	l_name	G1	G2	G3	G4	G5	G6
1	10	Smith	78	82	86	69	97	160
2	11	Simon	88	72	86	99	100	170
3	12	Jones	98	92	92	99	99	185

Yikes! This code looks even scarier! Let's dissect it a bit. First, just as in the previous program, the tallgrades data set is processed BY idno. Doing so makes the first.idno and last.idno variables available for us to use. The ARRAY statement defines an array called allgrades and, using a numbered range list, associates the array with six (uninitialized) variables G1, G2, ..., G6. The allgrades array is used to hold the six grades for each student before they are output in their transposed direction to the fatgrades data set. Because the elements of any array, and therefore allgrades, must be assigned using an index variable, this is how the transposition takes place:

("if first.idno then i = 1;") If the input observation contains a student idno that hasn't yet been encountered in the data set, then the index variable i is initialized to 1. If the input observation doesn't contain a new student idno, then do nothing other than advance to the next step.
("allgrades(i) = grade;") The grade from the current observation is assigned to the array allgrades. (For example, if the input observation contains Smith's first grade, then allgrades(1) is assigned the value 78. If the input observation contains Smith's second grade, then allgrades(2) are assigned the value 82. And so on.)
("if last.idno then output;") If the input observation is the last observation in the data set that contains the student idno, then dump the program data vector (which contains allgrades) to the output data set. (For example, if the input observation is Smith's final exam grade, then output the now fat observation containing his six grades). If the input observation is not the last observation in the data set that contains the student idno, do nothing other than advance to the next step.
("i + 1;") Then, increase the index variable i by 1. (For example, if i is 1, change i to 2.)
("retain G1-G6;") Rather than setting G1, G2, ..., G6 to missing at the beginning of the next iteration of the data step, retain their current values. (So, for example, for Smith, allgrades(1) would retain its value of 78, allgrades (2) would retain its value of 82, and so on.)

The program would keep cycling through the above five steps until it encountered the last observation in the data set. Then, the variables i, gtype, and grade would be dropped from the output fatgrades data set.

Now, launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that we have now successfully used arrays to transpose the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.