Let's suppose we want to read a raw data file that, in each record, contains a student ID number followed by the same number of grades, such as here:
Student Grades Per Exam | |||
---|---|---|---|
111000234 | 79 | 82 | 100 |
922232573 | 87 | 89 | 95 |
252359873 | 65 | 72 | 73 |
205804679 | 92 | 95 | 99 |
In some situations, it might work just fine to read in this data file using three different variables for the grades — grade1, grade2, and grade3, say — so that the resulting SAS data set looks like this:
ID | grade1 | grade2 | grade3 |
---|---|---|---|
111000231 | 79 | 82 | 100 |
922232573 | 87 | 89 | 95 |
252359873 | 65 | 72 | 73 |
205804679 | 92 | 95 | 99 |
In other cases, though, it makes more sense to read all of the grades into one score variable — if you wanted to calculate an overall average across all students and all grades, say. In that case, we'd want our data set to have this structure instead:
ID | exam | score |
---|---|---|
111000234 | 1 | 79 |
111000234 | 2 | 82 |
111000234 | 3 | 100 |
922232573 | 1 | 87 |
922232573 | 2 | 89 |
922232573 | 3 | 95 |
252359873 | 1 | 65 |
252359873 | 2 | 72 |
252359873 | 3 | 73 |
205804679 | 1 | 92 |
205804679 | 2 | 95 |
205804679 | 3 | 99 |
in which each record in the input data file produces three observations, with each observation containing three variables — ID, exam, and score.
In order to read in the data set as desired, we need to take advantage of the INPUT statement's single trailing at sign (@). Just like the double trailing at sign (@@), the single trailing @ is a line-hold specifier that tells SAS to hold the current record in the input buffer while SAS executes the next INPUT statement. The single trailing @ releases the current record from the input buffer:
- when SAS encounters an INPUT statement without a line-hold specifier, or
- when SAS returns to the top of the DATA step.
The trailing @ will do the trick for us here because we do not need SAS to hold the current record across multiple iterations of the DATA step. Let's make more sense of all of this by looking at an example!
Example 21.7 Section
The following program uses single trailing at signs (@) in multiple INPUT statements along with multiple output statements to read a raw data file into a temporary SAS data set called grades:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA grades;
input ID @;
input score @;
output;
input score @;
output;
input score @;
output;
DATALINES;
111000234 79 82 100
922232573 87 89 95
252359873 65 72 73
205804679 92 95 99
;
RUN;
PROC PRINT data = grades NOOBS;
title 'Grades data set';
RUN;
ID | score |
---|---|
111000234 | 79 |
111000234 | 82 |
111000234 | 100 |
922232573 | 87 |
922232573 | 89 |
922232573 | 95 |
252359873 | 65 |
252359873 | 72 |
252359873 | 73 |
205804679 | 92 |
205804679 | 95 |
205804679 | 99 |
Let's work our way through the program, from top to bottom, to make sure we understand what it is doing:
- The first INPUT statement tells SAS to place the first record into the input buffer, as well as read the value for the ID variable. The trailing @ tells SAS to hold the first record in the input buffer.
- The second INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The trailing @ tells SAS to continue to hold the first record in the input buffer.
- The OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 79, respectively — to create the first observation in the grades data set.
- The next INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The value 82 writes over the 79 that's in the score field in the program data vector. The ID value, 111000234, in the program data vector remains unchanged. The trailing @ tells SAS to continue to hold the first record in the input buffer.
- The next OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 82, respectively — to create the second observation in the grades data set.
- The next INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The value 100 writes over the 82 that's in the score field in the program data vector. The ID value, 111000234, in the program data vector remains unchanged. The trailing @ tells SAS to continue to hold the input buffer. The trailing @ is really not necessary here, since we've reached the bottom of the DATA step anyway.
- The final OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 100, respectively — to create the third observation in the grades data set.
- Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.
- The process continues as just described until SAS reads the last field in the last record.
Launch and run the SAS program, and review the output to convince yourself that the program does indeed create a data set called grades, in which each record contains just one of the grades for one of the students.
Now, you might be looking at that last program and wondering "Geez, what if I had an input data file with 100 repeating fields ... would I really need to write a program with 101 input statements and 100 output statements?" This program looks like a good candidate for an iterative DO loop, doesn't it?
Example 21.8 Section
The following program is identical in behavior to the previous program, except here the alternating sets of three INPUT and OUTPUT statements have been collapsed into one iterative DO loop:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA grades;
input ID @;
DO exam = 1 to 3;
input score @;
output;
END;
DATALINES;
111000234 79 82 100
922232573 87 89 95
252359873 65 72 73
205804679 92 95 99
;
RUN;
PROC PRINT data = grades NOOBS;
title 'Grades data set';
RUN;
ID | exam | score |
---|---|---|
111000234 | 1 | 79 |
111000234 | 2 | 82 |
111000234 | 3 | 100 |
922232573 | 1 | 87 |
922232573 | 2 | 89 |
922232573 | 3 | 95 |
252359873 | 1 | 65 |
252359873 | 2 | 72 |
252359873 | 3 | 73 |
205804679 | 1 | 92 |
205804679 | 2 | 95 |
205804679 | 3 | 99 |
Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — ID, exam, and score. Then, SAS moves on to the execution phase.
During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:
1---+----↓0---+----20 |
---|
11100234 79 82 100 |
and to read the value for the ID variable into the program data vector:
_N_ | _ERROR_ | ID | exam | score |
---|---|---|---|---|
1 | 0 | 111000234 | . | . |
The trailing @ tells SAS to hold the record in the input buffer. Then, SAS begins the execution of the DO loop. On the first iteration of the DO loop, exam is set to 1, the INPUT statement reads the first repeating field (79) and assigns the value to score in the program data vector:
_N_ | _ERROR_ | ID | exam | score |
---|---|---|---|---|
1 | 0 | 111000234 | 1 | 79 |
The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the grades data set. Having reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 2. The INPUT statement reads the second repeating field (82) and assigns the value to score in the program data vector:
_N_ | _ERROR_ | ID | exam | score |
---|---|---|---|---|
1 | 0 | 111000234 | 2 | 82 |
The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the grades data set. Having again reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 3. The INPUT statement reads the third, and final, repeating field (100) and assigns the value to score in the program data vector:
_N_ | _ERROR_ | ID | exam | score |
---|---|---|---|---|
1 | 0 | 111000234 | 3 | 100 |
The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the third observation of the grades data set. Having again reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 4. Because SAS is told to end the DO loop when exam is greater than 3, SAS ends the DO loop and moves on to the next statement in the DATA step. Oh, but it's the DATA step's RUN statement.
So, SAS returns to the top of the DATA step. The first record is released from the input buffer. The first INPUT statement now tells SAS to place the second record in the input data file into the input buffer:
↓---+----10---+----20 |
---|
922232573 87 89 95 |
The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:
_N_ | _ERROR_ | ID | exam | score |
---|---|---|---|---|
2 | 0 | . | . | . |
Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the grades data set as described.
It probably won't hurt to summarize again the main differences between the double trailing at sign (@@) and the single trailing at sign (@):
- The double trailing @@ holds a record in the input buffer across multiple iterations of the DATA step until the end of the record is reached.
- The single trailing @ releases a record when SAS returns to the top of the DATA step to process the next iteration.
Now, let's go investigate how we have to modify our programs here to accommodate data files that contain a varying number of repeating fields in each record.