21.3 - Reading Repeating Blocks of Data
21.3 - Reading Repeating Blocks of DataLet's suppose we want to read a raw data file containing repeating blocks of data values — month and average high and low temperatures — in each record, such as here:
Monthly High/ Low Temps | ||||||||
---|---|---|---|---|---|---|---|---|
Jan | 32 | 16 | Feb | 35 | 18 | Mar | 46 | 26 |
Apr | 58 | 37 | May | 68 | 47 | Jun | 78 | 56 |
Jul | 82 | 60 | Aug | 80 | 58 | Sep | 72 | 51 |
Oct | 61 | 40 | Nov | 48 | 32 | Dec | 37 | 22 |
We might be tempted to read in this data file using three different variables for the months, three different variables for the high temperatures, and three different variables for the low temperatures, so that the resulting SAS data set looks like this:
Month1 | High1 | Low1 | Month2 | High2 | Low2 | Month3 | High3 | Low3 |
---|---|---|---|---|---|---|---|---|
Jan | 32 | 16 | Feb | 35 | 18 | Mar | 46 | 26 |
Apr | 58 | 37 | May | 68 | 47 | Jun | 78 | 56 |
Jul | 82 | 60 | Aug | 80 | 58 | Sep | 72 | 51 |
Oct | 61 | 40 | Nov | 48 | 32 | Dec | 37 | 22 |
That makes for a quick and easy way to read the data values, but a hard way to analyze the data! For example, we can't even use the MEANS procedure to calculate the annual average of the high temperatures, because the values are contained in three separate variables. What we want instead is a data set that looks like this:
Month | Avg High | Avg Low |
---|---|---|
Jan | 32 | 16 |
Feb | 35 | 18 |
Mar | 46 | 26 |
Apr | 58 | 37 |
May | 68 | 47 |
Jun | 78 | 56 |
Jul | 82 | 60 |
Aug | 80 | 58 |
Sep | 72 | 51 |
Oct | 61 | 40 |
Nov | 48 | 32 |
Dec | 37 | 22 |
in which each record in the input data file produces three observations, with each observation containing three variables — Month, AvgHigh, and AvgLow. When the data set is structured like this, it is very a simple matter to use the AvgHigh variable in the MEANS procedure to calculate the annual average of the high temperatures.
In order to read the data set as desired, we need to take advantage of the INPUT statement's double trailing at sign (@@). Errr ... what the heck is that? The double trailing at sign (@@) tells SAS rather than advancing to a new record, to hold the current input record for the execution of the next INPUT statement, even across iterations of the DATA step. Perfectly clear, eh? Let's take a look at an example to see if we can make sense of it!
Example 21.6
The following SAS program uses the double trailing at sign (@@) to read the average high and low temperatures in State College, PA for each month when each record in the input data file contains the data values for three observations:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA StateCollege;
input Month $ AvgHigh AvgLow @@;
DATALINES;
Jan 32 16 Feb 35 18 Mar 46 26
Apr 58 37 May 68 47 Jun 78 56
Jul 82 60 Aug 80 58 Sep 72 51
Oct 61 40 Nov 48 32 Dec 37 22
;
RUN;
PROC PRINT data = StateCollege;
title 'Average Temperatures for State College';
id Month;
RUN;
The only thing that should stand out as unusual in this program is the double trailing at sign (@@) that appears at the end of the INPUT statement. Incidentally, it's not a coincidence that it is called trailing @@. That's because it must always be the last item to appear in the INPUT statement.
First, let's have you go ahead and launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data as we desired. Then, let's talk about how the trailing @@ works by investigating how SAS reads the data in this program into the StateCollege data set.
As you know from our work in Stat 480, during the compile phase, SAS creates a program data vector containing a storage location for each variable name that appears in the INPUT statement and subsequent assignment statements. So, this is what ours looks like at the beginning of the execution phase:
_N_ | _ERROR_ | Month | AvgHigh | AvgLow |
---|---|---|---|---|
1 | 0 | . | . |
And, as the execution phase begins and SAS encounters the INPUT statement, SAS reads the first line in the input data file into an input buffer, getting:
↓---+----10---+----20---+----30 |
---|
Jan 32 16 Feb 35 18 Mar 46 26 |
Of course, the numbers in the white row are just the data values for the first record in our input data file. You can think of the numbers in the gray row as a "ruler" that SAS uses to keep track of the column in which the input pointer resides. SAS won't really be needing the ruler, in this case, because we are using list input. The arrow indicates that the input pointer is in column 1, where SAS places it by default since it is not told to do otherwise.
During the first iteration of the DATA step, the INPUT statement tells SAS to read AvgHigh AvgLow Mean. After doing so, the input pointer points to the blank space after the low-temperature value of 16:
1---+----↓0---+----20---+----30 |
---|
Jan 32 16 Feb 35 18 Mar 46 26 |
and SAS places the values that are read into their appropriate place in the program data vector:
_N_ | _ERROR_ | Month | AvgHigh | AvgLow |
---|---|---|---|---|
1 | 0 | Jan | 32 | 16 |
Because the INPUT statement is the only statement in the program, after SAS executes it, the first iteration of the DATA step is complete. Therefore, SAS writes the first observation to the StateCollege data set and returns to the top of the DATA step to begin its second iteration. The automatic variable _N_ is increased to 2, and the variable values in the program data vector are reset to missing:
_N_ | _ERROR_ | Month | AvgHigh | AvgLow |
---|---|---|---|---|
2 | 0 | . | . |
Now, this is where we first see the effect of the trailing @@ in our INPUT statement. Because SAS is beginning another iteration of the DATA step, by default the INPUT statement should be telling SAS to put the next record in the input data file into the input buffer. The trailing @@ tells SAS not to do that though! Instead, the trailing @@ tells SAS to hold onto the record that is currently in the input buffer and to read the values for AvgHigh AvgLow Mean from it. After doing so, the input pointer points to the blank space after the low-temperature value of 18:
1---+----10---+----↓0---+----30 |
---|
Jan 32 16 Feb 35 18 Mar 46 26 |
and SAS places the values that are read into their appropriate place in the program data vector:
_N_ | _ERROR_ | Month | AvgHigh | AvgLow |
---|---|---|---|---|
2 | 0 | Feb | 35 | 18 |
The second iteration of the DATA step being complete, SAS writes the second observation to the StateCollege data set and returns to the top of the DATA step to begin its third iteration.
You should get the idea now. The automatic variable _N_ is increased to 3, and the variable values in the program data vector are reset to missing. The trailing @@ tells SAS to hold onto the record that is currently in the input buffer and to read the values for AvgHigh AvgLow Mean from it. After doing so, the input pointer points to the blank space after the low-temperature value of 26, which happens to coincide with the end-of-record marker:
1---+----10---+----20---+----↓0 |
---|
Jan 32 16 Feb 35 18 Mar 46 26 |
SAS places the values that are read into their appropriate place in the program data vector:
_N_ | _ERROR_ | Month | AvgHigh | AvgLow |
---|---|---|---|---|
3 | 0 | mar | 46 | 26 |
The third iteration of the DATA step being complete, SAS writes the third observation to the StateCollege data set and returns to the top of the DATA step to begin its fourth iteration.
Yikes, when can we stop this endless loop?! Soon, real soon! We have just one more important issue to discuss the trailing @@, namely, when does SAS know to release the current record from the input buffer to go get the next one? The answer is now! Well, okay, that doesn't quite answer the question. In our particular example, the answer is now, because SAS just encountered the end-of-record marker. In general, there are just two situations in which SAS releases a record being held by the trailing @@:
- When SAS moves the input pointer past the end of a record.
- When an INPUT statement that has no line-hold specifier executes. (The trailing @@ is one of two line-hold specifiers available. The single trailing @, which we'll learn about in the next section is the other.)
For our example, we're at the point at which SAS just encountered that first situation. Therefore, SAS releases the first record and reads the second record into the input buffer. And so on and so on ... SAS continues the execution phase until the last block of data is read from the input file.
Now can we say that we all get the idea? Let's assume so and stop there! Just keep in mind these few important points about the trailing @@:
- The trailing @@ holds a data line in the input buffer across multiple iterations of the DATA step.
- The trailing @@ is most frequently used to read multiple SAS observations from a single input record, just as was done in our example.
- The trailing @@ should not be used in conjunction with column input, the @n absolute pointer control, or the MISSOVER option.
Now, let's go learn what the single trailing at sign (@) can do for us, and how it differs from the double trailing at sign (@@).