Lesson 21: More on Importing Data - Part II

Lesson 21: More on Importing Data - Part II

Overview

In this lesson, we'll focus on two special situations that you might encounter when trying to read an input raw data file into a SAS data set, namely:

  • When you need to read across several records in the input raw data file in order to create one observation in a SAS data set.
  • When you need to read from just one record in the input raw data file in order to create multiple observations in a SAS data set.

We'll learn how to use two different line pointer controls — the forward slash (/) line pointer control and the pound-n (#n) line pointer control — to accomplish the first task. And, we'll learn how to use two different line-hold specifiers — the double trailing at sign (@@) and the single trailing at sign (@) — to accomplish the second task.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

  • read a raw data file into a SAS data set when the data values for one observation are spread out over several records in the file
  • use the forward slash (/) line pointer control to read records in a raw data file sequentially
  • use the pound-n (#n) line pointer control to read records in a raw data file non-sequentially
  • read a raw data file into a SAS data set when the data values for multiple observations are contained in just one record in the input data file
  • read raw data files in which each record contains a repeating block of values in which each block in the record represents a separate observation
  • read raw data files in which each record contains an ID field followed by an equal number of repeating fields that contribute values to separate observations
  • read raw data files in which each record contains an ID field followed by a varying number of repeating fields that contribute values to separate observations
  • use the double trailing at sign (@@) at the end of an INPUT statement to tell SAS to hold a record in the input buffer across multiple iterations of the DATA step
  • use the single trailing at sign (@) the end of an INPUT statement to tell SAS to hold a record in the input buffer
  • know when it is appropriate to use the double trailing @@
  • know when it is appropriate to use the single trailing @

21.1 - Creating a Single Observation From Multiple Records

21.1 - Creating a Single Observation From Multiple Records

In this section, we'll learn how to read in a raw data file when the data values for one observation are spread out over several records in the file. For example, suppose we are interested in reading the following data file:

State College PA
32 16 25
2.4
Miami FL
75 58 67
2.0
Honolulu HI
80 65 74
3.6

containing first a city name, and then on the next record, the average high, average low, and just plain old average temperature for that city in January, and then on the next record the average precipitation for that city in January. After reading in the first three records for the first city, we'd move on to read the next three records for the second city, and so on. When all is said and done, we'd like our SAS data set to look like this:

City State Avg High Avg Low Mean Percip
State College PA 32 15 25 2.4
Miami FL 75 58 67 2.0
Honolulu HI 80 65 74 3.6

We'll learn two different ways of reading multiple records in a raw data file while creating just one observation in a SAS data set. First, we'll learn how to use the forward slash (/) line pointer control to read the records in a raw data file sequentially. Then, we'll learn how to use the pound-n (#n) line pointer control to read the records in a raw data file non-sequentially.

Reading Multiple Records Sequentially

Just as you can use a +n relative pointer control in your INPUT statements to tell SAS to advance n columns within a record before reading the next data value, you can use one or more forward slash (/) line pointer controls in your INPUT statements to tell SAS to advance to a new record before reading the next data value. Because the forward slash (/) line pointer control only moves the input pointer forward in a record, it can only be used to read the records in a data file in order.

To make sure we understand the problem we are trying to solve in this section, let's start by looking at an example that illustrates how we'd have to read multiple records sequentially if the forward slash (/) line pointer control weren't available to us.

Example 21.1

The following program uses three INPUT statements to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input City & $13. State $;
    input AvgHigh AvgLow Mean;
    input Precip;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;

PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

The first INPUT statement tells SAS to read the City and State into the observation. (The ampersand (&) modifier is used because one of the cities — State College — contains an embedded blank, and the city and states are separated by two blank spaces.) Then, the second INPUT statement tells SAS to read into the observation the average high temperature (AvgHigh), the average low temperature (AvgLow) and the overall average temperature (Mean) for the city in January. Finally, the third INPUT statement tells SAS to read the average precipitation (Precip) in January for the city into the observation. Then, being at the end of the DATA step, the values in the program data vector are written to the january data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the three INPUT statements again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.

Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. Now, let's see how the forward slash (/) line pointer control allows us to accomplish the same thing.

Example 21.2

The following program uses the forward slash (/) line pointer control to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input   City & $13. State $
    	/ AvgHigh AvgLow Mean
        / Precip;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;

PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

Here, a single INPUT statement tells SAS to read the data values for the six variables (City, State, ..., Precip) across three records. Note that because the forward slash (/) line pointer control tells SAS to advance the input pointer to the next record, it is placed after the instructions for reading the values in the current record. For example, we first tell SAS to read the City and State values in the first record, then the / line pointer control tells SAS to move to the next record to read in the AvgHigh, AvgLow, and Mean values. Then, the / line pointer control tells SAS to move to the next record to read in the Precip value. Being at the end of the DATA step, the values in the program data vector are written to the data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the INPUT statement again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.

Launch and run the SAS program. Review the output to convince yourself that SAS reads in the temperature data correctly. You might also want to look at your log window for the NOTE that indicates that the temporary data set january contains only three observations (even though nine data records were input).

You'll want to keep in mind that when you use the forward slash (/) line pointer control in a single INPUT statement as we did here, your INPUT statement should contain just one semi-colon at the end of it. Some students get tripped up thinking that they need to put a semi-colon after each / line pointer control. One INPUT statement requires just one semi-colon.

Note that the input raw data files in the two previous examples contained the same number of records for each observation. You can probably well imagine that this is a requirement when using the forward slash (/) pointer control! Let's take a look at an example of what can go awry if your raw data file doesn't contain the same number of records for each observation.

Example 21.3

The following program attempts to use forward slash (/) pointer controls to read in the January weather statistics for our three cities, State College, Miami, and Honolulu, when the record containing the precipitation for Miami is omitted from the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input   City & $13. State $
    	/ AvgHigh AvgLow Mean
        / Precip;
    DATALINES;
	State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;

PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

First, note that the input data file contains only two records for Miami — the city name record and the temperature record, but not the precipitation record. Then, launch and run the SAS program. Review the output to convince yourself that SAS had trouble reading in the temperature data correctly. As you can see by the message in the log window:

when SAS attempts to read the Precip value for Miami, it instead reads Honolulu. Because Precip is a numeric variable and Honolulu is a character value, SAS displays an Invalid data message. The moral of the story ... before you use a forward slash (/) pointer control in an INPUT statement, make sure that the raw data file contains the same number of records for each observation.

Reading Multiple Records Non-Sequentially

Just as you can use an @n absolute pointer control in your INPUT statements to tell SAS to move to a record's nth column before reading the next data value, you can use the pound-n (#n) line pointer control to tell SAS to advance to a specific record before reading the next data value. Because the #n line pointer control moves the input pointer to a particular record, it can be used to read the records in a data file in any order.

Example 21.4

The following program uses the pound-n (#n) line pointer control to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input #2 AvgHigh AvgLow Mean
    	#3 Precip
        #1  City & $13. State $;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;

PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

Because the #n line pointer control tells SAS the specific record to which we want to move the input pointer, it is placed before the instructions for reading the values in that particular record. For example, we first tell SAS to move to the second record to read in the AvgHigh, AvgLow, and Mean values. Then, we tell SAS to move to the third record to read in the Precip value. Then, we tell SAS to move back to the first record to read the City and State values. Being at the end of the DATA step, the values in the program data vector are written to the data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the INPUT statement again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.

Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. You might also want to note, as is always the case, that the variables in the resulting january data set appear in the same order as their variable names appear in the INPUT statement — AvgHigh, AvgLow, and Mean first, followed by Precip, and then finally City and State.

Reading Multiple Records Sequentially and Non-Sequentially

Although, as you'll see, it's not perfectly flexible, you can use the forward slash (/) and pound-n (#n) line pointer controls together in the same INPUT statement to read multiple records both sequentially and non-sequentially.

Example 21.5

The following program uses both the / and #n line pointer controls to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 

DATA january;
    input #3 Precip 
            #1 City & $13. State $
            / AvgHigh AvgLow Mean;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;

PROC PRINT data = january;
    title 'January Weather for Three U.S. Cities';
RUN;

You should get the idea now. Here, the INPUT statement uses the #3 pointer control to tell SAS to move first to the third record to read in the Precip value, and then the #1 pointer control to tell SAS to move back to the first record to read in the City and State values, and finally the / pointer control to tell SAS to advance another record to read in the AvgHigh, AvgLow, and Mean values.

Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. You might again want to note that the variables in the resulting january data set appear in the same order as their variable names appear in the INPUT statement — Precip   first, followed by City and State, and then finally AvgHigh, AvgLow, and Mean.

To use this example to address that earlier point about using both pointer controls in the same INPUT statement not being perfectly flexible ... if you replace the INPUT statement with this one:

input #2 AvgHigh AvgLow Mean
       / Precip
       #1 City & $13. State $;

and re-run the SAS program, you should see that SAS hiccups. If you're not convinced of that look at error message displayed in the log window:

This example illustrates that you can't read records non-sequentially once you've invoked the sequential-reading forward slash (/) line pointer control.


21.2 - Creating Multiple Observations From a Single Record

21.2 - Creating Multiple Observations From a Single Record

In the next three sections, we'll pull in a number of different tools that we've learned throughout the course — as well as add a few new ones — in order to read raw data files that contain data values for multiple observations in just one record. We'll just introduce the situations here, and then investigate each more fully over the next three sections.

Reading Repeating Blocks of Data

First, we'll learn how to read raw data files in which each record contains a repeating block of values in which each block in the record represents a separate observation. For example, we'll learn how to read this data file:

Jan 32 16 Feb 35 18 Mar 46 26
Apr 58 37 May 68 47 Jun 78 56
Jul 82 60 Aug 80 58 Sep 72 51
Oct 61 40 Nov 48 32 Dec 37 22

in which each record contains three blocks of data values — the month and the average high and low temperature for that month in State College, PA. When all is said and done, we will have transformed the input raw data file into a SAS data set that looks like this:

Month Avg High Avg Low
Jan 32 16
Feb 35 18
Mar 46 26
Apr 58 37
May 68 47
Jun 78 56
Jul 82 60
Aug 80 58
Sep 72 51
Oct 61 40
Nov 48 32
Dec 37 22

Reading the Same Number of Repeating Fields

Then, we'll learn how to read raw data files in which each record contains an ID field followed by an equal number of repeating fields that contribute values to separate observations. For example, we'll learn how to read this data file:

111000234 79 82 100
922232573 87 89 95
252359873 65 72 73
205804679 92 95 99

in which each record contains a nine-digit student ID number followed by three exam scores. When all is said and done, we will have transformed the input raw data file into a SAS data set that looks like this:

Month Avg High Avg Low
Jan 32 16
Feb 35 18
Mar 46 26
Apr 58 37
May 68 47
Jun 78 56
Jul 82 60
Aug 80 58
Sep 72 51
Oct 61 40
Nov 48 32
Dec 37 22

Reading a Varying Number of Repeating Fields

Finally, we'll learn how to read raw data files in which each record contains an ID field followed by a varying number of repeating fields that contribute values to separate observations. For example, we'll learn how to read this raw data file:

1001 179 172 169    
1002 250 249      
1003 190 196 195 164 158
1004 232 224 219 212 208
1005 211 208 204 202  

in which each record contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects. Because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record. When all is said and done, we will have transformed the input raw data file into a SAS data set that looks like this:

id weighin weight
1001 1 179
1001 2 172
1001 3 169
1002 1 250
1002 2 249
1003 1 190
1003 2 196
1003 3 195
1004 1 232
1004 2 224
1004 3 219
1004 4 212
1004 5 208
1005 1 211
1005 2 208
1005 3 204
1005 4 202

21.3 - Reading Repeating Blocks of Data

21.3 - Reading Repeating Blocks of Data

Let's suppose we want to read a raw data file containing repeating blocks of data values — month and average high and low temperatures — in each record, such as here:

Jan 32 16 Feb 35 18 Mar 46 26
Apr 58 37 May 68 47 Jun 78 56
Jul 82 60 Aug 80 58 Sep 72 51
Oct 61 40 Nov 48 32 Dec 37 22

We might be tempted to read in this data file using three different variables for the months, three different variables for the high temperatures, and three different variables for the low temperatures, so that the resulting SAS data set looks like this:

Month1 High1 Low1 Month2 High2 Low2 Month3 High3 Low3
Jan 32 16 Feb 35 18 Mar 46 26
Apr 58 37 May 68 47 Jun 78 56
Jul 82 60 Aug 80 58 Sep 72 51
Oct 61 40 Nov 48 32 Dec 37 22

That makes for an quick and easy way to read in the data values, but a hard way to analyze the data! For example, we can't even using the MEANS procedure to calculate the annual average of the high temperatures, because the values are contained in three separate variables. What we want instead is a data set that looks like this:

Month Avg High Avg Low
Jan 32 16
Feb 35 18
Mar 46 26
Apr 58 37
May 68 47
Jun 78 56
Jul 82 60
Aug 80 58
Sep 72 51
Oct 61 40
Nov 48 32
Dec 37 22

in which each record in the input data file produces three observations, with each observation containing three variables — Month, AvgHigh, and AvgLow. When the data set is structured like this, it is very a simple matter to use the AvgHigh variable in the MEANS procedure to calculate the annual average of the high temperatures.

In order to read in the data set as desired, we need to take advantage of the INPUT statement's double trailing at sign (@@). Errr ... what the heck is that? The double trailing at sign (@@) tells SAS rather than advancing to a new record, hold the current input record for the execution of the next INPUT statement, even across iterations of the DATA step. Perfectly clear, eh? Let's take a look at an example to see if we can make sense of it!

Example 21.6

The following SAS program uses the double trailing at sign (@@) to read in the average high and low temperatures in State College, PA for each month, when each record in the input data file contains the data values for three observations:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 

DATA StateCollege;
    input Month $ AvgHigh AvgLow @@;		 
    DATALINES;
Jan 32 16 Feb 35 18 Mar 46 26
Apr 58 37 May 68 47 Jun 78 56
Jul 82 60 Aug 80 58 Sep 72 51
Oct 61 40 Nov 48 32 Dec 37 22
    ;
RUN;

PROC PRINT data = StateCollege;
    title 'Average Temperatures for State College';
    id Month;
RUN;

The only thing that should stand out as unusual in this program is the double trailing at sign (@@) that appears at the end of the INPUT statement. Incidentally, it's not a coincidence that it is called trailing @@. That's because it must always be the last item to appear in the INPUT statement.

First, let's have you go ahead and launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data as we desired. Then, let's talk about how the trailing @@ works by investigating how SAS reads the data in this program into the StateCollege data set.

As you know from our work in Stat 480, during the compile phase, SAS creates a program data vector containing a storage location for each variable name that appears in the INPUT statement and subsequent assignment statements. So, this is what ours looks like at the beginning of the execution phase:

_N_ _ERROR_ Month AvgHigh AvgLow
1 0   . .

And, as the execution phase begins and SAS encounters the INPUT statement, SAS reads the first line in the input data file into an input buffer, getting:

↓---+----10---+----20---+----30
Jan 32 16 Feb 35 18 Mar 46 26

Of course, the numbers in the white row are just the data values for the first record in our input data file. You can think of the numbers in the gray row as a "ruler" that SAS uses to keep track of the column in which the input pointer resides. SAS won't really be needing the ruler, in this case, because we are using list input. The arrow indicates that the input pointer is in column 1, where SAS places it by default since it is not told to do otherwise.

During the first iteration of the DATA step, the INPUT statement tells SAS to read Month, AvgHigh, and AvgLow. After doing so, the input pointer points to the blank space after the low-temperature value of 16:

1---+----↓0---+----20---+----30
Jan 32 16 Feb 35 18 Mar 46 26

and SAS places the values that are read into their appropriate place in the program data vector:

_N_ _ERROR_ Month AvgHigh AvgLow
1 0 Jan 32 16

Because the INPUT statement is the only statement in the program, after SAS executes it, the first iteration of the DATA step is complete. Therefore, SAS writes the first observation to the StateCollege data set, and returns to the top of the DATA step to begin its second iteration. The automatic variable _N_ is increased to 2, and the variable values in the program data vector are reset to missing:

_N_ _ERROR_ Month AvgHigh AvgLow
2 0   . .

Now, this is where we first see the effect of the trailing @@ in our INPUT statement. Because SAS is beginning another iteration of the DATA step, by default the INPUT statement should be telling SAS to put the next record in the input data file into the input buffer. The trailing @@ tells SAS not to do that though! Instead, the trailing @@ tells SAS to hold onto the record that is currently in the input buffer, and to read the values for Month, AvgHigh, and AvgLow from it. After doing so, the input pointer points to the blank space after the low temperature value of 18:

1---+----10---+----↓0---+----30
Jan 32 16 Feb 35 18 Mar 46 26

and SAS places the values that are read into their appropriate place in the program data vector:

_N_ _ERROR_ Month AvgHigh AvgLow
2 0 Feb 35 18

The second iteration of the DATA step being complete, SAS writes the second observation to the StateCollege data set, and returns to the top of the DATA step to begin its third iteration.

You should get the idea now. The automatic variable _N_ is increased to 3, and the variable values in the program data vector are reset to missing. The trailing @@ tells SAS to hold onto the record that is currently in the input buffer, and to read the values for Month, AvgHigh, and AvgLow from it. After doing so, the input pointer points to the blank space after the low temperature value of 26, which happens to coincide with the end-of-record marker:

1---+----10---+----20---+----↓0
Jan 32 16 Feb 35 18 Mar 46 26

SAS places the values that are read into their appropriate place in the program data vector:

_N_ _ERROR_ Month AvgHigh AvgLow
3 0 mar 46 26

The third iteration of the DATA step being complete, SAS writes the third observation to the StateCollege data set, and returns to the top of the DATA step to begin its fourth iteration.

Yikes, when can we stop this endless loop?! Soon, real soon! We have just one more important issue to discuss about the trailing @@, namely when does SAS know to release the current record from the input buffer to go get the next one? The answer is now! Well, okay, that doesn't quite answer the question. In our particular example, the answer is now, because SAS just encountered the end-of-record marker. In general, there are just two situations in which SAS releases a record being held by the trailing @@:

  1. When SAS moves the input pointer past the end of a record.
  2. When an INPUT statement that has no line-hold specifier executes. (The trailing @@ is one of two line-hold specifiers available. The single trailing @, which we'll learn about in the next section is the other.)

For our example, we're at the point at which SAS just encountered that first situation. Therefore, SAS releases the first record and reads the second record into the input buffer. And so on and so on ... SAS continues the execution phase until the last block of data is read from the input file.

Now can we say that we all get the idea? Let's assume so and stop there! Just keep in mind these few important points about the trailing @@:

  • The trailing @@ holds a data line in the input buffer across multiple iterations of the DATA step.
  • The trailing @@ is most frequently used to read multiple SAS observations from a single input record, just as was done in our example.
  • The trailing @@ should not be used in conjunction with column input, the @n absolute pointer control, nor with the MISSOVER option.

Now, let's go learn what the single trailing at sign (@) can do for us, and how it differs from the double trailing at sign (@@).


21.4 - Reading the Same Number of Repeating Fields

21.4 - Reading the Same Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a student ID number followed by the same number of grades, such as here:

111000234 79 82 100
922232573 87 89 95
252359873 65 72 73
205804679 92 95 99

In some situations, it might work just fine to read in this data file using three different variables for the grades — grade1, grade2, and grade3, say — so that the resulting SAS data set looks like this:

ID grade1 grade2 grade3
111000231 79 82 100
922232573 87 89 95
252359873 65 72 73
205804679 92 95 99

In other cases, though, it makes more sense to read all of the grades into one score variable — if you wanted to calculate an overall average across all students and all grades, say. In that case, we'd want our data set to have this structure instead:

ID exam score
111000234 1 79
111000234 2 82
111000234 3 100
922232573 1 87
922232573 2 89
922232573 3 95

252359873

1 65
252359873 2 72
252359873 3 73
205804679 1 92
205804679 2 95
205804679 3 99

in which each record in the input data file produces three observations, with each observation containing three variables — ID, exam, and score.

In order to read in the data set as desired, we need to take advantage of the INPUT statement's single trailing at sign (@). Just like the double trailing at sign (@@), the single trailing @ is a line-hold specifier that tells SAS to hold the current record in the input buffer while SAS executes the next INPUT statement. The single trailing @ releases the current record from the input buffer:

  1. when SAS encounters an INPUT statement without a line-hold specifier, or
  2. when SAS returns to the top of the DATA step.

The trailing @ will do the trick for us here, because we do not need SAS to hold the current record across multiple iterations of the DATA step. Let's make more sense of all of this by looking at an example!

Example 21.7

The following program uses single trailing at signs (@) in multiple INPUT statements along with multiple output statements to read a raw data file into a temporary SAS data set called grades:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

DATA grades;
    input ID @;
    input score @;
    output;
    input score @;
    output;
    input score @;
    output;
    DATALINES;
    111000234 79 82 100
    922232573 87 89  95
    252359873 65 72  73
    205804679 92 95  99
    ;
RUN;

PROC PRINT data = grades NOOBS;
    title 'Grades data set';
RUN;

Let's work our way through the program, from top to bottom, to make sure we understand what it is doing:

  • The first INPUT statement tells SAS to place the first record into the input buffer, as well as read the value for the ID variable. The trailing @ tells SAS to hold the first record in the input buffer.
  • The second INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The trailing @ tells SAS to continue to hold the first record in the input buffer.
  • The OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 79, respectively — to create the first observation in the grades data set.
  • The next INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The value 82 writes over the 79 that's in the score field in the program data vector. The ID value, 111000234, in the program data vector remains unchanged. The trailing @ tells SAS to continue to hold the first record in the input buffer.
  • The next OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 82, respectively — to create the second observation in the grades data set.
  • The next INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The value 100 writes over the 82 that's in the score field in the program data vector. The ID value, 111000234, in the program data vector remains unchanged. The trailing @ tells SAS to continue to hold the input buffer. The trailing @ is really not necessary here, since we've reached the bottom of the DATA step anyway.
  • The final OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 100, respectively — to create the third observation in the grades data set.
  • Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.
  • And the process continues as just described until SAS reads the last field in the last record.

Launch and run the SAS program, and review the output to convince yourself that the program does indeed create a data set called grades, in which each record contains just one of the grades for one of the students.

Now, you might be looking at that last program and wondering "geez, what if I had an input data file with 100 repeating fields ... would I really need to write a program with 101 input statements and 100 output statements?" This program looks like a good candidate for an iterative DO loop, doesn't it?

Example 21.8

The following program is identical in behavior to the previous program, except here the alternating sets of three INPUT and OUTPUT statements have been collapsed into one iterative DO loop:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

DATA grades;
    input ID @;
    DO exam = 1 to 3;
        input score @;
        output;
    END;
    DATALINES;
    111000234 79 82 100
    922232573 87 89  95
    252359873 65 72  73
    205804679 92 95  99
    ;
RUN;

PROC PRINT data = grades NOOBS;
    title 'Grades data set';
RUN;

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — ID, exam, and score. Then, SAS moves on to the execution phase.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

1---+----↓0---+----20
11100234 79 82 100

and to read the value for the ID variable into the program data vector:

_N_ _ERROR_ ID exam score
1 0 111000234 . .

The trailing @ tells SAS to hold the record in the input buffer. Then, SAS begins execution of the DO loop. On the first iteration of the DO loop, exam is set to 1, the INPUT statement reads the first repeating field (79) and assigns the value to score in the program data vector:

_N_ _ERROR_ ID exam score
1 0 111000234 1 79

The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the grades data set. Having reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 2. The INPUT statement reads the second repeating field (82) and assigns the value to score in the program data vector:

_N_ _ERROR_ ID exam score
1 0 111000234 2 82

The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the grades data set. Having again reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 3. The INPUT statement reads the third, and final, repeating field (100) and assigns the value to score in the program data vector:

_N_ _ERROR_ ID exam score
1 0 111000234 3 100

The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the third observation of the grades data set. Having again reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 4. Because SAS is told to end the DO loop when exam is greater than 3, SAS ends the DO loop, and moves on to the next statement in the DATA step. Oh, but it's the DATA step's RUN statement.

So, SAS returns to the top of the DATA step. The first record is released from the input buffer. The first INPUT statement now tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20
922232573 87 89 95

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N_ _ERROR_ ID exam score
2 0 . . .

Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the grades data set as described.

It probably won't hurt to summarize again the main differences between the double trailing at sign (@@) and the single trailing at sign (@):

  • The double trailing @@ holds a record in the input buffer across multiple iterations of the DATA step until the end of the record is reached.
  • The single trailing @ releases a record when SAS returns to the top of the DATA step to process the next iteration.

Now, let's go investigate how we have to modify our programs here to accommodate data files that containing a varying number of repeating fields in each record.


21.5 - Reading a Varying Number of Repeating Fields

21.5 - Reading a Varying Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects:

1001 179 172 169    
1002 250 249      
1003 190 196 195 164 158
1004 232 224 219 212 208
1005 211 208 204 202  

As you can see, because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record.

In some situations, it might work just fine to read in this data file using five different variables for the weights — wt1, wt2, wt3, wt4, and wt5, say — so that the resulting SAS data set looks like this:

id wt1 wt2 wt3 wt4 wt5
1001 179 172 169 . .
1002 250 249 . . .
1003 190 196 195 164 158
1004 232 224 219 212 208
1005 211 208 204 202 .

In other cases, though, it makes more sense to read all of the weights into one weight variable — if you wanted to calculate an overall average across all subjects and all weigh-ins, say. In that case, we'd want our data set to have this structure instead:

id weighin weight
1001 1 179
1001 2 172
1001 3 169
1002 1 250
1002 2 249
1003 1 190
1003 2 196
1003 3 195
1003 4 164
1003 5 158
1003 3 195
1004 1 232
1004 2 224
1004 3 219
1004 4 212
1004 5 208
1005 1 211
1005 2 208
1005 3 204
1005 4 202

in which each record in the input data file produces as many observations as it has weights, with each observation containing three variables — id, weighin, and weight.

Because the input data file contains a varying number of repeated fields in each record, our methods of the last section are going to have to be modified. We can still take advantage of the trailing @ to tell SAS to hold the records in the input buffer. We'll also have to take advantage, however, of the INFILE statement's MISSOVER option, as well as have to replace the iterative DO loop with a DO WHILE loop. Let's take a look!

Example 21.9

The following program uses trailing at signs (@) in conjunction with the MISSOVER option and a conditional DO WHILE loop to read in a raw data file containing a varying number of weights in each record:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

DATA dietdata;
    infile DATALINES MISSOVER;
    input id weight @;
    weighin = 0;
    DO WHILE (weight ne .);
        weighin+1;
        output;
        input weight @;
    END;
    DATALINES;
    1001 179 172 169
    1002 250 249
    1003 190 196 195 164 158
    1004 232 224 219 212 208
    1005 211 208 204 202
    ;
RUN;

PROC PRINT data = dietdata NOOBS;
    title 'The dietdata data set';
    var id weighin weight;
RUN;

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — id, weighin, and weight. Then, SAS moves on to the execution phase.

Even though we are reading the data instream, we've included an INFILE statement this time just so we can take advantage of the statement's MISSOVER option. Like the previous example with the same number of repeating fields, we must use trailing at signs (@) to tell SAS to read the same record more than once. But, here, we also have to tell SAS not to move the input pointer to the next record when there are missing weights. That's why the MISSOVER option.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

↓---+----10---+----20---+
1001 179 172 169

and to read the values for the ID and weight variables into the program data vector. The trailing @ tells SAS to hold the record in the input buffer. The weighin variable is initialized to 0, so that the program data vector looks like this now:

_N_ _ERROR_ id weighin weight
1 0 1001 0 179

Then, SAS begins execution of the DO WHILE loop. The current value for weight is not missing — it's 179 — so SAS steps into the DO WHILE loop. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_ _ERROR_ id weighin weight
1 0 1001 1 179

The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the dietdata data set. Then, the INPUT statement reads the repeating field (172) and assigns the value to the program data vector:

_N_ _ERROR_ id weighin weight
1 0 1001 1 172

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 172. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_ _ERROR_ id weighin weight
1 0 1001 2 172

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement reads the repeating field (169) and assigns the value to the program data vector:

_N_ _ERROR_ id weighin weight
1 0 1001 2 169

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 169. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_ _ERROR_ id weighin weight
1 0 1001 3 169

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement attempts to read another weight value and determines there to be none. The MISSOVER option prevents the input pointer from moving to the next record in search of another value for weight. SAS assigns a missing value to the weight variable in the program data vector:

_N_ _ERROR_ id weighin weight
1 0 1001 3 .

Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is now missing, and so the statements in the loop are not executed. Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.

During the second iteration of the DATA step, the first INPUT statement tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20---+
1002 250 249

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N_ _ERROR_ id weighin weight
2 0 . . .

Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the dietdata data set as described.


21.6 - Summary

21.6 - Summary

In this lesson, we extended our knowledge of how to read data from various types of data files into SAS data sets.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility