Lesson 21: More on Importing Data - Part II

Overview

In this lesson, we'll focus on two special situations that you might encounter when trying to read an input raw data file into a SAS data set, namely:

When you need to read across several records in the input raw data file in order to create one observation in a SAS data set.
When you need to read from just one record in the input raw data file in order to create multiple observations in a SAS data set.

We'll learn how to use two different line pointer controls — the forward-slash (/) line pointer control and the pound-n (#n) line pointer control — to accomplish the first task. And, we'll learn how to use two different line-hold specifiers — the double trailing at sign (@@) and the single trailing at sign (@) — to accomplish the second task.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

read a raw data file into a SAS data set when the data values for one observation are spread out over several records in the file
use the forward-slash (/) line pointer control to read records in a raw data file sequentially
use the pound-n (#n) line pointer control to read records in a raw data file non-sequentially
read a raw data file into a SAS data set when the data values for multiple observations are contained in just one record in the input data file
read raw data files in which each record contains a repeating block of values in which each block in the record represents a separate observation
read raw data files in which each record contains an ID field followed by an equal number of repeating fields that contribute values to separate observations
read raw data files in which each record contains an ID field followed by a varying number of repeating fields that contribute values to separate observations
use the double trailing at sign (@@) at the end of an INPUT statement to tell SAS to hold a record in the input buffer across multiple iterations of the DATA step
use the single trailing at sign (@) at the end of an INPUT statement to tell SAS to hold a record in the input buffer
define when it is appropriate to use the double trailing @@
define when it is appropriate to use the single trailing @

21.1 - Creating a Single Observation From Multiple Records

In this section, we'll learn how to read in a raw data file when the data values for one observation are spread out over several records in the file. For example, suppose we are interested in reading the following data file:

City 1	State College PA
Temps	32	16	25
Precip	2.4
City 2	Miami FL
Temps	75	58	67
Precip	2.0
City 3	Honolulu HI
Temps	80	65	74
Precip	3.6

containing first a city name, and then on the next record, the average high, average low, and just plain old average temperature for that city in January, and then on the next record the average precipitation for that city in January. After reading in the first three records for the first city, we'd move on to read the next three records for the second city, and so on. When all is said and done, we'd like our SAS data set to look like this:

City	State	Avg High	Avg Low	Mean	Percip
State College	PA	32	15	25	2.4
Miami	FL	75	58	67	2.0
Honolulu	HI	80	65	74	3.6

We'll learn two different ways of reading multiple records in a raw data file while creating just one observation in a SAS data set. First, we'll learn how to use the forward slash (/) line pointer control to read the records in a raw data file sequentially. Then, we'll learn how to use the pound-n (#n) line pointer control to read the records in a raw data file non-sequentially.

Reading Multiple Records Sequentially

Just as you can use a +n relative pointer control in your INPUT statements to tell SAS to advance n columns within a record before reading the next data value, you can use one or more forward slash (/) line pointer controls in your INPUT statements to tell SAS to advance to a new record before reading the next data value. Because the forward-slash (/) line pointer control only moves the input pointer forward in a record, it can only be used to read the records in a data file in order.

To make sure we understand the problem we are trying to solve in this section, let's start by looking at an example that illustrates how we'd have to read multiple records sequentially if the forward-slash (/) line pointer control weren't available to us.

Example 21.1

The following program uses three INPUT statements to read January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input City & $13. State $;
    input AvgHigh AvgLow Mean;
    input Precip;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;
 
PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

January Weather for Three U.S. Cities
Obs	City	State	AvgHigh	AvgLow	Mean	Precip
1	State College	PA	32	16	25	2.4
2	Miami	FL	75	58	67	2.0
3	Honolulu	HI	80	65	74	3.6

The first INPUT statement tells SAS to read the City and State into the observation. (The ampersand (&) modifier is used because one of the cities — State College — contains an embedded blank, and the city and states are separated by two blank spaces.) Then, the second INPUT statement tells SAS to read into the observation the average high temperature (AvgHigh), the average low temperature (AvgLow), and the overall average temperature (Mean) for the city in January. Finally, the third INPUT statement tells SAS to read the average precipitation (Precip) in January for the city into the observation. Then, at the end of the DATA step, the values in the program data vector are written to the january data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the three INPUT statements again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.

Launch and run the SAS program, and review the output to convince yourself that SAS reads the temperature data correctly. Now, let's see how the forward-slash (/) line pointer control allows us to accomplish the same thing.

Example 21.2

The following program uses the forward slash (/) line pointer control to read January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input   City & $13. State $
    	/ AvgHigh AvgLow Mean
        / Precip;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;
 
PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

January Weather for Three U.S. Cities
Obs	City	State	AvgHigh	AvgLow	Mean	Precip
1	State College	PA	32	16	25	2.4
2	Miami	FL	75	58	67	2.0
3	Honolulu	HI	80	65	74	3.6

Here, a single INPUT statement tells SAS to read the data values for the six variables (City, State, ..., Precip) across three records. Note that because the forward slash (/) line pointer control tells SAS to advance the input pointer to the next record, it is placed after the instructions for reading the values in the current record. For example, we first tell SAS to read the City and State values in the first record, then the / line pointer control tells SAS to move to the next record to read in the AvgHigh, AvgLow, and Mean values. Then, the / line pointer control tells SAS to move to the next record to read in the Precip value. Being at the end of the DATA step, the values in the program data vector are written to the data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the INPUT statement again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.

Launch and run the SAS program. Review the output to convince yourself that SAS reads the temperature data correctly. You might also want to look at your log window for the NOTE that indicates that the temporary data set january contains only three observations (even though nine data records were input).

You'll want to keep in mind that when you use the forward slash (/) line pointer control in a single INPUT statement as we did here, your INPUT statement should contain just one semi-colon at the end of it. Some students get tripped up thinking that they need to put a semi-colon after each / line pointer control. One INPUT statement requires just one semi-colon.

Note that the input raw data files in the two previous examples contained the same number of records for each observation. You can probably well imagine that this is a requirement when using the forward-slash (/) pointer control! Let's take a look at an example of what can go awry if your raw data file doesn't contain the same number of records for each observation.

Example 21.3

The following program attempts to use forward slash (/) pointer controls to read the January weather statistics for our three cities, State College, Miami, and Honolulu, when the record containing the precipitation for Miami is omitted from the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input   City & $13. State $
    	/ AvgHigh AvgLow Mean
        / Precip;
    DATALINES;
	State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;
 
PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

January Weather for Three U.S. Cities
Obs	City	State	AvgHigh	AvgLow	Mean	Precip
1	State College	PA	32	16	25	2.4
2	Miami	FL	75	58	67	.

First, note that the input data file contains only two records for Miami — the city name record and the temperature record, but not the precipitation record. Then, launch and run the SAS program. Review the output to convince yourself that SAS had trouble reading the temperature data correctly. As you can see by the message in the log window:

  OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
  DATA january;
       input   City & $13. State $
           / AvgHigh AvgLow Mean
           / Precip;
               DATALINES;
NOTE: Invalid data for Precip in line 54 5-12.
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6-
                     Honolulu  HI
City=Miami State=FL AvgHigh=75 AvgLow=58 Mean=67 Precip=. _ERROR_=1
_N_=2
NOTE: LOST CARD.
                       ;
City=80 65 74 State=3.6 AvgHigh=. AvgLow=. Mean=. Precip=. _ERROR_=1
_N_=3
NOTE: SAS went to a new line when INPUT statement reached past the end
              of a line.
NOTE: The data set WORK.JANUARY has 2 observations and 6 variables.
NOTE: DATA statement used (Total process time):
              real time           0.00 seconds
              cpu time            0.01 seconds
               ;
      RUN;
      PROC PRINT data = january;
NOTE: Writing HTML Body file: sashtml2.htm
         title 'January Weather for Three U.S. Cities';
      RUN;
NOTE: There were 2 observations read from the data set WORK.JANUARY.
NOTE: PROCEDURE PRINT used (Total process time):
              real time           0.44 seconds
              cpu time            0.28 seconds

when SAS attempts to read the Precip value for Miami, it instead reads Honolulu. Because Precip is a numeric variable and Honolulu is a character value, SAS displays an Invalid data message. The moral of the story ... before you use a forward slash (/) pointer control in an INPUT statement, make sure that the raw data file contains the same number of records for each observation.

Reading Multiple Records Non-Sequentially

Just as you can use an @n absolute pointer control in your INPUT statements to tell SAS to move to a record's nth column before reading the next data value, you can use the pound-n (#n) line pointer control to tell SAS to advance to a specific record before reading the next data value. Because the #n line pointer control moves the input pointer to a particular record, it can be used to read the records in a data file in any order.

Example 21.4

The following program uses the pound-n (#n) line pointer control to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
	input #2 AvgHigh AvgLow Mean
    	#3 Precip
        #1  City & $13. State $;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;
 
PROC PRINT data = january;
	title 'January Weather for Three U.S. Cities';
RUN;

January Weather for Three U.S. Cities
Obs	AvgHigh	AvgLow	Mean	Precip	City	State
1	32	16	25	2.4	State College	PA
2	75	58	67	2.0	Miami	FL
3	80	65	74	3.6	Honolulu	HI

Because the #n line pointer control tells SAS the specific record to which we want to move the input pointer, it is placed before the instructions for reading the values in that particular record. For example, we first tell SAS to move to the second record to read in the AvgHigh, AvgLow, and Mean values. Then, we tell SAS to move to the third record to read in the Precip value. Then, we tell SAS to move back to the first record to read the City and State values. Being at the end of the DATA step, the values in the program data vector are written to the data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the INPUT statement again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.

Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. You might also want to note, as is always the case, that the variables in the resulting january data set appear in the same order as their variable names appear in the INPUT statement — AvgHigh, AvgLow, and Mean first, followed by Precip, and then finally City and State.

Reading Multiple Records Sequentially and Non-Sequentially

Although, as you'll see, it's not perfectly flexible, you can use the forward slash (/) and pound-n (#n) line pointer controls together in the same INPUT statement to read multiple records both sequentially and non-sequentially.

Example 21.5

The following program uses both the / and #n line pointer controls to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA january;
    input #3 Precip 
            #1 City & $13. State $
            / AvgHigh AvgLow Mean;
    DATALINES;
    State College  PA
    32 16 25
    2.4
    Miami  FL
    75 58 67
    2.0
    Honolulu  HI
    80 65 74
    3.6
    ;
RUN;
 
PROC PRINT data = january;
    title 'January Weather for Three U.S. Cities';
RUN;

January Weather for Three U.S. Cities
Obs	Precip	City	State	AvgHigh	AvgLow	Mean
1	2.4	State College	PA	32	16	25
2	2.0	Miami	FL	75	58	67
3	3.6	Honolulu	HI	80	65	74

You should get the idea now. Here, the INPUT statement uses the #3 pointer control to tell SAS to move first to the third record to read in the Precip value, and then the #1 pointer control to tell SAS to move back to the first record to read in the City and State values, and finally the / pointer control to tell SAS to advance another record to read in the AvgHigh, AvgLow, and Mean values.

Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. You might again want to note that the variables in the resulting january data set appear in the same order as their variable names appear in the INPUT statement — Precip first, followed by City and State, and then finally AvgHigh, AvgLow, and Mean.

To use this example to address that earlier point about using both pointer controls in the same INPUT statement not being perfectly flexible ... if you replace the INPUT statement with this one:

input #2 AvgHigh AvgLow Mean
               / Precip
               #1 City & $13. State $;

and re-run the SAS program, you should see SAS hiccups. If you're not convinced of that look at the error message displayed in the log window:

This example illustrates that you can't read records non-sequentially once you've invoked the sequential-reading forward slash (/) line pointer control.

21.2 - Creating Multiple Observations From a Single Record

In the next three sections, we'll pull in a number of different tools that we've learned throughout the course — as well as add a few new ones — in order to read raw data files that contain data values for multiple observations in just one record. We'll just introduce the situations here and then investigate each more fully over the next three sections.

Reading Repeating Blocks of Data

First, we'll learn how to read raw data files in which each record contains a repeating block of values in which each block in the record represents a separate observation. For example, we'll learn how to read this data file:

Month	Av High	Av Low	Month	Av High	Av Low	Month	Av High	Av Low
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

in which each record contains three blocks of data values — the month and the average high and low temperature for that month in State College, PA. When all is said and done, we will have transformed the input raw data file into a SAS data set that looks like this:

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

Reading the Same Number of Repeating Fields

Then, we'll learn how to read raw data files in which each record contains an ID field followed by an equal number of repeating fields that contribute values to separate observations. For example, we'll learn how to read this data file:

id	exam 1	exam 2	exam 3
111000234	79	82	100
922232573	87	89	95
252359873	65	72	73
205804679	92	95	99

in which each record contains a nine-digit student ID number followed by three exam scores. When all is said and done, we will have transformed the input raw data file into a SAS data set that looks like this:

id	exam	score
111000234	1	79
111000234	2	82
111000234	3	100
922232573	1	87
922232573	2	89
922232573	3	95
252359873	1	65
252359873	2	72
252359873	3	73
205804679	1	92
205804679	2	95
205804679	3	99

Reading a Varying Number of Repeating Fields

Finally, we'll learn how to read raw data files in which each record contains an ID field followed by a varying number of repeating fields that contribute values to separate observations. For example, we'll learn how to read this raw data file:

id	weight 1	weight 2	weight 3	weight 4	weight 5
1001	179	172	169
1002	250	249
1003	190	196	195	164	158
1004	232	224	219	212	208
1005	211	208	204	202

in which each record contains a four-digit subject ID number followed by the monthly weights (in pounds) of the subjects. Because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record. When all is said and done, we will have transformed the input raw data file into a SAS data set that looks like this:

id	weigh-in	weight
1001	1	179
1001	2	172
1001	3	169
1002	1	250
1002	2	249
1003	1	190
1003	2	196
1003	3	195
1004	1	232
1004	2	224
1004	3	219
1004	4	212
1004	5	208
1005	1	211
1005	2	208
1005	3	204
1005	4	202

21.3 - Reading Repeating Blocks of Data

Let's suppose we want to read a raw data file containing repeating blocks of data values — month and average high and low temperatures — in each record, such as here:

Monthly High/ Low Temps
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

We might be tempted to read in this data file using three different variables for the months, three different variables for the high temperatures, and three different variables for the low temperatures, so that the resulting SAS data set looks like this:

Month1	High1	Low1	Month2	High2	Low2	Month3	High3	Low3
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

That makes for a quick and easy way to read the data values, but a hard way to analyze the data! For example, we can't even use the MEANS procedure to calculate the annual average of the high temperatures, because the values are contained in three separate variables. What we want instead is a data set that looks like this:

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

in which each record in the input data file produces three observations, with each observation containing three variables — Month, AvgHigh, and AvgLow. When the data set is structured like this, it is very a simple matter to use the AvgHigh variable in the MEANS procedure to calculate the annual average of the high temperatures.

In order to read the data set as desired, we need to take advantage of the INPUT statement's double trailing at sign (@@). Errr ... what the heck is that? The double trailing at sign (@@) tells SAS rather than advancing to a new record, to hold the current input record for the execution of the next INPUT statement, even across iterations of the DATA step. Perfectly clear, eh? Let's take a look at an example to see if we can make sense of it!

Example 21.6

The following SAS program uses the double trailing at sign (@@) to read the average high and low temperatures in State College, PA for each month when each record in the input data file contains the data values for three observations:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER; 
DATA StateCollege;
     input Month $ AvgHigh AvgLow @@;		 
     DATALINES;
Jan 32 16 Feb 35 18 Mar 46 26
Apr 58 37 May 68 47 Jun 78 56
Jul 82 60 Aug 80 58 Sep 72 51
Oct 61 40 Nov 48 32 Dec 37 22
    ;
RUN;
PROC PRINT data = StateCollege;
    title 'Average Temperatures for State College';
    id Month;
RUN;

The only thing that should stand out as unusual in this program is the double trailing at sign (@@) that appears at the end of the INPUT statement. Incidentally, it's not a coincidence that it is called trailing @@. That's because it must always be the last item to appear in the INPUT statement.

First, let's have you go ahead and launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data as we desired. Then, let's talk about how the trailing @@ works by investigating how SAS reads the data in this program into the StateCollege data set.

As you know from our work in Stat 480, during the compile phase, SAS creates a program data vector containing a storage location for each variable name that appears in the INPUT statement and subsequent assignment statements. So, this is what ours looks like at the beginning of the execution phase:

_N_	_ERROR_	Month	AvgHigh	AvgLow
1	0		.	.

And, as the execution phase begins and SAS encounters the INPUT statement, SAS reads the first line in the input data file into an input buffer, getting:

↓---+----10---+----20---+----30
Jan 32 16 Feb 35 18 Mar 46 26

Of course, the numbers in the white row are just the data values for the first record in our input data file. You can think of the numbers in the gray row as a "ruler" that SAS uses to keep track of the column in which the input pointer resides. SAS won't really be needing the ruler, in this case, because we are using list input. The arrow indicates that the input pointer is in column 1, where SAS places it by default since it is not told to do otherwise.

During the first iteration of the DATA step, the INPUT statement tells SAS to read AvgHigh AvgLow Mean. After doing so, the input pointer points to the blank space after the low-temperature value of 16:

1---+----↓0---+----20---+----30
Jan 32 16 Feb 35 18 Mar 46 26

and SAS places the values that are read into their appropriate place in the program data vector:

_N_	_ERROR_	Month	AvgHigh	AvgLow
1	0	Jan	32	16

Because the INPUT statement is the only statement in the program, after SAS executes it, the first iteration of the DATA step is complete. Therefore, SAS writes the first observation to the StateCollege data set and returns to the top of the DATA step to begin its second iteration. The automatic variable _N_ is increased to 2, and the variable values in the program data vector are reset to missing:

_N_	_ERROR_	Month	AvgHigh	AvgLow
2	0		.	.

Now, this is where we first see the effect of the trailing @@ in our INPUT statement. Because SAS is beginning another iteration of the DATA step, by default the INPUT statement should be telling SAS to put the next record in the input data file into the input buffer. The trailing @@ tells SAS not to do that though! Instead, the trailing @@ tells SAS to hold onto the record that is currently in the input buffer and to read the values for AvgHigh AvgLow Mean from it. After doing so, the input pointer points to the blank space after the low-temperature value of 18:

1---+----10---+----↓0---+----30
Jan 32 16 Feb 35 18 Mar 46 26

and SAS places the values that are read into their appropriate place in the program data vector:

_N_	_ERROR_	Month	AvgHigh	AvgLow
2	0	Feb	35	18

The second iteration of the DATA step being complete, SAS writes the second observation to the StateCollege data set and returns to the top of the DATA step to begin its third iteration.

You should get the idea now. The automatic variable _N_ is increased to 3, and the variable values in the program data vector are reset to missing. The trailing @@ tells SAS to hold onto the record that is currently in the input buffer and to read the values for AvgHigh AvgLow Mean from it. After doing so, the input pointer points to the blank space after the low-temperature value of 26, which happens to coincide with the end-of-record marker:

1---+----10---+----20---+----↓0
Jan 32 16 Feb 35 18 Mar 46 26

SAS places the values that are read into their appropriate place in the program data vector:

_N_	_ERROR_	Month	AvgHigh	AvgLow
3	0	mar	46	26

The third iteration of the DATA step being complete, SAS writes the third observation to the StateCollege data set and returns to the top of the DATA step to begin its fourth iteration.

Yikes, when can we stop this endless loop?! Soon, real soon! We have just one more important issue to discuss the trailing @@, namely, when does SAS know to release the current record from the input buffer to go get the next one? The answer is now! Well, okay, that doesn't quite answer the question. In our particular example, the answer is now, because SAS just encountered the end-of-record marker. In general, there are just two situations in which SAS releases a record being held by the trailing @@:

When SAS moves the input pointer past the end of a record.
When an INPUT statement that has no line-hold specifier executes. (The trailing @@ is one of two line-hold specifiers available. The single trailing @, which we'll learn about in the next section is the other.)

For our example, we're at the point at which SAS just encountered that first situation. Therefore, SAS releases the first record and reads the second record into the input buffer. And so on and so on ... SAS continues the execution phase until the last block of data is read from the input file.

Now can we say that we all get the idea? Let's assume so and stop there! Just keep in mind these few important points about the trailing @@:

The trailing @@ holds a data line in the input buffer across multiple iterations of the DATA step.
The trailing @@ is most frequently used to read multiple SAS observations from a single input record, just as was done in our example.
The trailing @@ should not be used in conjunction with column input, the @n absolute pointer control, or the MISSOVER option.

Now, let's go learn what the single trailing at sign (@) can do for us, and how it differs from the double trailing at sign (@@).

21.4 - Reading the Same Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a student ID number followed by the same number of grades, such as here:

Student Grades Per Exam
111000234	79	82	100
922232573	87	89	95
252359873	65	72	73
205804679	92	95	99

In some situations, it might work just fine to read in this data file using three different variables for the grades — grade1, grade2, and grade3, say — so that the resulting SAS data set looks like this:

ID	grade1	grade2	grade3
111000231	79	82	100
922232573	87	89	95
252359873	65	72	73
205804679	92	95	99

In other cases, though, it makes more sense to read all of the grades into one score variable — if you wanted to calculate an overall average across all students and all grades, say. In that case, we'd want our data set to have this structure instead:

ID	exam	score
111000234	1	79
111000234	2	82
111000234	3	100
922232573	1	87
922232573	2	89
922232573	3	95
252359873	1	65
252359873	2	72
252359873	3	73
205804679	1	92
205804679	2	95
205804679	3	99

in which each record in the input data file produces three observations, with each observation containing three variables — ID, exam, and score.

In order to read in the data set as desired, we need to take advantage of the INPUT statement's single trailing at sign (@). Just like the double trailing at sign (@@), the single trailing @ is a line-hold specifier that tells SAS to hold the current record in the input buffer while SAS executes the next INPUT statement. The single trailing @ releases the current record from the input buffer:

when SAS encounters an INPUT statement without a line-hold specifier, or
when SAS returns to the top of the DATA step.

The trailing @ will do the trick for us here because we do not need SAS to hold the current record across multiple iterations of the DATA step. Let's make more sense of all of this by looking at an example!

Example 21.7

The following program uses single trailing at signs (@) in multiple INPUT statements along with multiple output statements to read a raw data file into a temporary SAS data set called grades:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA grades;
    input ID @;
    input score @;
    output;
    input score @;
    output;
    input score @;
    output;
    DATALINES;
    111000234 79 82 100
    922232573 87 89  95
    252359873 65 72  73
    205804679 92 95  99
    ;
RUN;
PROC PRINT data = grades NOOBS;
    title 'Grades data set';
RUN;

Grades data set
ID	score
111000234	79
111000234	82
111000234	100
922232573	87
922232573	89
922232573	95
252359873	65
252359873	72
252359873	73
205804679	92
205804679	95
205804679	99

Let's work our way through the program, from top to bottom, to make sure we understand what it is doing:

The first INPUT statement tells SAS to place the first record into the input buffer, as well as read the value for the ID variable. The trailing @ tells SAS to hold the first record in the input buffer.
The second INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The trailing @ tells SAS to continue to hold the first record in the input buffer.
The OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 79, respectively — to create the first observation in the grades data set.
The next INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The value 82 writes over the 79 that's in the score field in the program data vector. The ID value, 111000234, in the program data vector remains unchanged. The trailing @ tells SAS to continue to hold the first record in the input buffer.
The next OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 82, respectively — to create the second observation in the grades data set.
The next INPUT statement tells SAS to read the next value in the input buffer into the variable called score. The value 100 writes over the 82 that's in the score field in the program data vector. The ID value, 111000234, in the program data vector remains unchanged. The trailing @ tells SAS to continue to hold the input buffer. The trailing @ is really not necessary here, since we've reached the bottom of the DATA step anyway.
The final OUTPUT statement then tells SAS to write the current contents of the program data vector containing the values for the ID and score variables — 111000234 and 100, respectively — to create the third observation in the grades data set.
Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.
The process continues as just described until SAS reads the last field in the last record.

Launch and run the SAS program, and review the output to convince yourself that the program does indeed create a data set called grades, in which each record contains just one of the grades for one of the students.

Now, you might be looking at that last program and wondering "Geez, what if I had an input data file with 100 repeating fields ... would I really need to write a program with 101 input statements and 100 output statements?" This program looks like a good candidate for an iterative DO loop, doesn't it?

Example 21.8

The following program is identical in behavior to the previous program, except here the alternating sets of three INPUT and OUTPUT statements have been collapsed into one iterative DO loop:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA grades;
    input ID @;
    DO exam = 1 to 3;
        input score @;
        output;
    END;
    DATALINES;
    111000234 79 82 100
    922232573 87 89  95
    252359873 65 72  73
    205804679 92 95  99
    ;
RUN;
 
PROC PRINT data = grades NOOBS;
    title 'Grades data set';
RUN;

Grades data set
ID	exam	score
111000234	1	79
111000234	2	82
111000234	3	100
922232573	1	87
922232573	2	89
922232573	3	95
252359873	1	65
252359873	2	72
252359873	3	73
205804679	1	92
205804679	2	95
205804679	3	99

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — ID, exam, and score. Then, SAS moves on to the execution phase.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

1---+----↓0---+----20
11100234 79 82 100

and to read the value for the ID variable into the program data vector:

_N_	_ERROR_	ID	exam	score
1	0	111000234	.	.

The trailing @ tells SAS to hold the record in the input buffer. Then, SAS begins the execution of the DO loop. On the first iteration of the DO loop, exam is set to 1, the INPUT statement reads the first repeating field (79) and assigns the value to score in the program data vector:

_N_	_ERROR_	ID	exam	score
1	0	111000234	1	79

The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the grades data set. Having reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 2. The INPUT statement reads the second repeating field (82) and assigns the value to score in the program data vector:

_N_	_ERROR_	ID	exam	score
1	0	111000234	2	82

The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the grades data set. Having again reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 3. The INPUT statement reads the third, and final, repeating field (100) and assigns the value to score in the program data vector:

_N_	_ERROR_	ID	exam	score
1	0	111000234	3	100

The trailing @ tells SAS to continue to hold the record in the input buffer. The OUTPUT statement tells SAS to write the contents of the program data vector to create the third observation of the grades data set. Having again reached the DO loop's END statement, SAS returns to the top of the DO loop and sets exam to 4. Because SAS is told to end the DO loop when exam is greater than 3, SAS ends the DO loop and moves on to the next statement in the DATA step. Oh, but it's the DATA step's RUN statement.

So, SAS returns to the top of the DATA step. The first record is released from the input buffer. The first INPUT statement now tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20
922232573 87 89 95

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N_	_ERROR_	ID	exam	score
2	0	.	.	.

Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the grades data set as described.

It probably won't hurt to summarize again the main differences between the double trailing at sign (@@) and the single trailing at sign (@):

The double trailing @@ holds a record in the input buffer across multiple iterations of the DATA step until the end of the record is reached.
The single trailing @ releases a record when SAS returns to the top of the DATA step to process the next iteration.

Now, let's go investigate how we have to modify our programs here to accommodate data files that contain a varying number of repeating fields in each record.

21.5 - Reading a Varying Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects:

Monthly Weight of Subjects
1001	179	172	169
1002	250	249
1003	190	196	195	164	158
1004	232	224	219	212	208
1005	211	208	204	202

As you can see, because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record.

In some situations, it might work just fine to read in this data file using five different variables for the weights — wt1, wt2, wt3, wt4, and wt5, say — so that the resulting SAS data set looks like this:

id	wt1	wt2	wt3	wt4	wt5
1001	179	172	169	.	.
1002	250	249	.	.	.
1003	190	196	195	164	158
1004	232	224	219	212	208
1005	211	208	204	202	.

In other cases, though, it makes more sense to read all of the weights into one weight variable — if you wanted to calculate an overall average across all subjects and all weigh-ins, say. In that case, we'd want our data set to have this structure instead:

id	weigh in	weight
1001	1	179
1001	2	172
1001	3	169
1002	1	250
1002	2	249
1003	1	190
1003	2	196
1003	3	195
1003	4	164
1003	5	158
1003	3	195
1004	1	232
1004	2	224
1004	3	219
1004	4	212
1004	5	208
1005	1	211
1005	2	208
1005	3	204
1005	4	202

in which each record in the input data file produces as many observations as it has weights, with each observation containing three variables — id, weighin, and weight.

Because the input data file contains a varying number of repeated fields in each record, our methods for the last section are going to have to be modified. We can still take advantage of the trailing @ to tell SAS to hold the records in the input buffer. We'll also have to take advantage, however, of the INFILE statement's MISSOVER option, as well as replace the iterative DO loop with a DO WHILE loop. Let's take a look!

Example 21.9

The following program uses trailing at signs (@) in conjunction with the MISSOVER option and a conditional DO WHILE loop to read in a raw data file containing a varying number of weights in each record:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA dietdata;
    infile DATALINES MISSOVER;
    input id weight @;
    weighin = 0;
    DO WHILE (weight ne .);
        weighin+1;
        output;
        input weight @;
    END;
    DATALINES;
    1001 179 172 169
    1002 250 249
    1003 190 196 195 164 158
    1004 232 224 219 212 208
    1005 211 208 204 202
    ;
RUN;
 
PROC PRINT data = dietdata NOOBS;
    title 'The dietdata data set';
    var id weighin weight;
RUN;

The dietdata data set
id	weighin	weight
1001	1	179
1001	2	172
1001	3	169
1002	1	250
1002	2	249
1003	1	190
1003	2	196
1003	3	195
1003	4	164
1003	5	158
1004	1	232
1004	2	224
1004	3	219
1004	4	212
1004	5	208
1005	1	211
1005	2	208
1005	3	204
1005	4	202

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — id, weighin, and weight. Then, SAS moves on to the execution phase.

Even though we are reading the data instream, we've included an INFILE statement this time just so we can take advantage of the statement's MISSOVER option. Like the previous example with the same number of repeating fields, we must use trailing at signs (@) to tell SAS to read the same record more than once. But, here, we also have to tell SAS not to move the input pointer to the next record when there are missing weights. That's why the MISSOVER option.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

↓---+----10---+----20---+
1001 179 172 169

and to read the values for the ID and weight variables into the program data vector. The trailing @ tells SAS to hold the record in the input buffer. The weighin variable is initialized to 0, so that the program data vector looks like this now:

_N_	_ERROR_	id	weighin	weight
1	0	1001	0	179

Then, SAS begins execution of the DO WHILE loop. The current value for weight is not missing — it's 179 — so SAS steps into the DO WHILE loop. The value of weighin is increased to 1 so that the program data vector looks like this:

_N_	_ERROR_	id	weighin	weight
1	0	1001	1	179

The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the dietdata data set. Then, the INPUT statement reads the repeating field (172) and assigns the value to the program data vector:

_N_	_ERROR_	id	weighin	weight
1	0	1001	1	172

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 172. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_	_ERROR_	id	weighin	weight
1	0	1001	2	172

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement reads the repeating field (169) and assigns the value to the program data vector:

_N_	_ERROR_	id	weighin	weight
1	0	1001	2	169

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 169. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_	_ERROR_	id	weighin	weight
1	0	1001	3	169

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement attempts to read another weight value and determines there to be none. The MISSOVER option prevents the input pointer from moving to the next record in search of another value for weight. SAS assigns a missing value to the weight variable in the program data vector:

_N_	_ERROR_	id	weighin	weight
1	0	1001	3	.

Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is now missing, and so the statements in the loop are not executed. Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.

During the second iteration of the DATA step, the first INPUT statement tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20---+
1002 250 249

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N_	_ERROR_	id	weighin	weight
2	0	.	.	.

21.6 - Summary

In this lesson, we extended our knowledge of how to read data from various types of data files into SAS data sets.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility

Month	Av High	Av Low	Month	Av High	Av Low	Month	Av High	Av Low
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

Monthly High/ Low Temps
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month1	High1	Low1	Month2	High2	Low2	Month3	High3	Low3
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

Month	Av High	Av Low	Month	Av High	Av Low	Month	Av High	Av Low
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

Monthly High/ Low Temps
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month1	High1	Low1	Month2	High2	Low2	Month3	High3	Low3
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

Month	Av High	Av Low	Month	Av High	Av Low	Month	Av High	Av Low
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22

Monthly High/ Low Temps
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month1	High1	Low1	Month2	High2	Low2	Month3	High3	Low3
Jan	32	16	Feb	35	18	Mar	46	26
Apr	58	37	May	68	47	Jun	78	56
Jul	82	60	Aug	80	58	Sep	72	51
Oct	61	40	Nov	48	32	Dec	37	22

Month	Avg High	Avg Low
Jan	32	16
Feb	35	18
Mar	46	26
Apr	58	37
May	68	47
Jun	78	56
Jul	82	60
Aug	80	58
Sep	72	51
Oct	61	40
Nov	48	32
Dec	37	22