21.1 - Creating a Single Observation From Multiple Records
21.1 - Creating a Single Observation From Multiple RecordsIn this section, we'll learn how to read in a raw data file when the data values for one observation are spread out over several records in the file. For example, suppose we are interested in reading the following data file:
City 1 | State College PA | ||
---|---|---|---|
Temps | 32 | 16 | 25 |
Precip | 2.4 | ||
City 2 | Miami FL | ||
Temps | 75 | 58 | 67 |
Precip | 2.0 | ||
City 3 | Honolulu HI | ||
Temps | 80 | 65 | 74 |
Precip | 3.6 |
containing first a city name, and then on the next record, the average high, average low, and just plain old average temperature for that city in January, and then on the next record the average precipitation for that city in January. After reading in the first three records for the first city, we'd move on to read the next three records for the second city, and so on. When all is said and done, we'd like our SAS data set to look like this:
City | State | Avg High | Avg Low | Mean | Percip |
---|---|---|---|---|---|
State College | PA | 32 | 15 | 25 | 2.4 |
Miami | FL | 75 | 58 | 67 | 2.0 |
Honolulu | HI | 80 | 65 | 74 | 3.6 |
We'll learn two different ways of reading multiple records in a raw data file while creating just one observation in a SAS data set. First, we'll learn how to use the forward slash (/) line pointer control to read the records in a raw data file sequentially. Then, we'll learn how to use the pound-n (#n) line pointer control to read the records in a raw data file non-sequentially.
Reading Multiple Records Sequentially
Just as you can use a +n relative pointer control in your INPUT statements to tell SAS to advance n columns within a record before reading the next data value, you can use one or more forward slash (/) line pointer controls in your INPUT statements to tell SAS to advance to a new record before reading the next data value. Because the forward-slash (/) line pointer control only moves the input pointer forward in a record, it can only be used to read the records in a data file in order.
To make sure we understand the problem we are trying to solve in this section, let's start by looking at an example that illustrates how we'd have to read multiple records sequentially if the forward-slash (/) line pointer control weren't available to us.
Example 21.1
The following program uses three INPUT statements to read January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA january;
input City & $13. State $;
input AvgHigh AvgLow Mean;
input Precip;
DATALINES;
State College PA
32 16 25
2.4
Miami FL
75 58 67
2.0
Honolulu HI
80 65 74
3.6
;
RUN;
PROC PRINT data = january;
title 'January Weather for Three U.S. Cities';
RUN;
Obs | City | State | AvgHigh | AvgLow | Mean | Precip |
---|---|---|---|---|---|---|
1 | State College | PA | 32 | 16 | 25 | 2.4 |
2 | Miami | FL | 75 | 58 | 67 | 2.0 |
3 | Honolulu | HI | 80 | 65 | 74 | 3.6 |
The first INPUT statement tells SAS to read the City and State into the observation. (The ampersand (&) modifier is used because one of the cities — State College — contains an embedded blank, and the city and states are separated by two blank spaces.) Then, the second INPUT statement tells SAS to read into the observation the average high temperature (AvgHigh), the average low temperature (AvgLow), and the overall average temperature (Mean) for the city in January. Finally, the third INPUT statement tells SAS to read the average precipitation (Precip) in January for the city into the observation. Then, at the end of the DATA step, the values in the program data vector are written to the january data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the three INPUT statements again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.
Launch and run the SAS program, and review the output to convince yourself that SAS reads the temperature data correctly. Now, let's see how the forward-slash (/) line pointer control allows us to accomplish the same thing.
Example 21.2
The following program uses the forward slash (/) line pointer control to read January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA january;
input City & $13. State $
/ AvgHigh AvgLow Mean
/ Precip;
DATALINES;
State College PA
32 16 25
2.4
Miami FL
75 58 67
2.0
Honolulu HI
80 65 74
3.6
;
RUN;
PROC PRINT data = january;
title 'January Weather for Three U.S. Cities';
RUN;
Obs | City | State | AvgHigh | AvgLow | Mean | Precip |
---|---|---|---|---|---|---|
1 | State College | PA | 32 | 16 | 25 | 2.4 |
2 | Miami | FL | 75 | 58 | 67 | 2.0 |
3 | Honolulu | HI | 80 | 65 | 74 | 3.6 |
Here, a single INPUT statement tells SAS to read the data values for the six variables (City, State, ..., Precip) across three records. Note that because the forward slash (/) line pointer control tells SAS to advance the input pointer to the next record, it is placed after the instructions for reading the values in the current record. For example, we first tell SAS to read the City and State values in the first record, then the / line pointer control tells SAS to move to the next record to read in the AvgHigh, AvgLow, and Mean values. Then, the / line pointer control tells SAS to move to the next record to read in the Precip value. Being at the end of the DATA step, the values in the program data vector are written to the data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the INPUT statement again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.
Launch and run the SAS program. Review the output to convince yourself that SAS reads the temperature data correctly. You might also want to look at your log window for the NOTE that indicates that the temporary data set january contains only three observations (even though nine data records were input).
You'll want to keep in mind that when you use the forward slash (/) line pointer control in a single INPUT statement as we did here, your INPUT statement should contain just one semi-colon at the end of it. Some students get tripped up thinking that they need to put a semi-colon after each / line pointer control. One INPUT statement requires just one semi-colon.
Note that the input raw data files in the two previous examples contained the same number of records for each observation. You can probably well imagine that this is a requirement when using the forward-slash (/) pointer control! Let's take a look at an example of what can go awry if your raw data file doesn't contain the same number of records for each observation.
Example 21.3
The following program attempts to use forward slash (/) pointer controls to read the January weather statistics for our three cities, State College, Miami, and Honolulu, when the record containing the precipitation for Miami is omitted from the input data file:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA january;
input City & $13. State $
/ AvgHigh AvgLow Mean
/ Precip;
DATALINES;
State College PA
32 16 25
2.4
Miami FL
75 58 67
Honolulu HI
80 65 74
3.6
;
RUN;
PROC PRINT data = january;
title 'January Weather for Three U.S. Cities';
RUN;
Obs | City | State | AvgHigh | AvgLow | Mean | Precip |
---|---|---|---|---|---|---|
1 | State College | PA | 32 | 16 | 25 | 2.4 |
2 | Miami | FL | 75 | 58 | 67 | . |
First, note that the input data file contains only two records for Miami — the city name record and the temperature record, but not the precipitation record. Then, launch and run the SAS program. Review the output to convince yourself that SAS had trouble reading the temperature data correctly. As you can see by the message in the log window:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA january;
input City & $13. State $
/ AvgHigh AvgLow Mean
/ Precip;
DATALINES;
NOTE: Invalid data for Precip in line 54 5-12.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6-
Honolulu HI
City=Miami State=FL AvgHigh=75 AvgLow=58 Mean=67 Precip=. _ERROR_=1
_N_=2
NOTE: LOST CARD.
;
City=80 65 74 State=3.6 AvgHigh=. AvgLow=. Mean=. Precip=. _ERROR_=1
_N_=3
NOTE: SAS went to a new line when INPUT statement reached past the end
of a line.
NOTE: The data set WORK.JANUARY has 2 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
;
RUN;
PROC PRINT data = january;
NOTE: Writing HTML Body file: sashtml2.htm
title 'January Weather for Three U.S. Cities';
RUN;
NOTE: There were 2 observations read from the data set WORK.JANUARY.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.44 seconds
cpu time 0.28 seconds
when SAS attempts to read the Precip value for Miami, it instead reads Honolulu. Because Precip is a numeric variable and Honolulu is a character value, SAS displays an Invalid data message. The moral of the story ... before you use a forward slash (/) pointer control in an INPUT statement, make sure that the raw data file contains the same number of records for each observation.
Reading Multiple Records Non-Sequentially
Just as you can use an @n absolute pointer control in your INPUT statements to tell SAS to move to a record's nth column before reading the next data value, you can use the pound-n (#n) line pointer control to tell SAS to advance to a specific record before reading the next data value. Because the #n line pointer control moves the input pointer to a particular record, it can be used to read the records in a data file in any order.
Example 21.4
The following program uses the pound-n (#n) line pointer control to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA january;
input #2 AvgHigh AvgLow Mean
#3 Precip
#1 City & $13. State $;
DATALINES;
State College PA
32 16 25
2.4
Miami FL
75 58 67
2.0
Honolulu HI
80 65 74
3.6
;
RUN;
PROC PRINT data = january;
title 'January Weather for Three U.S. Cities';
RUN;
Obs | AvgHigh | AvgLow | Mean | Precip | City | State |
---|---|---|---|---|---|---|
1 | 32 | 16 | 25 | 2.4 | State College | PA |
2 | 75 | 58 | 67 | 2.0 | Miami | FL |
3 | 80 | 65 | 74 | 3.6 | Honolulu | HI |
Because the #n line pointer control tells SAS the specific record to which we want to move the input pointer, it is placed before the instructions for reading the values in that particular record. For example, we first tell SAS to move to the second record to read in the AvgHigh, AvgLow, and Mean values. Then, we tell SAS to move to the third record to read in the Precip value. Then, we tell SAS to move back to the first record to read the City and State values. Being at the end of the DATA step, the values in the program data vector are written to the data set as the first observation. SAS then returns to the beginning of the DATA step, sets the variable values back to missing, and executes the INPUT statement again to build the second observation. The DATA step continues to execute as such until SAS runs out of records to read in the input data file.
Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. You might also want to note, as is always the case, that the variables in the resulting january data set appear in the same order as their variable names appear in the INPUT statement — AvgHigh, AvgLow, and Mean first, followed by Precip, and then finally City and State.
Reading Multiple Records Sequentially and Non-Sequentially
Although, as you'll see, it's not perfectly flexible, you can use the forward slash (/) and pound-n (#n) line pointer controls together in the same INPUT statement to read multiple records both sequentially and non-sequentially.
Example 21.5
The following program uses both the / and #n line pointer controls to read in January weather statistics for three cities — State College, Miami, and Honolulu — when the data values for each city are recorded over three records in the input data file:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA january;
input #3 Precip
#1 City & $13. State $
/ AvgHigh AvgLow Mean;
DATALINES;
State College PA
32 16 25
2.4
Miami FL
75 58 67
2.0
Honolulu HI
80 65 74
3.6
;
RUN;
PROC PRINT data = january;
title 'January Weather for Three U.S. Cities';
RUN;
Obs | Precip | City | State | AvgHigh | AvgLow | Mean |
---|---|---|---|---|---|---|
1 | 2.4 | State College | PA | 32 | 16 | 25 |
2 | 2.0 | Miami | FL | 75 | 58 | 67 |
3 | 3.6 | Honolulu | HI | 80 | 65 | 74 |
You should get the idea now. Here, the INPUT statement uses the #3 pointer control to tell SAS to move first to the third record to read in the Precip value, and then the #1 pointer control to tell SAS to move back to the first record to read in the City and State values, and finally the / pointer control to tell SAS to advance another record to read in the AvgHigh, AvgLow, and Mean values.
Launch and run the SAS program, and review the output to convince yourself that SAS reads in the temperature data correctly. You might again want to note that the variables in the resulting january data set appear in the same order as their variable names appear in the INPUT statement — Precip first, followed by City and State, and then finally AvgHigh, AvgLow, and Mean.
To use this example to address that earlier point about using both pointer controls in the same INPUT statement not being perfectly flexible ... if you replace the INPUT statement with this one:
input #2 AvgHigh AvgLow Mean
/ Precip
#1 City & $13. State $;
and re-run the SAS program, you should see SAS hiccups. If you're not convinced of that look at the error message displayed in the log window:
This example illustrates that you can't read records non-sequentially once you've invoked the sequential-reading forward slash (/) line pointer control.